This Jupyter notebook showcases how to build a Product Purchase Recommendation System. (Please refer to the attached Data Exploration notebook for a deeper understanding of the Product Orders dataset used here).


Recommendation System for Product Purchases

Step 0: Business Understanding

The client seeks to increase sales by recommending products that users are likely to purchase, based on data.

Objectives

1) Check for Missing Values & Understand Data.

2) Calculate the Most Popular Product: 
 =>We do this below by finding the product with MAXIMUM SALES BY VOLUME and BY DOLLAR AMOUNT.

3) Find Company with Maximum Purchase.

4) Given the Most Selling Product, find Similar Products (using ITEM-BASED COLLABORATIVE FILTERING).
 =>This requires calculating the cosine similarity (or alternatively the Pearson correlation, which produces similar results) between the Most Selling Product and other products.

5) Make a Product Recommendation for a Specific User, based on the user's previous purchases

Project Overview:

This project shows an example of applying ITEM-BASED COLLABORATIVE FILTERING. Given the Most Selling Product in our dataset, we seek to find similar products by calculating either the cosine similarity or the Pearson correlation between the Most Selling Product and other products.

To define similarity between products, we build a co-ocurrence matrix with users as rows and products as columhs, then calculate the correlation of all columns (i.e., all products) with each other. This is a way to measure similarity between any two products in terms of how two products tend to be purchased by the same users in similar amounts. For example, if a user purchased large amounts of PRODUCT A, we will recommend PRODUCT B if other users who purchased large amounts of PRODUCT A also purchased product B.

An alternative approach would be to look for similarity between products in terms of product information such as Product Title, Product description, SEO search keywords, or Product Size.

What is important to understand here is, "Why are we looking for similar products.. ?" If a customer is trying to purchase product X, we should we make a recommendation for them to consider product Y, as long as product Y is similar to product X (a substitute product). A different goal would be to show products that are often purchased together with product X (complementary products), if a user has already purchased product X or is adding product X to a current order.

Another consideration is that some products tend to have recurrent purchases while others do not. For example, if someone already bought a wheelchair, that user is unlikely to buy another one, so showing different wheelchair options only makes sense BEFORE the purchase and not afterwards. In this case, suggesting products for wheelchair users or wheelchair purchasers would be more effective.

In this project, the strategy will be to focus on co-ocurring purchases -- recommending products that tend to be purchased together by users (whether the products are purchased together on the same order or not).

Step 0: Import and Clean Data

Load libraries and dataset

In [1]:
import pandas as pd
df = pd.read_csv("PBL 5 recommendation data.csv", encoding ='iso-8859-1')

Select the most important features

In [2]:
df = df[["Customers.id","Customers.customer_type", "Customers.company","Orders.id", "Orders.subtotal","Orders.total", \
         "Orders.customer_id", "Orders.company","Products.vendor","Orders.order_number","Order_Items.id", "Order_Items.parent", "Order_Items.qty", "Order_Items.price", "Products.id", "Products.name"]]
df.head(3)
Out[2]:
Customers.id Customers.customer_type Customers.company Orders.id Orders.subtotal Orders.total Orders.customer_id Orders.company Products.vendor Orders.order_number Order_Items.id Order_Items.parent Order_Items.qty Order_Items.price Products.id Products.name
0 797 0.0 Company0 3758 57.20 64.29 797 Company0 1.0 3758 5284 3758 1 57.20 2310.0 Basic Steel Rollators,Green
1 3 0.0 Company1 23 20.00 29.99 3 Company1 1.0 23 31 23 4 5.00 177.0 Urinary Drain Bags
2 3 0.0 Company1 9531 68.78 78.73 3 Company1 1.0 9531 11655 9531 1 68.78 1.0 SensiCare Nitrile Exam Gloves,Blue,XX-Large

Drop duplicates

  • Orders.id, Order-Items.parent and Orders.order_number are the same
  • Customers.id is the same as Orders.customer_id
  • Customers.company is the same as Orders.company
In [3]:
df = df[["Customers.id", "Customers.company","Orders.id", "Orders.total","Orders.subtotal","Order_Items.id", "Order_Items.qty", \
         "Order_Items.price", "Products.id", "Products.name"]]
df.head(3)
Out[3]:
Customers.id Customers.company Orders.id Orders.total Orders.subtotal Order_Items.id Order_Items.qty Order_Items.price Products.id Products.name
0 797 Company0 3758 64.29 57.20 5284 1 57.20 2310.0 Basic Steel Rollators,Green
1 3 Company1 23 29.99 20.00 31 4 5.00 177.0 Urinary Drain Bags
2 3 Company1 9531 78.73 68.78 11655 1 68.78 1.0 SensiCare Nitrile Exam Gloves,Blue,XX-Large

Step 1: Data Understanding and Feature Engineering

   Explore Data

    (Please see attached the Data Exploration notebook for further reference)

  • What does each row in the dataframe represent?  
    => As shown below in the section "Visualize a particular order", each row in our dataframe represents a product sale. There are several rows per order, depending on the number of products in that order.
  • Missing data and Data Cleaning

       => The most important features for our model are NOT missing data.

  • Should we clean data?

       => There are many duplicate columns.

        => We will perform our analysis on a data subset containing only the most important features, as shown below.

   Visualize a particular order

We select id = 16186 for visualization and testing purposes throughout this notebook:

In [4]:
dfs= df[df["Orders.id"] == 16186] 
dfs
Out[4]:
Customers.id Customers.company Orders.id Orders.total Orders.subtotal Order_Items.id Order_Items.qty Order_Items.price Products.id Products.name
3845 3399 NaN 16186 386.18 376.23 18833 2 29.00 11168.0 Disposable Tissue/Poly Flat Stretcher Sheets,B...
3846 3399 NaN 16186 386.18 376.23 18834 9 3.25 11161.0 Standard Facial Tissues
3847 3399 NaN 16186 386.18 376.23 18835 2 19.81 1475.0 Micro-Kill+ Disinfectant Wipes
3848 3399 NaN 16186 386.18 376.23 18836 2 19.80 25612.0 Dynarex 3672, Triangular Bandage 36x36x51 - 12/Bx
3849 3399 NaN 16186 386.18 376.23 18837 1 48.99 25558.0 Dynarex 3596, Athletic Tape, 1½" x 15 y...
3850 3399 NaN 16186 386.18 376.23 18838 1 27.35 2144.0 Caring Non-Sterile Latex Self-Adherent Wrap,Tan
3851 3399 NaN 16186 386.18 376.23 18839 3 17.05 4324.0 Solstice Nitrile Powder-Free Exam Gloves,Dark ...
3852 3399 NaN 16186 386.18 376.23 18840 1 24.10 12539.0 Caring Plastic Adhesive Bandages,Natural,No
3853 3399 NaN 16186 386.18 376.23 18841 1 13.83 25269.0 Dynarex 3102, Stretch Gauze Bandage Roll Non-S...
3854 3399 NaN 16186 386.18 376.23 18842 2 22.17 1629.0 Avant Gauze Non-Woven Sterile Sponges

   Engineer a New Feature for Total Item Price

Engineer the total_Items.price variable as an item's price multiplied by its amount in any given order, and show results for our test id = 16186.

In [5]:
# Add column with total Items.price. 
df["Total_Items.price"] = df["Order_Items.price"] * df["Order_Items.qty"]
df[df["Orders.id"] == 16186]
Out[5]:
Customers.id Customers.company Orders.id Orders.total Orders.subtotal Order_Items.id Order_Items.qty Order_Items.price Products.id Products.name Total_Items.price
3845 3399 NaN 16186 386.18 376.23 18833 2 29.00 11168.0 Disposable Tissue/Poly Flat Stretcher Sheets,B... 58.00
3846 3399 NaN 16186 386.18 376.23 18834 9 3.25 11161.0 Standard Facial Tissues 29.25
3847 3399 NaN 16186 386.18 376.23 18835 2 19.81 1475.0 Micro-Kill+ Disinfectant Wipes 39.62
3848 3399 NaN 16186 386.18 376.23 18836 2 19.80 25612.0 Dynarex 3672, Triangular Bandage 36x36x51 - 12/Bx 39.60
3849 3399 NaN 16186 386.18 376.23 18837 1 48.99 25558.0 Dynarex 3596, Athletic Tape, 1½" x 15 y... 48.99
3850 3399 NaN 16186 386.18 376.23 18838 1 27.35 2144.0 Caring Non-Sterile Latex Self-Adherent Wrap,Tan 27.35
3851 3399 NaN 16186 386.18 376.23 18839 3 17.05 4324.0 Solstice Nitrile Powder-Free Exam Gloves,Dark ... 51.15
3852 3399 NaN 16186 386.18 376.23 18840 1 24.10 12539.0 Caring Plastic Adhesive Bandages,Natural,No 24.10
3853 3399 NaN 16186 386.18 376.23 18841 1 13.83 25269.0 Dynarex 3102, Stretch Gauze Bandage Roll Non-S... 13.83
3854 3399 NaN 16186 386.18 376.23 18842 2 22.17 1629.0 Avant Gauze Non-Woven Sterile Sponges 44.34

We want to know the total revenue per product, and the total quantity of items that were sold for every product. Therefore, we will GROUP ORDERS BY PRODUCT, and CALCULATE THE SUM for Total Item Price and Item quantity:

In [6]:
pop_products = df.groupby("Products.id")["Total_Items.price", "Order_Items.qty"].sum()
pop_products.head(5)
Out[6]:
Total_Items.price Order_Items.qty
Products.id
1.0 68.78 1
19.0 273.24 6
20.0 197.12 2
22.0 29.00 1
30.0 189.56 5

Sort to obtain the Most Popular Product by Sales Quantity:

In [7]:
# Sort results by Item quantity
pop_products.sort_values("Order_Items.qty", ascending = False).head(3)
Out[7]:
Total_Items.price Order_Items.qty
Products.id
1846.0 13705.56 396
2107.0 12542.26 228
1672.0 5559.40 220
In [8]:
# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)
Out[8]:
Products.name
846 Disposable Emergency Blanket,Not Applicable
In [9]:
# Find product name
df[df["Order_Items.id"] == 18049].loc[:,["Products.name"]].head(1)
Out[9]:
Products.name
3631 NaN

Sort to obtain the Most popular by Revenue:

In [10]:
pop_products.sort_values("Total_Items.price", ascending = False).head(3)
Out[10]:
Total_Items.price Order_Items.qty
Products.id
1846.0 13705.56 396
2107.0 12542.26 228
911.0 8449.00 68
In [11]:
# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)
Out[11]:
Products.name
846 Disposable Emergency Blanket,Not Applicable

Step 3: Calculate company with maximum purchase

In [12]:
df1 = df.groupby("Customers.company")["Total_Items.price"].sum()
In [13]:
df1.sort_values(ascending = False).head()
Out[13]:
Customers.company
Company59     13186.41
Company343     5628.56
Company17      4167.15
Company281     3061.96
Company145     2696.38
Name: Total_Items.price, dtype: float64

So company 59 is the one with the most sales amount. But we are missing too many companies and if we had more information, we could try to fill the missing values in that column.

Step 4: Item-Based Collaborative Filtering: Given the Most Selling Product, Find Similar Products

First, create a dataframe with Customer id, Product Id, and Total Item Purchases

In [14]:
df_grouped = df.groupby(["Customers.id", "Products.id"])["Order_Items.qty"].sum()
dfg = pd.DataFrame(df_grouped)

dfg.reset_index()
dfg.shape
dfg = dfg.reset_index()
dfg.head()
Out[14]:
Customers.id Products.id Order_Items.qty
0 3 1.0 1
1 3 177.0 4
2 4 983.0 1
3 5 991.0 1
4 7 1379.0 1

Build Pivot Table of Product Quantity Sales, with Customers as rows and Products as Columns

In [15]:
prodsales = dfg.pivot_table(index = ["Customers.id"], columns = ["Products.id"], values = "Order_Items.qty" )
prodsales.head()
Out[15]:
Products.id 1.0 19.0 20.0 22.0 30.0 35.0 62.0 64.0 65.0 74.0 ... 25170.0 25269.0 25356.0 25527.0 25558.0 25612.0 25694.0 25908.0 25920.0 26175.0
Customers.id
3 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1710 columns

We see below that the best-selling product 1846.0 was purchased by customers 699, 1305 and 1352

In [16]:
prodsales.columns
prod1 = prodsales[1846.0]
prod1.dropna().head(3)
Out[16]:
Customers.id
699      81.0
1305      3.0
1352    300.0
Name: 1846.0, dtype: float64

Fill NAs in order to calculate correlation matrix without errors

In [17]:
prodsales.fillna(0, inplace = True)
prodsales.head()
Out[17]:
Products.id 1.0 19.0 20.0 22.0 30.0 35.0 62.0 64.0 65.0 74.0 ... 25170.0 25269.0 25356.0 25527.0 25558.0 25612.0 25694.0 25908.0 25920.0 26175.0
Customers.id
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1710 columns

Try the correlation of best-selling product 1846.0 with all other products

In [18]:
prodsales.corrwith(prod1).sort_values(ascending = False).head()
Out[18]:
Products.id
1846.0     1.000000
1825.0     0.009221
5506.0    -0.000430
1807.0    -0.000430
13301.0   -0.000430
dtype: float64

Interpretation of why product 1825 which is correlated to best-selling product 1846.0

As seen below, product 1825 was purchased by the same customer 1305 as our best-selling product 1846.0. Therefore product 1825 shows up as correlated to 1846.0, in this case because they were purchased by the same people.

In [19]:
# Identify customer who purchased product 1825
prodsales[1825][prodsales[1825]>0]
Out[19]:
Customers.id
1305    1.0
Name: 1825.0, dtype: float64

Build Correlation Matrix for all Products

In [20]:
corrMatrix = prodsales.corr(method='pearson', min_periods = 100)
corrMatrix.dropna().head()
Out[20]:
Products.id 1.0 19.0 20.0 22.0 30.0 35.0 62.0 64.0 65.0 74.0 ... 25170.0 25269.0 25356.0 25527.0 25558.0 25612.0 25694.0 25908.0 25920.0 26175.0
Products.id
1.0 1.000000 -0.000397 -0.000477 -0.000337 -0.000563 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 ... -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337
19.0 -0.000397 1.000000 -0.000562 -0.000397 -0.000662 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 ... -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397 -0.000397
20.0 -0.000477 -0.000562 1.000000 -0.000477 -0.000796 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 ... -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477 -0.000477
22.0 -0.000337 -0.000397 -0.000477 1.000000 -0.000563 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 ... -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337 -0.000337
30.0 -0.000563 -0.000662 -0.000796 -0.000563 1.000000 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 ... -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563 -0.000563

5 rows × 1710 columns

Make a Product Recommendation for a Specific Customer

We will pick Customer id = 400, who likes product 858.0 (as shown below), and then find similar products:

In [21]:
prodsales.head(3)
Out[21]:
Products.id 1.0 19.0 20.0 22.0 30.0 35.0 62.0 64.0 65.0 74.0 ... 25170.0 25269.0 25356.0 25527.0 25558.0 25612.0 25694.0 25908.0 25920.0 26175.0
Customers.id
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 1710 columns

In [22]:
myPurchases = prodsales.iloc[400][prodsales.iloc[400]>0]
myPurchases = pd.DataFrame(myPurchases)
myPurchases.columns = [[ "Quantity Purchased"]]
myPurchases
Out[22]:
Quantity Purchased
Products.id
858.0 1.0

Now we Build a Recommendation List

In [23]:
simCandidates = pd.Series()
for i in range(0, len(myPurchases)):
    print ("Adding similarity for ", myPurchases.index[i])
    sims = corrMatrix[myPurchases.index[i]].dropna()
    sims = sims.map(lambda x: x* myPurchases.iloc[i,0])
    
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)

# Print results
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)
Adding similarity for  858.0
Out[23]:
858.0     1.000000
968.0     0.056531
845.0     0.036730
1672.0   -0.002310
153.0    -0.002310
2150.0   -0.002310
831.0    -0.002310
1366.0   -0.002310
1278.0   -0.002310
1509.0   -0.002310
dtype: float64

For each item that the user purchased, what we did was:

  • Retrieve similar products from the correlation matrix
  • Scale the correlation by how many products the user purchased (so the more quantity that the user bought of a product, the more that product counts to recommend similar ones)

We identified products 968.0 and 845.0 as the ones we could recommend.