This Jupyter notebook showcases how to build a Product Purchase Recommendation System. (Please refer to the attached Data Exploration notebook
for a deeper understanding of the Product Orders dataset used here).

Recommendation System for Product Purchases¶

Step 0: Business Understanding¶

The client seeks to increase sales by recommending products that users are likely to purchase, based on data.

Objectives¶

1) Check for Missing Values & Understand Data.

2) Calculate the Most Popular Product:
=>We do this below by finding the product with MAXIMUM SALES BY VOLUME and BY DOLLAR AMOUNT.

3) Find Company with Maximum Purchase.

4) Given the Most Selling Product, find Similar Products (using ITEM-BASED COLLABORATIVE FILTERING).
=>This requires calculating the cosine similarity (or alternatively the Pearson correlation, which produces similar results) between the Most Selling Product and other products.

5) Make a Product Recommendation for a Specific User, based on the user's previous purchases

Project Overview:¶

This project shows an example of applying ITEM-BASED COLLABORATIVE FILTERING. Given the Most Selling Product in our dataset, we seek to find similar products by calculating either the cosine similarity or the Pearson correlation between the Most Selling Product and other products.

To define similarity between products, we build a co-ocurrence matrix with users as rows and products as columhs, then calculate the correlation of all columns (i.e., all products) with each other. This is a way to measure similarity between any two products in terms of how two products tend to be purchased by the same users in similar amounts. For example, if a user purchased large amounts of PRODUCT A, we will recommend PRODUCT B if other users who purchased large amounts of PRODUCT A also purchased product B.

An alternative approach would be to look for similarity between products in terms of product information such as Product Title, Product description, SEO search keywords, or Product Size.

What is important to understand here is, "Why are we looking for similar products.. ?" If a customer is trying to purchase product X, we should we make a recommendation for them to consider product Y, as long as product Y is similar to product X (a substitute product). A different goal would be to show products that are often purchased together with product X (complementary products), if a user has already purchased product X or is adding product X to a current order.

Another consideration is that some products tend to have recurrent purchases while others do not. For example, if someone already bought a wheelchair, that user is unlikely to buy another one, so showing different wheelchair options only makes sense BEFORE the purchase and not afterwards. In this case, suggesting products for wheelchair users or wheelchair purchasers would be more effective.

In this project, the strategy will be to focus on co-ocurring purchases -- recommending products that tend to be purchased together by users (whether the products are purchased together on the same order or not).

Step 0: Import and Clean Data¶

Load libraries and dataset¶

import pandas as pd
df = pd.read_csv("PBL 5 recommendation data.csv", encoding ='iso-8859-1')

Select the most important features¶

df = df[["Customers.id","Customers.customer_type", "Customers.company","Orders.id", "Orders.subtotal","Orders.total", \
         "Orders.customer_id", "Orders.company","Products.vendor","Orders.order_number","Order_Items.id", "Order_Items.parent", "Order_Items.qty", "Order_Items.price", "Products.id", "Products.name"]]
df.head(3)

Drop duplicates¶

Orders.id, Order-Items.parent and Orders.order_number are the same
Customers.id is the same as Orders.customer_id
Customers.company is the same as Orders.company

df = df[["Customers.id", "Customers.company","Orders.id", "Orders.total","Orders.subtotal","Order_Items.id", "Order_Items.qty", \
         "Order_Items.price", "Products.id", "Products.name"]]
df.head(3)

Step 1: Data Understanding and Feature Engineering¶

Explore Data

(Please see attached the Data Exploration notebook for further reference)

What does each row in the dataframe represent?
=> As shown below in the section "Visualize a particular order", each row in our dataframe represents a product sale. There are several rows per order, depending on the number of products in that order.

Missing data and Data Cleaning

=> The most important features for our model are NOT missing data.

Should we clean data?

=> There are many duplicate columns.

=> We will perform our analysis on a data subset containing only the most important features, as shown below.

Visualize a particular order

We select id = 16186 for visualization and testing purposes throughout this notebook:

dfs= df[df["Orders.id"] == 16186] 
dfs

Engineer a New Feature for Total Item Price

Engineer the total_Items.price variable as an item's price multiplied by its amount in any given order, and show results for our test id = 16186.

# Add column with total Items.price. 
df["Total_Items.price"] = df["Order_Items.price"] * df["Order_Items.qty"]
df[df["Orders.id"] == 16186]

Step 2: Calculate the Most Popular Product¶

We want to know the total revenue per product, and the total quantity of items that were sold for every product. Therefore, we will GROUP ORDERS BY PRODUCT, and CALCULATE THE SUM for Total Item Price and Item quantity:

pop_products = df.groupby("Products.id")["Total_Items.price", "Order_Items.qty"].sum()
pop_products.head(5)

Sort to obtain the Most Popular Product by Sales Quantity:

# Sort results by Item quantity
pop_products.sort_values("Order_Items.qty", ascending = False).head(3)

# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)

# Find product name
df[df["Order_Items.id"] == 18049].loc[:,["Products.name"]].head(1)

Sort to obtain the Most popular by Revenue:

pop_products.sort_values("Total_Items.price", ascending = False).head(3)

# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)

Step 3: Calculate company with maximum purchase¶

df1 = df.groupby("Customers.company")["Total_Items.price"].sum()

df1.sort_values(ascending = False).head()

Customers.company
Company59     13186.41
Company343     5628.56
Company17      4167.15
Company281     3061.96
Company145     2696.38
Name: Total_Items.price, dtype: float64

So company 59 is the one with the most sales amount. But we are missing too many companies and if we had more information, we could try to fill the missing values in that column.

Step 4: Item-Based Collaborative Filtering: Given the Most Selling Product, Find Similar Products¶

First, create a dataframe with Customer id, Product Id, and Total Item Purchases¶

df_grouped = df.groupby(["Customers.id", "Products.id"])["Order_Items.qty"].sum()
dfg = pd.DataFrame(df_grouped)

dfg.reset_index()
dfg.shape
dfg = dfg.reset_index()
dfg.head()

Build Pivot Table of Product Quantity Sales, with Customers as rows and Products as Columns¶

prodsales = dfg.pivot_table(index = ["Customers.id"], columns = ["Products.id"], values = "Order_Items.qty" )
prodsales.head()

We see below that the best-selling product 1846.0 was purchased by customers 699, 1305 and 1352¶

prodsales.columns
prod1 = prodsales[1846.0]
prod1.dropna().head(3)

Customers.id
699      81.0
1305      3.0
1352    300.0
Name: 1846.0, dtype: float64

Fill NAs in order to calculate correlation matrix without errors¶

prodsales.fillna(0, inplace = True)
prodsales.head()

Try the correlation of best-selling product 1846.0 with all other products¶

prodsales.corrwith(prod1).sort_values(ascending = False).head()

Products.id
1846.0     1.000000
1825.0     0.009221
5506.0    -0.000430
1807.0    -0.000430
13301.0   -0.000430
dtype: float64

Interpretation of why product 1825 which is correlated to best-selling product 1846.0¶

As seen below, product 1825 was purchased by the same customer 1305 as our best-selling product 1846.0. Therefore product 1825 shows up as correlated to 1846.0, in this case because they were purchased by the same people.

# Identify customer who purchased product 1825
prodsales[1825][prodsales[1825]>0]

Customers.id
1305    1.0
Name: 1825.0, dtype: float64

Build Correlation Matrix for all Products¶

corrMatrix = prodsales.corr(method='pearson', min_periods = 100)
corrMatrix.dropna().head()

Make a Product Recommendation for a Specific Customer¶

We will pick Customer id = 400, who likes product 858.0 (as shown below), and then find similar products:¶

prodsales.head(3)

myPurchases = prodsales.iloc[400][prodsales.iloc[400]>0]
myPurchases = pd.DataFrame(myPurchases)
myPurchases.columns = [[ "Quantity Purchased"]]
myPurchases

Now we Build a Recommendation List¶

simCandidates = pd.Series()
for i in range(0, len(myPurchases)):
    print ("Adding similarity for ", myPurchases.index[i])
    sims = corrMatrix[myPurchases.index[i]].dropna()
    sims = sims.map(lambda x: x* myPurchases.iloc[i,0])
    
    # Add the score to the list of similarity candidates
    simCandidates = simCandidates.append(sims)

# Print results
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)

Adding similarity for  858.0

858.0     1.000000
968.0     0.056531
845.0     0.036730
1672.0   -0.002310
153.0    -0.002310
2150.0   -0.002310
831.0    -0.002310
1366.0   -0.002310
1278.0   -0.002310
1509.0   -0.002310
dtype: float64

For each item that the user purchased, what we did was:

Retrieve similar products from the correlation matrix
Scale the correlation by how many products the user purchased (so the more quantity that the user bought of a product, the more that product counts to recommend similar ones)

We identified products 968.0 and 845.0 as the ones we could recommend.

	Customers.id	Customers.company	Orders.id	Orders.total	Orders.subtotal	Order_Items.id	Order_Items.qty	Order_Items.price	Products.id	Products.name
0	797	Company0	3758	64.29	57.20	5284	1	57.20	2310.0	Basic Steel Rollators,Green
1	3	Company1	23	29.99	20.00	31	4	5.00	177.0	Urinary Drain Bags
2	3	Company1	9531	78.73	68.78	11655	1	68.78	1.0	SensiCare Nitrile Exam Gloves,Blue,XX-Large

	Customers.id	Customers.company	Orders.id	Orders.total	Orders.subtotal	Order_Items.id	Order_Items.qty	Order_Items.price	Products.id	Products.name
3845	3399	NaN	16186	386.18	376.23	18833	2	29.00	11168.0	Disposable Tissue/Poly Flat Stretcher Sheets,B...
3846	3399	NaN	16186	386.18	376.23	18834	9	3.25	11161.0	Standard Facial Tissues
3847	3399	NaN	16186	386.18	376.23	18835	2	19.81	1475.0	Micro-Kill+ Disinfectant Wipes
3848	3399	NaN	16186	386.18	376.23	18836	2	19.80	25612.0	Dynarex 3672, Triangular Bandage 36x36x51 - 12/Bx
3849	3399	NaN	16186	386.18	376.23	18837	1	48.99	25558.0	Dynarex 3596, Athletic Tape, 1½" x 15 y...
3850	3399	NaN	16186	386.18	376.23	18838	1	27.35	2144.0	Caring Non-Sterile Latex Self-Adherent Wrap,Tan
3851	3399	NaN	16186	386.18	376.23	18839	3	17.05	4324.0	Solstice Nitrile Powder-Free Exam Gloves,Dark ...
3852	3399	NaN	16186	386.18	376.23	18840	1	24.10	12539.0	Caring Plastic Adhesive Bandages,Natural,No
3853	3399	NaN	16186	386.18	376.23	18841	1	13.83	25269.0	Dynarex 3102, Stretch Gauze Bandage Roll Non-S...
3854	3399	NaN	16186	386.18	376.23	18842	2	22.17	1629.0	Avant Gauze Non-Woven Sterile Sponges

	Customers.id	Customers.company	Orders.id	Orders.total	Orders.subtotal	Order_Items.id	Order_Items.qty	Order_Items.price	Products.id	Products.name	Total_Items.price
3845	3399	NaN	16186	386.18	376.23	18833	2	29.00	11168.0	Disposable Tissue/Poly Flat Stretcher Sheets,B...	58.00
3846	3399	NaN	16186	386.18	376.23	18834	9	3.25	11161.0	Standard Facial Tissues	29.25
3847	3399	NaN	16186	386.18	376.23	18835	2	19.81	1475.0	Micro-Kill+ Disinfectant Wipes	39.62
3848	3399	NaN	16186	386.18	376.23	18836	2	19.80	25612.0	Dynarex 3672, Triangular Bandage 36x36x51 - 12/Bx	39.60
3849	3399	NaN	16186	386.18	376.23	18837	1	48.99	25558.0	Dynarex 3596, Athletic Tape, 1½" x 15 y...	48.99
3850	3399	NaN	16186	386.18	376.23	18838	1	27.35	2144.0	Caring Non-Sterile Latex Self-Adherent Wrap,Tan	27.35
3851	3399	NaN	16186	386.18	376.23	18839	3	17.05	4324.0	Solstice Nitrile Powder-Free Exam Gloves,Dark ...	51.15
3852	3399	NaN	16186	386.18	376.23	18840	1	24.10	12539.0	Caring Plastic Adhesive Bandages,Natural,No	24.10
3853	3399	NaN	16186	386.18	376.23	18841	1	13.83	25269.0	Dynarex 3102, Stretch Gauze Bandage Roll Non-S...	13.83
3854	3399	NaN	16186	386.18	376.23	18842	2	22.17	1629.0	Avant Gauze Non-Woven Sterile Sponges	44.34

	Total_Items.price	Order_Items.qty
Products.id
1.0	68.78	1
19.0	273.24	6
20.0	197.12	2
22.0	29.00	1
30.0	189.56	5

	Total_Items.price	Order_Items.qty
Products.id
1846.0	13705.56	396
2107.0	12542.26	228
1672.0	5559.40	220

Products.id	1.0	19.0	20.0	22.0	30.0	35.0	62.0	64.0	65.0	74.0	...	25170.0	25269.0	25356.0	25527.0	25558.0	25612.0	25694.0	25908.0	25920.0	26175.0
Products.id
1.0	1.000000	-0.000397	-0.000477	-0.000337	-0.000563	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	...	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337
19.0	-0.000397	1.000000	-0.000562	-0.000397	-0.000662	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	...	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397	-0.000397
20.0	-0.000477	-0.000562	1.000000	-0.000477	-0.000796	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	...	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477	-0.000477
22.0	-0.000337	-0.000397	-0.000477	1.000000	-0.000563	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	...	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337	-0.000337
30.0	-0.000563	-0.000662	-0.000796	-0.000563	1.000000	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	...	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563	-0.000563

Products.id	1.0	19.0	20.0	22.0	30.0	35.0	62.0	64.0	65.0	74.0	...	25170.0	25269.0	25356.0	25527.0	25558.0	25612.0	25694.0	25908.0	25920.0	26175.0
Customers.id
3	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Products.id	1.0	19.0	20.0	22.0	30.0	35.0	62.0	64.0	65.0	74.0	...	25170.0	25269.0	25356.0	25527.0	25558.0	25612.0	25694.0	25908.0	25920.0	26175.0
Customers.id
3	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0