This Jupyter notebook showcases how to build a Product Purchase Recommendation System. (Please refer to the attached Data Exploration notebook
for a deeper understanding of the Product Orders dataset used here).
The client seeks to increase sales by recommending products that users are likely to purchase, based on data.
1) Check for Missing Values & Understand Data.
2) Calculate the Most Popular Product:
=>We do this below by finding the product with MAXIMUM SALES BY VOLUME and BY DOLLAR AMOUNT.
3) Find Company with Maximum Purchase.
4) Given the Most Selling Product, find Similar Products (using ITEM-BASED COLLABORATIVE FILTERING).
=>This requires calculating the cosine similarity (or alternatively the Pearson correlation, which produces similar results) between the Most Selling Product and other products.
5) Make a Product Recommendation for a Specific User, based on the user's previous purchases
This project shows an example of applying ITEM-BASED COLLABORATIVE FILTERING. Given the Most Selling Product in our dataset, we seek to find similar products by calculating either the cosine similarity or the Pearson correlation between the Most Selling Product and other products.
To define similarity between products, we build a co-ocurrence matrix with users as rows and products as columhs, then calculate the correlation of all columns (i.e., all products) with each other. This is a way to measure similarity between any two products in terms of how two products tend to be purchased by the same users in similar amounts. For example, if a user purchased large amounts of PRODUCT A, we will recommend PRODUCT B if other users who purchased large amounts of PRODUCT A also purchased product B.
An alternative approach would be to look for similarity between products in terms of product information such as Product Title, Product description, SEO search keywords, or Product Size.
What is important to understand here is, "Why are we looking for similar products.. ?" If a customer is trying to purchase product X, we should we make a recommendation for them to consider product Y, as long as product Y is similar to product X (a substitute product). A different goal would be to show products that are often purchased together with product X (complementary products), if a user has already purchased product X or is adding product X to a current order.
Another consideration is that some products tend to have recurrent purchases while others do not. For example, if someone already bought a wheelchair, that user is unlikely to buy another one, so showing different wheelchair options only makes sense BEFORE the purchase and not afterwards. In this case, suggesting products for wheelchair users or wheelchair purchasers would be more effective.
In this project, the strategy will be to focus on co-ocurring purchases -- recommending products that tend to be purchased together by users (whether the products are purchased together on the same order or not).
import pandas as pd
df = pd.read_csv("PBL 5 recommendation data.csv", encoding ='iso-8859-1')
df = df[["Customers.id","Customers.customer_type", "Customers.company","Orders.id", "Orders.subtotal","Orders.total", \
"Orders.customer_id", "Orders.company","Products.vendor","Orders.order_number","Order_Items.id", "Order_Items.parent", "Order_Items.qty", "Order_Items.price", "Products.id", "Products.name"]]
df.head(3)
df = df[["Customers.id", "Customers.company","Orders.id", "Orders.total","Orders.subtotal","Order_Items.id", "Order_Items.qty", \
"Order_Items.price", "Products.id", "Products.name"]]
df.head(3)
(Please see attached the Data Exploration notebook for further reference)
=> The most important features for our model are NOT missing data.
=> There are many duplicate columns.
=> We will perform our analysis on a data subset containing only the most important features, as shown below.
We select id = 16186 for visualization and testing purposes throughout this notebook:
dfs= df[df["Orders.id"] == 16186]
dfs
Engineer the total_Items.price variable as an item's price multiplied by its amount in any given order, and show results for our test id = 16186.
# Add column with total Items.price.
df["Total_Items.price"] = df["Order_Items.price"] * df["Order_Items.qty"]
df[df["Orders.id"] == 16186]
We want to know the total revenue per product, and the total quantity of items that were sold for every product. Therefore, we will GROUP ORDERS BY PRODUCT, and CALCULATE THE SUM for Total Item Price and Item quantity:
pop_products = df.groupby("Products.id")["Total_Items.price", "Order_Items.qty"].sum()
pop_products.head(5)
Sort to obtain the Most Popular Product by Sales Quantity:
# Sort results by Item quantity
pop_products.sort_values("Order_Items.qty", ascending = False).head(3)
# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)
# Find product name
df[df["Order_Items.id"] == 18049].loc[:,["Products.name"]].head(1)
Sort to obtain the Most popular by Revenue:
pop_products.sort_values("Total_Items.price", ascending = False).head(3)
# Find product name
df[df["Products.id"] == 1846.0].loc[:,["Products.name"]].head(1)
df1 = df.groupby("Customers.company")["Total_Items.price"].sum()
df1.sort_values(ascending = False).head()
So company 59 is the one with the most sales amount. But we are missing too many companies and if we had more information, we could try to fill the missing values in that column.
df_grouped = df.groupby(["Customers.id", "Products.id"])["Order_Items.qty"].sum()
dfg = pd.DataFrame(df_grouped)
dfg.reset_index()
dfg.shape
dfg = dfg.reset_index()
dfg.head()
prodsales = dfg.pivot_table(index = ["Customers.id"], columns = ["Products.id"], values = "Order_Items.qty" )
prodsales.head()
prodsales.columns
prod1 = prodsales[1846.0]
prod1.dropna().head(3)
prodsales.fillna(0, inplace = True)
prodsales.head()
prodsales.corrwith(prod1).sort_values(ascending = False).head()
As seen below, product 1825 was purchased by the same customer 1305 as our best-selling product 1846.0. Therefore product 1825 shows up as correlated to 1846.0, in this case because they were purchased by the same people.
# Identify customer who purchased product 1825
prodsales[1825][prodsales[1825]>0]
corrMatrix = prodsales.corr(method='pearson', min_periods = 100)
corrMatrix.dropna().head()
prodsales.head(3)
myPurchases = prodsales.iloc[400][prodsales.iloc[400]>0]
myPurchases = pd.DataFrame(myPurchases)
myPurchases.columns = [[ "Quantity Purchased"]]
myPurchases
simCandidates = pd.Series()
for i in range(0, len(myPurchases)):
print ("Adding similarity for ", myPurchases.index[i])
sims = corrMatrix[myPurchases.index[i]].dropna()
sims = sims.map(lambda x: x* myPurchases.iloc[i,0])
# Add the score to the list of similarity candidates
simCandidates = simCandidates.append(sims)
# Print results
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)
For each item that the user purchased, what we did was:
We identified products 968.0 and 845.0 as the ones we could recommend.