These days, recommendation systems have become part of our everyday digital world. We find large scale recommender systems in e-commerce, video on demand, or music streaming services and almost everywhere. With the digital footprint of the individual growing, companies can now understand user behaviour much better than ever before. With this data, it is a cakewalk to the companies to provide personalized experiences and make relevant, individualized, and accurate recommendations for new content and new products based on their previous activities.
According to Wikipedia, “A recommender system, or a recommendation system (sometimes replacing ‘system’ with a synonym such as platform or engine), is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are primarily used in commercial applications.” In a brief, recommendation systems are algorithms aimed at suggesting relevant items to users where items being movies to watch, text to read, products to buy, advertisements or anything else depending on industries.
There is a difference between Data Mining techniques like the Apriori algorithm and Market Basket Analysis to Machine Learning based recommendation system. Apriori algorithm is limited only to finding frequent itemsets in a dataset for the boolean association rule where it uses prior knowledge of frequent itemset properties. This Association Rule Mining is applied in the retail world called Market Basket Analysis. Association rules do not extract an individual’s preference, rather find relationships between sets of elements of every distinct transaction. This is what makes them different than the recommendation systems where it is based solely on the user’s preferences. To understand it better, we notice – “Frequently Bought Together” and the “Customers who bought this item also bought” on each product’s info page in Amazon. Where products are shown in this section -“Frequently Bought Together” is based on the data mining rules, whereas the products from this section – “Customers who bought this item also bought” is based on the recommendation system.
Recommendation systems can generate a huge income for companies and would help them to stand out from their competitors. In Kaggle and other platforms, there are competitions worth Million of dollars to build better recommendation systems. This showcases the Importance of such systems.
Social Media are quite useful for helping to predict what users are likely to do next based on their current plans. For example, Spotify can predict which users will like or dislike a certain song and provide suggestions based on it.
We at ALPES have worked intensively on different types of recommendation systems. The two main types of recommendation systems are collaborative filtering methods and content-based methods. In this article, we will go through different recommendation systems. To make it simpler, we take the use case of Movie Recommendation.
Collaborative Filtering
Collaborative filtering systems are based on past/historic interactions recorded between users and items (clicked, watched, purchased, liked, rated, etc.) in order to produce new recommendations. These interactions are stored in the user-item interactions matrix.
For example, let’s say user X likes Romantic movies and Suspense thrillers a lot. User Y also enjoys Romantic movies but never watched Suspense thrillers. The collaborative filter will recommend Suspense thrillers shows to user Y, based on the common taste that the two users have for Romantic movies. This scenario can go two ways: either user Y finds out that he/she likes Suspense thrillers a lot, and in that case, great, a lot of new things to watch on his/her list! Or, user Y really doesn’t enjoy Suspense thrillers, and in that case, the recommendation has not been successful.
Content Based
Unlike collaborative filtering that only relies on user-item interactions, content based approaches use additional information about users and/or items. If we consider the example of a movie recommender system, this additional information can be the age, gender, job or any other personal information for users as well as the category, the main actors, the duration or other characteristics for the movies.
If we compare this to the example above, here we take into consideration only the content of the movie for a recommendation, user Y will only keep getting the recommendation of Romantic Films or similar. Of course, there are many categories we can calculate the similarity on: in the case of movies we can decide to build our own recommender system based on genre only, or maybe we want to include director, main actors and so on.
Now, Let’s focus on how to build a recommendation system from scratch using ALPES SNN.
For a content-based filtering approach, we need to convert the words or text into vector form and then find the closest match which is our recommendation for a given movie input title.
- Data Fetching:
I have taken a dataset from Kaggle. URL (https://www.kaggle.com/rounakbanik/the-movies-dataset). The metadata says this dataset has over 26 million ratings from over 270,000 users on 45,000 movies. For keeping this blog as simple as possible, I am using “links_small.csv” and “movies_metadata.csv”. To make this blog simpler, I have taken only 9099 movies using this “links_small” file. Now, download and place these two files in the working folder.
- Data Loading:
We load the required data files and process the data by removing null values if any. This a pre-processing step. We choose the “’description’” as the feature for this recommendation system and convert it to vectors in the next steps.
Code :
Step1:
# Reading the CSV file
data = pd.read_csv(‘./movies_metadata.csv’)
data_links = pd.read_csv(‘./links_small.csv’)
print(data.shape,data_links.shape)
Step 2:
# A look up operation on all movies that are present in links_small dataset
data[‘id’] = data[‘id’].astype(‘int’)
data_process = data[data[‘id’].isin(data_links[‘tmdbId’])]
print(data_process.shape)
Step 3:
# Removing Null values if any
data_process[‘tagline’] = data_process[‘tagline’].fillna(‘ ‘)
data_process[‘tagline’]
# Merging Overview and tittle together
data_process[‘description’] = data_process[‘overview’] + data_process[‘tagline’]
data_process[‘description’] = data_process[‘description’].fillna(‘ ‘)
- Feature Vectorization:
The pre-processed data is taken and “’description’” is chosen as a feature. Now we use the space library to convert the given text into a real-valued vector form. We have chosen the “en_core_web_lg” pre-defined spacy model to generate vectors as it has a large collection of vocabulary when compared to other spacy models. A 300 Dimension vector is returned for each movie feature.
Code:
Step 1:
#Loading Model
import spacy
nlp = spacy.load(“en_core_web_lg”)
Step 2:
# Generating Vectors
data_vectors = np.zeros([300])
for text in tqdm(data_process[‘description’]):
doc = nlp(text)
data_vectors = np.vstack((data_vectors, doc.vector))
- Training SNN:
With the above vectors and data we train the SNN. Then using this model, we test it with a Movie name to get the recommendations.