Updated: Aug 7
In this blog post, we will delve into the world of video recommendations for a short video platform. With an extensive inventory and active user engagement, recommending videos becomes an exciting challenge. We will explore the strategies and techniques employed to curate personalized video recommendations that captivate users and keep them coming back for more. Let's dive in!
1. Introduction and Overview
What are recommendation systems?
A recommendation system is a system that would predict the “rating” or “preference” a user would give to an item. This increases the engagement rate/purchases of the user which results in revenue generation.
The purpose of the blog post is to explain how a recommendation system can be built for short video content.
There are generally two types of recommendation systems:
Content-based recommendation: Recommend videos to users based on user information like location, language, gender etc or video information like caption, category, hashtags etc.
Collaborative Filtering: Recommend videos to users based on behaviour.
Let U-User, V-Video if U1 likes/views V1, V2, V3 and U2 likes/views V1, V3, V4 and U3 likes/views V1 we can recommend V3 to user U3 as he liked V1.
1.2 Problem Statement
To recommend similar videos to users based on videos users viewed/liked. We would be using Collaborative Filtering to build the recommendation system and train the model every hour for the recent 10 days of data.
We will extract the user view history for the recent 10 days of data, train a collaborative filtering model, and store similar videos for each video in a collection. we will schedule the script for every hour.
For a user, we will check if the user has a view history in the past 10 days, extract the recently viewed videos and get similar videos.
We will now filter similar videos based on postType and Recency, giving more preference to recent videos and more preference to Influencers and Premium videos.
1.4 Business metric
Business metric is generally the engagement rate of the user which is decided based on the following factors:
A total number of videos viewed by users.
Total time spent on watching videos.
Total number of videos viewed not considering time viewed = 0 (duration of time spent on a video).
1.5 Business Constraint
Latency requirements: The API should have a response time of less than 100ms.
2. Data Analysis
2.1 Understanding Data
Description: The data is present in MongoDB and the following collections were used:
User Collection: Contains all the user Information which includes the following fields.
'_id', 'isAdult', 'postLanguage', 'accountStatus', 'following_count', 'followers_count', 'post_count', 'view_count', 'like_count', 'saved_count', 'comment_count', 'isPremium', 'isProfileVerified', 'isInfluencer', 'isStarApplicant', 'gender', 'language', 'location', 'created_at'
Post Collection: Contains all the information of the video, includes the below fields.
['_id', 'userId', 'categoryId', 'subCategoryId', 'shareCount', 'likeCount', 'commentCount', 'viewsCount', 'reportPostsCount', 'status', 'language', 'created_at', 'createdBy', 'accessibility', 'postLink', 'hashtag', 'caption', 'mediaLocation', 'gender', 'logoStatus', 'hashtagIds', 'postType', 'uploadSource', 'isModerated', 'ownerData', 'song', 'isSpam', 'isPremium', 'country' ]
Visit Collection: Contains posts visited by the user with timestamps.
'_id', 'user', 'duration', 'userId', 'postId', 'createdAt', 'updatedAt'
2.2 Exploratory Data Analysis
We checked the distribution of postType and found that there are fewer videos from Influencers and Premium users than normal users. We will give more preference to Influencers and Premium videos based on this distribution.
We checked the number of unique registered users per day and checked if they have a view history in the past 10 days. We saw a ratio of 7:3, 7 being the number of unique users who had a view history.
We also checked for statistical description like the average view time of a user in one month, the average number of videos viewed by a user, gender distribution, unregistered to registered user ratio, etc.
2.3 Data Preprocessing
Below are the steps used:
We will extract recent 10 days of visit data from the visit collection which contains the postId’s visited by users (userId’s) with the duration and time at which the video was viewed.
Use multiprocessing the extract the data.
Drop userId’s with null.
Group by userId with a list of videos viewed, duration and time stamp and store the data in a separate Mongo Collection (userHistory). The reason for user history being stored separately when we already have visit collection is explained in section 3.3.
LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback. It’s easy to use, fast (via multithreaded model estimation), and produces high-quality results. We will train a Collaborative Filtering model using the LightFM library.
Preprocessing: We have extracted the visit data which contains userId and postId. We will use only those records which have a duration greater than or equal to 2. This is because the user might have simply scrolled the video (duration =0) and including this will reduce the model performance. We will create a dataset object and interaction matrix.
2. Training the model: We will now train the model using the interaction matrix and get the user and item mapping.
3. Generating similar videos: we also have item embedding and user embedding, we will follow item-item similarity and save the similar videos corresponding to each video.
Since there are millions of videos, it is not feasible to run a KNN to get the nearest neighbours, instead, we will use an implementation of Approximate Nearest Neighbours (ANN), Annoy.
For Each video,
Get the most similar 200 videos using Annoy
Get 20 most similar videos of 200 using cosine similarity.
Save the postId along with 20 similar postId’s, createdAt, and postType in a new Mongo Collection ‘Recommendations’.
We will also store top videos based on views, for completely new users with no view history.
Since videos trend based on time, we will use only recent 10 days data and schedule it every hour since visit collection will update continuously. We will use cron to schedule and the model will run every hour with new data.
We will create a script ‘generate_recommendations.py’ which inputs an userId and returns a list of recommended videos to the user within 100ms.
Check if the userId has a view history in the past 10 days from the userHistory collection. We will use userHistory collection instead of visit collection since going through the entire visit collection is time-consuming.
If the userId has a history in the past 10 days, then get the latest 20 videos viewed by the user using the ‘updatedAt’ field.
For each postId, get the similar postId’s from the ‘Recommendations’ collection. Now we have 400 similar videos.
We will now filter the recommended videos based on postType and recency, we will give more preference to posts created by ‘Influencer’ and ‘Premium users’ and more preference is given to recent videos and return the top 20 filtered videos.
3.4 Cold Start Recommendations
If the user does not contain a view history, we will get the top videos based on views from the ‘Recommendations’ collection and return the top 20 videos.
We will create an API in Flask that will take userId as input and recommended postIds as response using ‘generate_recommendations.py’ script.
Although Flask has a built-in server, it is not suitable for production, hence we will be using Gunicorn to run Flask in production. We will use a supervisor to monitor Gunicorn. The supervisor will let Gunicorn run in the background and also start it automatically on reboot.
A/B testing: We will be testing the recommendation engine for 5% of the users present on the app and evaluate the performance based on the engagement rate of the user.
4.3 Model Architecture
OS: Ubuntu, 16core, 32gb RAM.
Python 3.7, Jupyterlab
API- Flask with Gunicorn
Follow-up post: Explore more advanced methods on MAB, RL and Include more non-trivial features.