Recommendation-System-1- Collaborative Filtering and Content-based Filtering
Introduction to Recommendation system
In recommendation system, we want to let it give recommendations of items to users based on users’ visit history or other data.
if we represent the items and users in a table, it would be something like this:
users\items | item 1 | item 2 | item 3 |
---|---|---|---|
user 1 | 1 | - | - |
user 2 | - | 1 | - |
user 3 | - | - | 1 |
user 4 | 1 | 1 | - |
In this table, each row represent a user and column represent an item. The cell with value 1 means that user buys the corresponding item. For example, the cell of user1 - item 1 =1 means user 1 buys item 1.
Then recommendation system is to predict if the empty cells in the table will become 1 or not. Or predict if that user will buy the item or not. Building such recommendation system can help encourage users to consume items by recommending somethings attracting them and let companies make profits
Collaborative Filtering is one type of common methods in constructing recommendation system. Its main idea is to find the similarity between users or between items to rank the items and then pick and recommend the top K items to users.
In content-based filtering, it utilizes the content features of items and the contents users may like to compute similarity between item and user and make recommendation.
There are three common ways to do recommendation
- user-based
It first computes similarity between users, then pick top K items that the most similar user like to the current user - Item-based
It recommends new items to current user based on the similarity between new items and items that user has brought before - content-based
It computes the similarity between user and item based on the tags/topics the user like and the tags/topics that item contains. Then it recommends the most similar item to the user.
Cold Start Problem
In recommendation system, we also face a problem that when a new user or a new item comes to the platform, we don’t have enough information (like the preference of new users, tags/contents of new items) about them to make recommendation.
Hence Cold start problem mainly can be divided into two types:
- Cold start of user:
We don’t have enough information about new users and can not determine the preference of users to do recommendation - Cold start of item:
We don’t have enough information of new items and don’t know the tags and contents those items may have
User-based filtering (similarity between users)
The main idea in user-based filtering is that we first find the similarity between each two users based on the item vectors of the two users.Then pick the user who is most similar to the current user.
In the item list of the selected user, we pick the K top frequent visited items to the current user.
How it works
- Assumption
Assume we have a user-item table
users\items | item 1 | item 2 | item 3 | item 4 |
---|---|---|---|---|
user 1 | 1 | - | 1 | 1 |
user 2 | 1 | 1 | 1 | - |
user 3 | - | - | 1 | 1 |
user 4 | 1 | 1 | - | 1 |
where 1 represent the ratign from the user to the item the user purchase, - mean the user doesn’t buy the item.
Denote the $i^{th}$ user and the $j^{th}$ user as ui, uj, respectively and N(ui), N(uj) are the set of items the $i^{th}$ user purchased and the set of items $j^{th}$ user purchased respectively. |N(ui)| is the number of items the $i^{th}$ user purchased.
- Step 1: Compute Similarity between a pair of users
In order to compute the similarity between a pair of users, we need a way to measure the distance between users.
One common way is cosine distance, denoted as cos(ui, uj):
$$
cos(ui, uj) = \frac{ |N(ui) \cap N(uj) |}{\sqrt{|N(ui)| \times|N(uj)|} }
$$
where $|N(ui) \cap N(uj)|$ means the size of the intersection set between the set of items purchased by ui and the set of items purchased by uj.
Example:
In the table above, we want to compute cos(u1, u2), then we can see the intersection of the items from u1 and items from u2 is [item1, item3], so $|N(ui) \cap N(uj)|=2$. Both u1 and u2 users purchase 3 items, so $|N(u1)|=|N(u2)|=3$
Then similarity between user 1 and user 2 is:
$$
cos(u1, u2) = \frac{ 2}{\sqrt{3 \times 3} }
$$
Based on this, we can compute similarity between every pair of users in table.
There are other choices to compute similarity, such as Euclidean distance, dot product, Pearson’s Correlation.
- Step 2: Predict the interest of the $i^{th}$ user, ui, on the $x^{th}$ item, ix
In this step, we first pick the top K users who are most similar to ui. Then We pick the users who rated the item ix from the K users.
Finally we compute the estimated rating of ui on item ix using user similarity and ratings from those users. Estimated rating of ui on item ix is
$$
R(ui, ix) = \sum_{v\in M} { cos(ui, v) \times R(v, ix)}
$$
where M is the intersection between the set of the top K similar users and users who rated the item ix. The $R(v, ix)$ is the rating of user v on item ix.
Note that we can also use K-Nearest Neighbor method, rather than cosine similarity to find the top K simiar users as well. In this case, we may use Euclidean distance or other distance to measure similarity.
Properties
Advantages
- Easy to implement
- It is good to explore group interests on items, since it measures similarity between users in a group
- Good to use this method when the update of incoming new items is faster than the update of incoming new users
(Computing similarity between new user and other existing users is time expensive when there is millions of users. But updating similarity after adding new items is much easy, we just need to update N(ui))
Disadvantages
- Sparsity. The percentage of rating from users is low
- User-based collaborative filtering is also a memory-based mehtod, since it requires memory to store similarity between users.
- when update frequency of user is faster than the update frequency of items, it is time-expensive, especially there are millions of users
- Cold start problem. When there is no sufficient information about users or item, it is hard to estimate similarity efficiently
- update is slow when there are large amount of users or items
When to use
- when we want to explore group interests or similarity between users in a group
- The number of users is smaller than the number of items. Or Incoming new users amount is smaller than incoming new items amount.
Item-based filtering (similarity between items)
The main idea in Item-based is to compute the similarity between new item and the items purchased by users to predict if users will buy the new item.
How it works
Using the user-item table above
users\items | item 1 | item 2 | item 3 | item 4 |
---|---|---|---|---|
user 1 | 1 | - | 1 | 1 |
user 2 | 1 | 1 | 1 | - |
user 3 | - | - | 1 | 1 |
user 4 | 1 | 1 | - | 1 |
Denote the $i^{th}$ item as xi. M(xi) is the set of users who purchased item xi and |M(xi)| is the size of this set
- Step 1: Compute similarity between a pair of items
The cosine similarity of item xi and item xj is similarity to the similarity between users:
$$
cos(xi, xj) = \frac{ |M(xi) \cap M(xj) |}{\sqrt{|M(xi)| \times|M(xj)|} }
$$
In this example, $cos(x1, x2) = \frac{ |M(x1) \cap M(x2) |}{\sqrt{|M(x1)| \times|M(x2)|} } = \frac{2}{\sqrt{3\times 2} } $
- Step 2: Predict Rating of user ui on item xi
We pick the K items which are most similar to item xi based on computed item similarity. Then find the intersection Q between these K items and the set of items user ui purchased.
Then estimated rating of user ui on item xi is
$$
R(ui, xi) = \sum_{y \in Q} { cos(xi, y) \times R(ui, y)}
$$
where Q is the intersection set between K items and the set of items user ui purchased. y is item from Q. $R(ui, y)$ is rating of user ui on item y.
Properties
- Advantages
- Item-based Collaborative filtering is good for personalized recommendation, since it is based on similarity of items user purchased, which shows user’s preference information.
- Good to use this method when there are a large amount of users.
- Easy to compute as well
- Good Explainability
- Disadvantages
- It is memory-based model as well. Hence it needs to store information of items, users. If there are millions of items, the computation is expensive
- Cold Start problem. For new item, it is hard for us to compute similarity between items since there is lack of users using the new item
- Sparsity. The number of users using new items could be small and it is hard to compute similarity
- Time complexity is high when there are millions of items and users
When to use
- when we want to make personalized recommendations to users
- when the amount of items is not pretty large. Or the update frequency of items is smaller than update frequency of users’ information.
- When we have enough information about items and cold start effect is small.
Content-based filtering (based on tags/content of items)
Let’s consider that there are some tags/contents in items and users may prefer some tags/contents and hence like the items that contain such tags/contents.
Based on this setting, we can construct an item vector $v_i$ and an user vector $u_i$. $v_i$ and $u_i$ have the same shape
Each entry in the vector represents whether this item contains the corresponding tags/contents, or whether this user like or dislike the corresponding content / tag.
The user vector is given by this formula:
$$
u_i = \frac{1}{n}\sum_i^nu_i - \frac{1}{m}\sum_j^mu_j
$$
where $u_i$ is the item vector that user likes and $u_j$ is the item vector that user dislikes.
Each entry in user vector is the difference between the level of prefering this contents and the level of disliking this contents. If entry is negative, then user may dislike that content.
After computing the user vector, we can compute the similarity between user vector and item vectors to see which item is most likely to be purchased by user based on user’s interests.
The similarity can be computed by dot product similarity:
$$
similarity(ui, vi) = <ui, vi>
$$
where <ui, vi> is dot product of ui, vi.
Note that for ui, vi, we don’t use normalization here, since when there are large amount of contents but item has only a few contents, the vector will be very sparse. Normalization in a sparse vector will make the non-zero values pretty small and even close to 0, which make computation difficult and may loss precision.
Steps in content-based filtering are as follow:
- Compute item-content vectors and user vectors
- Compute similarity between user and items
- Rank items to recommend
Example
in a item-content table:
item | tag1 | tag2 | tag3 |
---|---|---|---|
v1 | 1 | 0 | 1 |
v2 | 1 | 0 | 0 |
v3 | 0 | 1 | 0 |
v4 | 0 | 1 | 1 |
in a user-item table:
user | item1 | item2 | item3 | item4 |
---|---|---|---|---|
u1 | 1 | 1 | -1 | -1 |
- Step1: compute item vectors and user vector
in item-content table, 1 represents item vi contains this tag/content, otherwise, it doesnt. In user-item table, 1 indicates this user like this item and -1 means user dislike this item.
Hence item vectors are v1 = [1, 0, 1], v2 = [1, 0, 0], v3= [0, 1, 0], v4=[0, 1, 1]
Then user vector u1 = $\frac{v1+v2}{2} - \frac{v3+v4}{2}$ = [1, 0, 0.5] - [0, 1, 0.5] = [1, -1, 0].
In this case, we can see user1 like tag1 and dislike tag2 and be neutral about tag3.
Step2: compute similarity between user and items
Sim(u1, v1) = <u1, v1> = 1 * 1 + 0 * (-1) + 1 * 0 = 1
Sim(u1, v2) = <u1, v2> = 1 * 1 + 0 * (-1) + 0 * 0 = 1
Sim(u1, v3) = <u1, v3> = 0 * 1 + 1 * (-1) + 0 * 0 = -1
Sim(u1, v4) = <u1, v4> = 0 * 1 + 1 * (-1) + 1 * 0 = -1Step3: ranking and recommend
user likes v1, v2 and dislikes v3, v4.
Update of user vector
Since user vector is just the difference between the average of like item vectors and the average of dislike item vectors. When the $k^{th}$ item comes, we can update either the like vector or dislike vector to update the user vector.
Let the average of K1 item vectors that user likes as
$$
u^+_{k-1}= \frac{1}{k1-1}\sum^{k1-1}_iv_i
$$
the average of K2 item vectors that user dislikes as
$$
u^-_{k-1} = \frac{1}{k2-1}\sum^{k2-1}_jv_j
$$
and K1 + K2 = k-1
user vector for k-1 items is
$$
u_{k-1} = u_{k-1}^+ -u_{k-1}^-
$$
When the $K^{th}$ item comes, if user like this item, then we update like vector
$$
u_{k}^+ = \frac{(k1-1)u_{k-1}^+ + v_k }{k1} , u_{k}^- = u_{k-1}^-
$$
if user dislikes the item, just update dislike vector
$$
u_{k}^- = \frac{(k2-1)u_{k-1}^- + v_k }{k2} , u_{k}^+ = u_{k-1}^+
$$
This enable us to update the user vector on the fly.
Properties
Advantages
- Easy and very fast to compute similarity even when there are large amount of items or users.
- Utilize the content information to explore potential contents that user likes
- Compared with Item-based and user-based method, it is less sensitive to Cold Start problem, since contents of item can be defined easily.
Disadvantages
- Sparsity of item vectors. Item vector could be sparse when there are a lot of contents and content follow long-tail distribution (Note long-tail distribution of feature/item can also lead to the sparsity)
- Very easy to converge to certain scope. Since content/tag could be long-tail distribution and most of items have similar content, then a small change in the item vector will lead to large change of similarity.
For example, a user vector u= [100, 0, 1, -100], due to long-tail distribution of contents, the values between contents are quick different. For item vectors v1 = [1, 0, 0, 0 ] and v2= [0, 0, 1, 0], sim(u1,v1) and sim(u1,v2) would be very different.
When to use
When we have more information about items and want to use those contents in recommendation.
Reference
[2] https://zh.wikipedia.org/wiki/%E5%8D%94%E5%90%8C%E9%81%8E%E6%BF%BE
[3] https://ars.els-cdn.com/content/image/1-s2.0-S1110866515000341-gr3.jpg
[4] https://towardsdatascience.com/introduction-to-recommender-systems-1-971bd274f421