Latent Factor Recommender System#
The NetFlix Challenge
Training data 100 million ratings, 480,000 users, 17,770 movies. 6 years of data: 2000-2005
Test data Last few ratings of each user (2.8 million)
Competition 2,700+ teams, $1 million prize for 10% improvement on Netflix
Evaluation criterion: Root Mean Square Error (RMSE)
Netflix’s system RMSE: 0.9514
U (m x m) ,
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
# Singular-value decomposition
U, s, VT = np.linalg.svd(A)
# create n x n Sigma matrix
Sigma = np.diag(s)
# reconstruct matrix
PT = Sigma.dot(VT)
#B = U.dot(Sigma.dot(VT))
print(PT)
[[1 2 3]
[4 5 6]
[7 8 9]]
[[-8.08154958e+00 -9.64331175e+00 -1.12050739e+01]
[-8.29792976e-01 -8.08611173e-02 6.68070742e-01]
[-1.36140716e-16 2.72281431e-16 -1.36140716e-16]]
# Singular-value decomposition
A = np.array([[1, 2], [3, 4], [5, 6]])
U, s, VT = np.linalg.svd(A)
# create n x n Sigma matrix
Sigma = np.zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[1], :A.shape[1]] = np.diag(s)
# reconstruct matrix
PT = Sigma.dot(VT)
B = U.dot(PT)
print('A = \n', A, '\n')
print('U = \n', U, '\n')
print('Sigma = \n', Sigma, '\n')
print('VT = \n', VT, '\n')
print('PT = \n', PT, '\n')
print('B = \n', B, '\n')
A =
[[1 2]
[3 4]
[5 6]]
U =
[[-0.2298477 0.88346102 0.40824829]
[-0.52474482 0.24078249 -0.81649658]
[-0.81964194 -0.40189603 0.40824829]]
Sigma =
[[9.52551809 0. ]
[0. 0.51430058]
[0. 0. ]]
VT =
[[-0.61962948 -0.78489445]
[-0.78489445 0.61962948]]
PT =
[[-5.90229186 -7.47652631]
[-0.40367167 0.3186758 ]
[ 0. 0. ]]
B =
[[1. 2.]
[3. 4.]
[5. 6.]]
# Singular-value decomposition
A = np.array([[1, 2, 3],
[4, 5, 6]])
U,S,VT = np.linalg.svd(A)
# create n x n Sigma matrix
Sigma = np.zeros((A.shape[1], A.shape[1]))
# populate Sigma with n x n diagonal matrix
if A.shape[1] > S.shape[0]:
S = np.append(S, 0)
Sigma[:A.shape[1], :A.shape[1]] = np.diag(S)
PT= Sigma.dot(VT)
PT = PT[0:A.shape[0]]
B = U.dot(PT)
print('A = \n', A, '\n')
print('U = \n', U, '\n')
print('Sigma = \n', Sigma, '\n')
print('VT = \n', VT, '\n')
print('PT = \n', PT, '\n')
print('B = \n', B, '\n')
A =
[[1 2 3]
[4 5 6]]
U =
[[-0.3863177 -0.92236578]
[-0.92236578 0.3863177 ]]
Sigma =
[[9.508032 0. 0. ]
[0. 0.77286964 0. ]
[0. 0. 0. ]]
VT =
[[-0.42866713 -0.56630692 -0.7039467 ]
[ 0.80596391 0.11238241 -0.58119908]
[ 0.40824829 -0.81649658 0.40824829]]
PT =
[[-4.07578082 -5.38446431 -6.69314779]
[ 0.62290503 0.08685696 -0.44919112]]
B =
[[1. 2. 3.]
[4. 5. 6.]]
SVD gives minimum reconstruction error (Sum of Squared Errors, SSE)
SSE and RMSE are monotonically related:
Great news: SVD is minimizing RMSE
# https://beckernick.github.io/matrix-factorization-recommender/
import pandas as pd
import numpy as np
ratings_list = [i.strip().split("::") for i in open('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', 'r').readlines()]
users_list = [i.strip().split("::") for i in open('/Users/datalab/bigdata/cjc/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('/Users/datalab/bigdata/cjc/ml-1m/movies.dat', 'r', encoding = 'iso-8859-15').readlines()]
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].astype('int64')
ratings_df['UserID'] = ratings_df['UserID'].astype('int64')
ratings_df['MovieID'] = ratings_df['MovieID'].astype('int64')
movies_df.head()
MovieID | Title | Genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
ratings_df.head()
UserID | MovieID | Rating | Timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
# 注意:使用0填充缺失值
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()
MovieID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UserID | |||||||||||||||||||||
1 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 3706 columns
R = R_df.to_numpy(dtype=np.int16)
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)
sigma = np.diag(sigma)
all_user_predicted_ratings = U.dot( sigma.dot(Vt)) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df
# each row is a user
# each column is a movie
MovieID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.288861 | 0.143055 | -0.195080 | -0.018843 | 0.012232 | -0.176604 | -0.074120 | 0.141358 | -0.059553 | -0.195950 | ... | 0.027807 | 0.001640 | 0.026395 | -0.022024 | -0.085415 | 0.403529 | 0.105579 | 0.031912 | 0.050450 | 0.088910 |
1 | 0.744716 | 0.169659 | 0.335418 | 0.000758 | 0.022475 | 1.353050 | 0.051426 | 0.071258 | 0.161601 | 1.567246 | ... | -0.056502 | -0.013733 | -0.010580 | 0.062576 | -0.016248 | 0.155790 | -0.418737 | -0.101102 | -0.054098 | -0.140188 |
2 | 1.818824 | 0.456136 | 0.090978 | -0.043037 | -0.025694 | -0.158617 | -0.131778 | 0.098977 | 0.030551 | 0.735470 | ... | 0.040481 | -0.005301 | 0.012832 | 0.029349 | 0.020866 | 0.121532 | 0.076205 | 0.012345 | 0.015148 | -0.109956 |
3 | 0.408057 | -0.072960 | 0.039642 | 0.089363 | 0.041950 | 0.237753 | -0.049426 | 0.009467 | 0.045469 | -0.111370 | ... | 0.008571 | -0.005425 | -0.008500 | -0.003417 | -0.083982 | 0.094512 | 0.057557 | -0.026050 | 0.014841 | -0.034224 |
4 | 1.574272 | 0.021239 | -0.051300 | 0.246884 | -0.032406 | 1.552281 | -0.199630 | -0.014920 | -0.060498 | 0.450512 | ... | 0.110151 | 0.046010 | 0.006934 | -0.015940 | -0.050080 | -0.052539 | 0.507189 | 0.033830 | 0.125706 | 0.199244 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6035 | 2.392388 | 0.233964 | 0.413676 | 0.443726 | -0.083641 | 2.192294 | 1.168936 | 0.145237 | -0.046551 | 0.560895 | ... | 0.188493 | -0.004439 | -0.042271 | -0.090101 | 0.276312 | 0.133806 | 0.732374 | 0.271234 | 0.244983 | 0.734771 |
6036 | 2.070760 | 0.139294 | -0.012666 | -0.176990 | 0.261243 | 1.074234 | 0.083999 | 0.013814 | -0.030179 | -0.084956 | ... | -0.161548 | 0.001184 | -0.029223 | -0.047087 | 0.099036 | -0.192653 | -0.091265 | 0.050798 | -0.113427 | 0.033283 |
6037 | 0.619089 | -0.161769 | 0.106738 | 0.007048 | -0.074701 | -0.079953 | 0.100220 | -0.034013 | 0.007671 | 0.001280 | ... | -0.053546 | 0.005835 | 0.007551 | -0.024082 | -0.010739 | -0.008863 | -0.099774 | -0.013369 | -0.030354 | -0.114936 |
6038 | 1.503605 | -0.036208 | -0.161268 | -0.083401 | -0.081617 | -0.143517 | 0.106668 | -0.054404 | -0.008826 | 0.205801 | ... | -0.006104 | 0.008933 | 0.007595 | -0.037800 | 0.050743 | 0.024052 | -0.172466 | -0.010904 | -0.038647 | -0.168359 |
6039 | 1.996248 | -0.185987 | -0.156478 | 0.104143 | -0.030001 | 0.105521 | -0.168477 | -0.058174 | 0.122714 | -0.119716 | ... | 0.238088 | -0.047046 | -0.043259 | 0.038256 | 0.055693 | 0.149593 | 0.587989 | -0.006641 | 0.127067 | 0.285001 |
6040 rows × 3706 columns
def recommend_movies(preds_df, user_row_number, movies_df, ratings_df, num_recommendations=5):
# Get and sort the user's predictions
sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
# Get the user's data and merge in the movie information.
userID = user_row_number + 1
user_data = ratings_df[ratings_df.UserID == userID]
user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').
sort_values(['Rating'], ascending=False)
)
print('UserID {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
# Recommend the highest predicted rating movies that the user hasn't seen yet.
potential_movie_df= movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])]
predicted_movie_df = pd.DataFrame(sorted_user_predictions).reset_index()
predicted_movie_df['MovieID'] = predicted_movie_df['MovieID'].astype('int64')
recommendations = (
potential_movie_df.merge(predicted_movie_df, how = 'left', on = 'MovieID').
rename(columns = {user_row_number: 'Predictions'}).
sort_values('Predictions', ascending = False).
iloc[:num_recommendations, :-1]
)
return user_full, recommendations
already_rated, predictions = recommend_movies(preds_df, 0, movies_df, ratings_df, 10)
UserID 1 has already rated 53 movies.
Recommending the highest 10 predicted ratings movies not already rated.
already_rated[:3]
UserID | MovieID | Rating | Timestamp | Title | Genres | |
---|---|---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 | One Flew Over the Cuckoo's Nest (1975) | Drama |
46 | 1 | 1029 | 5 | 978302205 | Dumbo (1941) | Animation|Children's|Musical |
40 | 1 | 1 | 5 | 978824268 | Toy Story (1995) | Animation|Children's|Comedy |
predictions
MovieID | Title | Genres | |
---|---|---|---|
311 | 318 | Shawshank Redemption, The (1994) | Drama |
32 | 34 | Babe (1995) | Children's|Comedy|Drama |
356 | 364 | Lion King, The (1994) | Animation|Children's|Musical |
1975 | 2081 | Little Mermaid, The (1989) | Animation|Children's|Comedy|Musical|Romance |
1235 | 1282 | Fantasia (1940) | Animation|Children's|Musical |
1974 | 2080 | Lady and the Tramp (1955) | Animation|Children's|Comedy|Musical|Romance |
1972 | 2078 | Jungle Book, The (1967) | Animation|Children's|Comedy|Musical |
1990 | 2096 | Sleeping Beauty (1959) | Animation|Children's|Musical |
1981 | 2087 | Peter Pan (1953) | Animation|Children's|Fantasy|Musical |
348 | 356 | Forrest Gump (1994) | Comedy|Romance|War |
比较三种矩阵分解的方法
特征值分解 Eigen value decomposition
只能用于方阵
奇异值分解 Singular value decomposition
需要填充稀疏矩阵中的缺失元素
计算复杂度高
梯度下降 Gradient Descent
广泛使用!
Including bias#
is the global bias, measured by the overall mean rating is the bias for user x, measured by the mean rating given by user x. is the bias for movie i, measured by the mean ratings of movie i. is the user-movie interaction

Further reading:#
Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09