使用Turicreate进行电影推荐#
import turicreate as tc
# set canvas to show sframes and sgraphs in ipython notebook
# import matplotlib.pyplot as plt
# %matplotlib inline
# download data from: http://files.grouplens.org/datasets/movielens/ml-1m.zip
data = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', delimiter='\n',
header=False)['X1'].apply(lambda x: x.split('::')).unpack()
for col in data.column_names():
data[col] = data[col].astype(int)
data = data.rename({'X.0': 'user_id', 'X.1': 'movie_id', 'X.2': 'rating', 'X.3': 'timestamp'})
#data.save('ratings')
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 100 lines in 0.281192 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 1000209 lines in 0.372092 secs.
users = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/users.dat', delimiter='\n',
header=False)['X1'].apply(lambda x: x.split('::')).unpack()
users = users.rename({'X.0': 'user_id', 'X.1': 'gender', 'X.2': 'age', 'X.3': 'occupation', 'X.4': 'zip-code'})
users['user_id'] = users['user_id'].astype(int)
users.save('users')
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 100 lines in 0.028041 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 6040 lines in 0.007235 secs.
#items = tc.SFrame.read_csv('/Users/datalab/bigdata/ml-1m/movies.dat', delimiter='\n', header=False)#['X1'].apply(lambda x: x.split('::')).unpack()
# items = items.rename({'X.0': 'movie_id', 'X.1': 'title', 'X.2': 'genre'})
# items['movie_id'] = items['movie_id'].astype(int)
# items.save('items')
data
user_id | movie_id | rating | timestamp |
---|---|---|---|
1 | 1193 | 5 | 978300760 |
1 | 661 | 3 | 978302109 |
1 | 914 | 3 | 978301968 |
1 | 3408 | 4 | 978300275 |
1 | 2355 | 5 | 978824291 |
1 | 1197 | 3 | 978302268 |
1 | 1287 | 5 | 978302039 |
1 | 2804 | 5 | 978300719 |
1 | 594 | 4 | 978302268 |
1 | 919 | 4 | 978301368 |
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
#items
users
user_id | gender | age | occupation | zip-code |
---|---|---|---|---|
1 | F | 1 | 10 | 48067 |
2 | M | 56 | 16 | 70072 |
3 | M | 25 | 15 | 55117 |
4 | M | 45 | 7 | 02460 |
5 | M | 25 | 20 | 55455 |
6 | F | 50 | 9 | 55117 |
7 | M | 35 | 1 | 06810 |
8 | M | 25 | 12 | 11413 |
9 | M | 25 | 17 | 61614 |
10 | F | 35 | 1 | 95370 |
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
#data = data.join(items, on='movie_id')
#data
train_set, test_set = data.random_split(0.95, seed=1)
m = tc.recommender.create(train_set, 'user_id', 'movie_id', 'rating')
Preparing data set.
Data has 949852 observations with 6040 users and 3701 items.
Data prepared in: 0.550091s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter | Description | Value |
+--------------------------------+--------------------------------------------------+----------+
| num_factors | Factor Dimension | 32 |
| regularization | L2 Regularization on Factors | 1e-09 |
| solver | Solver used for training | adagrad |
| linear_regularization | L2 Regularization on Linear Coefficients | 1e-09 |
| ranking_regularization | Rank-based Regularization Weight | 0.25 |
| max_iterations | Maximum Number of Iterations | 25 |
+--------------------------------+--------------------------------------------------+----------+
Optimizing model using SGD; tuning step size.
Using 118731 / 949852 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value |
+---------+-------------------+------------------------------------------+
| 0 | 16.6667 | Not Viable |
| 1 | 4.16667 | Not Viable |
| 2 | 1.04167 | Not Viable |
| 3 | 0.260417 | Not Viable |
| 4 | 0.0651042 | 1.8722 |
| 5 | 0.0325521 | 1.94425 |
| 6 | 0.016276 | 1.95877 |
| 7 | 0.00813802 | 2.0441 |
+---------+-------------------+------------------------------------------+
| Final | 0.0651042 | 1.8722 |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter. | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 110us | 2.44718 | 1.1172 | |
+---------+--------------+-------------------+-----------------------+-------------+
| 1 | 536.251ms | 2.09737 | 1.13925 | 0.0651042 |
| 2 | 1.05s | 1.85594 | 1.06079 | 0.0651042 |
| 3 | 1.55s | 1.79883 | 1.03161 | 0.0651042 |
| 4 | 2.06s | 1.77231 | 1.02676 | 0.0651042 |
| 5 | 2.57s | 1.75455 | 1.02264 | 0.0651042 |
| 10 | 5.81s | 1.66968 | 0.995516 | 0.0651042 |
| 20 | 12.34s | 1.58039 | 0.969493 | 0.0651042 |
| 25 | 15.69s | 1.54869 | 0.961055 | 0.0651042 |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
Final objective value: 1.57752
Final training RMSE: 0.95536
m
Class : RankingFactorizationRecommender
Schema
------
User ID : user_id
Item ID : movie_id
Target : rating
Additional observation features : 1
User side features : []
Item side features : []
Statistics
----------
Number of observations : 949852
Number of users : 6040
Number of items : 3701
Training summary
----------------
Training time : 21.9973
Model Parameters
----------------
Model class : RankingFactorizationRecommender
num_factors : 32
binary_target : 0
side_data_factorization : 1
solver : auto
nmf : 0
max_iterations : 25
Regularization Settings
-----------------------
regularization : 0.0
regularization_type : normal
linear_regularization : 0.0
ranking_regularization : 0.25
unobserved_rating_value : -1.7976931348623157e+308
num_sampled_negative_examples : 4
ials_confidence_scaling_type : auto
ials_confidence_scaling_factor : 1
Optimization Settings
---------------------
init_random_sigma : 0.01
sgd_convergence_interval : 4
sgd_convergence_threshold : 0.0
sgd_max_trial_iterations : 5
sgd_sampling_block_size : 131072
sgd_step_adjustment_interval : 4
sgd_step_size : 0.0
sgd_trial_sample_minimum_size : 10000
sgd_trial_sample_proportion : 0.125
step_size_decrease_rate : 0.75
additional_iterations_if_unhealthy : 5
adagrad_momentum_weighting : 0.9
num_tempering_iterations : 4
tempering_regularization_start_value : 0.0
track_exact_loss : 0
m2 = tc.item_similarity_recommender.create(train_set,
'user_id', 'movie_id', 'rating',
similarity_type='pearson')
Warning: Ignoring columns timestamp;
To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
Data has 949852 observations with 6040 users and 3701 items.
Data prepared in: 0.426101s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 27.234ms | 16.5 |
| 42.954ms | 100 |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 73.627ms | 0 | 2 |
| 2.79s | 100 | 3701 |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 2.82252s
m2
Class : ItemSimilarityRecommender
Schema
------
User ID : user_id
Item ID : movie_id
Target : rating
Additional observation features : 0
User side features : []
Item side features : []
Statistics
----------
Number of observations : 949852
Number of users : 6040
Number of items : 3701
Training summary
----------------
Training time : 2.8226
Model Parameters
----------------
Model class : ItemSimilarityRecommender
threshold : 0.001
similarity_type : pearson
training_method : auto
Other Settings
--------------
max_data_passes : 4096
max_item_neighborhood_size : 64
nearest_neighbors_interaction_proportion_threshold : 0.05
target_memory_usage : 8589934592
sparse_density_estimation_sample_size : 4096
degree_approximation_threshold : 4096
seed_item_set_size : 50
result = tc.recommender.util.compare_models(test_set,
[m, m2],
user_sample=.5, skip_set=train_set)
compare_models: using 2811 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/2811 queries. users per second: 10084.7
recommendations finished on 2000/2811 queries. users per second: 10557.4
Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff | mean_recall | mean_precision |
+--------+----------------------+----------------------+
| 1 | 0.004372314245294037 | 0.03344005691924596 |
| 2 | 0.008439255238125647 | 0.030771967271433692 |
| 3 | 0.011792773608123091 | 0.029764022293371297 |
| 4 | 0.014103362205887681 | 0.027303450729277888 |
| 5 | 0.017724646480050326 | 0.026894343649946677 |
| 6 | 0.01985047799128097 | 0.02549507885687179 |
| 7 | 0.023037645809147193 | 0.025054632311836182 |
| 8 | 0.02564717744662357 | 0.024101743151903235 |
| 9 | 0.027494985038662042 | 0.023123443614372085 |
| 10 | 0.02954846065621093 | 0.022483102098897183 |
+--------+----------------------+----------------------+
[10 rows x 3 columns]
Overall RMSE: 0.988323739301448
Per User RMSE (best)
+---------+----------------------+-------+
| user_id | rmse | count |
+---------+----------------------+-------+
| 4695 | 0.008856667044261357 | 1 |
+---------+----------------------+-------+
[1 rows x 3 columns]
Per User RMSE (worst)
+---------+-------------------+-------+
| user_id | rmse | count |
+---------+-------------------+-------+
| 1102 | 2.957562522855876 | 1 |
+---------+-------------------+-------+
[1 rows x 3 columns]
Per Item RMSE (best)
+----------+----------------------+-------+
| movie_id | rmse | count |
+----------+----------------------+-------+
| 3674 | 0.012974611607248221 | 1 |
+----------+----------------------+-------+
[1 rows x 3 columns]
Per Item RMSE (worst)
+----------+--------------------+-------+
| movie_id | rmse | count |
+----------+--------------------+-------+
| 3886 | 3.4432479133103597 | 1 |
+----------+--------------------+-------+
[1 rows x 3 columns]
PROGRESS: Evaluate model M1
recommendations finished on 1000/2811 queries. users per second: 23065.4
recommendations finished on 2000/2811 queries. users per second: 24766.9
Precision and recall summary statistics by cutoff
+--------+-------------+----------------+
| cutoff | mean_recall | mean_precision |
+--------+-------------+----------------+
| 1 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 |
| 5 | 0.0 | 0.0 |
| 6 | 0.0 | 0.0 |
| 7 | 0.0 | 0.0 |
| 8 | 0.0 | 0.0 |
| 9 | 0.0 | 0.0 |
| 10 | 0.0 | 0.0 |
+--------+-------------+----------------+
[10 rows x 3 columns]
Overall RMSE: 0.977554609754323
Per User RMSE (best)
+---------+-----------------------+-------+
| user_id | rmse | count |
+---------+-----------------------+-------+
| 3872 | 4.440892098500626e-16 | 1 |
+---------+-----------------------+-------+
[1 rows x 3 columns]
Per User RMSE (worst)
+---------+--------------------+-------+
| user_id | rmse | count |
+---------+--------------------+-------+
| 5214 | 3.2845314102161183 | 2 |
+---------+--------------------+-------+
[1 rows x 3 columns]
Per Item RMSE (best)
+----------+------+-------+
| movie_id | rmse | count |
+----------+------+-------+
| 1842 | 0.0 | 1 |
+----------+------+-------+
[1 rows x 3 columns]
Per Item RMSE (worst)
+----------+------+-------+
| movie_id | rmse | count |
+----------+------+-------+
| 572 | 4.0 | 1 |
+----------+------+-------+
[1 rows x 3 columns]
Getting similar items#
m.get_similar_items([1287]) # movie_id is Ben-Hur
movie_id | similar | score | rank |
---|---|---|---|
1287 | 1262 | 0.8935538530349731 | 1 |
1287 | 1272 | 0.8684239983558655 | 2 |
1287 | 2662 | 0.8668187260627747 | 3 |
1287 | 3366 | 0.8548122048377991 | 4 |
1287 | 2948 | 0.8543752431869507 | 5 |
1287 | 3062 | 0.8494184017181396 | 6 |
1287 | 2947 | 0.8432653546333313 | 7 |
1287 | 3836 | 0.8384832739830017 | 8 |
1287 | 1304 | 0.8308332562446594 | 9 |
1287 | 1250 | 0.8267531394958496 | 10 |
help(m.get_similar_items)
Help on method get_similar_items in module graphlab.toolkits.recommender.util:
get_similar_items(self, items=None, k=10, verbose=False) method of graphlab.toolkits.recommender.ranking_factorization_recommender.RankingFactorizationRecommender instance
Get the k most similar items for each item in items.
Each type of recommender has its own model for the similarity
between items. For example, the item_similarity_recommender will
return the most similar items according to the user-chosen
similarity; the factorization_recommender will return the
nearest items based on the cosine similarity between latent item
factors.
Parameters
----------
items : SArray or list; optional
An :class:`~graphlab.SArray` or list of item ids for which to get
similar items. If 'None', then return the `k` most similar items for
all items in the training set.
k : int, optional
The number of similar items for each item.
verbose : bool, optional
Progress printing is shown.
Returns
-------
out : SFrame
A SFrame with the top ranked similar items for each item. The
columns `item`, 'similar', 'score' and 'rank', where
`item` matches the item column name specified at training time.
The 'rank' is between 1 and `k` and 'score' gives the similarity
score of that item. The value of the score depends on the method
used for computing item similarities.
Examples
--------
>>> sf = graphlab.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
'item_id': ["a", "b", "c", "a", "b", "b", "c", "d"]})
>>> m = graphlab.item_similarity_recommender.create(sf)
>>> nn = m.get_similar_items()
‘score’ gives the similarity score of that item
# m.get_similar_items([1287]).join(items, on={'similar': 'movie_id'}).sort('rank')
Making recommendations#
recs = m.recommend()
recommendations finished on 1000/6040 queries. users per second: 11685.2
recommendations finished on 2000/6040 queries. users per second: 11654.4
recommendations finished on 3000/6040 queries. users per second: 11658.6
recommendations finished on 4000/6040 queries. users per second: 11321.5
recommendations finished on 5000/6040 queries. users per second: 11502.9
recommendations finished on 6000/6040 queries. users per second: 11105.9
recs
user_id | movie_id | score | rank |
---|---|---|---|
1 | 318 | 5.045622686663793 | 1 |
1 | 1198 | 4.862424055854009 | 2 |
1 | 50 | 4.76625474802606 | 3 |
1 | 593 | 4.766107517102884 | 4 |
1 | 858 | 4.747795152286218 | 5 |
1 | 1196 | 4.689315832028316 | 6 |
1 | 2858 | 4.678970253834652 | 7 |
1 | 2396 | 4.5986619915758835 | 8 |
1 | 110 | 4.588308471063303 | 9 |
1 | 2571 | 4.573408636072801 | 10 |
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
data[data['user_id'] == 4]
user_id | movie_id | rating | timestamp |
---|---|---|---|
4 | 3468 | 5 | 978294008 |
4 | 1210 | 3 | 978293924 |
4 | 2951 | 4 | 978294282 |
4 | 1214 | 4 | 978294260 |
4 | 1036 | 4 | 978294282 |
4 | 260 | 5 | 978294199 |
4 | 2028 | 5 | 978294230 |
4 | 480 | 4 | 978294008 |
4 | 1196 | 2 | 978294199 |
4 | 1198 | 5 | 978294199 |
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
# m.recommend(users=[4], k=20).join(items, on='movie_id')
Recommendations for new users#
recent_data = tc.SFrame()
recent_data['movie_id'] = [30, 1000, 900, 883, 251, 200, 199, 180, 120, 991, 1212]
recent_data['user_id'] = 99999
recent_data['rating'] = [2, 1, 3, 4, 0, 0, 1, 1, 1, 2, 3]
recent_data
movie_id | user_id | rating |
---|---|---|
30 | 99999 | 2 |
1000 | 99999 | 1 |
900 | 99999 | 3 |
883 | 99999 | 4 |
251 | 99999 | 0 |
200 | 99999 | 0 |
199 | 99999 | 1 |
180 | 99999 | 1 |
120 | 99999 | 1 |
991 | 99999 | 2 |
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
m2.recommend(users=[99999], new_observation_data=recent_data)#.join(items, on='movie_id').sort('rank')
user_id | movie_id | score | rank |
---|---|---|---|
99999 | 3881 | 5.0 | 1 |
99999 | 3607 | 5.0 | 2 |
99999 | 1830 | 5.0 | 3 |
99999 | 989 | 5.0 | 4 |
99999 | 3172 | 5.0 | 5 |
99999 | 3233 | 5.0 | 6 |
99999 | 787 | 5.0 | 7 |
99999 | 3382 | 5.0 | 8 |
99999 | 3656 | 5.0 | 9 |
99999 | 3280 | 5.0 | 10 |
Saving and loading models#
m.save('my_model')
m_again = graphlab.load_model('my_model')
m_again