使用Turicreate进行电影推荐#

import turicreate as tc
# set canvas to show sframes and sgraphs in ipython notebook
# import matplotlib.pyplot as plt
# %matplotlib inline
# download data from: http://files.grouplens.org/datasets/movielens/ml-1m.zip
data = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', delimiter='\n', 
                                header=False)['X1'].apply(lambda x: x.split('::')).unpack()
for col in data.column_names():
    data[col] = data[col].astype(int)
data = data.rename({'X.0': 'user_id', 'X.1': 'movie_id', 'X.2': 'rating', 'X.3': 'timestamp'})
#data.save('ratings')
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 100 lines in 0.281192 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 1000209 lines in 0.372092 secs.
users = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/users.dat', delimiter='\n', 
                                 header=False)['X1'].apply(lambda x: x.split('::')).unpack()
users = users.rename({'X.0': 'user_id', 'X.1': 'gender', 'X.2': 'age', 'X.3': 'occupation', 'X.4': 'zip-code'})
users['user_id'] = users['user_id'].astype(int)
users.save('users')
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 100 lines in 0.028041 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 6040 lines in 0.007235 secs.
#items = tc.SFrame.read_csv('/Users/datalab/bigdata/ml-1m/movies.dat', delimiter='\n', header=False)#['X1'].apply(lambda x: x.split('::')).unpack()
# items = items.rename({'X.0': 'movie_id', 'X.1': 'title', 'X.2': 'genre'})
# items['movie_id'] = items['movie_id'].astype(int)
# items.save('items')
data
user_id movie_id rating timestamp
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291
1 1197 3 978302268
1 1287 5 978302039
1 2804 5 978300719
1 594 4 978302268
1 919 4 978301368
[1000209 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
#items
users
user_id gender age occupation zip-code
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 02460
5 M 25 20 55455
6 F 50 9 55117
7 M 35 1 06810
8 M 25 12 11413
9 M 25 17 61614
10 F 35 1 95370
[6040 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
#data = data.join(items, on='movie_id')
#data
train_set, test_set = data.random_split(0.95, seed=1)
m = tc.recommender.create(train_set, 'user_id', 'movie_id', 'rating')
Preparing data set.
    Data has 949852 observations with 6040 users and 3701 items.
    Data prepared in: 0.550091s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
| max_iterations                 | Maximum Number of Iterations                     | 25       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 118731 / 949852 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 16.6667           | Not Viable                               |
| 1       | 4.16667           | Not Viable                               |
| 2       | 1.04167           | Not Viable                               |
| 3       | 0.260417          | Not Viable                               |
| 4       | 0.0651042         | 1.8722                                   |
| 5       | 0.0325521         | 1.94425                                  |
| 6       | 0.016276          | 1.95877                                  |
| 7       | 0.00813802        | 2.0441                                   |
+---------+-------------------+------------------------------------------+
| Final   | 0.0651042         | 1.8722                                   |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 110us        | 2.44718           | 1.1172                |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 536.251ms    | 2.09737           | 1.13925               | 0.0651042   |
| 2       | 1.05s        | 1.85594           | 1.06079               | 0.0651042   |
| 3       | 1.55s        | 1.79883           | 1.03161               | 0.0651042   |
| 4       | 2.06s        | 1.77231           | 1.02676               | 0.0651042   |
| 5       | 2.57s        | 1.75455           | 1.02264               | 0.0651042   |
| 10      | 5.81s        | 1.66968           | 0.995516              | 0.0651042   |
| 20      | 12.34s       | 1.58039           | 0.969493              | 0.0651042   |
| 25      | 15.69s       | 1.54869           | 0.961055              | 0.0651042   |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 1.57752
       Final training RMSE: 0.95536
m
Class                            : RankingFactorizationRecommender

Schema
------
User ID                          : user_id
Item ID                          : movie_id
Target                           : rating
Additional observation features  : 1
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 949852
Number of users                  : 6040
Number of items                  : 3701

Training summary
----------------
Training time                    : 21.9973

Model Parameters
----------------
Model class                      : RankingFactorizationRecommender
num_factors                      : 32
binary_target                    : 0
side_data_factorization          : 1
solver                           : auto
nmf                              : 0
max_iterations                   : 25

Regularization Settings
-----------------------
regularization                   : 0.0
regularization_type              : normal
linear_regularization            : 0.0
ranking_regularization           : 0.25
unobserved_rating_value          : -1.7976931348623157e+308
num_sampled_negative_examples    : 4
ials_confidence_scaling_type     : auto
ials_confidence_scaling_factor   : 1

Optimization Settings
---------------------
init_random_sigma                : 0.01
sgd_convergence_interval         : 4
sgd_convergence_threshold        : 0.0
sgd_max_trial_iterations         : 5
sgd_sampling_block_size          : 131072
sgd_step_adjustment_interval     : 4
sgd_step_size                    : 0.0
sgd_trial_sample_minimum_size    : 10000
sgd_trial_sample_proportion      : 0.125
step_size_decrease_rate          : 0.75
additional_iterations_if_unhealthy : 5
adagrad_momentum_weighting       : 0.9
num_tempering_iterations         : 4
tempering_regularization_start_value : 0.0
track_exact_loss                 : 0
m2 = tc.item_similarity_recommender.create(train_set, 
                                                 'user_id', 'movie_id', 'rating',
                                 similarity_type='pearson')
Warning: Ignoring columns timestamp;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 949852 observations with 6040 users and 3701 items.
    Data prepared in: 0.426101s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 27.234ms                       | 16.5       |
| 42.954ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 73.627ms                            | 0                | 2               |
| 2.79s                               | 100              | 3701            |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 2.82252s
m2
Class                            : ItemSimilarityRecommender

Schema
------
User ID                          : user_id
Item ID                          : movie_id
Target                           : rating
Additional observation features  : 0
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 949852
Number of users                  : 6040
Number of items                  : 3701

Training summary
----------------
Training time                    : 2.8226

Model Parameters
----------------
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : pearson
training_method                  : auto

Other Settings
--------------
max_data_passes                  : 4096
max_item_neighborhood_size       : 64
nearest_neighbors_interaction_proportion_threshold : 0.05
target_memory_usage              : 8589934592
sparse_density_estimation_sample_size : 4096
degree_approximation_threshold   : 4096
seed_item_set_size               : 50
result = tc.recommender.util.compare_models(test_set, 
                                                  [m, m2],
                                            user_sample=.5, skip_set=train_set)
compare_models: using 2811 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/2811 queries. users per second: 10084.7
recommendations finished on 2000/2811 queries. users per second: 10557.4
Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |     mean_recall      |    mean_precision    |
+--------+----------------------+----------------------+
|   1    | 0.004372314245294037 | 0.03344005691924596  |
|   2    | 0.008439255238125647 | 0.030771967271433692 |
|   3    | 0.011792773608123091 | 0.029764022293371297 |
|   4    | 0.014103362205887681 | 0.027303450729277888 |
|   5    | 0.017724646480050326 | 0.026894343649946677 |
|   6    | 0.01985047799128097  | 0.02549507885687179  |
|   7    | 0.023037645809147193 | 0.025054632311836182 |
|   8    | 0.02564717744662357  | 0.024101743151903235 |
|   9    | 0.027494985038662042 | 0.023123443614372085 |
|   10   | 0.02954846065621093  | 0.022483102098897183 |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.988323739301448

Per User RMSE (best)
+---------+----------------------+-------+
| user_id |         rmse         | count |
+---------+----------------------+-------+
|   4695  | 0.008856667044261357 |   1   |
+---------+----------------------+-------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+-------------------+-------+
| user_id |        rmse       | count |
+---------+-------------------+-------+
|   1102  | 2.957562522855876 |   1   |
+---------+-------------------+-------+
[1 rows x 3 columns]


Per Item RMSE (best)
+----------+----------------------+-------+
| movie_id |         rmse         | count |
+----------+----------------------+-------+
|   3674   | 0.012974611607248221 |   1   |
+----------+----------------------+-------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+----------+--------------------+-------+
| movie_id |        rmse        | count |
+----------+--------------------+-------+
|   3886   | 3.4432479133103597 |   1   |
+----------+--------------------+-------+
[1 rows x 3 columns]

PROGRESS: Evaluate model M1
recommendations finished on 1000/2811 queries. users per second: 23065.4
recommendations finished on 2000/2811 queries. users per second: 24766.9
Precision and recall summary statistics by cutoff
+--------+-------------+----------------+
| cutoff | mean_recall | mean_precision |
+--------+-------------+----------------+
|   1    |     0.0     |      0.0       |
|   2    |     0.0     |      0.0       |
|   3    |     0.0     |      0.0       |
|   4    |     0.0     |      0.0       |
|   5    |     0.0     |      0.0       |
|   6    |     0.0     |      0.0       |
|   7    |     0.0     |      0.0       |
|   8    |     0.0     |      0.0       |
|   9    |     0.0     |      0.0       |
|   10   |     0.0     |      0.0       |
+--------+-------------+----------------+
[10 rows x 3 columns]


Overall RMSE: 0.977554609754323

Per User RMSE (best)
+---------+-----------------------+-------+
| user_id |          rmse         | count |
+---------+-----------------------+-------+
|   3872  | 4.440892098500626e-16 |   1   |
+---------+-----------------------+-------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+--------------------+-------+
| user_id |        rmse        | count |
+---------+--------------------+-------+
|   5214  | 3.2845314102161183 |   2   |
+---------+--------------------+-------+
[1 rows x 3 columns]


Per Item RMSE (best)
+----------+------+-------+
| movie_id | rmse | count |
+----------+------+-------+
|   1842   | 0.0  |   1   |
+----------+------+-------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+----------+------+-------+
| movie_id | rmse | count |
+----------+------+-------+
|   572    | 4.0  |   1   |
+----------+------+-------+
[1 rows x 3 columns]

Getting similar items#

m.get_similar_items([1287])  # movie_id is Ben-Hur
movie_id similar score rank
1287 1262 0.8935538530349731 1
1287 1272 0.8684239983558655 2
1287 2662 0.8668187260627747 3
1287 3366 0.8548122048377991 4
1287 2948 0.8543752431869507 5
1287 3062 0.8494184017181396 6
1287 2947 0.8432653546333313 7
1287 3836 0.8384832739830017 8
1287 1304 0.8308332562446594 9
1287 1250 0.8267531394958496 10
[10 rows x 4 columns]
help(m.get_similar_items)
Help on method get_similar_items in module graphlab.toolkits.recommender.util:

get_similar_items(self, items=None, k=10, verbose=False) method of graphlab.toolkits.recommender.ranking_factorization_recommender.RankingFactorizationRecommender instance
    Get the k most similar items for each item in items.
    
    Each type of recommender has its own model for the similarity
    between items. For example, the item_similarity_recommender will
    return the most similar items according to the user-chosen
    similarity; the factorization_recommender will return the
    nearest items based on the cosine similarity between latent item
    factors.
    
    Parameters
    ----------
    items : SArray or list; optional
        An :class:`~graphlab.SArray` or list of item ids for which to get
        similar items. If 'None', then return the `k` most similar items for
        all items in the training set.
    
    k : int, optional
        The number of similar items for each item.
    
    verbose : bool, optional
        Progress printing is shown.
    
    Returns
    -------
    out : SFrame
        A SFrame with the top ranked similar items for each item. The
        columns `item`, 'similar', 'score' and 'rank', where
        `item` matches the item column name specified at training time.
        The 'rank' is between 1 and `k` and 'score' gives the similarity
        score of that item. The value of the score depends on the method
        used for computing item similarities.
    
    Examples
    --------
    
    >>> sf = graphlab.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
                              'item_id': ["a", "b", "c", "a", "b", "b", "c", "d"]})
    >>> m = graphlab.item_similarity_recommender.create(sf)
    >>> nn = m.get_similar_items()

‘score’ gives the similarity score of that item

# m.get_similar_items([1287]).join(items, on={'similar': 'movie_id'}).sort('rank')

Making recommendations#

recs = m.recommend()
recommendations finished on 1000/6040 queries. users per second: 11685.2
recommendations finished on 2000/6040 queries. users per second: 11654.4
recommendations finished on 3000/6040 queries. users per second: 11658.6
recommendations finished on 4000/6040 queries. users per second: 11321.5
recommendations finished on 5000/6040 queries. users per second: 11502.9
recommendations finished on 6000/6040 queries. users per second: 11105.9
recs
user_id movie_id score rank
1 318 5.045622686663793 1
1 1198 4.862424055854009 2
1 50 4.76625474802606 3
1 593 4.766107517102884 4
1 858 4.747795152286218 5
1 1196 4.689315832028316 6
1 2858 4.678970253834652 7
1 2396 4.5986619915758835 8
1 110 4.588308471063303 9
1 2571 4.573408636072801 10
[60400 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
data[data['user_id'] == 4]
user_id movie_id rating timestamp
4 3468 5 978294008
4 1210 3 978293924
4 2951 4 978294282
4 1214 4 978294260
4 1036 4 978294282
4 260 5 978294199
4 2028 5 978294230
4 480 4 978294008
4 1196 2 978294199
4 1198 5 978294199
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
# m.recommend(users=[4], k=20).join(items, on='movie_id')

Recommendations for new users#

recent_data = tc.SFrame()
recent_data['movie_id'] = [30, 1000, 900, 883, 251, 200, 199, 180, 120, 991, 1212] 
recent_data['user_id'] = 99999
recent_data['rating'] = [2, 1, 3, 4, 0, 0, 1, 1, 1, 2, 3]
recent_data
movie_id user_id rating
30 99999 2
1000 99999 1
900 99999 3
883 99999 4
251 99999 0
200 99999 0
199 99999 1
180 99999 1
120 99999 1
991 99999 2
[11 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
m2.recommend(users=[99999], new_observation_data=recent_data)#.join(items, on='movie_id').sort('rank')
user_id movie_id score rank
99999 3881 5.0 1
99999 3607 5.0 2
99999 1830 5.0 3
99999 989 5.0 4
99999 3172 5.0 5
99999 3233 5.0 6
99999 787 5.0 7
99999 3382 5.0 8
99999 3656 5.0 9
99999 3280 5.0 10
[10 rows x 4 columns]

Saving and loading models#

m.save('my_model')
m_again = graphlab.load_model('my_model')
m_again