Skip to main content
Ctrl+K
计算传播学 - Home
  • 寻找人类传播行为的基因
  • 第一章 计算传播学简介
    • 数据科学的编程工具:大数据
  • 第二章 数据科学的编程工具
    • Jupyter Notebook
    • Using ChatGPT to Learn Python
    • iching: A python package of I Ching
    • 计算思维:通过拆解和重组学习
    • 使用Jupyter制作Slides的介绍
    • Turicreate: Departure from Graphlab
    • 解决Matplotlib绘图显示中文问题
    • 案例:2009年英国国会议员开支丑闻
    • 案例:《转角遇到爱》背后的数据
    • 案例:Who runs China?背后的数据
    • Gdelt Dataset: Events, Mentions, and Global Knowledge Graph
  • 第三章 数据抓取
    • 抓取实时辟谣数据
    • 抓取网络小说
    • 抓取微信公众号文章内容
    • 使用requests + Xpath抓取豆瓣电影数据
    • 抓取历届政府工作报告
    • 抓取江苏省政协十年提案
    • 抓取网易云音乐热门评论
    • 使用Selenium操纵浏览器
    • 抓取网易云音乐用户的听歌记录
    • 使用Selenium提取人民网搜索数据
    • 使用Selenium抓取TripAdvisor用户评论
    • 使用Pyppeteer实现异步抓取!
    • 轻型微博爬虫
    • snscrape
  • 第四章 数据清洗
    • 对大数据进行预处理
    • 数据清洗之推特数据
    • 对占中新闻进行数据清洗
    • 清洗音乐列表🎵
    • 使用Pandas进行数据清洗
  • 第五章 统计思维
    • KL Divergence
    • Linear algebra
    • Distribution Functions
    • Probability
    • Statistical Hypothesis Testing
    • Introduction to Gradient Descent
    • Linear Regression
    • Statistical Modeling with Python
    • Logistic Regression of Titanic Data
    • Analysing the Pew Survey Data of COVID19
    • 中国家庭追踪调查2018
    • 社交媒体可以预测新冠疫情吗?
    • Survival Analysis with Python
    • The Book of Why
  • 第六章 社会科学家的机器学习
    • Hyperparameters and Model Validation
    • Feature Engineering
    • In Depth: Naive Bayes Classification
    • In Depth: Linear Regression
    • In-Depth: Support Vector Machines
    • In-Depth: Decision Trees and Random Forests
    • Forecasting and nowcasting with Google Flu Trends
    • The future of employment
    • Causal Forests
  • 第七章 神经网络与深度学习
    • Convolutional Networks
    • Sequnce Modeling: Recurrent and Recursive Nets
    • Recognizing Hand-Written Digits with Neural Networks
    • 使用CNN对CIFAR10图像进行分类
    • VGG16预训练模型
  • 第八章 文本挖掘
    • 词向量模型简介
    • Doc2Vec
    • 基于字典的情感分析
    • 大连理工大学中文情感词汇
    • 基于NRC字典的情感分析
    • 利用textblob进行情感分析
    • 基于机器学习的情感分析
    • LIWC: Linguistic Inquiry and Word Count analyzer
    • Moral Foundation Dictionary
    • 主题模型简介
    • 使用Turicreate建立主题模型
  • 第九章 推荐系统简介
    • Latent Factor Recommender System
    • 使用Surprise构建推荐系统
    • 使用Turicreate进行音乐推荐
    • 使用Turicreate进行电影推荐
  • 第十章 网络科学简介
    • 网络科学模型
    • 使用NetworkX分析网络
    • Simulating Network Diffusion With NDlib
    • Epidemics on Networks
    • SEIR-HCD Model
    • Social Network Analysis
    • 微博热搜分析
    • 天涯论坛的回帖网络分析
    • 可视化Facebook社交网络
    • Economic Complexity and Product Complexity
  • 第十一章 可视化
    • Qualitative Colormaps in Matplotlib Visualization
    • Matplotlib的科学绘图样式
    • 使用PyEcharts进行可视化
    • Plotly Express in Python
    • 使用folium做地图可视化
    • 使用Datashader可视化地理信息
    • 使用Datapane制作数据报告
    • 万神殿项目(Pantheon Project)
  • Colab
  • Repository
  • Open issue
  • .ipynb

Forecasting and nowcasting with Google Flu Trends

Contents

  • Predicting the Present with Google Trends
  • Detecting influenza epidemics using search engine query data
  • The Parable of Google Flu: Traps in Big Data Analysis
  • References
  • Learning by Doing

Forecasting and nowcasting with Google Flu Trends#

Rather than predicting the future, nowcasting attempts to use ideas from forecasting to measure the current state of the world; it attempts to “predict the present” (Choi and Varian 2012). Nowcasting has the potential to be especially useful to governments and companies that require timely and accurate measures of the world.

https://www.bitbybitbook.com/en/1st-ed/observing-behavior/strategies/forecasting/

Predicting the Present with Google Trends#

HYUNYOUNG CHOI and HAL VARIAN

Google, Inc., California, USA

In this paper we show how to use search engine data to forecast near-term values of economic indicators. Examples include automobile sales, unemployment claims, travel destination planning and consumer confidence.

Choi, Hyunyoung, and Hal Varian. 2012. “Predicting the Present with Google Trends.” Economic Record 88 (June):2–9. https://doi.org/10.1111/j.1475-4932.2012.00809.x.

Detecting influenza epidemics using search engine query data#

  • Jeremy Ginsberg et al. (2009) Detecting influenza epidemics using search engine query data. Nature. 457, pp:1012–1014 https://www.nature.com/articles/nature07634#Ack1

  • Google Query Data https://static-content.springer.com/esm/art%3A10.1038%2Fnature07634/MediaObjects/41586_2009_BFnature07634_MOESM271_ESM.xls Query fractions for the top 100 search queries, sorted by mean Z-transformed correlation with CDC-provided ILI percentages across the nine regions of the United States. (XLS 5264 kb)

  • CDC’s ILI Data. We use the weighted version of CDC’s ILI activity level as the estimation target (available at gis.cdc.gov/grasp/fluview/fluportaldashboard.html). The weekly revisions of CDC’s ILI are available at the CDC website for all recorded seasons (from week 40 of a given year to week 20 of the subsequent year). Click Download Data to get the data.

image.png

For example, ILI report revision at week 50 of season 2012–2013 is available at www.cdc.gov/flu/weekly/weeklyarchives2012-2013/data/senAllregt50.htm; ILI report revision at week 9 of season 2014–2015 is available at www.cdc.gov/flu/weekly/weeklyarchives2014-2015/data/senAllregt09.html.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV, ElasticNet
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
df = pd.read_excel('41586_2009_BFnature07634_MOESM271_ESM.xls', sheet_name=1, header = 1)
df.head()
Date United States New England Region Mid-Atlantic Region East North Central Region West North Central Region South Atlantic Region East South Central Region West South Central Region Mountain Region Pacific Region
0 2003-06-01 0.778 0.979 0.990 0.838 0.673 0.732 0.922 0.809 0.486 0.621
1 2003-06-08 0.850 0.932 0.806 0.879 0.839 0.852 0.479 0.800 1.383 1.008
2 2003-06-15 0.838 1.018 0.892 0.839 0.568 0.751 1.130 1.111 0.702 0.777
3 2003-06-22 0.828 0.615 1.149 0.676 0.730 0.867 1.149 0.498 0.936 0.734
4 2003-06-29 0.747 0.896 0.768 0.829 0.612 0.530 0.854 0.491 1.081 1.015
plt.plot(df['Date'], df['United States']);
_images/92740f3275dd02e10b2e8f8e39172da6ac214722d3e949d1b8d38452d4c4f688.png
plt.plot(df['Date'], df['Mid-Atlantic Region']);
_images/9415467a6c0256d51388c94116e5bf5f6cd6122439b80c7d516a921b5164fc41.png

Figure 1: An evaluation of how many top-scoring queries to include in the ILI-related query fraction.

image.png

Maximal performance at estimating out-of-sample points during cross-validation was obtained by summing the top 45 search queries. A steep drop in model performance occurs after adding query 81, which is ‘oscar nominations’.

# Combine 45 queries
dict = {'date': df['Date'].tolist()}
for i in range(1, 46):
    df = pd.read_excel('41586_2009_BFnature07634_MOESM271_ESM.xls', sheet_name=i, header = 1)
    dict['query'+str(i)] = df['United States'].tolist()
dat = pd.DataFrame.from_dict(dict)
dat.head()
date query1 query2 query3 query4 query5 query6 query7 query8 query9 ... query36 query37 query38 query39 query40 query41 query42 query43 query44 query45
0 2003-06-01 0.778 5.297 6.096 0.893 1.036 1.357 0.124 0.366 0.675 ... 0.483 0.131 0.633 0.173 0.241 0.848 0.138 0.190 2.027 2.133
1 2003-06-08 0.850 5.348 6.097 1.005 0.899 1.584 0.096 0.432 0.574 ... 0.527 0.178 0.716 0.227 0.257 1.153 0.145 0.164 2.245 2.335
2 2003-06-15 0.838 4.961 5.772 0.868 0.811 1.515 0.084 0.392 0.563 ... 0.509 0.156 0.760 0.213 0.234 1.123 0.138 0.213 2.338 2.311
3 2003-06-22 0.828 4.480 5.140 0.733 0.883 1.942 0.052 0.326 0.478 ... 0.417 0.128 0.715 0.198 0.192 1.220 0.146 0.198 2.231 2.237
4 2003-06-29 0.747 3.910 4.409 0.637 0.726 1.580 0.049 0.352 0.364 ... 0.429 0.080 0.671 0.150 0.153 1.114 0.138 0.165 2.005 2.085

5 rows × 46 columns

The Parable of Google Flu: Traps in Big Data Analysis#

David Lazer*, Ryan Kennedy, Gary King, Alessandro Vespignani

Science 14 Mar 2014: Vol. 343, Issue 6176, pp. 1203-1205 DOI: 10.1126/science.1248506

In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?

https://science.sciencemag.org/content/343/6176/1203.summary

Data & Code

https://science.sciencemag.org/content/sci/suppl/2014/03/12/343.6176.1203.DC1/1248506.Lazer.SM.revision1.pdf

Lazer, David; Kennedy, Ryan; King, Gary; Vespignani, Alessandro, 2014, “Replication data for: The Parable of Google Flu: Traps in Big Data Analysis”, https://doi.org/10.7910/DVN/24823, Harvard Dataverse

# merge the ILI data
# cflu is CDC % ILI
dat2 = pd.read_csv('../GFT2.0/parable/ParableOfGFT(Replication).csv')
dat3 = dat2[['date', 'cflu']]
data = pd.merge(dat, dat3, how='right', on='date')
data.head()
date query1 query2 query3 query4 query5 query6 query7 query8 query9 ... query37 query38 query39 query40 query41 query42 query43 query44 query45 cflu
0 2003-09-28 1.853 6.679 7.824 1.072 2.399 1.623 0.162 0.606 0.731 ... 0.224 0.741 0.324 0.496 1.267 0.294 0.204 2.097 2.905 0.884021
1 2003-10-05 1.976 6.310 8.259 1.194 2.733 1.589 0.167 0.607 0.662 ... 0.210 0.852 0.295 0.442 1.329 0.322 0.225 2.233 2.713 1.027731
2 2003-10-12 2.834 6.911 9.009 1.228 3.304 1.581 0.221 0.664 0.783 ... 0.259 0.890 0.306 0.413 1.392 0.353 0.201 2.305 2.874 1.282964
3 2003-10-19 3.501 7.492 9.611 1.291 3.846 1.619 0.326 0.698 0.841 ... 0.260 0.900 0.352 0.450 1.357 0.383 0.255 2.279 2.965 1.326605
4 2003-10-26 3.721 7.121 9.352 1.309 3.876 1.640 0.288 0.674 0.762 ... 0.260 0.950 0.293 0.393 1.367 0.319 0.271 2.317 2.986 1.773040

5 rows × 47 columns

#data.to_csv('gft_ili_us.csv', index = False)
# filter data
data = data[data['query1'].notna()]
data['date'] = pd.to_datetime(data['date'])
data['date'] = data['date'].dt.date
plt.plot(data['date'], data['query1'], label = 'query1')
plt.plot(data['date'], data['query2'], label = 'query2')
plt.plot(data['date'], data['query3'], label = 'query3')
plt.plot(data['date'], data['cflu'],  label = 'CDC ILI')
plt.legend()
plt.show()
_images/c13bb5f0c00a6b555b3c05594c7b520c16f77ab9c7af7f0742fc2174bdb0b090.png

Using this ILI-related query fraction as the explanatory variable, we fit a final linear model to weekly ILI percentages between 2003 and 2007 for all nine regions together, thus obtaining a single, region-independent coefficient. The model was able to obtain a good fit with CDC-reported ILI percentages, with a mean correlation of 0.90 (min = 0.80, max = 0.96, n = 9 regions; Fig. 2).

Figure 2: A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated.

image.png

A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a correlation of 0.96 was obtained over 42 validation points. Dotted lines indicate 95% prediction intervals. The region comprises New York, New Jersey and Pennsylvania.

Figure 3: ILI percentages estimated by our model (black) and provided by the CDC (red) in the mid-Atlantic region, showing data available at four points in the 2007-2008 influenza season.

image.png

for i in range(1, 8):
    data["lag_{}".format(i)] = data['cflu'].shift(i)
print("done")
data=data.fillna(0)
done
y = data['cflu']
date = data['date']
X = data.drop(['cflu', 'date'], axis = 1)
len(y)
242
y
0      0.884021
1      1.027731
2      1.282964
3      1.326605
4      1.773040
         ...   
237    1.165197
238    1.020348
239    0.877607
240    0.825107
241    0.787315
Name: cflu, Length: 242, dtype: float64
X
query1 query2 query3 query4 query5 query6 query7 query8 query9 query10 ... query43 query44 query45 lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7
0 1.853 6.679 7.824 1.072 2.399 1.623 0.162 0.606 0.731 0.145 ... 0.204 2.097 2.905 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 1.976 6.310 8.259 1.194 2.733 1.589 0.167 0.607 0.662 0.185 ... 0.225 2.233 2.713 0.884021 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 2.834 6.911 9.009 1.228 3.304 1.581 0.221 0.664 0.783 0.259 ... 0.201 2.305 2.874 1.027731 0.884021 0.000000 0.000000 0.000000 0.000000 0.000000
3 3.501 7.492 9.611 1.291 3.846 1.619 0.326 0.698 0.841 0.312 ... 0.255 2.279 2.965 1.282964 1.027731 0.884021 0.000000 0.000000 0.000000 0.000000
4 3.721 7.121 9.352 1.309 3.876 1.640 0.288 0.674 0.762 0.321 ... 0.271 2.317 2.986 1.326605 1.282964 1.027731 0.884021 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
237 2.683 8.227 11.683 1.636 2.690 2.322 0.241 0.858 0.933 0.292 ... 0.281 2.707 1.569 1.311296 1.652751 2.033656 2.692446 3.215218 3.832486 4.532340
238 2.290 7.648 10.502 1.435 2.359 2.239 0.165 0.769 0.850 0.225 ... 0.254 2.568 1.510 1.165197 1.311296 1.652751 2.033656 2.692446 3.215218 3.832486
239 1.766 7.375 10.081 1.258 2.031 2.194 0.142 0.735 0.775 0.214 ... 0.237 2.480 1.468 1.020348 1.165197 1.311296 1.652751 2.033656 2.692446 3.215218
240 1.446 7.132 9.531 1.097 1.930 2.190 0.114 0.654 0.657 0.188 ... 0.269 2.436 1.378 0.877607 1.020348 1.165197 1.311296 1.652751 2.033656 2.692446
241 1.317 6.807 9.315 1.000 1.752 2.551 0.126 0.619 0.672 0.168 ... 0.259 2.470 1.423 0.825107 0.877607 1.020348 1.165197 1.311296 1.652751 2.033656

242 rows × 52 columns

N = 50
X_train = X.iloc[:N,]
X_test = X.iloc[N:,]
y_train = y[:N]
y_test = y[N:]

# 利用弹性网络
from sklearn.model_selection import cross_val_score
cv_model = ElasticNetCV(l1_ratio=0.5, eps=1e-3, n_alphas=200, fit_intercept=True, 
                        normalize=True, precompute='auto', max_iter=200, tol=0.006, cv=10, 
                        copy_X=True, verbose=0, n_jobs=-1, positive=False, random_state=0)

# 训练模型              
cv_model.fit(X_train, y_train)

# 计算最佳迭代次数、alpha和ratio
print('最佳 alpha: %.8f'%cv_model.alpha_)
print('最佳 l1_ratio: %.3f'%cv_model.l1_ratio_)
print('迭代次数 %d'%cv_model.n_iter_)
最佳 alpha: 0.06455398
最佳 l1_ratio: 0.500
迭代次数 63
# 输出结果
y_train_pred = cv_model.predict(X_train)
y_pred = cv_model.predict(X_test)
print('Train r2 score: ', r2_score(y_train_pred, y_train))
print('Test r2 score: ', r2_score(y_test, y_pred))
train_mse = mean_squared_error(y_train_pred, y_train)
test_mse = mean_squared_error(y_pred, y_test)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print('Train RMSE: %.4f' % train_rmse)
print('Test RMSE: %.4f' % test_rmse)
Train r2 score:  0.899526504124001
Test r2 score:  0.8509986476456004
Train RMSE: 0.4792
Test RMSE: 0.4345
import datetime
plt.style.use('ggplot')

plt.rcParams.update({'figure.figsize': (15, 5)})

plt.plot(date, y)
plt.plot(date[N:], y_pred)

plt.show()
_images/a69efd2ada106f74576b74c61d209a43eaa0a94d599d1dc0b6a1945099e7fd7b.png

However, this apparent success story eventually turned into an embarrassment.

  1. Google Flu Trends with all its data, machine learning, and powerful computing did not dramatically outperform a simple and easier-to-understand heuristic. This suggests that when evaluating any forecast or nowcast, it is important to compare against a baseline.

  2. Its ability to predict the CDC flu data was prone to short-term failure and long-term decay because of drift and algorithmic confounding.

These two caveats complicate future nowcasting efforts, but they do not doom them. In fact, by using more careful methods, Lazer et al. (2014) and Yang, Santillana, and Kou (2015) were able to avoid these two problems.

References#

  • Goel, Sharad, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts. 2010. “Predicting Consumer Behavior with Web Search.” Proceedings of the National Academy of Sciences of the USA 107 (41):17486–90. https://doi.org/10.1073/pnas.1005962107.

  • Yang, Shihao, Mauricio Santillana, and S. C. Kou. 2015. “Accurate Estimation of Influenza Epidemics Using Google Search Data via ARGO.” Proceedings of the National Academy of Sciences of the USA 112 (47):14473–8. https://doi.org/10.1073/pnas.1515373112.

  • Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176):1203–5. https://doi.org/10.1126/science.1248506.

Learning by Doing#

https://github.com/JEstebanMejiaV/The.Analytics.Edge/blob/352d59a27d2c376f268b1dbdf838e9ee77989d36/Unit 2 - Linear Regression/Detecting Flu Epidemics via Search Engine Query Data.ipynb

dat = pd.read_csv('FluTrain.csv')
dat.head()
Week ILI Queries
0 2004-01-04 - 2004-01-10 2.418331 0.237716
1 2004-01-11 - 2004-01-17 1.809056 0.220452
2 2004-01-18 - 2004-01-24 1.712024 0.225764
3 2004-01-25 - 2004-01-31 1.542495 0.237716
4 2004-02-01 - 2004-02-07 1.437868 0.224436
dat['Week']
0      2004-01-04 - 2004-01-10
1      2004-01-11 - 2004-01-17
2      2004-01-18 - 2004-01-24
3      2004-01-25 - 2004-01-31
4      2004-02-01 - 2004-02-07
                ...           
412    2011-11-27 - 2011-12-03
413    2011-12-04 - 2011-12-10
414    2011-12-11 - 2011-12-17
415    2011-12-18 - 2011-12-24
416    2011-12-25 - 2011-12-31
Name: Week, Length: 417, dtype: object
for i in range(1, 8):
    dat["lag_{}".format(i)] = dat['ILI'].shift(i)
print("done")
dat=dat.fillna(0)
done
y = dat['ILI']
week = dat['Week']
week = [i[:10] for i in week.tolist()]
week = pd.to_datetime(week)
X = dat.drop(['ILI', 'Week'], axis = 1)
y
0      2.418331
1      1.809056
2      1.712024
3      1.542495
4      1.437868
         ...   
412    1.465723
413    1.518106
414    1.663954
415    1.852736
416    2.124130
Name: ILI, Length: 417, dtype: float64
X
Queries lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7
0 0.237716 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.220452 2.418331 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.225764 1.809056 2.418331 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.237716 1.712024 1.809056 2.418331 0.000000 0.000000 0.000000 0.000000
4 0.224436 1.542495 1.712024 1.809056 2.418331 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ...
412 0.478088 1.655415 1.462212 1.440892 1.452843 1.305461 1.252586 1.236957
413 0.464807 1.465723 1.655415 1.462212 1.440892 1.452843 1.305461 1.252586
414 0.479416 1.518106 1.465723 1.655415 1.462212 1.440892 1.452843 1.305461
415 0.537849 1.663954 1.518106 1.465723 1.655415 1.462212 1.440892 1.452843
416 0.618858 1.852736 1.663954 1.518106 1.465723 1.655415 1.462212 1.440892

417 rows × 8 columns

N = 100
X_train = X.iloc[:N,]
X_test = X.iloc[N:,]
y_train = y[:N]
y_test = y[N:]

# 利用弹性网络
from sklearn.model_selection import cross_val_score
cv_model = ElasticNetCV(l1_ratio=0.5, eps=1e-3, n_alphas=200, fit_intercept=True, 
                        normalize=True, precompute='auto', max_iter=200, tol=0.006, cv=10, 
                        copy_X=True, verbose=0, n_jobs=-1, positive=False, random_state=0)

# 训练模型              
cv_model.fit(X_train, y_train)

# 计算最佳迭代次数、alpha和ratio
print('最佳 alpha: %.8f'%cv_model.alpha_)
print('最佳 l1_ratio: %.3f'%cv_model.l1_ratio_)
print('迭代次数 %d'%cv_model.n_iter_)
最佳 alpha: 0.00034012
最佳 l1_ratio: 0.500
迭代次数 15
# 输出结果
y_train_pred = cv_model.predict(X_train)
y_pred = cv_model.predict(X_test)
print('Train r2 score: ', r2_score(y_train_pred, y_train))
print('Test r2 score: ', r2_score(y_test, y_pred))
train_mse = mean_squared_error(y_train_pred, y_train)
test_mse = mean_squared_error(y_pred, y_test)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print('Train RMSE: %.4f' % train_rmse)
print('Test RMSE: %.4f' % test_rmse)
Train r2 score:  0.8704190473192481
Test r2 score:  0.9216895094890315
Train RMSE: 0.2868
Test RMSE: 0.3293
plt.plot(week, y)
plt.plot(week[N:], y_pred)
plt.show()
_images/35ee2af5b5068e9805f940c5bd37bcdde7d6ef4b7a4fb088e6030830cd369d19.png

previous

In-Depth: Decision Trees and Random Forests

next

The future of employment

Contents
  • Predicting the Present with Google Trends
  • Detecting influenza epidemics using search engine query data
  • The Parable of Google Flu: Traps in Big Data Analysis
  • References
  • Learning by Doing

By Cheng-Jun Wang

© Copyright 2022.