万神殿项目(Pantheon Project)#

由塞萨尔·伊达尔戈(César Hidalgo)创建的一个在线工具。

  • 伊达尔戈现在是麻省理工学院媒体实验室的教授,

    • 他曾说:”真正著名的人在他们各自的领域外也相当知名”。

  • 一个人的维基百科页面使用了多少种语言,他就有多大的名气。

若想被列入万神殿,一个人的名气必须跨越国家和语言障碍,必须在维基百科页面上出现至少25种语言。

单单这一个要求就将名人的范围从所有的小名人或不太出名的人缩小到11341人——他们各有特色,魅力十足。

Yu, A. Z., et al. (2016). Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific Data 2:150075. doi: 10.1038/sdata.2015.75

https://pantheon.world/data/datasets

  • pantheon.tsv

A tab delimited file containing a row of data per person found in the Panthon 1.0 dataset.

  • wikilangs.tsv

A tab delimited file of all the different Wikipedia language editions that each biography has a presence in.

  • pageviews_2008-2013.tsv A file containing the monthly pageview data for each individual, for all the Wikipedia language editions in which they have a presence.

Please refer to the methods section for more information on how this data was created. For detailed descriptions of these datasets, please refer to our data descriptor paper.

  • Jara-Figueroa, C., Yu, A.Z. and Hidalgo, C.A., 2015. The medium is the memory: how communication technologies shape what we remember. arXiv preprint arXiv:1512.05020.

  • Yu, A.Z., Ronen, S., Hu, K., Lu, T. and Hidalgo, C.A., 2016. Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific data, 3.

  • Ronen, S., Gonçalves, B., Hu, K.Z., Vespignani, A., Pinker, S. and Hidalgo, C.A., 2014. Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences, 111(52), pp.E5616-E5622.

  • Cesar A. Hidalgo and Ali Almossawi. “The Data-Visualization Revolution.” Scientific American. March 2014.

  • Hidalgo, C. A. “The Last 20 Inches: Data’s Treacherous Journey from the Screen to the Mind.” MIT Technology Review. March 2014.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('./data/person_2020_update.csv',low_memory=False)
df.head()
id wd_id wp_id slug name occupation prob_ratio gender twitter alive ... deathdate deathyear bplace_geacron_name dplace_geacron_name is_group l_ age non_en_page_views coefficient_of_variation hpi
0 18934 Q9458 18934 Muhammad Muhammad RELIGIOUS FIGURE 0.0 M NaN False ... 0632-06-08 632.0 Mecca NaN False 27.918400 1450.0 5160422.0 3.199355 100.000000
1 17414699 Q720 17414699 Genghis_Khan Genghis Khan MILITARY PERSONNEL 0.0 M NaN False ... 1227-08-18 1227.0 NaN NaN False 25.843621 858.0 3249211.0 2.753641 97.723669
2 18079 Q762 18079 Leonardo_da_Vinci Leonardo da Vinci INVENTOR 0.0 M NaN False ... 1519-05-02 1519.0 NaN NaN False 17.545406 568.0 5362406.0 4.796629 97.460691
3 14627 Q935 14627 Isaac_Newton Isaac Newton PHYSICIST 0.0 M NaN False ... 1727-03-31 1726.0 NaN NaN False 21.608920 378.0 3431331.0 4.632474 96.836567
4 17914 Q255 17914 Ludwig_van_Beethoven Ludwig van Beethoven COMPOSER 0.0 M NaN False ... 1827-03-26 1827.0 NaN Austria False 19.796430 250.0 5179518.0 3.926626 96.583969

5 rows × 34 columns

len(df)
88937
df.iloc[0]
id                                         18934
wd_id                                      Q9458
wp_id                                      18934
slug                                    Muhammad
name                                    Muhammad
occupation                      RELIGIOUS FIGURE
prob_ratio                              0.000000
gender                                         M
twitter                                      NaN
alive                                      False
l                                            193
hpi_raw                                       36
bplace_name                                Mecca
bplace_lat                             21.416667
bplace_lon                             39.816667
bplace_geonameid                    21021.000000
bplace_country                      Saudi Arabia
birthdate                                    NaN
birthyear                             570.000000
dplace_name                               Medina
dplace_lat                             24.466667
dplace_lon                             39.600000
dplace_geonameid                    36636.000000
dplace_country                      Saudi Arabia
deathdate                             0632-06-08
deathyear                             632.000000
bplace_geacron_name                        Mecca
dplace_geacron_name                          NaN
is_group                                   False
l_                                     27.918400
age                                  1450.000000
non_en_page_views                 5160422.000000
coefficient_of_variation                3.199355
hpi                                   100.000000
age_group                   (1003.333, 2006.667]
Name: 0, dtype: object
df.columns
Index(['id', 'wd_id', 'wp_id', 'slug', 'name', 'occupation', 'prob_ratio',
       'gender', 'twitter', 'alive', 'l', 'hpi_raw', 'bplace_name',
       'bplace_lat', 'bplace_lon', 'bplace_geonameid', 'bplace_country',
       'birthdate', 'birthyear', 'dplace_name', 'dplace_lat', 'dplace_lon',
       'dplace_geonameid', 'dplace_country', 'deathdate', 'deathyear',
       'bplace_geacron_name', 'dplace_geacron_name', 'is_group', 'l_', 'age',
       'non_en_page_views', 'coefficient_of_variation', 'hpi'],
      dtype='object')

The HPI combines the number of languages L, the effective number of languages L*, the age of the historical character A, the number of PageViews in Non-English Wikipedias v_NE (calculated in 2016), and the coefficient of variation in PageViews in all languages between CV (also calculated in 2016).

https://pantheon.world/about/methods

plt.style.use('ggplot')
plt.plot(df['l_'], df['hpi'], 'o')
plt.xscale('log')
plt.show()
_images/ed02725947b6291c6b13ece3a52dae6e0ea9570642e452a9403604193d0dc8c9.png
plt.style.use('ggplot')
plt.plot(df['non_en_page_views'], df['hpi'], 'o')
plt.xscale('log')
plt.show()
_images/c021453413eed39a61ec37a81d665de7e3d02f9262fa0e5ba160384501e83fa5.png
plt.style.use('ggplot')
plt.plot(df['l'], df['hpi'], 'o')
#plt.xscale('log')
plt.show()
_images/5a018c5c381b576990cc7d75af7115819a1c1ab63ed520a1e35b97a66e0c9ec8.png
plt.hist(df['hpi']);
_images/f9d52bfc3d6d0490a88c00eb3c2071f3517aebd2d55760565a0e81e22c24b909.png
plt.hist(df['non_en_page_views']);
_images/67daf59b341fc7b88f46c7e6145bcfa192c0664274e5d119ea36f4831c5b1592.png
sns.pointplot(x='gender', y = 'hpi', data = df, color = 'blue', linestyles='');
_images/765711a99ef44ae04d6529e101ea6d0dceac7c04a60c864beb6f917467b70a8e.png
sns.pointplot(x='gender', y = 'hpi', ci = 'sd', data = df, color = 'blue', linestyles='');
_images/149ef785188bd07e9997bce264ef2d4c4d772b461873d2e40653c6760d5e02c0.png
sns.boxplot(x="gender", y="hpi",
            #hue="smoker", 
            palette=["m", "g"],
            data=df);
_images/b0df5396920df34cceba72c2f7b88016fc33adbe0e28772e222d26d9da885853.png
# Draw a nested barplot by species and sex
g = sns.catplot(
    data=df, kind="bar",
    x="gender", y="hpi", hue="gender",
    ci="sd", palette="dark", alpha=.6, height=6
)
_images/4918442e0d1eed7c63e8cc07747ee1293d0115401f359d5384957fcfc83baa52.png
df['age_group']=pd.cut(df['age'], bins = 6)
plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="hpi", data=df)
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()
_images/1a35872a05742c5b9fb516a2e217bc2ca215ed65d7cf8616d758cd4c20746d06.png
plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="hpi", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()
_images/1ffc0b0907e2ffa7bfb5d8d3a2f75d53ed80891a418cca3f110e68bf7e9fe24a.png
plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="non_en_page_views", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()
_images/14e871e0c442cee17526b2e5cb87973359b962807c8fcb0c0888f6561847ed99.png
plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="l", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('Language Impact', fontsize = 16)
plt.show()
_images/3a599fe69206b53420d3dfeb32e5eda37d660b0777a1a2cbba526d7dc77e32f0.png
plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="l", ci = 'sd', data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('Language Impact', fontsize = 16)
plt.show()
_images/94ee85efb058df452b73bf959d721fa184ec74674c8bbb2015e87deb9decd87f.png
df['occupation'].unique()
array(['RELIGIOUS FIGURE', 'MILITARY PERSONNEL', 'INVENTOR', 'PHYSICIST',
       'COMPOSER', 'PHILOSOPHER', 'POLITICIAN', 'ASTRONOMER', 'EXPLORER',
       'PAINTER', 'WRITER', 'MATHEMATICIAN', 'PSYCHOLOGIST',
       'SOCIAL ACTIVIST', 'BIOLOGIST', 'ECONOMIST', 'NOBLEMAN', 'BOXER',
       'HISTORIAN', 'ACTOR', 'CHEMIST', 'PHYSICIAN', 'SOCIOLOGIST',
       'SOCCER PLAYER', 'OCCULTIST', 'ASTRONAUT', 'COMPANION', 'DESIGNER',
       'MUSICIAN', 'SINGER', 'ARCHITECT', 'INSPIRATION', 'EXTREMIST',
       'FILM DIRECTOR', 'COMPUTER SCIENTIST', 'DIPLOMAT', 'PRODUCER',
       'GEOGRAPHER', 'MAFIOSO', 'PILOT', 'BUSINESSPERSON',
       'FASHION DESIGNER', 'ARTIST', 'CELEBRITY', 'SCULPTOR', 'ENGINEER',
       'DANCER', 'PIRATE', 'LINGUIST', 'LAWYER', 'COMEDIAN',
       'ARCHAEOLOGIST', 'MARTIAL ARTS', 'COMIC ARTIST', 'ATHLETE',
       'COACH', 'RACING DRIVER', 'MAGICIAN', 'GEOLOGIST', 'CONDUCTOR',
       'MOUNTAINEER', 'PUBLIC WORKER', 'CHESS PLAYER', 'JUDGE',
       'JOURNALIST', 'POLITICAL SCIENTIST', 'ANTHROPOLOGIST',
       'PHOTOGRAPHER', 'SWIMMER', 'PORNOGRAPHIC ACTOR', 'CYCLIST',
       'TENNIS PLAYER', 'AMERICAN FOOTBALL PLAYER', 'STATISTICIAN',
       'BASKETBALL PLAYER', 'PRESENTER', 'MODEL', 'CRITIC', 'SKATER',
       'BASEBALL PLAYER', 'GYMNAST', 'WRESTLER', 'REFEREE', 'BULLFIGHTER',
       'GAME DESIGNER', 'YOUTUBER', 'SKIER', 'CHEF', 'GO PLAYER',
       'HOCKEY PLAYER', 'FENCER', 'GOLFER', 'POKER PLAYER',
       'TABLE TENNIS PLAYER', 'SNOOKER', 'CRICKETER', 'HANDBALL PLAYER',
       'VOLLEYBALL PLAYER', 'RUGBY PLAYER', 'GAMER', 'BADMINTON PLAYER'],
      dtype=object)
plt.figure(figsize = (8, 20))

sns.boxplot(x="hpi", y="occupation", data=df,
            whis=[0, 100], width=.6, palette="vlag");
_images/e0974121fe4cf356f84689e57cd298bbc96f596c76f2516d0281c05dadabfbcb.png
plt.figure(figsize = (8, 20))

sns.boxplot(x="hpi", y="occupation", data=df[df['alive']==True],
            whis=[0, 100], width=.6, palette="vlag");
_images/6788852f60ca9ceab048f418f1999099d3b57995deebc11c8e537a92ba2fb386.png
import numpy as np

dat = df[(pd.isna(df['bplace_lat'])==False) &(pd.isna(df['dplace_lat'])==False)]
len(dat)
36123
dat0 = dat[dat['birthyear']<=0]
len(dat0)
700
dat0['name']
5            Alexander the Great
6                      Aristotle
8                  Julius Caesar
10                         Plato
11                         Jesus
                  ...           
42995              Prince Vijaya
43349    Seleucus VII Philometor
45324                  Panyassis
46802                     Alexis
54763                Agathokleia
Name: name, Length: 700, dtype: object
import plotly.graph_objects as go

fig = go.Figure()

#'bplace_lat', 'bplace_lon','dplace_lat', 'dplace_lon'

fig.add_trace(go.Scattergeo(
    #locationmode = 'USA-states',
    lon = dat0['bplace_lon'],
    lat = dat0['bplace_lat'],
    hovertext = dat0['name'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 0, 0)',
        line = dict(
            width = 3,
            color = 'rgba(68, 68, 68, 0)'
        )
    )))


fig.update_layout(
    title_text = 'Pantheon Project',
    showlegend = False,
    geo = dict(
        scope = 'world',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(243, 243, 243)',
        countrycolor = 'rgb(204, 204, 204)',

    ),
)

fig.show()
import plotly.graph_objects as go

fig = go.Figure()

#'bplace_lat', 'bplace_lon','dplace_lat', 'dplace_lon'


for i in dat0.index:
    fig.add_trace(
        go.Scattergeo(
            #locationmode = 'USA-states',
            lon = [dat0['bplace_lon'][i], dat0['dplace_lon'][i]],
            lat = [dat0['bplace_lat'][i], dat0['dplace_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = 0.5,
            hovertext = dat0['name'],
            hoverinfo="text",
        )
    )

fig.update_layout(
    title_text = 'Pantheon Project',
    showlegend = False,
    geo = dict(
        scope = 'world',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(243, 243, 243)',
        countrycolor = 'rgb(204, 204, 204)',

    ),
)

fig.show()