万神殿项目（Pantheon Project）

万神殿项目（Pantheon Project）#

由塞萨尔·伊达尔戈（César Hidalgo）创建的一个在线工具。

伊达尔戈现在是麻省理工学院媒体实验室的教授，
- 他曾说:”真正著名的人在他们各自的领域外也相当知名”。
一个人的维基百科页面使用了多少种语言，他就有多大的名气。

若想被列入万神殿，一个人的名气必须跨越国家和语言障碍，必须在维基百科页面上出现至少25种语言。

单单这一个要求就将名人的范围从所有的小名人或不太出名的人缩小到11341人——他们各有特色，魅力十足。

Yu, A. Z., et al. (2016). Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific Data 2:150075. doi: 10.1038/sdata.2015.75

https://pantheon.world/data/datasets

pantheon.tsv

A tab delimited file containing a row of data per person found in the Panthon 1.0 dataset.

wikilangs.tsv

A tab delimited file of all the different Wikipedia language editions that each biography has a presence in.

pageviews_2008-2013.tsv A file containing the monthly pageview data for each individual, for all the Wikipedia language editions in which they have a presence.

Please refer to the methods section for more information on how this data was created. For detailed descriptions of these datasets, please refer to our data descriptor paper.

Jara-Figueroa, C., Yu, A.Z. and Hidalgo, C.A., 2015. The medium is the memory: how communication technologies shape what we remember. arXiv preprint arXiv:1512.05020.
Yu, A.Z., Ronen, S., Hu, K., Lu, T. and Hidalgo, C.A., 2016. Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific data, 3.
Ronen, S., Gonçalves, B., Hu, K.Z., Vespignani, A., Pinker, S. and Hidalgo, C.A., 2014. Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences, 111(52), pp.E5616-E5622.
Cesar A. Hidalgo and Ali Almossawi. “The Data-Visualization Revolution.” Scientific American. March 2014.
Hidalgo, C. A. “The Last 20 Inches: Data’s Treacherous Journey from the Screen to the Mind.” MIT Technology Review. March 2014.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('./data/person_2020_update.csv',low_memory=False)
df.head()

	id	wd_id	wp_id	slug	name	occupation	gender	twitter	alive	...	deathdate	deathyear	bplace_geacron_name	dplace_geacron_name	is_group	l_	age	non_en_page_views	coefficient_of_variation	hpi
0	18934	Q9458	18934	Muhammad	Muhammad	RELIGIOUS FIGURE	M	NaN	False	...	0632-06-08	632.0	Mecca	NaN	False	27.918400	1450.0	5160422.0	3.199355	100.000000
1	17414699	Q720	17414699	Genghis_Khan	Genghis Khan	MILITARY PERSONNEL	M	NaN	False	...	1227-08-18	1227.0	NaN	NaN	False	25.843621	858.0	3249211.0	2.753641	97.723669
2	18079	Q762	18079	Leonardo_da_Vinci	Leonardo da Vinci	INVENTOR	M	NaN	False	...	1519-05-02	1519.0	NaN	NaN	False	17.545406	568.0	5362406.0	4.796629	97.460691
3	14627	Q935	14627	Isaac_Newton	Isaac Newton	PHYSICIST	M	NaN	False	...	1727-03-31	1726.0	NaN	NaN	False	21.608920	378.0	3431331.0	4.632474	96.836567
4	17914	Q255	17914	Ludwig_van_Beethoven	Ludwig van Beethoven	COMPOSER	M	NaN	False	...	1827-03-26	1827.0	NaN	Austria	False	19.796430	250.0	5179518.0	3.926626	96.583969

5 rows × 34 columns

len(df)

df.iloc[0]

id                                         18934
wd_id                                      Q9458
wp_id                                      18934
slug                                    Muhammad
name                                    Muhammad
occupation                      RELIGIOUS FIGURE
prob_ratio                              0.000000
gender                                         M
twitter                                      NaN
alive                                      False
l                                            193
hpi_raw                                       36
bplace_name                                Mecca
bplace_lat                             21.416667
bplace_lon                             39.816667
bplace_geonameid                    21021.000000
bplace_country                      Saudi Arabia
birthdate                                    NaN
birthyear                             570.000000
dplace_name                               Medina
dplace_lat                             24.466667
dplace_lon                             39.600000
dplace_geonameid                    36636.000000
dplace_country                      Saudi Arabia
deathdate                             0632-06-08
deathyear                             632.000000
bplace_geacron_name                        Mecca
dplace_geacron_name                          NaN
is_group                                   False
l_                                     27.918400
age                                  1450.000000
non_en_page_views                 5160422.000000
coefficient_of_variation                3.199355
hpi                                   100.000000
age_group                   (1003.333, 2006.667]
Name: 0, dtype: object

df.columns

Index(['id', 'wd_id', 'wp_id', 'slug', 'name', 'occupation', 'prob_ratio',
       'gender', 'twitter', 'alive', 'l', 'hpi_raw', 'bplace_name',
       'bplace_lat', 'bplace_lon', 'bplace_geonameid', 'bplace_country',
       'birthdate', 'birthyear', 'dplace_name', 'dplace_lat', 'dplace_lon',
       'dplace_geonameid', 'dplace_country', 'deathdate', 'deathyear',
       'bplace_geacron_name', 'dplace_geacron_name', 'is_group', 'l_', 'age',
       'non_en_page_views', 'coefficient_of_variation', 'hpi'],
      dtype='object')

The HPI combines the number of languages L, the effective number of languages L*, the age of the historical character A, the number of PageViews in Non-English Wikipedias v_NE (calculated in 2016), and the coefficient of variation in PageViews in all languages between CV (also calculated in 2016).

https://pantheon.world/about/methods

plt.style.use('ggplot')
plt.plot(df['l_'], df['hpi'], 'o')
plt.xscale('log')
plt.show()

_images/ed02725947b6291c6b13ece3a52dae6e0ea9570642e452a9403604193d0dc8c9.png

plt.style.use('ggplot')
plt.plot(df['non_en_page_views'], df['hpi'], 'o')
plt.xscale('log')
plt.show()

_images/c021453413eed39a61ec37a81d665de7e3d02f9262fa0e5ba160384501e83fa5.png

plt.style.use('ggplot')
plt.plot(df['l'], df['hpi'], 'o')
#plt.xscale('log')
plt.show()

_images/5a018c5c381b576990cc7d75af7115819a1c1ab63ed520a1e35b97a66e0c9ec8.png

plt.hist(df['hpi']);

_images/f9d52bfc3d6d0490a88c00eb3c2071f3517aebd2d55760565a0e81e22c24b909.png

plt.hist(df['non_en_page_views']);

_images/67daf59b341fc7b88f46c7e6145bcfa192c0664274e5d119ea36f4831c5b1592.png

sns.pointplot(x='gender', y = 'hpi', data = df, color = 'blue', linestyles='');

_images/765711a99ef44ae04d6529e101ea6d0dceac7c04a60c864beb6f917467b70a8e.png

sns.pointplot(x='gender', y = 'hpi', ci = 'sd', data = df, color = 'blue', linestyles='');

_images/149ef785188bd07e9997bce264ef2d4c4d772b461873d2e40653c6760d5e02c0.png

sns.boxplot(x="gender", y="hpi",
            #hue="smoker", 
            palette=["m", "g"],
            data=df);

_images/b0df5396920df34cceba72c2f7b88016fc33adbe0e28772e222d26d9da885853.png

# Draw a nested barplot by species and sex
g = sns.catplot(
    data=df, kind="bar",
    x="gender", y="hpi", hue="gender",
    ci="sd", palette="dark", alpha=.6, height=6
)

_images/4918442e0d1eed7c63e8cc07747ee1293d0115401f359d5384957fcfc83baa52.png

df['age_group']=pd.cut(df['age'], bins = 6)

plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="hpi", data=df)
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()

_images/1a35872a05742c5b9fb516a2e217bc2ca215ed65d7cf8616d758cd4c20746d06.png

plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="hpi", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()

_images/1ffc0b0907e2ffa7bfb5d8d3a2f75d53ed80891a418cca3f110e68bf7e9fe24a.png

plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="non_en_page_views", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('HPI', fontsize = 16)
plt.show()

_images/14e871e0c442cee17526b2e5cb87973359b962807c8fcb0c0888f6561847ed99.png

plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="l", data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('Language Impact', fontsize = 16)
plt.show()

_images/3a599fe69206b53420d3dfeb32e5eda37d660b0777a1a2cbba526d7dc77e32f0.png

plt.figure(figsize = (16, 5))
sns.pointplot(x="age_group", y="l", ci = 'sd', data=df, hue = 'gender')
plt.xlabel('Age', fontsize = 16)
plt.ylabel('Language Impact', fontsize = 16)
plt.show()

_images/94ee85efb058df452b73bf959d721fa184ec74674c8bbb2015e87deb9decd87f.png

df['occupation'].unique()

array(['RELIGIOUS FIGURE', 'MILITARY PERSONNEL', 'INVENTOR', 'PHYSICIST',
       'COMPOSER', 'PHILOSOPHER', 'POLITICIAN', 'ASTRONOMER', 'EXPLORER',
       'PAINTER', 'WRITER', 'MATHEMATICIAN', 'PSYCHOLOGIST',
       'SOCIAL ACTIVIST', 'BIOLOGIST', 'ECONOMIST', 'NOBLEMAN', 'BOXER',
       'HISTORIAN', 'ACTOR', 'CHEMIST', 'PHYSICIAN', 'SOCIOLOGIST',
       'SOCCER PLAYER', 'OCCULTIST', 'ASTRONAUT', 'COMPANION', 'DESIGNER',
       'MUSICIAN', 'SINGER', 'ARCHITECT', 'INSPIRATION', 'EXTREMIST',
       'FILM DIRECTOR', 'COMPUTER SCIENTIST', 'DIPLOMAT', 'PRODUCER',
       'GEOGRAPHER', 'MAFIOSO', 'PILOT', 'BUSINESSPERSON',
       'FASHION DESIGNER', 'ARTIST', 'CELEBRITY', 'SCULPTOR', 'ENGINEER',
       'DANCER', 'PIRATE', 'LINGUIST', 'LAWYER', 'COMEDIAN',
       'ARCHAEOLOGIST', 'MARTIAL ARTS', 'COMIC ARTIST', 'ATHLETE',
       'COACH', 'RACING DRIVER', 'MAGICIAN', 'GEOLOGIST', 'CONDUCTOR',
       'MOUNTAINEER', 'PUBLIC WORKER', 'CHESS PLAYER', 'JUDGE',
       'JOURNALIST', 'POLITICAL SCIENTIST', 'ANTHROPOLOGIST',
       'PHOTOGRAPHER', 'SWIMMER', 'PORNOGRAPHIC ACTOR', 'CYCLIST',
       'TENNIS PLAYER', 'AMERICAN FOOTBALL PLAYER', 'STATISTICIAN',
       'BASKETBALL PLAYER', 'PRESENTER', 'MODEL', 'CRITIC', 'SKATER',
       'BASEBALL PLAYER', 'GYMNAST', 'WRESTLER', 'REFEREE', 'BULLFIGHTER',
       'GAME DESIGNER', 'YOUTUBER', 'SKIER', 'CHEF', 'GO PLAYER',
       'HOCKEY PLAYER', 'FENCER', 'GOLFER', 'POKER PLAYER',
       'TABLE TENNIS PLAYER', 'SNOOKER', 'CRICKETER', 'HANDBALL PLAYER',
       'VOLLEYBALL PLAYER', 'RUGBY PLAYER', 'GAMER', 'BADMINTON PLAYER'],
      dtype=object)

plt.figure(figsize = (8, 20))

sns.boxplot(x="hpi", y="occupation", data=df,
            whis=[0, 100], width=.6, palette="vlag");

_images/e0974121fe4cf356f84689e57cd298bbc96f596c76f2516d0281c05dadabfbcb.png

plt.figure(figsize = (8, 20))

sns.boxplot(x="hpi", y="occupation", data=df[df['alive']==True],
            whis=[0, 100], width=.6, palette="vlag");

_images/6788852f60ca9ceab048f418f1999099d3b57995deebc11c8e537a92ba2fb386.png

import numpy as np

dat = df[(pd.isna(df['bplace_lat'])==False) &(pd.isna(df['dplace_lat'])==False)]
len(dat)

dat0 = dat[dat['birthyear']<=0]
len(dat0)

dat0['name']

          Alexander the Great
                    Aristotle
                Julius Caesar
                       Plato
                       Jesus
                  ...           
            Prince Vijaya
  Seleucus VII Philometor
                Panyassis
                   Alexis
              Agathokleia
Name: name, Length: 700, dtype: object

import plotly.graph_objects as go

fig = go.Figure()

#'bplace_lat', 'bplace_lon'，'dplace_lat', 'dplace_lon'

fig.add_trace(go.Scattergeo(
    #locationmode = 'USA-states',
    lon = dat0['bplace_lon'],
    lat = dat0['bplace_lat'],
    hovertext = dat0['name'],
    mode = 'markers',
    marker = dict(
        size = 2,
        color = 'rgb(255, 0, 0)',
        line = dict(
            width = 3,
            color = 'rgba(68, 68, 68, 0)'
        )
    )))


fig.update_layout(
    title_text = 'Pantheon Project',
    showlegend = False,
    geo = dict(
        scope = 'world',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(243, 243, 243)',
        countrycolor = 'rgb(204, 204, 204)',

    ),
)

fig.show()

import plotly.graph_objects as go

fig = go.Figure()

#'bplace_lat', 'bplace_lon'，'dplace_lat', 'dplace_lon'


for i in dat0.index:
    fig.add_trace(
        go.Scattergeo(
            #locationmode = 'USA-states',
            lon = [dat0['bplace_lon'][i], dat0['dplace_lon'][i]],
            lat = [dat0['bplace_lat'][i], dat0['dplace_lat'][i]],
            mode = 'lines',
            line = dict(width = 1,color = 'red'),
            opacity = 0.5,
            hovertext = dat0['name'],
            hoverinfo="text",
        )
    )

fig.update_layout(
    title_text = 'Pantheon Project',
    showlegend = False,
    geo = dict(
        scope = 'world',
        projection_type = 'natural earth',
        showland = True,
        landcolor = 'rgb(243, 243, 243)',
        countrycolor = 'rgb(204, 204, 204)',

    ),
)

fig.show()