基于NRC字典的情感分析#
NRC词典为加拿大国家研究委员会信息技术研究所(Institute for Information Technology, National Research Council Canada. )组织制作的基于众包方式标注出的词典。
Mohammad, Saif M., and Peter D. Turney. “Crowdsourcing a word–emotion association lexicon.” Computational Intelligence 29, no. 3 (2013): 436-465.
http://sentiment.nrc.ca/lexicons-for-research/
THE SENTIMENT AND EMOTION LEXICONS (click to download the complete bundle, 100Mb)
Individual lexicons for download:
Manually created lexicons:NRC Emotion LexiconNRC Emotion Intensity LexiconNRC Valence, Arousal, and Dominance (VAD) LexiconNRC Sentiment Composition Lexicons (SCL-NMA, SCL-OPP, SemEval-2015 English Twitter, SemEval-2016 Arabic Twitter)NRC Word-Colour Association Lexicon
Automatically generated lexicons:NRC Hashtag Emotion LexiconNRC Hashtag Sentiment LexiconNRC Hashtag Affirmative Context Sentiment Lexicon and NRC Hashtag Negated Context Sentiment LexiconNRC Emoticon Lexicon (a.k.a. Sentiment140 Lexicon)NRC Emoticon Affirmative Context Lexicon and NRC Emoticon Negated Context Lexicon
Vosoughi et al., The spread of true and false news online. Science 359, 1146–1151 (2018) 9 March 2018
We categorized the emotion in the replies by using the leading lexicon curated by the National Research Council Canada (NRC), which provides a comprehensive list of ~140,000 English words and their associations with eight emotions
False news was more novel than true news, which suggests that people were more likely to share novel information. Whereas false stories inspired fear, disgust, and surprise in replies, true stories inspired anticipation, sadness, joy, and trust.
import pandas as pd
lexion_df = pd.read_excel('./Textmining/NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.xlsx')
lexion_df.head()
English (en) | Afrikaans (af) | Albanian (sq) | Amharic (am) | Arabic (ar) | Armenian (hy) | Azeerbaijani (az) | Basque (eu) | Belarusian (be) | Bengali (bn) | ... | Positive | Negative | Anger | Anticipation | Disgust | Fear | Joy | Sadness | Surprise | Trust | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aback | uit die veld geslaan | prapa | ተጭኗል | الى الوراء | շեղում | sanki | aback | ззаду | পশ্চাতে | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | abacus | abakus | numërator | abacus | طبلية تاج | անբավարարություն | abacus | abako | абака | গণনা-যন্ত্রবিশেষ | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | abandon | verlaat | braktis | ውጣ | تخلى | լքել | tərk et | bertan behera | адмовіцца ад | বর্জিত করা | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
3 | abandoned | verlate | braktisur | ተትቷል | مهجور | լքված | tərk etdi | abandonatutako | закінуты | পরিত্যক্ত | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | abandonment | verlating | braktisje | ማቋረጥ | التخلي عن | հրաժարվելով | ləğv | abandono | пакіданне | বিসর্জন | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
5 rows × 115 columns
lexion_df.columns.tolist()
['English (en)',
'Afrikaans (af)',
'Albanian (sq)',
'Amharic (am)',
'Arabic (ar)',
'Armenian (hy)',
'Azeerbaijani (az)',
'Basque (eu)',
'Belarusian (be)',
'Bengali (bn)',
'Bosnian (bs)',
'Bulgarian (bg)',
'Catalan (ca)',
'Cebuano (ceb)',
'Chinese (Simplified) (zh-CN)',
'Chinese (Traditional) (zh-TW)',
'Corsican (co)',
'Croatian (hr)',
'Czech (cs)',
'Danish (da)',
'Dutch (nl)',
'English (en).1',
'Esperanto (eo)',
'Estonian (et)',
'Finnish (fi)',
'French (fr)',
'Frisian (fy)',
'Galician (gl)',
'Georgian (ka)',
'German (de)',
'Greek (el)',
'Gujarati (gu)',
'Haitian Creole (ht)',
'Hausa (ha)',
'Hawaiian (haw)',
'Hebrew (iw)',
'Hindi (hi)',
'Hmong (hmn)',
'Hungarian (hu)',
'Icelandic (is)',
'Igbo (ig)',
'Indonesian (id)',
'Irish (ga)',
'Italian (it)',
'Japanese (ja)',
'Javanese (jw)',
'Kannada (kn)',
'Kazakh (kk)',
'Khmer (km)',
'Korean (ko)',
'Kurdish (ku)',
'Kyrgyz (ky)',
'Lao (lo)',
'Latin (la)',
'Latvian (lv)',
'Lithuanian (lt)',
'Luxembourgish (lb)',
'Macedonian (mk)',
'Malagasy (mg)',
'Malay (ms)',
'Malayalam (ml)',
'Maltese (mt)',
'Maori (mi)',
'Marathi (mr)',
'Mongolian (mn)',
'Myanmar (Burmese) (my)',
'Nepali (ne)',
'Norwegian (no)',
'Nyanja (Chichewa) (ny)',
'Pashto (ps)',
'Persian (fa)',
'Polish (pl)',
'Portuguese (Portugal, Brazil) (pt)',
'Punjabi (pa)',
'Romanian (ro)',
'Russian (ru)',
'Samoan (sm)',
'Scots Gaelic (gd)',
'Serbian (sr)',
'Sesotho (st)',
'Shona (sn)',
'Sindhi (sd)',
'Sinhala (Sinhalese) (si)',
'Slovak (sk)',
'Slovenian (sl)',
'Somali (so)',
'Spanish (es)',
'Sundanese (su)',
'Swahili (sw)',
'Swedish (sv)',
'Tagalog (Filipino) (tl)',
'Tajik (tg)',
'Tamil (ta)',
'Telugu (te)',
'Thai (th)',
'Turkish (tr)',
'Ukrainian (uk)',
'Urdu (ur)',
'Uzbek (uz)',
'Vietnamese (vi)',
'Welsh (cy)',
'Xhosa (xh)',
'Yiddish (yi)',
'Yoruba (yo)',
'Zulu (zu)',
'Positive',
'Negative',
'Anger',
'Anticipation',
'Disgust',
'Fear',
'Joy',
'Sadness',
'Surprise',
'Trust']
chinese_df = lexion_df[['Chinese (Simplified) (zh-CN)', 'Positive', 'Negative',
'Anger','Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise', 'Trust']]
chinese_df.head()
Chinese (Simplified) (zh-CN) | Positive | Negative | Anger | Anticipation | Disgust | Fear | Joy | Sadness | Surprise | Trust | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 吓了一跳 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 算盘 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 放弃 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
3 | 弃 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 放弃 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
# 构建情感词列表
Positive, Negative, Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust= [[] for i in range(10)]
for idx, row in chinese_df.iterrows():
if row['Positive']==1:
Positive.append(row['Chinese (Simplified) (zh-CN)'])
if row['Negative']==1:
Negative.append(row['Chinese (Simplified) (zh-CN)'])
if row['Anger']==1:
Anger.append(row['Chinese (Simplified) (zh-CN)'])
if row['Anticipation']==1:
Anticipation.append(row['Chinese (Simplified) (zh-CN)'])
if row['Disgust']==1:
Disgust.append(row['Chinese (Simplified) (zh-CN)'])
if row['Fear']==1:
Fear.append(row['Chinese (Simplified) (zh-CN)'])
if row['Joy']==1:
Joy.append(row['Chinese (Simplified) (zh-CN)'])
if row['Sadness']==1:
Sadness.append(row['Chinese (Simplified) (zh-CN)'])
if row['Surprise']==1:
Surprise.append(row['Chinese (Simplified) (zh-CN)'])
if row['Trust']==1:
Trust.append(row['Chinese (Simplified) (zh-CN)'])
print('词语列表构建完成')
词语列表构建完成
Anger[:10]
['弃', '放弃', '痛恨', '可恶', '废除', '厌恶', '滥用', '诅咒', '指控', '被告']
positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust = [0 for i in range(10)]
[positive, negative]
[0, 0]
import jieba
import time
def emotion_caculate(text):
positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust = [0 for i in range(10)]
wordlist = jieba.lcut(text)
wordset = set(wordlist)
wordfreq = []
for word in wordset:
freq = wordlist.count(word)
if word in Positive:
positive+=freq
if word in Negative:
negative+=freq
if word in Anger:
anger+=freq
if word in Anticipation:
anticipation+=freq
if word in Disgust:
disgust+=freq
if word in Fear:
fear+=freq
if word in Joy:
joy+=freq
if word in Sadness:
sadness+=freq
if word in Surprise:
surprise+=freq
if word in Trust:
trust+=freq
emotion_info = {
'positive': positive,
'negative': negative,
'anger': anger,
'anticipation': anticipation,
'disgust': disgust,
'fear':fear,
'joy':joy,
'sadness':sadness,
'surprise':surprise,
'trust':trust,
'length':len(wordlist)
}
indexs = ['length', 'positive', 'negative', 'anger', 'anticipation','disgust','fear','joy','sadness','surprise','trust']
return pd.Series(emotion_info, index=indexs)
emotion_caculate(text='这个国家再对这些制造假冒伪劣食品药品的人手软的话,那后果真的会相当糟糕。坐牢?从快判个死刑')
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/jieba.cache
Loading model cost 0.730 seconds.
Prefix dict has been built successfully.
length 25
positive 1
negative 2
anger 1
anticipation 0
disgust 1
fear 1
joy 0
sadness 1
surprise 0
trust 3
dtype: int64