基于NRC字典的情感分析

基于NRC字典的情感分析#

NRC词典为加拿大国家研究委员会信息技术研究所(Institute for Information Technology, National Research Council Canada. )组织制作的基于众包方式标注出的词典。

Mohammad, Saif M., and Peter D. Turney. “Crowdsourcing a word–emotion association lexicon.” Computational Intelligence 29, no. 3 (2013): 436-465.

http://sentiment.nrc.ca/lexicons-for-research/

THE SENTIMENT AND EMOTION LEXICONS (click to download the complete bundle, 100Mb)

Individual lexicons for download:

Manually created lexicons:NRC Emotion LexiconNRC Emotion Intensity LexiconNRC Valence, Arousal, and Dominance (VAD) LexiconNRC Sentiment Composition Lexicons (SCL-NMA, SCL-OPP, SemEval-2015 English Twitter, SemEval-2016 Arabic Twitter)NRC Word-Colour Association Lexicon
Automatically generated lexicons:NRC Hashtag Emotion LexiconNRC Hashtag Sentiment LexiconNRC Hashtag Affirmative Context Sentiment Lexicon and NRC Hashtag Negated Context Sentiment LexiconNRC Emoticon Lexicon (a.k.a. Sentiment140 Lexicon)NRC Emoticon Affirmative Context Lexicon and NRC Emoticon Negated Context Lexicon

Vosoughi et al., The spread of true and false news online. Science 359, 1146–1151 (2018) 9 March 2018

We categorized the emotion in the replies by using the leading lexicon curated by the National Research Council Canada (NRC), which provides a comprehensive list of ~140,000 English words and their associations with eight emotions

False news was more novel than true news, which suggests that people were more likely to share novel information. Whereas false stories inspired fear, disgust, and surprise in replies, true stories inspired anticipation, sadness, joy, and trust.

import pandas as pd

lexion_df = pd.read_excel('./Textmining/NRC-Emotion-Lexicon-v0.92-In105Languages-Nov2017Translations.xlsx')
lexion_df.head()

	English (en)	Afrikaans (af)	Albanian (sq)	Amharic (am)	Arabic (ar)	Armenian (hy)	Azeerbaijani (az)	Basque (eu)	Belarusian (be)	Bengali (bn)	...	Negative	Anger	Fear	Sadness	Surprise	Trust
0	aback	uit die veld geslaan	prapa	ተጭኗል	الى الوراء	շեղում	sanki	aback	ззаду	পশ্চাতে	...	0	0	0	0	0	0
1	abacus	abakus	numërator	abacus	طبلية تاج	անբավարարություն	abacus	abako	абака	গণনা-যন্ত্রবিশেষ	...	0	0	0	0	0	1
2	abandon	verlaat	braktis	ውጣ	تخلى	լքել	tərk et	bertan behera	адмовіцца ад	বর্জিত করা	...	1	0	1	1	0	0
3	abandoned	verlate	braktisur	ተትቷል	مهجور	լքված	tərk etdi	abandonatutako	закінуты	পরিত্যক্ত	...	1	1	1	1	0	0
4	abandonment	verlating	braktisje	ማቋረጥ	التخلي عن	հրաժարվելով	ləğv	abandono	пакіданне	বিসর্জন	...	1	1	1	1	1	0

5 rows × 115 columns

lexion_df.columns.tolist()

['English (en)',
 'Afrikaans (af)',
 'Albanian (sq)',
 'Amharic (am)',
 'Arabic (ar)',
 'Armenian (hy)',
 'Azeerbaijani (az)',
 'Basque (eu)',
 'Belarusian (be)',
 'Bengali (bn)',
 'Bosnian (bs)',
 'Bulgarian (bg)',
 'Catalan (ca)',
 'Cebuano (ceb)',
 'Chinese (Simplified) (zh-CN)',
 'Chinese (Traditional) (zh-TW)',
 'Corsican (co)',
 'Croatian (hr)',
 'Czech (cs)',
 'Danish (da)',
 'Dutch (nl)',
 'English (en).1',
 'Esperanto (eo)',
 'Estonian (et)',
 'Finnish (fi)',
 'French (fr)',
 'Frisian (fy)',
 'Galician (gl)',
 'Georgian (ka)',
 'German (de)',
 'Greek (el)',
 'Gujarati (gu)',
 'Haitian Creole (ht)',
 'Hausa (ha)',
 'Hawaiian (haw)',
 'Hebrew (iw)',
 'Hindi (hi)',
 'Hmong (hmn)',
 'Hungarian (hu)',
 'Icelandic (is)',
 'Igbo (ig)',
 'Indonesian (id)',
 'Irish (ga)',
 'Italian (it)',
 'Japanese (ja)',
 'Javanese (jw)',
 'Kannada (kn)',
 'Kazakh (kk)',
 'Khmer (km)',
 'Korean (ko)',
 'Kurdish (ku)',
 'Kyrgyz (ky)',
 'Lao (lo)',
 'Latin (la)',
 'Latvian (lv)',
 'Lithuanian (lt)',
 'Luxembourgish (lb)',
 'Macedonian (mk)',
 'Malagasy (mg)',
 'Malay (ms)',
 'Malayalam (ml)',
 'Maltese (mt)',
 'Maori (mi)',
 'Marathi (mr)',
 'Mongolian (mn)',
 'Myanmar (Burmese) (my)',
 'Nepali (ne)',
 'Norwegian (no)',
 'Nyanja (Chichewa) (ny)',
 'Pashto (ps)',
 'Persian (fa)',
 'Polish (pl)',
 'Portuguese (Portugal, Brazil) (pt)',
 'Punjabi (pa)',
 'Romanian (ro)',
 'Russian (ru)',
 'Samoan (sm)',
 'Scots Gaelic (gd)',
 'Serbian (sr)',
 'Sesotho (st)',
 'Shona (sn)',
 'Sindhi (sd)',
 'Sinhala (Sinhalese) (si)',
 'Slovak (sk)',
 'Slovenian (sl)',
 'Somali (so)',
 'Spanish (es)',
 'Sundanese (su)',
 'Swahili (sw)',
 'Swedish (sv)',
 'Tagalog (Filipino) (tl)',
 'Tajik (tg)',
 'Tamil (ta)',
 'Telugu (te)',
 'Thai (th)',
 'Turkish (tr)',
 'Ukrainian (uk)',
 'Urdu (ur)',
 'Uzbek (uz)',
 'Vietnamese (vi)',
 'Welsh (cy)',
 'Xhosa (xh)',
 'Yiddish (yi)',
 'Yoruba (yo)',
 'Zulu (zu)',
 'Positive',
 'Negative',
 'Anger',
 'Anticipation',
 'Disgust',
 'Fear',
 'Joy',
 'Sadness',
 'Surprise',
 'Trust']

chinese_df = lexion_df[['Chinese (Simplified) (zh-CN)', 'Positive', 'Negative', 
                 'Anger','Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise', 'Trust']]
chinese_df.head()

	Chinese (Simplified) (zh-CN)	Negative	Anger	Fear	Sadness	Surprise	Trust
0	吓了一跳	0	0	0	0	0	0
1	算盘	0	0	0	0	0	1
2	放弃	1	0	1	1	0	0
3	弃	1	1	1	1	0	0
4	放弃	1	1	1	1	1	0

# 构建情感词列表

Positive, Negative, Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust= [[] for i in range(10)]
for idx, row in chinese_df.iterrows():
    if row['Positive']==1:
        Positive.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Negative']==1:
        Negative.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Anger']==1:
        Anger.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Anticipation']==1:
        Anticipation.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Disgust']==1:
        Disgust.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Fear']==1:
        Fear.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Joy']==1:
        Joy.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Sadness']==1:
        Sadness.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Surprise']==1:
        Surprise.append(row['Chinese (Simplified) (zh-CN)'])
    if row['Trust']==1:
        Trust.append(row['Chinese (Simplified) (zh-CN)'])

print('词语列表构建完成')

词语列表构建完成

Anger[:10]

['弃', '放弃', '痛恨', '可恶', '废除', '厌恶', '滥用', '诅咒', '指控', '被告']

positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust = [0 for i in range(10)]
[positive, negative]

[0, 0]

import jieba
import time


def emotion_caculate(text):
    positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust = [0 for i in range(10)]
    
    wordlist = jieba.lcut(text)
    wordset = set(wordlist)
    wordfreq = []
    for word in wordset:
        freq = wordlist.count(word)
        if word in Positive:
            positive+=freq
        if word in Negative:
            negative+=freq
        if word in Anger:
            anger+=freq  
        if word in Anticipation:
            anticipation+=freq
        if word in Disgust:
            disgust+=freq
        if word in Fear:
            fear+=freq
        if word in Joy:
            joy+=freq
        if word in Sadness:
            sadness+=freq
        if word in Surprise:
            surprise+=freq
        if word in Trust:
            trust+=freq
            
    emotion_info = {
        'positive': positive,
        'negative': negative,
        'anger': anger,
        'anticipation': anticipation,
        'disgust': disgust,
        'fear':fear,
        'joy':joy,
        'sadness':sadness,
        'surprise':surprise,
        'trust':trust,
        'length':len(wordlist)
    }
    indexs = ['length', 'positive', 'negative', 'anger', 'anticipation','disgust','fear','joy','sadness','surprise','trust']
    return pd.Series(emotion_info, index=indexs)
        

emotion_caculate(text='这个国家再对这些制造假冒伪劣食品药品的人手软的话，那后果真的会相当糟糕。坐牢？从快判个死刑')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/jieba.cache
Loading model cost 0.730 seconds.
Prefix dict has been built successfully.

length          25
positive         1
negative         2
anger            1
anticipation     0
disgust          1
fear             1
joy              0
sadness          1
surprise         0
trust            3
dtype: int64