词向量模型简介#

Introduction to Word Embeddings: Analyzing Meaning through Word Embeddings

Using vectors to represent things

The Geometry of Culture#

Analyzing Meaning through Word Embeddings

Austin C. Kozlowski; Matt Taddy; James A. Evans

https://arxiv.org/abs/1803.09288

Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture.

  • Dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning,

  • Macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century

The success of these high-dimensional models motivates a move towards “high-dimensional theorizing” of meanings, identities and cultural processes.

_images/gender_class.png

HistWords#

HistWords is a collection of tools and datasets for analyzing language change using word vector embeddings.

  • The goal of this project is to facilitate quantitative research in diachronic linguistics, history, and the digital humanities.

  • We used the historical word vectors in HistWords to study the semantic evolution of more than 30,000 words across 4 languages.

  • This study led us to propose two statistical laws that govern the evolution of word meaning

https://nlp.stanford.edu/projects/histwords/

williamleif/histwords

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

_images/wordpaths-final.png

Word embeddings quantify 100 years of gender and ethnic stereotypes#

http://www.pnas.org/content/early/2018/03/30/1720347115

_images/sex.png

The Illustrated Word2vec#

Jay Alammar. https://jalammar.github.io/illustrated-word2vec/

Personality Embeddings#

What are you like?

Big Five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism

  • the five-factor model (FFM)

  • the OCEAN model

  • 开放性(openness):具有想象、审美、情感丰富、求异、创造、智能等特质。

  • 责任心(conscientiousness):显示胜任、公正、条理、尽职、成就、自律、谨慎、克制等特点。

  • 外倾性(extraversion):表现出热情、社交、果断、活跃、冒险、乐观等特质。

  • 宜人性(agreeableness):具有信任、利他、直率、依从、谦虚、移情等特质。

  • 神经质或情绪稳定性(neuroticism):具有平衡焦虑、敌对、压抑、自我意识、冲动、脆弱等情绪的特质,即具有保持情绪稳定的能力。

# Personality Embeddings: What are you like?
jay = [-0.4, 0.8, 0.5, -0.2, 0.3]
john = [-0.3, 0.2, 0.3, -0.4, 0.9]
mike = [-0.5, -0.4, -0.2, 0.7, -0.1]

Cosine Similarity#

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

\[ \mathbf{A}\cdot\mathbf{B} =\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta \]
\[ \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }, \]

where \(A_i\) and \(B_i\) are components of vector \(A\) and \(B\) respectively.

from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
    return dot(a, b)/(norm(a)*norm(b))

cos_sim([1, 0, -1], [-1,-1, 0])
-0.4999999999999999
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
array([[-0.5]])
\[CosineDistance = 1- CosineSimilarity\]
from scipy import spatial
# spatial.distance.cosine computes 
# the Cosine distance between 1-D arrays.
1-spatial.distance.cosine([1, 0, -1], [-1,-1, 0])
-0.5
cos_sim(jay, john)
0.6582337075311759
cos_sim(jay, mike)
-0.3683509554826695

Cosine similarity works for any number of dimensions.

  • We can represent people (and things) as vectors of numbers (which is great for machines!).

  • We can easily calculate how similar vectors are to each other.

Word Embeddings#

Google News Word2Vec#

You can download Google’s pre-trained model here.

  • It’s 1.5GB!

  • It includes word vectors for a vocabulary of 3 million words and phrases

  • It is trained on roughly 100 billion words from a Google News dataset.

  • The vector length is 300 features.

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

Using the Gensim library in python, we can

  • find the most similar words to the resulting vector.

  • add and subtract word vectors,

import gensim
# Load Google's pre-trained Word2Vec model.
filepath = '/Users/datalab/bigdata/GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=True) 
model['woman'][:5]
array([ 0.24316406, -0.07714844, -0.10302734, -0.10742188,  0.11816406],
      dtype=float32)
model.most_similar('woman')
[('man', 0.7664012908935547),
 ('girl', 0.7494640946388245),
 ('teenage_girl', 0.7336829900741577),
 ('teenager', 0.631708562374115),
 ('lady', 0.6288785934448242),
 ('teenaged_girl', 0.614178478717804),
 ('mother', 0.6076306104660034),
 ('policewoman', 0.6069462299346924),
 ('boy', 0.5975908637046814),
 ('Woman', 0.5770983099937439)]
model.similarity('woman', 'man')
0.76640123
cos_sim(model['woman'], model['man'])
0.76640123
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)
[('queen', 0.7118193507194519),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133)]
model.most_similar(positive=['London', 'China'], negative=['Beijing'], topn=5)
[('UK', 0.6304796934127808),
 ('Britain', 0.6012536287307739),
 ('EURASIAN_NATURAL_RESOURCES_CORP.', 0.5524139404296875),
 ('Europe', 0.5444058775901794),
 ('United_Kingdom', 0.5375455021858215)]
model.most_similar(positive=['big', 'worst'], negative=['bad'], topn=5)
[('biggest', 0.7630176544189453),
 ('greatest', 0.6113720536231995),
 ('major', 0.5540789365768433),
 ('huge', 0.5368192195892334),
 ('Biggest', 0.5305125713348389)]
\[King- Queen = Man - Woman\]
model.most_similar(positive=['go', 'do'], negative=['did'], topn=5)
[('come', 0.6050394177436829),
 ('get', 0.5739777684211731),
 ('goes', 0.5202917456626892),
 ('happen', 0.5166401267051697),
 ('sit', 0.5145819187164307)]
_images/word2vec.png

Now that we’ve looked at trained word embeddings,

  • let’s learn more about the training process.

  • But before we get to word2vec, we need to look at a conceptual parent of word embeddings: the neural language model.

The neural language model#

“You shall know a word by the company it keeps” J.R. Firth

Bengio 2003 A Neural Probabilistic Language Model. Journal of Machine Learning Research. 3:1137–1155

After being trained, early neural language models (Bengio 2003) would calculate a prediction in three steps:

_images/neural-language-model-prediction.png _images/bengio.png

The output of the neural language model is a probability score for all the words the model knows.

  • We’re referring to the probability as a percentage here,

  • but 40% would actually be represented as 0.4 in the output vector.

Language Model Training#

  • We get a lot of text data (say, all Wikipedia articles, for example). then

  • We have a window (say, of three words) that we slide against all of that text.

  • The sliding window generates training samples for our model

_images/lm-sliding-window-4.png

As this window slides against the text, we (virtually) generate a dataset that we use to train a model.

Instead of only looking two words before the target word, we can also look at two words after it.

_images/continuous-bag-of-words-example.png

If we do this, the dataset we’re virtually building and training the model against would look like this:

_images/continuous-bag-of-words-dataset.png

This is called a Continuous Bag of Words (CBOW) https://arxiv.org/pdf/1301.3781.pdf

Skip-gram#

Instead of guessing a word based on its context (the words before and after it), this other architecture tries to guess neighboring words using the current word.

_images/skipgram.png

https://arxiv.org/pdf/1301.3781.pdf

_images/cbow.png _images/skipgram-sliding-window-samples.png

The pink boxes are in different shades because this sliding window actually creates four separate samples in our training dataset.

  • We then slide our window to the next position:

  • Which generates our next four examples:

_images/skipgram-language-model-training.png _images/skipgram-language-model-training-4.png _images/skipgram-language-model-training-5.png _images/language-model-expensive.png

Negative Sampling#

And switch it to a model that takes the input and output word, and outputs a score indicating if they’re neighbors or not

  • 0 for “not neighbors”, 1 for “neighbors”.

_images/are-the-words-neighbors.png _images/word2vec-negative-sampling-2.png

we need to introduce negative samples to our dataset

  • samples of words that are not neighbors.

  • Our model needs to return 0 for those samples.

  • This leads to a great tradeoff of computational and statistical efficiency.

Skipgram with Negative Sampling (SGNS)

Word2vec Training Process#

_images/word2vec-training-update.png

Pytorch word2vec#

jojonki/word2vec-pytorch

bamtercelboo/pytorch_word2vec

# see http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)
<torch._C.Generator at 0x116c571d0>
text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

text = text.replace(',', '').replace('.', '').lower().split()
# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(text)
vocab_size = len(vocab)
print('vocab_size:', vocab_size)

w2i = {w: i for i, w in enumerate(vocab)}
i2w = {i: w for i, w in enumerate(vocab)}
vocab_size: 44
# context window size is two
def create_cbow_dataset(text):
    data = []
    for i in range(2, len(text) - 2):
        context = [text[i - 2], text[i - 1],
                   text[i + 1], text[i + 2]]
        target = text[i]
        data.append((context, target))
    return data

cbow_train = create_cbow_dataset(text)
print('cbow sample', cbow_train[0])
cbow sample (['we', 'are', 'to', 'study'], 'about')
def create_skipgram_dataset(text):
    import random
    data = []
    for i in range(2, len(text) - 2):
        data.append((text[i], text[i-2], 1))
        data.append((text[i], text[i-1], 1))
        data.append((text[i], text[i+1], 1))
        data.append((text[i], text[i+2], 1))
        # negative sampling
        for _ in range(4):
            if random.random() < 0.5 or i >= len(text) - 3:
                rand_id = random.randint(0, i-1)
            else:
                rand_id = random.randint(i+3, len(text)-1)
            data.append((text[i], text[rand_id], 0))
    return data


skipgram_train = create_skipgram_dataset(text)
print('skipgram sample', skipgram_train[0])
skipgram sample ('about', 'we', 1)
class CBOW(nn.Module):
    def __init__(self, vocab_size, embd_size, context_size, hidden_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim = 1)
        return log_probs
    
    def extract(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embd_size):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
    
    def forward(self, focus, context):
        embed_focus = self.embeddings(focus).view((1, -1)) # input
        embed_ctx = self.embeddings(context).view((1, -1)) # output
        score = torch.mm(embed_focus, torch.t(embed_ctx)) # input*output
        log_probs = F.logsigmoid(score) # sigmoid
        return log_probs
    
    def extract(self, focus):
        embed_focus = self.embeddings(focus)
        return embed_focus

torch.mm Performs a matrix multiplication of the matrices

torch.t Expects :attr:input to be a matrix (2-D tensor) and transposes dimensions 0 and 1. Can be seen as a short-hand function for transpose(input, 0, 1).

embd_size = 100
learning_rate = 0.001
n_epoch = 30
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
def train_cbow():
    hidden_size = 64
    losses = []
    loss_fn = nn.NLLLoss()
    model = CBOW(vocab_size, embd_size, CONTEXT_SIZE, hidden_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    for epoch in range(n_epoch):
        total_loss = .0
        for context, target in cbow_train:
            ctx_idxs = [w2i[w] for w in context]
            ctx_var = Variable(torch.LongTensor(ctx_idxs))

            model.zero_grad()
            log_probs = model(ctx_var)

            loss = loss_fn(log_probs, Variable(torch.LongTensor([w2i[target]])))

            loss.backward()
            optimizer.step()

            total_loss += loss.data.item()
        losses.append(total_loss)
    return model, losses 
def train_skipgram():
    losses = []
    loss_fn = nn.MSELoss()
    model = SkipGram(vocab_size, embd_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    for epoch in range(n_epoch):
        total_loss = .0
        for in_w, out_w, target in skipgram_train:
            in_w_var = Variable(torch.LongTensor([w2i[in_w]]))
            out_w_var = Variable(torch.LongTensor([w2i[out_w]]))
            
            model.zero_grad()
            log_probs = model(in_w_var, out_w_var)
            loss = loss_fn(log_probs[0], Variable(torch.Tensor([target])))
            
            loss.backward()
            optimizer.step()

            total_loss += loss.data.item()
        losses.append(total_loss)
    return model, losses
cbow_model, cbow_losses = train_cbow()
sg_model, sg_losses = train_skipgram()
CBOW(
  (embeddings): Embedding(44, 100)
  (linear1): Linear(in_features=400, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=44, bias=True)
)
SkipGram(
  (embeddings): Embedding(44, 100)
)
plt.figure(figsize= (10, 4))
plt.subplot(121)
plt.plot(range(n_epoch), cbow_losses, 'r-o', label = 'CBOW Losses')
plt.legend()
plt.subplot(122)
plt.plot(range(n_epoch), sg_losses, 'g-s', label = 'SkipGram Losses')
plt.legend()
plt.tight_layout()
_images/e0e533f62149b697a39be11984134982da41109a81caf981026e6789783156c0.png
cbow_vec = cbow_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))
cbow_vec = cbow_vec.data.numpy()
len(cbow_vec[0])
100
sg_vec = sg_model.extract(Variable(torch.LongTensor([v for v in w2i.values()])))
sg_vec = sg_vec.data.numpy()
len(sg_vec[0])
100
# 利用PCA算法进行降维
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit_transform(sg_vec)

# 绘制所有单词向量的二维空间投影
import matplotlib.pyplot as plt
import matplotlib

fig = plt.figure(figsize = (20, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')
# 绘制几个特殊单词的向量
words = list(w2i.keys())
# 设置中文字体,否则无法在图形上显示中文
for w in words:
    if w in w2i:
        ind = w2i[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, alpha = 1, color = 'white', fontsize = 20)
_images/1885356bce901c5609348555ad9d541563565dbd9c652c2cfdfff15e3b235a06.png

NGram词向量模型#

本文件是集智AI学园http://campus.swarma.org 出品的“火炬上的深度学习”第VI课的配套源代码

原理:利用一个人工神经网络来根据前N个单词来预测下一个单词,从而得到每个单词的词向量

以刘慈欣著名的科幻小说《三体》为例,来展示利用NGram模型训练词向量的方法

  • 预处理分为两个步骤:1、读取文件、2、分词、3、将语料划分为N+1元组,准备好训练用数据

  • 在这里,我们并没有去除标点符号,一是为了编程简洁,而是考虑到分词会自动将标点符号当作一个单词处理,因此不需要额外考虑。

with open("../data/3body.txt", 'r') as f:
    text = str(f.read())
import jieba, re
temp = jieba.lcut(text)
words = []
for i in temp:
    #过滤掉所有的标点符号
    i = re.sub("[\s+\.\!\/_,$%^*(+\"\'””《》]+|[+——!,。?、~@#¥%……&*():]+", "", i)
    if len(i) > 0:
        words.append(i)
print(len(words))
7754
text[:100]
'八万五千三体时(约8.6个地球年)后。\n\n元首下令召开三体世界全体执政官紧急会议,这很不寻常,一定有什么重大的事件发生。\n\n两万三体时前,三体舰队启航了,它们只知道目标的大致方向,却不知道它的距离。也'
print(*words[:50])
八万五千 三体 时 约 86 个 地球 年 后 元首 下令 召开 三体 世界 全体 执政官 紧急会议 这 很 不 寻常 一定 有 什么 重大 的 事件 发生 两万 三体 时前 三体 舰队 启航 了 它们 只 知道 目标 的 大致 方向 却 不 知道 它 的 距离 也许 目标
trigrams = [([words[i], words[i + 1]], words[i + 2]) for i in range(len(words) - 2)]
# 打印出前三个元素看看
print(trigrams[:3])
[(['八万五千', '三体'], '时'), (['三体', '时'], '约'), (['时', '约'], '86')]
# 得到词汇表
vocab = set(words)
print(len(vocab))
word_to_idx = {i:[k, 0] for k, i in enumerate(vocab)} 
idx_to_word = {k:i for k, i in enumerate(vocab)}
for w in words:
     word_to_idx[w][1] +=1
2000

构造NGram神经网络模型 (三层的网络)

  1. 输入层:embedding层,这一层的作用是:先将输入单词的编号映射为一个one hot编码的向量,形如:001000,维度为单词表大小。 然后,embedding会通过一个线性的神经网络层映射到这个词的向量表示,输出为embedding_dim

  2. 线性层,从embedding_dim维度到128维度,然后经过非线性ReLU函数

  3. 线性层:从128维度到单词表大小维度,然后log softmax函数,给出预测每个单词的概率

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import torch

class NGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)  #嵌入层
        self.linear1 = nn.Linear(context_size * embedding_dim, 128) #线性层
        self.linear2 = nn.Linear(128, vocab_size) #线性层

    def forward(self, inputs):
        #嵌入运算,嵌入运算在内部分为两步:将输入的单词编码映射为one hot向量表示,然后经过一个线性层得到单词的词向量
        embeds = self.embeddings(inputs).view(1, -1)
        # 线性层加ReLU
        out = F.relu(self.linear1(embeds))
        
        # 线性层加Softmax
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim = 1)
        return log_probs
    def extract(self, inputs):
        embeds = self.embeddings(inputs)
        return embeds
losses = [] #纪录每一步的损失函数
criterion = nn.NLLLoss() #运用负对数似然函数作为目标函数(常用于多分类问题的目标函数)
model = NGram(len(vocab), 10, 2) #定义NGram模型,向量嵌入维数为10维,N(窗口大小)为2
optimizer = optim.SGD(model.parameters(), lr=0.001) #使用随机梯度下降算法作为优化器 
#循环100个周期
for epoch in range(100):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
        # 准备好输入模型的数据,将词汇映射为编码
        context_idxs = [word_to_idx[w][0] for w in context]
        # 包装成PyTorch的Variable
        context_var = Variable(torch.LongTensor(context_idxs))
        # 清空梯度:注意PyTorch会在调用backward的时候自动积累梯度信息,故而每隔周期要清空梯度信息一次。
        optimizer.zero_grad()
        # 用神经网络做计算,计算得到输出的每个单词的可能概率对数值
        log_probs = model(context_var)
        # 计算损失函数,同样需要把目标数据转化为编码,并包装为Variable
        loss = criterion(log_probs, Variable(torch.LongTensor([word_to_idx[target][0]])))
        # 梯度反传
        loss.backward()
        # 对网络进行优化
        optimizer.step()
        # 累加损失函数值
        total_loss += loss.data
    losses.append(total_loss)
    print('第{}轮,损失函数为:{:.2f}'.format(epoch, total_loss.numpy()[0]))
第0轮,损失函数为:56704.61
第1轮,损失函数为:53935.28
第2轮,损失函数为:52241.16
第3轮,损失函数为:51008.51
第4轮,损失函数为:50113.76
第5轮,损失函数为:49434.07
第6轮,损失函数为:48879.33
第7轮,损失函数为:48404.71
第8轮,损失函数为:47983.95
第9轮,损失函数为:47600.01
第10轮,损失函数为:47240.32
第11轮,损失函数为:46897.53
第12轮,损失函数为:46566.24
第13轮,损失函数为:46241.59
第14轮,损失函数为:45920.18
第15轮,损失函数为:45599.50
第16轮,损失函数为:45277.74
第17轮,损失函数为:44953.10
第18轮,损失函数为:44624.41
第19轮,损失函数为:44290.34
第20轮,损失函数为:43950.63
第21轮,损失函数为:43604.48
第22轮,损失函数为:43251.90
第23轮,损失函数为:42891.99
第24轮,损失函数为:42524.64
第25轮,损失函数为:42149.46
第26轮,损失函数为:41766.14
第27轮,损失函数为:41374.89
第28轮,损失函数为:40975.62
第29轮,损失函数为:40568.36
第30轮,损失函数为:40153.31
第31轮,损失函数为:39730.61
第32轮,损失函数为:39300.70
第33轮,损失函数为:38863.39
第34轮,损失函数为:38419.11
第35轮,损失函数为:37968.16
第36轮,损失函数为:37510.99
第37轮,损失函数为:37048.06
第38轮,损失函数为:36579.82
第39轮,损失函数为:36106.78
第40轮,损失函数为:35629.46
第41轮,损失函数为:35148.57
第42轮,损失函数为:34665.39
第43轮,损失函数为:34180.25
第44轮,损失函数为:33693.93
第45轮,损失函数为:33207.48
第46轮,损失函数为:32721.72
第47轮,损失函数为:32237.36
第48轮,损失函数为:31755.00
第49轮,损失函数为:31275.05
第50轮,损失函数为:30798.38
第51轮,损失函数为:30325.62
第52轮,损失函数为:29857.59
第53轮,损失函数为:29394.65
第54轮,损失函数为:28937.08
第55轮,损失函数为:28485.72
第56轮,损失函数为:28041.07
第57轮,损失函数为:27603.33
第58轮,损失函数为:27173.14
第59轮,损失函数为:26750.82
第60轮,损失函数为:26336.92
第61轮,损失函数为:25931.60
第62轮,损失函数为:25534.87
第63轮,损失函数为:25147.07
第64轮,损失函数为:24768.02
第65轮,损失函数为:24397.92
第66轮,损失函数为:24036.68
第67轮,损失函数为:23684.69
第68轮,损失函数为:23341.30
第69轮,损失函数为:23006.46
第70轮,损失函数为:22680.18
第71轮,损失函数为:22361.95
第72轮,损失函数为:22051.86
第73轮,损失函数为:21749.46
第74轮,损失函数为:21454.48
第75轮,损失函数为:21167.06
第76轮,损失函数为:20886.72
第77轮,损失函数为:20613.04
第78轮,损失函数为:20346.13
第79轮,损失函数为:20085.52
第80轮,损失函数为:19831.27
第81轮,损失函数为:19583.16
第82轮,损失函数为:19341.03
第83轮,损失函数为:19104.43
第84轮,损失函数为:18873.11
第85轮,损失函数为:18646.91
第86轮,损失函数为:18425.87
第87轮,损失函数为:18209.80
第88轮,损失函数为:17998.34
第89轮,损失函数为:17791.97
第90轮,损失函数为:17589.94
第91轮,损失函数为:17392.24
第92轮,损失函数为:17199.04
第93轮,损失函数为:17009.97
第94轮,损失函数为:16824.82
第95轮,损失函数为:16643.87
第96轮,损失函数为:16466.76
第97轮,损失函数为:16293.54
第98轮,损失函数为:16123.99
第99轮,损失函数为:15957.75

12m 24s!!!

# 从训练好的模型中提取每个单词的向量
vec = model.extract(Variable(torch.LongTensor([v[0] for v in word_to_idx.values()])))
vec = vec.data.numpy()

# 利用PCA算法进行降维
from sklearn.decomposition import PCA
X_reduced = PCA(n_components=2).fit_transform(vec)
# 绘制所有单词向量的二维空间投影
import matplotlib.pyplot as plt
import matplotlib

fig = plt.figure(figsize = (20, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.4, color = 'white')
# 绘制几个特殊单词的向量
words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '信息']
# 设置中文字体,否则无法在图形上显示中文
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size = 35)
for w in words:
    if w in word_to_idx:
        ind = word_to_idx[w][0]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'white')
_images/c85e2f07de1f0a85af0cb1de4f1532ce6da9a6dfca7d17ba918f1887706d5bf5.png
# 定义计算cosine相似度的函数
import numpy as np
def cos_similarity(vec1, vec2):
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    norm = norm1 * norm2
    dot = np.dot(vec1, vec2)
    result = dot / norm if norm > 0 else 0
    return result
    
# 在所有的词向量中寻找到与目标词(word)相近的向量,并按相似度进行排列
def find_most_similar(word, vectors, word_idx):
    vector = vectors[word_to_idx[word][0]]
    simi = [[cos_similarity(vector, vectors[num]), key] for num, key in enumerate(word_idx.keys())]
    sort = sorted(simi)[::-1]
    words = [i[1] for i in sort]
    return words

# 与智子靠近的词汇
find_most_similar('智子', vec, word_to_idx)[:10]
['局部', '一场', '来', '错误', '一生', '正中', '航行', '地面', '只是', '政府']

Gensim Word2vec#

import gensim as gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.models.word2vec import LineSentence
f = open("./data/三体.txt", 'r')
lines = []

import jieba
import re

for line in f:
    temp = jieba.lcut(line)
    words = []
    for i in temp:
        #过滤掉所有的标点符号
        i = re.sub("[\s+\.\!\/_,$%^*(+\"\'””《》]+|[+——!,。?、~@#¥%……&*():;‘]+", "", i)
        if len(i) > 0:
            words.append(i)
    if len(words) > 0:
        lines.append(words)
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/jieba.cache
Loading model cost 0.685 seconds.
Prefix dict has been built successfully.
# 调用gensim Word2Vec的算法进行训练。
# 参数分别为:size: 嵌入后的词向量维度;window: 上下文的宽度,min_count为考虑计算的单词的最低词频阈值
model = Word2Vec(lines, vector_size = 20, window = 2 , min_count = 0)
model.wv.most_similar('三体', topn = 100)
[('地球', 0.9974961876869202),
 ('外星', 0.9959420561790466),
 ('中', 0.9958519339561462),
 ('的', 0.9958161115646362),
 ('信息', 0.9955942034721375),
 ('文明', 0.9955913424491882),
 ('时间', 0.9953882694244385),
 ('宇宙', 0.9951216578483582),
 ('发展', 0.9950591325759888),
 ('与', 0.9950133562088013),
 ('一个', 0.9949650764465332),
 ('它', 0.9945959448814392),
 ('四个', 0.9944809079170227),
 ('世界', 0.9944674968719482),
 ('加速器', 0.9942856431007385),
 ('太阳', 0.9941897988319397),
 ('通过', 0.9940699934959412),
 ('放到', 0.9940448999404907),
 ('将', 0.9940245151519775),
 ('社会', 0.9938994646072388),
 ('实验', 0.9938514232635498),
 ('目前', 0.993848979473114),
 ('其他', 0.9938352704048157),
 ('结构', 0.9938239455223083),
 ('存在', 0.9937348961830139),
 ('舰队', 0.9936554431915283),
 ('这个', 0.9936324954032898),
 ('历史', 0.993557333946228),
 ('内', 0.9935263991355896),
 ('使', 0.9934986233711243),
 ('巨大', 0.9934812188148499),
 ('出现', 0.9934180974960327),
 ('最近', 0.9934037327766418),
 ('大殿', 0.9933778047561646),
 ('毁灭', 0.993355929851532),
 ('等', 0.9933347105979919),
 ('在', 0.9933309555053711),
 ('这样', 0.9933069944381714),
 ('而', 0.9932495355606079),
 ('杨冬', 0.9932450652122498),
 ('周围', 0.9932316541671753),
 ('那个', 0.9932215213775635),
 ('产生', 0.9931648373603821),
 ('为', 0.9931637644767761),
 ('所有', 0.9931597113609314),
 ('沙瑞山', 0.9931430220603943),
 ('不同', 0.9931164383888245),
 ('你们', 0.9930866360664368),
 ('人们', 0.993083119392395),
 ('上', 0.993078351020813),
 ('人类', 0.9930549263954163),
 ('人', 0.9930544495582581),
 ('时', 0.9930477142333984),
 ('那些', 0.9930435419082642),
 ('一些', 0.9930272698402405),
 ('该', 0.9929757714271545),
 ('频率', 0.9928855895996094),
 ('技术', 0.9928467273712158),
 ('美好', 0.9928457736968994),
 ('主要', 0.9928236603736877),
 ('疯狂', 0.9928078055381775),
 ('努力', 0.9927535653114319),
 ('进行', 0.9927522540092468),
 ('需要', 0.992743730545044),
 ('被', 0.9927397966384888),
 ('发生', 0.992729127407074),
 ('接触', 0.9927168488502502),
 ('复杂', 0.99271559715271),
 ('研究', 0.9927100539207458),
 ('和', 0.9926401376724243),
 ('于', 0.992630660533905),
 ('发出', 0.9926106333732605),
 ('三', 0.9926024675369263),
 ('运动', 0.9925642013549805),
 ('只能', 0.9925564527511597),
 ('一片', 0.9925556182861328),
 ('已经', 0.9925482273101807),
 ('自己', 0.992435872554779),
 ('计算', 0.9924312829971313),
 ('那样', 0.9924249649047852),
 ('这些', 0.9924024343490601),
 ('第一次', 0.9923712611198425),
 ('一张', 0.9923089742660522),
 ('基地', 0.9922966361045837),
 ('辐射', 0.9922692775726318),
 ('混乱', 0.9922570586204529),
 ('一名', 0.9922410249710083),
 ('其', 0.9922134280204773),
 ('比', 0.9921837449073792),
 ('为了', 0.992165744304657),
 ('计算机', 0.9921648502349854),
 ('个', 0.99213045835495),
 ('天线', 0.9921154975891113),
 ('之间', 0.9921114444732666),
 ('两次', 0.9920859336853027),
 ('建造', 0.9920815825462341),
 ('杨母', 0.9920707941055298),
 ('后面', 0.9920520186424255),
 ('接收', 0.9920087456703186),
 ('如何', 0.9919811487197876)]
# 将词向量投影到二维空间
import numpy as np
from sklearn.decomposition import PCA

rawWordVec = []
word2ind = {}
for i, w in enumerate(model.wv.key_to_index):
    rawWordVec.append(model.wv[w])
    word2ind[w] = i
rawWordVec = np.array(rawWordVec)
X_reduced = PCA(n_components=2).fit_transform(rawWordVec)
# 绘制星空图
# 绘制所有单词向量的二维空间投影
import matplotlib.pyplot as plt
import matplotlib
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

fig = plt.figure(figsize = (15, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.5, color = 'white')
# 绘制几个特殊单词的向量
words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '进展','的']
# 设置中文字体,否则无法在图形上显示中文
#zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size=26)
for w in words:
    if w in word2ind:
        ind = word2ind[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        #plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'yellow')
        plt.text(xy[0], xy[1], w, alpha = 1, color = 'yellow', fontsize = 16)
_images/09f2a763d0528fdb6658e6f88786200e0b349e4978f05162fb322de60b33a871.png
# 绘制星空图
# 绘制所有单词向量的二维空间投影
fig = plt.figure(figsize = (15, 10))
ax = fig.gca()
ax.set_facecolor('black')
ax.plot(X_reduced[:, 0], X_reduced[:, 1], '.', markersize = 1, alpha = 0.3, color = 'white')
# 绘制几个特殊单词的向量
words = ['智子', '地球', '三体', '质子', '科学', '世界', '文明', '太空', '加速器', '平面', '宇宙', '进展','的']
# 设置中文字体,否则无法在图形上显示中文
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/华文仿宋.ttf', size=26)
for w in words:
    if w in word2ind:
        ind = word2ind[w]
        xy = X_reduced[ind]
        plt.plot(xy[0], xy[1], '.', alpha =1, color = 'red')
        plt.text(xy[0], xy[1], w, fontproperties = zhfont1, alpha = 1, color = 'yellow')
_images/a89c6e9fcb0a356e3403493890b959bd477b366ac7e7d7b23d0b4dcd6239f89b.png