在自然语言处理 (NLP) 项目中,词嵌入是表示词语的一种常用方法。它将每个词映射到一个向量,该向量包含词语的语义信息。词嵌入已被证明可以有效提高 NLP 任务的性能,例如机器翻译、文本分类和情感分析。
1、自己训练词嵌入
使用自己的数据集和模型来训练词嵌入。但是,训练词嵌入可能需要大量数据和计算资源。
一些流行的词嵌入训练方法包括:
2、预训练的词嵌入
使用其他人已经训练好的词嵌入。这可以节省您大量的时间和精力,并且通常可以获得良好的性能。许多预训练的词嵌入都可以在网上免费获得:
首先需要表示成 token,每个较小的文本单元称为token,将文本分解成token的过程称为分词(tokenization)。
在Python中有很多强大的库可以用来进行分词。
one-hot(独热)编码和词嵌入是将token映射到向量最流行的两种方法。
《怦然心动》经典台词:有时落日泛起紫红的余晖,有时散发出橘红色的火光燃起天边的晚霞。在这绚烂的日落景象中,我慢慢领悟了父亲所说的整体胜于局部总和的道理。
Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.
独热编码在前文进行了一些介绍,大概就是下面这个例子的意思。
运动特征:["足球","篮球","羽毛球","乒乓球"]
足球 => 1000
篮球 => 0100
羽毛球 => 0010
乒乓球 => 0001
现在使用python代码,把上面的经典台词处理为这种[0,0,0,0,0,0,0,0,0,0,1,0, ... ,0,0,0]
这种形式。
import torch
import numpy as np
# 去除所有的标点符号
import string
# 《怦然心动》经典台词
s ='''Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.'''
for c in string.punctuation:
s = s.replace(c," ").lower() #去除标点符号且全部小写
vocab = dict((word,index) for index,word in enumerate(np.unique(s.split())))
s = [vocab.get(w) for w in s.split()]
b = np.zeros((len(s),len(vocab)))
for index, i in enumerate(s):
b[index,i] = 1
在Pytorch中,已经有现成的函数用于进行词嵌入。num_embeddings
表示词的长度,也就是唯一单词的数量,embedding_dim
是张量的长度。
torch.nn.Embedding(
num_embeddings: int,
embedding_dim: int,
padding_idx: Optional[int] = None,
max_norm: Optional[float] = None,
norm_type: float = 2.0,
scale_grad_by_freq: bool = False,
sparse: bool = False,
_weight: Optional[torch.Tensor] = None,
_freeze: bool = False,
device=None,
dtype=None,
)
下面是词嵌入编码的完整示例代码:
import torch
import numpy as np
# 去除所有的标点符号
import string
# 《怦然心动》经典台词
s ='''Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.'''
for c in string.punctuation:
s = s.replace(c," ").lower() #去除标点符号且全部小写
vocab = dict((word,index) for index,word in enumerate(np.unique(s.split())))
# 向量化方法二:词嵌入表示
em = torch.nn.Embedding(len(vocab),20) # 假设使用20的张量表示一个词
s_em = em(torch.LongTensor(s))
这个s_em显示的结果如下: