IT博客汇 | NLP学习笔记(2) - 文本的向量化与分词

NLP学习笔记(2) - 文本的向量化与分词

52txr发表于 2024-05-17 20:09:00

在自然语言处理 (NLP) 项目中，词嵌入是表示词语的一种常用方法。它将每个词映射到一个向量，该向量包含词语的语义信息。词嵌入已被证明可以有效提高 NLP 任务的性能，例如机器翻译、文本分类和情感分析。

词嵌入的主要方法

1、自己训练词嵌入

使用自己的数据集和模型来训练词嵌入。但是，训练词嵌入可能需要大量数据和计算资源。

一些流行的词嵌入训练方法包括：

词共现方法：这些方法基于词语在文本语料库中共同出现的频率来学习词嵌入。例如，Word2Vec 是一种流行的词共现方法。
神经网络方法：这些方法使用神经网络来学习词嵌入。例如，GloVe 是一种流行的神经网络词嵌入方法。

2、预训练的词嵌入

使用其他人已经训练好的词嵌入。这可以节省您大量的时间和精力，并且通常可以获得良好的性能。许多预训练的词嵌入都可以在网上免费获得：

Google News word vectors：这些词嵌入是在 Google 新闻语料库上训练的。
Stanford Word Embeddings：这些词嵌入是在多个语料库上训练的，包括维基百科和推特。

“词嵌入表示”处理流程

1. 表示成token

首先需要表示成 token，每个较小的文本单元称为token，将文本分解成token的过程称为分词(tokenization)。

在Python中有很多强大的库可以用来进行分词。

one-hot(独热)编码和词嵌入是将token映射到向量最流行的两种方法。

2. 向量化示例

《怦然心动》经典台词：有时落日泛起紫红的余晖，有时散发出橘红色的火光燃起天边的晚霞。在这绚烂的日落景象中，我慢慢领悟了父亲所说的整体胜于局部总和的道理。

Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.

2.1 独热编码

独热编码在前文进行了一些介绍，大概就是下面这个例子的意思。

运动特征：["足球"，"篮球"，"羽毛球"，"乒乓球"]

足球 => 1000

篮球 => 0100

羽毛球 => 0010

乒乓球 => 0001

现在使用python代码，把上面的经典台词处理为这种[0,0,0,0,0,0,0,0,0,0,1,0, ... ,0,0,0]这种形式。

import torch
import numpy as np
# 去除所有的标点符号
import string

# 《怦然心动》经典台词
s ='''Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.'''


for c in string.punctuation:
    s = s.replace(c," ").lower()  #去除标点符号且全部小写    
    
vocab = dict((word,index) for index,word in enumerate(np.unique(s.split())))

s = [vocab.get(w) for w in s.split()]

b = np.zeros((len(s),len(vocab)))

for index, i in enumerate(s):
    b[index,i] = 1

2.2 词嵌入编码

在Pytorch中，已经有现成的函数用于进行词嵌入。num_embeddings表示词的长度，也就是唯一单词的数量，embedding_dim是张量的长度。

torch.nn.Embedding(
    num_embeddings: int,
    embedding_dim: int,
    padding_idx: Optional[int] = None,
    max_norm: Optional[float] = None,
    norm_type: float = 2.0,
    scale_grad_by_freq: bool = False,
    sparse: bool = False,
    _weight: Optional[torch.Tensor] = None,
    _freeze: bool = False,
    device=None,
    dtype=None,
)

下面是词嵌入编码的完整示例代码：

import torch
import numpy as np
# 去除所有的标点符号
import string

# 《怦然心动》经典台词
s ='''Some days the sunsets would be purple and pink. And some days they were a blazing orange setting fire to the clouds on the horizon. It was during one of those sunsets that my father's idea of the whole being greater than the sum of its parts moved from my head to my heart.'''


for c in string.punctuation:
    s = s.replace(c," ").lower()  #去除标点符号且全部小写    
    
vocab = dict((word,index) for index,word in enumerate(np.unique(s.split())))

# 向量化方法二：词嵌入表示
em = torch.nn.Embedding(len(vocab),20) # 假设使用20的张量表示一个词

s_em = em(torch.LongTensor(s))

这个s_em显示的结果如下：

词嵌入之后的结果