当前位置: 首页 > news >正文

AI学习记录 - Word2Vec 超详细解析

创作不易,点个赞

我们有一堆文本,词汇拆分

sentences = ["jack like dog", "jack like cat", "jack like animal","dog cat animal", "banana apple cat dog like", "dog fish milk like","dog cat animal like", "jack like apple", "apple like", "jack like banana","apple banana jack movie book music like", "cat dog hate", "cat dog like"]
sentence_list = " ".join(sentences).split() # ['jack', 'like', 'dog']
vocab = list(set(sentence_list))
word2idx = {w:i for i, w in enumerate(vocab)}
print("word2idx", word2idx)
vocab_size = len(vocab)
print("vocab_size", vocab_size)

打印如下,一共有13个词汇:

word2idx {'banana': 0, 'animal': 1, 'hate': 2, 'like': 3, 'jack': 4, 'dog': 5, 'fish': 6, 'apple': 7, 'book': 8, 'milk': 9, 'music': 10, 'cat': 11, 'movie': 12}
vocab_size 13

生成训练集合,生成规则是【目标词,上一个词】,【目标词,下一个词】,下面使用的是下标

skip_grams = []
for idx in range(C, len(sentence_list) - C):center = word2idx[sentence_list[idx]]context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))context = [word2idx[sentence_list[i]] for i in context_idx]for w in context:skip_grams.append([center, w])print(skip_grams)
print(len(skip_grams))

skip_grams变量打印如下:

[[5, 4], [5, 3], [5, 4], [5, 3], [4, 3], [4, 5], [4, 3], [4, 11], [3, 5], [3, 4], [3, 11], [3, 4], [11, 4], [11, 3], [11, 4], [11, 3], [4, 3], [4, 11], [4, 3], [4, 1], [3, 11], [3, 4], [3, 1], [3, 5], [1, 4], [1, 3], [1, 5], [1, 11], [5, 3], [5, 1], [5, 11], [5, 1], [11, 1], [11, 5], [11, 1], [11, 0], [1, 5], [1, 11], [1, 0], [1, 7], [0, 11], [0, 1], [0, 7], [0, 11], [7, 1], [7, 0], [7, 11], [7, 5], [11, 0], [11, 7], [11, 5], [11, 3], [5, 7], [5, 11], [5, 3], [5, 5],

我们现在有了训练集,那么现在需要构造模型结构,假设原one-hot编码是13位,训练出来的词向量位2位,这样子我们就减少了词汇的维度并建立词与词的关联,如下图:

在这里插入图片描述

代码为: W矩阵就是我们需要训练的矩阵,V矩阵在训练完成之后我们就丢弃了。

class Word2Vec(nn.Module):def __init__(self):super(Word2Vec, self).__init__()self.W = nn.Parameter(torch.randn(vocab_size, m).type(dtype))self.V = nn.Parameter(torch.randn(m, vocab_size).type(dtype))def forward(self, X):# X : [batch_size, vocab_size]hidden = torch.mm(X, self.W) # [batch_size, m]output = torch.mm(hidden, self.V) # [batch_size, vocab_size]return output

反向传播阶段照样是使用 CrossEntropyLoss 计算误差,关于 CrossEntropyLoss 函数在上一个章节有介绍,利用实际输出和标签的损失进行反向传播,如果你想了解权重怎么调整,继续往前翻找也可查看。

在这里插入图片描述

完整的word2Vec代码


import torch
import numpy as np
import torch.nn as nn
import torch.optim as optimizer
import torch.utils.data as Datadevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.FloatTensorsentences = ["jack like dog", "jack like cat", "jack like animal","dog cat animal", "banana apple cat dog like", "dog fish milk like","dog cat animal like", "jack like apple", "apple like", "jack like banana","apple banana jack movie book music like", "cat dog hate", "cat dog like"]
sentence_list = " ".join(sentences).split() # ['jack', 'like', 'dog']
vocab = list(set(sentence_list))
word2idx = {w:i for i, w in enumerate(vocab)}
vocab_size = len(vocab)# model parameters
C = 2 # window size
batch_size = 8
m = 2 # word embedding dimskip_grams = []
for idx in range(C, len(sentence_list) - C):center = word2idx[sentence_list[idx]]context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))context = [word2idx[sentence_list[i]] for i in context_idx]for w in context:skip_grams.append([center, w])def make_data(skip_grams):input_data = []output_data = []for a, b in skip_grams:input_data.append(np.eye(vocab_size)[a])output_data.append(b)return input_data, output_datainput_data, output_data = make_data(skip_grams)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)class Word2Vec(nn.Module):def __init__(self):super(Word2Vec, self).__init__()self.W = nn.Parameter(torch.randn(vocab_size, m).type(dtype))self.V = nn.Parameter(torch.randn(m, vocab_size).type(dtype))def forward(self, X):# X : [batch_size, vocab_size]hidden = torch.mm(X, self.W) # [batch_size, m]output = torch.mm(hidden, self.V) # [batch_size, vocab_size]return outputmodel = Word2Vec().to(device)
loss_fn = nn.CrossEntropyLoss().to(device)
optim = optimizer.Adam(model.parameters(), lr=1e-3)for epoch in range(2000):for i, (batch_x, batch_y) in enumerate(loader):batch_x = batch_x.to(device)batch_y = batch_y.to(device)pred = model(batch_x)loss = loss_fn(pred, batch_y)if (epoch + 1) % 1000 == 0:print(epoch + 1, i, loss.item())optim.zero_grad()loss.backward()optim.step()import matplotlib.pyplot as plt
for i, label in enumerate(vocab):W, WT = model.parameters()x,y = float(W[i][0]), float(W[i][1])plt.scatter(x, y)plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()

http://www.mrgr.cn/news/1114.html

相关文章:

  • c++进阶——继承的定义,复杂的菱形继承及菱形虚拟继承
  • JZ51 数组中的逆序对
  • HarmonyOS 开发
  • 设计模式 建造者模式
  • C++ //练习 18.6 已知下面的异常类型和catch语句,书写一个throw表达式使其创建的异常对象能被这些catch语句捕获:
  • 非关系型数据库MongoDB(文档型数据库)介绍与使用实例
  • 安装Pentaho Data Integration并启动
  • C++入门——08list
  • Flutter-->使用dart编写蒲公英上传脚本
  • 七大排序算法
  • 基于飞腾平台的Hadoop的安装配置
  • 智慧水务项目(七)vscode 远程连接ubuntu 20.04 服务器,调试pyscada,踩坑多多
  • BOOST c++库学习 之 boost.mpi库入门实战指南 以及 使用 boost.mpi库实现进程间通讯(同步与异步的对比)的简单例程
  • Nuxt3【详解】资源引用 vs 添加样式(2024最新版)
  • EmguCV学习笔记 VB.Net和C# 下的OpenCv开发 C# 目录
  • AI安全-文生图
  • leetcode 41-50(2024.08.19)
  • Centos系统中创建定时器完成定时任务
  • CDGA|数据治理落地实践指南:构建高效、安全的数据管理体系
  • 小五金加工:细节决定产品质量与性能