当前位置：首页 > news >正文

python实现简单中文词元化、词典构造、时序数据集封装等

news 2026/1/23 22:03:57

文章目录

- 简述
- 代码
- - 词元化
  - 词典构造
  - 时序数据生成
  - data.TensorDataset生成

简述

中文词元化、删除非中文字符、构建索引词典，以便于为训练提供向量化数据。

待处理文本，以朱自清的《背影》为例，图中是给句子手动换行了，不换也是没问题的。

在这里插入图片描述

代码

词元化

# 词元化，删除标点符号, 仍保持行关系 
def tokenize_lines(txt_path, encoding='utf=8'):  with open(txt_path, 'r', encoding=encoding) as f:  lines = f.readlines()  # 删除中文符号 这里枚举不完善  chars_to_remove = (r'[，。？；、：“”：！～()『』「」\\【】\"\[\]➕〈〉/／<>（）‰\％《》\＊\?\-\.…·○０１２３４５６７８９0123456789•\n\t abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ—\*\x0c!#\$%&\'+,:\=@\^_]')  return [re.sub(chars_to_remove, '', line).strip() for line in lines]

词典构造

这里要注意：
对于同一个语料库,应保证每次加载得到的 word_to_idx_dict、idx_to_word_dict是一样的，这里还添加了0来表示未知词。

# 构建词表 word_to_id, id_to_wordclass Vocab:   def __init__(self, tokens, multi_line: bool):  self.tokens = tokens  # 如果是多行的词元 将二维句子展开到一行  if multi_line:  word_list = [ch for line in tokens for ch in line]  else:  word_list = tokens  # 创建字典添加0 代表未知词  self.word_to_idx_dict = {'0': 0}  self.idx_to_word_dict = {0: '0'}  for word in word_list:  if word not in self.word_to_idx_dict:  word_id = len(self.word_to_idx_dict)  self.word_to_idx_dict[word] = word_id  self.idx_to_word_dict[word_id] = word  def __len__(self):  return len(self.idx_to_word_dict)  def word2idx(self, word):  return self.word_to_idx_dict.get(word, 0)  def idx2word(self, idx):  return self.idx_to_word_dict.get(idx, '0')  def save_dict(self, dir_path):  with open(dir_path + './idx_to_word_dict.json', 'w', encoding='utf-8') as f:  JSON.dump(self.idx_to_word_dict, f, ensure_ascii=False, indent=4)  with open(dir_path + './word_to_idx_dict.json', 'w', encoding='utf-8') as f:  JSON.dump(self.word_to_idx_dict, f, ensure_ascii=False, indent=4)

另外实现了save_dict(self, dir_path)函数用于保存字典到本地json文件中。
在这里插入图片描述

时序数据生成

使用rnn学习时，需要构建时序数据。

# time_size即序列长度  
def create_random_sequential(corpus, time_size=4):  xs_tmp = corpus[:-1]  ys_tmp = corpus[1:]  line_num = len(xs_tmp) // time_size  xs = np.zeros((line_num, time_size), dtype=np.int32)  ys = np.zeros((line_num, time_size), dtype=np.int32)  for i in range(line_num):  xs[i] = xs_tmp[i * time_size: (i + 1) * time_size]  ys[i] = ys_tmp[i * time_size: (i + 1) * time_size]  return xs, ys

data.TensorDataset生成

def make_dataset(corpus, time_size):  xs, ys = create_random_sequential(corpus, time_size)  xs = torch.tensor(xs)  ys = torch.tensor(ys)  return data.TensorDataset(xs, ys)

查看全文

http://www.mrgr.cn/news/9847.html