数据
完整的数据可以在Google Drive文件夹中找到:https://drive.google.com/open?id=1RIIbsS-vxR7Dlo2_v6FWHDFE7q1XPPgj
要复现文档中的代码,需要执行以下操作:
1) 下载 以下文件:
- glove.6B.50d.txt (Subfolder GloVe)
- training_10000.csv (Subfolder MAIN FILES)
- validation_1000.csv (Subfolder MAIN FILES)
- testing_same_structure_1000.csv (Subfolder MAIN FILES)
- testing_different_structure_100.csv (Subfolder MAIN FILES)
- saved_model_10000_gpu.pt (Subfolder SAVED MODELS)
2) 调整变量大小 :对于代码中出现的 num_training_examples, num_validation_examples, embedding_dim, test_dataframe_same_structure, test_dataframe_different_structure 和saved model file name 可以根据数据量的大小进行调整
3) 调整超参数设置:具体模型的参数大家可以自己调整,也可以参考SAVED MODELS文件夹下的内容。
代码
相关库
1 | import pandas as pd |
定义helper函数以构建训练和验证过程中的变量
1 | def create_dataframe(csvfile): |
模型定义
1 | class Encoder(nn.Module): |
数据与变量构建
定义函数去调用所有的helper函数,以便完成各种数据和变量初始化,以及部分的预训练词向量加载等.1
2
3
4
5
6
7
8
9
10
11
12
13
14def creating_variables(num_training_examples, num_validation_examples, embedding_dim):
print(str(datetime.datetime.now()).split('.')[0], "Creating variables for training and validation...")
training_dataframe = create_dataframe('training_%d.csv' %num_training_examples)
vocab = create_vocab(training_dataframe)
word_to_id = create_word_to_id(vocab)
id_to_vec, emb_dim = create_id_to_vec(word_to_id, 'glove.6B.%dd.txt' %embedding_dim)
validation_dataframe = create_dataframe('validation_%d.csv' %num_validation_examples)
print(str(datetime.datetime.now()).split('.')[0], "Variables created.\n")
return training_dataframe, vocab, word_to_id, id_to_vec, emb_dim, validation_dataframe
模型构建
调用Encoder和DualEncoder去构建模型.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16def creating_model(hidden_size, p_dropout):
print(str(datetime.datetime.now()).split('.')[0], "Calling model...")
encoder = Encoder(
emb_size = emb_dim,
hidden_size = hidden_size,
vocab_size = len(vocab),
p_dropout = p_dropout)
dual_encoder = DualEncoder(encoder)
print(str(datetime.datetime.now()).split('.')[0], "Model created.\n")
print(dual_encoder)
return encoder, dual_encoder
训练集和验证集准确率计算
1 | def increase_count(correct_count, score, label): |
构建模型训练函数
1 | def train_model(learning_rate, l2_penalty, epochs): |
构建数据
1 | training_dataframe, vocab, word_to_id, id_to_vec, emb_dim, validation_dataframe = creating_variables(num_training_examples = 10000, |
设定hidden size和dropout概率,构建模型
1 | encoder, dual_encoder = creating_model(hidden_size = 50, |
设定学习率,迭代轮数,l2正则化强度,开始训练
1 | train_model(learning_rate = 0.0001, |
加载训练好的模型进行测试
1 | dual_encoder.load_state_dict(torch.load('saved_model_10000_examples.pt')) |
第1种测试方式:
测试数据集和训练还有验证数据集有着一样的数据组织格式 (context, response, label)
测试评判指标:准确率1
test_dataframe_same_structure = pd.read_csv('testing_same_structure_1000.csv')
构建测试函数1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21def testing_same_structure():
test_correct_count = 0
for index, row in test_dataframe_same_structure.iterrows():
context_ids, response_ids, label = load_ids_and_labels(row, word_to_id)
context = autograd.Variable(torch.LongTensor(context_ids).view(-1,1)) #.cuda()
response = autograd.Variable(torch.LongTensor(response_ids).view(-1, 1)) #.cuda()
label = autograd.Variable(torch.FloatTensor(torch.from_numpy(np.array(label).reshape(1,1)))) #.cuda()
score = dual_encoder(context, response)
test_correct_count = increase_count(test_correct_count, score, label)
test_accuracy = get_accuracy(test_correct_count, test_dataframe_same_structure)
return test_accuracy
准确率1
2test_accuracy = testing_same_structure()
print("Test accuracy for %d training examples and %d test examples: %.2f" %(len(training_dataframe),len(test_dataframe_same_structure),test_accuracy))
第2种测试方式
测试数据集和训练/验证集格式不一样 (1个问题,1个标准答案,9个干扰项错误答案)
测试评估指标:recall(召回)
加载数据1
test_dataframe_different_structure = pd.read_csv('testing_different_structure_100.csv')
以字典形态存储对话word ids
Outer dictionary “ids_per_example_and_candidate”: keys = examples, values = inner dictionaries
Inner dictionaries “ids_per_candidate”: keys = candidate names, values = list of word IDs1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33def load_ids(test_dataframe_different_structure, word_to_id):
print(str(datetime.datetime.now()).split('.')[0], "Loading test IDs...")
max_context_len = 160
ids_per_example_and_candidate = {}
for i, example in test_dataframe_different_structure.iterrows():
ids_per_candidate = {}
for column_name, cell in example.iteritems():
id_list = []
words = str(cell).split()
if len(words) > max_context_len:
words = words[:max_context_len]
for word in words:
if word in word_to_id:
id_list.append(word_to_id[word])
else:
id_list.append(0) #UNK
ids_per_candidate[column_name] = id_list
ids_per_example_and_candidate[i] = ids_per_candidate
print(str(datetime.datetime.now()).split('.')[0], "Test IDs loaded.")
return ids_per_example_and_candidate
1 | ids_per_example_and_candidate = load_ids(test_dataframe_different_structure, word_to_id) |
以字典形态存储得分score
Outer dictionary “scores_per_example_and_candidate”: keys = examples, values = inner dictionaries
Inner dictionaries “scores_per_candidate”: keys = candidate names, values = score1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26def load_scores():
print(str(datetime.datetime.now()).split('.')[0], "Computing test scores...")
scores_per_example_and_candidate = {}
for example, utterance_ids_dict in sorted(ids_per_example_and_candidate.items()):
score_per_candidate = {}
for utterance_name, ids_list in sorted(utterance_ids_dict.items()):
context = autograd.Variable(torch.LongTensor(utterance_ids_dict['Context']).view(-1,1))#.cuda()
if utterance_name != 'Context':
candidate_response = autograd.Variable(torch.LongTensor(utterance_ids_dict[utterance_name]).view(-1, 1))#.cuda()
score = torch.sigmoid(dual_encoder(context, candidate_response))
score_per_candidate["Score with " + utterance_name] = score.data[0][0]
scores_per_example_and_candidate[example] = score_per_candidate
print(str(datetime.datetime.now()).split('.')[0], "Test scores computed.")
return scores_per_example_and_candidate
1 | scores_per_example_and_candidate = load_scores() |
定义计算召回结果的方法:
这里计算的是recall@k这个评估指标。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15def get_recall_at_k(k):
count_true_hits = 0
for example, score_per_candidate_dict in sorted(scores_per_example_and_candidate.items()):
top_k = dict(sorted(score_per_candidate_dict.items(), key=operator.itemgetter(1), reverse=True)[:k])
if 'Score with Ground Truth Utterance' in top_k:
count_true_hits += 1
number_of_examples = len(scores_per_example_and_candidate)
recall_at_k = count_true_hits/number_of_examples
return recall_at_k
1 | print("recall_at_5 =",get_recall_at_k(k = 5)) #Baseline expectation: 5/10 = 0.5 for random guess |
建议把cuda()打开。在GPU上训练。
github