基于RNN的语言模型与机器翻译NMT

图片名称

以RNN为代表的语言模型在机器翻译领域已经达到了State of Art的效果,本文将简要介绍语言模型、机器翻译,基于RNN的seq2seq架构及优化方法。

语言模型

语言模型就是计算一序列词出现的概率$P(w_1,w_2,…,w_T)$。

语言模型在机器翻译领域的应用有:

  • 词排序:p(the cat is small) > p(small the is cat)
  • 词选择:p(walking home after school) > p(walking house after school)

传统的语言模型

传统的语言模型通过两点假设,将词序列的联合概率转化为每个词条件概率的连乘形式:

  • 每个词只和它前面出现的词有关
  • 每个词只和它前面出现的$k$个词有关

每个词条件概率的计算通过n-gram的形式,具体如下图。

这里写图片描述

然而,传统语言模型的一大缺点就是,精度的提升需要提高n-gram中的n。提高n的值带来需要内存的指数提高。

基于RNN的语言模型

基于RNN的语言模型利用RNN本身输入是序列的特点,在隐含层神经元之上加了全连接层、Softmax层,得到输出词的概率分布。

这里写图片描述

然而,RNN的训练比较困难,通常采用的trick如下:

  • gradient clipping
  • Initialization(identity matrix) + ReLus
  • Class-based word prediction,$p(w_t|h) = p(c_t|h)p(w_t|c_t)$
  • 采用rmsprop、adma等优化方法
  • 采用gru、lstm等神经单元

机器翻译

基于统计的机器翻译架构

基于统计的机器翻译架构,简单来说包含两个步骤:

  1. 构建从source到target的alignment。
  2. 根据source到target的alignment获得各种组合,根据language model获得概率最大的组合。

这里写图片描述

这里写图片描述

这里写图片描述

这里写图片描述

基于RNN的seq2seq架构

seq2seq结构

基于RNN的seq2seq架构包含encoder和decoder,decoder部分又分train和inference两个过程,具体结构如下面两图所示:

这里写图片描述

这里写图片描述

优化seq2seq

  • seq2seq的decoder中,输入信息只有$h_{t-1},x_t$,在这基础上,可以增加新的信息$y_{t-1}, s_{enc}$。
  • 加深网络结构
  • 双向RNN,结合了上文和下文信息
  • Train input sequence in reverse order for simple optimization problem
  • 采用gru、lstm

attention

attention机制的核心是将encoder的状态储存起来并供decoder每个step有选择地调用。具体包括以下步骤:

  1. 计算decoder的当前step的state与encoder各个step state的分数。
  2. 将分数正则化。
  3. 依照正则化后的分数将encoder的各个step的state线性组合起来。
  4. 将组合后的state作为decoder当前step的输入,相当于给decoder提供了额外的信息。

这里写图片描述

这里写图片描述

这里写图片描述

这里写图片描述

search in decoder

decoder inference的时候对inference序列的选择有以下几种方法:

  • Exhaustive Search:穷举每种可能的情况,不实用。
  • Ancestral Sampling:按照概率分布抽样,优点是效率高无偏差,缺点是方差大。
  • Greedy Search:选择概率最大的抽样,优点是效率高,缺点是不容易找到最佳值。
  • Beam Search:常用的采样方法,设置采样为K,第一次迭代出现K个,第二次出现K^2个并在其中在挑选K个。优点是质量好,缺点是计算量大。

这里写图片描述

这里写图片描述

Tensorflow实例

下面,介绍基于Tensorflow 1.1的实例代码,代码参考了Udacity DeepLearning NanoDegree的部分示例代码。

输入部分:

  • input: 秩为2的输入数据(翻译源)的Tensor,长度不同的增加PAD。是encoder的输入。
  • targets: 秩为2的输出数据(翻译目标)的Tensor,每个序列的结尾增加EOS,长度不同的增加PAD。是decoder的输出。

构建网络部分代码函数如下:

  • model_inputs: 构建网络输入。
  • process_decoder_input:对target去除最后一列并在每个序列起始位加上GO,构建decoder的输入dec_input。
  • encoding_layer:对input加embedding层,加RNN得到输出enc_output, enc_state。
  • decoding_layer_train:得到decoder train过程的输出dec_outputs_train。
  • decoding_layer_infer:得到decoder infer过程的输出dec_outputs_infer。
  • decoding_layer:对dec_input加embedding层,加RNN得到train和infer的输出dec_outputs_train, dec_outputs_infer。
  • seq2seq_model:将encoder和decoder封装到一起。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
# prepare input
def model\_inputs():
"""
Create TF Placeholders for input, targets, learning rate, and lengths of source and target sequences.
:return: Tuple (input, targets, learning rate, keep probability, target sequence length,
max target sequence length, source sequence length)
"""

# TODO: Implement Function
# input parameters
input = tf.placeholder(tf.int32, [None, None], name="input")
targets = tf.placeholder(tf.int32, [None, None], name="targets")
# training parameters
learning\_rate = tf.placeholder(tf.float32, name="learning\_rate")
keep\_prob = tf.placeholder(tf.float32, name="keep\_prob")
# sequence length parameters
target\_sequence\_length = tf.placeholder(tf.int32, [None], name="target\_sequence\_length")
max\_target\_sequence\_length = tf.reduce\_max(target\_sequence\_length)
source\_sequence\_length = tf.placeholder(tf.int32, [None], name="source\_sequence\_length")

return (input, targets, learning\_rate, keep\_prob, target\_sequence\_length, \
max\_target\_sequence\_length, source\_sequence\_length)


def process\_decoder\_input(target\_data, target\_vocab\_to\_int, batch\_size):
"""
Preprocess target data for encoding
:param target\_data: Target Placehoder
:param target\_vocab\_to\_int: Dictionary to go from the target words to an id
:param batch\_size: Batch Size
:return: Preprocessed target data
"""

x = tf.strided\_slice(target\_data, [0,0], [batch\_size, -1], [1,1])
y = tf.concat([tf.fill([batch\_size, 1], target\_vocab\_to\_int['<GO>']), x], 1)
return y


# Encoding
def encoding\_layer(rnn\_inputs, rnn\_size, num\_layers, keep\_prob,
source\_sequence\_length, source\_vocab\_size,
encoding\_embedding\_size):

"""
Create encoding layer
:param rnn\_inputs: Inputs for the RNN
:param rnn\_size: RNN Size
:param num\_layers: Number of layers
:param keep\_prob: Dropout keep probability
:param source\_sequence\_length: a list of the lengths of each sequence in the batch
:param source\_vocab\_size: vocabulary size of source data
:param encoding\_embedding\_size: embedding size of source data
:return: tuple (RNN output, RNN state)
"""

# TODO: Implement Function
# embedding input
enc\_inputs = tf.contrib.layers.embed\_sequence(rnn\_inputs, source\_vocab\_size, encoding\_embedding\_size)
# construcll rnn cell
cell = tf.contrib.rnn.MultiRNNCell([
tf.contrib.rnn.LSTMCell(rnn\_size) \
for \_ in range(num\_layers) ])
# rnn forward
enc\_output, enc\_state = tf.nn.dynamic\_rnn(cell, enc\_inputs, sequence\_length=source\_sequence\_length, dtype=tf.float32)
return enc\_output, enc\_state

# Decoding
## Decoding Training
def decoding\_layer\_train(encoder\_state, dec\_cell, dec\_embed\_input,
target\_sequence\_length, max\_summary\_length,
output\_layer, keep\_prob):

"""
Create a decoding layer for training
:param encoder\_state: Encoder State
:param dec\_cell: Decoder RNN Cell
:param dec\_embed\_input: Decoder embedded input
:param target\_sequence\_length: The lengths of each sequence in the target batch
:param max\_summary\_length: The length of the longest sequence in the batch
:param output\_layer: Function to apply the output layer
:param keep\_prob: Dropout keep probability
:return: BasicDecoderOutput containing training logits and sample\_id
"""

helper = tf.contrib.seq2seq.TrainingHelper(dec\_embed\_input, target\_sequence\_length)
decoder = tf.contrib.seq2seq.BasicDecoder(dec\_cell, helper, encoder\_state, output\_layer=output\_layer)
dec\_outputs, dec\_state = tf.contrib.seq2seq.dynamic\_decode(decoder, impute\_finished=True, maximum\_iterations=max\_summary\_length)
return dec\_outputs

## Decoding Inference
def decoding\_layer\_infer(encoder\_state, dec\_cell, dec\_embeddings, start\_of\_sequence\_id,
end\_of\_sequence\_id, max\_target\_sequence\_length,
vocab\_size, output\_layer, batch\_size, keep\_prob):

"""
Create a decoding layer for inference
:param encoder\_state: Encoder state
:param dec\_cell: Decoder RNN Cell
:param dec\_embeddings: Decoder embeddings
:param start\_of\_sequence\_id: GO ID
:param end\_of\_sequence\_id: EOS Id
:param max\_target\_sequence\_length: Maximum length of target sequences
:param vocab\_size: Size of decoder/target vocabulary
:param decoding\_scope: TenorFlow Variable Scope for decoding
:param output\_layer: Function to apply the output layer
:param batch\_size: Batch size
:param keep\_prob: Dropout keep probability
:return: BasicDecoderOutput containing inference logits and sample\_id
"""

# TODO: Implement Function
start\_tokens = tf.tile(tf.constant([start\_of\_sequence\_id], dtype=tf.int32), [batch\_size], name='start\_tokens')
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec\_embeddings,
start\_tokens, end\_of\_sequence\_id)
decoder = tf.contrib.seq2seq.BasicDecoder(dec\_cell, helper, encoder\_state, output\_layer=output\_layer)
dec\_outputs, dec\_state = tf.contrib.seq2seq.dynamic\_decode(decoder,impute\_finished=True,
maximum\_iterations=max\_target\_sequence\_length)
return dec\_outputs

## Decoding Layer
from tensorflow.python.layers import core as layers\_core
def decoding\_layer(dec\_input, encoder\_state,
target\_sequence\_length, max\_target\_sequence\_length,
rnn\_size,
num\_layers, target\_vocab\_to\_int, target\_vocab\_size,
batch\_size, keep\_prob, decoding\_embedding\_size):

"""
Create decoding layer
:param dec\_input: Decoder input
:param encoder\_state: Encoder state
:param target\_sequence\_length: The lengths of each sequence in the target batch
:param max\_target\_sequence\_length: Maximum length of target sequences
:param rnn\_size: RNN Size
:param num\_layers: Number of layers
:param target\_vocab\_to\_int: Dictionary to go from the target words to an id
:param target\_vocab\_size: Size of target vocabulary
:param batch\_size: The size of the batch
:param keep\_prob: Dropout keep probability
:return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
"""

# embedding target sequence
dec\_embeddings = tf.Variable(tf.random\_uniform([target\_vocab\_size, decoding\_embedding\_size]))
dec\_embed\_input = tf.nn.embedding\_lookup(dec\_embeddings, dec\_input)
# construct decoder lstm cell
dec\_cell = tf.contrib.rnn.MultiRNNCell([
tf.contrib.rnn.LSTMCell(rnn\_size) \
for \_ in range(num\_layers) ])
# create output layer to map the outputs of the decoder to the elements of our vocabulary
output\_layer = layers\_core.Dense(target\_vocab\_size,
kernel\_initializer = tf.truncated\_normal\_initializer(mean = 0.0, stddev=0.1))
# decoder train
with tf.variable\_scope("decoding") as decoding\_scope:
dec\_outputs\_train = decoding\_layer\_train(encoder\_state, dec\_cell, dec\_embed\_input,
target\_sequence\_length, max\_target\_sequence\_length,
output\_layer, keep\_prob)
# decoder inference
start\_of\_sequence\_id = target\_vocab\_to\_int["<GO>"]
end\_of\_sequence\_id = target\_vocab\_to\_int["<EOS>"]
with tf.variable\_scope("decoding", reuse=True) as decoding\_scope:
dec\_outputs\_infer = decoding\_layer\_infer(encoder\_state, dec\_cell, dec\_embeddings, start\_of\_sequence\_id,
end\_of\_sequence\_id, max\_target\_sequence\_length,
target\_vocab\_size, output\_layer, batch\_size, keep\_prob)
# rerturn
return dec\_outputs\_train, dec\_outputs\_infer

# Seq2seq
def seq2seq\_model(input\_data, target\_data, keep\_prob, batch\_size,
source\_sequence\_length, target\_sequence\_length,
max\_target\_sentence\_length,
source\_vocab\_size, target\_vocab\_size,
enc\_embedding\_size, dec\_embedding\_size,
rnn\_size, num\_layers, target\_vocab\_to\_int):

"""
Build the Sequence-to-Sequence part of the neural network
:param input\_data: Input placeholder
:param target\_data: Target placeholder
:param keep\_prob: Dropout keep probability placeholder
:param batch\_size: Batch Size
:param source\_sequence\_length: Sequence Lengths of source sequences in the batch
:param target\_sequence\_length: Sequence Lengths of target sequences in the batch
:param source\_vocab\_size: Source vocabulary size
:param target\_vocab\_size: Target vocabulary size
:param enc\_embedding\_size: Decoder embedding size
:param dec\_embedding\_size: Encoder embedding size
:param rnn\_size: RNN Size
:param num\_layers: Number of layers
:param target\_vocab\_to\_int: Dictionary to go from the target words to an id
:return: Tuple of (Training BasicDecoderOutput, Inference BasicDecoderOutput)
"""

# TODO: Implement Function
# embedding and encoding
enc\_output, enc\_state = encoding\_layer(input\_data, rnn\_size, num\_layers, keep\_prob,
source\_sequence\_length, source\_vocab\_size,
enc\_embedding\_size)
# process target data
dec\_input = process\_decoder\_input(target\_data, target\_vocab\_to\_int, batch\_size)
# embedding and decoding
dec\_outputs\_train, dec\_outputs\_infer = decoding\_layer(dec\_input, enc\_state,
target\_sequence\_length, tf.reduce\_max(target\_sequence\_length),
rnn\_size,
num\_layers, target\_vocab\_to\_int, target\_vocab\_size,
batch\_size, keep\_prob, dec\_embedding\_size)
return dec\_outputs\_train, dec\_outputs\_infer

# Build Graph
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

save\_path = 'checkpoints/dev'
(source\_int\_text, target\_int\_text), (source\_vocab\_to\_int, target\_vocab\_to\_int), \_ = helper.load\_preprocess()
max\_target\_sentence\_length = max([len(sentence) for sentence in source\_int\_text])

train\_graph = tf.Graph()
with train\_graph.as\_default():
input\_data, targets, lr, keep\_prob, target\_sequence\_length, max\_target\_sequence\_length, source\_sequence\_length = model\_inputs()

#sequence\_length = tf.placeholder\_with\_default(max\_target\_sentence\_length, None, name='sequence\_length')
input\_shape = tf.shape(input\_data)

train\_logits, inference\_logits = seq2seq\_model(tf.reverse(input\_data, [-1]),
targets,
keep\_prob,
batch\_size,
source\_sequence\_length,
target\_sequence\_length,
max\_target\_sequence\_length,
len(source\_vocab\_to\_int),
len(target\_vocab\_to\_int),
encoding\_embedding\_size,
decoding\_embedding\_size,
rnn\_size,
num\_layers,
target\_vocab\_to\_int)


training\_logits = tf.identity(train\_logits.rnn\_output, name='logits')
inference\_logits = tf.identity(inference\_logits.sample\_id, name='predictions')

masks = tf.sequence\_mask(target\_sequence\_length, max\_target\_sequence\_length, dtype=tf.float32, name='masks')

with tf.name\_scope("optimization"):
# Loss function
cost = tf.contrib.seq2seq.sequence\_loss(
training\_logits,
targets,
masks)

# Optimizer
optimizer = tf.train.AdamOptimizer(lr)

# Gradient Clipping
gradients = optimizer.compute\_gradients(cost)
capped\_gradients = [(tf.clip\_by\_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train\_op = optimizer.apply\_gradients(capped\_gradients)


# Training
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""

def get\_accuracy(target, logits):
"""
Calculate accuracy
"""

max\_seq = max(target.shape[1], logits.shape[1])
if max\_seq - target.shape[1]:
target = np.pad(
target,
[(0,0),(0,max\_seq - target.shape[1])],
'constant')
if max\_seq - logits.shape[1]:
logits = np.pad(
logits,
[(0,0),(0,max\_seq - logits.shape[1])],
'constant')

return np.mean(np.equal(target, logits))

# Split data to training and validation sets
train\_source = source\_int\_text[batch\_size:]
train\_target = target\_int\_text[batch\_size:]
valid\_source = source\_int\_text[:batch\_size]
valid\_target = target\_int\_text[:batch\_size]
(valid\_sources\_batch, valid\_targets\_batch, valid\_sources\_lengths, valid\_targets\_lengths ) = next(get\_batches(valid\_source,

with tf.Session(graph=train\_graph) as sess:
sess.run(tf.global\_variables\_initializer())

for epoch\_i in range(epochs):

for batch\_i, (source\_batch, target\_batch, sources\_lengths, targets\_lengths) in enumerate(
get\_batches(train\_source, train\_target, batch\_size,
source\_vocab\_to\_int['<PAD>'],
target\_vocab\_to\_int['<PAD>'])):

\_, loss = sess.run(
[train\_op, cost],
{input\_data: source\_batch,
targets: target\_batch,
lr: learning\_rate,
target\_sequence\_length: targets\_lengths,
source\_sequence\_length: sources\_lengths,
keep\_prob: keep\_probability})


if batch\_i % display\_step == 0 and batch\_i > 0:


batch\_train\_logits = sess.run(
inference\_logits,
{input\_data: source\_batch,
source\_sequence\_length: sources\_lengths,
target\_sequence\_length: targets\_lengths,
keep\_prob: 1.0})


batch\_valid\_logits = sess.run(
inference\_logits,
{input\_data: valid\_sources\_batch,
source\_sequence\_length: valid\_sources\_lengths,
target\_sequence\_length: valid\_targets\_lengths,
keep\_prob: 1.0})


train\_acc = get\_accuracy(target\_batch, batch\_train\_logits)


valid\_acc = get\_accuracy(valid\_targets\_batch, batch\_valid\_logits)


print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.4f}, Validation Accuracy: {:>6.4f}, Loss: {:>6.4f}'
.format(epoch\_i, batch\_i, len(source\_int\_text) // batch\_size, train\_acc, valid\_acc, loss))

# Save Model
saver = tf.train.Saver()
saver.save(sess, save\_path)
print('Model Trained and Saved')