风格转换简介

图片名称

风格转换,是把一张图片转化成同内容但包含某风格的新图片。本文将介绍如何让机器学习风格转换,包含两种方法:优化问题求解、转化网络求解。

风格转换

风格转换,就是根据现有的风格照片$S$,把当前照片$C$转化成带有$S$风格同时保留$C$内容的照片$T$。

本文将叙述两种风格转换的思路:

  1. 将风格转换变成优化问题的求解,构建$T,C$之间的损失$L_c$以及$T,S$之间的损失$L_s$,同时增加图片平滑的损失$L_v$。通过求解$\min_T \sum_i L_i$的优化问题求解。
  2. 不直接把目标图片$T$当做求解变量,而是构建一个transform network把内容图片$C$转化成目标图片$T$,以类似1中的方法构建损失函数,通过求解transform network的参数求解该问题。

优化问题

综述

首先,陈述问题:假设已知风格照片$S$、当前照片$C$,求目标照片$T$,要求带有$S$的风格并且保留$C$的内容。

下面,确定几个损失函数:

  • $L_s$:$T$和$S$风格上的距离
  • $L_c$:$T$和$C$内容上的距离
  • $L_v$:$T$不平滑的度量

最后,便是求解优化问题:

$$\min_{T} \alpha_s L_s(T,S) + \alpha_c L_c(T,C) + \alpha_v L_v(T)$$

损失函数

优化问题中$L_s,L_c$是通过预先训练的VGG网络得到。

首先,简单介绍下VGG网络:它是一种固定的网络结构,其结构如下所示,一般采用D或E结构,通常叫VGG-16和VGG-19:

这里写图片描述

那么,为什么$L_s,L_c$是通过预先训练的VGG网络得到呢?

训练后的VGG网络,每一层都对特征进行了抽象,越深得到的特征越具象。所以每一层的特征也就代表了图片不同粒度的抽象,可以根据特征的距离判断图片内容的相似程度。VGG的卷积层得到了feature map,假设其大小是$CHW$。

假设在层$l$,$T$的feature map是$T^l$,$C$的feature map是$C^l$,那么$L_c$的计算如下:

$$L_c(T^l, C^l) = || T^l-C^l ||$$

假设在层$l$,$S$的feature map是$S^l$,$C$的feature map是$C^l$,那么$L_s$的计算如下:

$$L_s(T^l, S^l) = || G(T^l)-G(S^l) ||$$

其中,$G$代表gram matrix,$G(S^l)$的含义是先将$S^l$变成$C*(H*W)$的二维矩阵,然后计算$S^l$对自己的协方差$S^l {S^l}^T$。$G(x)$其实表示了$x$ feature map上不同feature的相互作用关系,用其来度量风格。

至于$L_v$,可以理解成:

$$L_v(T) = ||T-T_{水平位移1像素}|| + ||T-T_{垂直位移1像素}|| $$

训练

构建好损失函数$L$后,求解如下优化问题即可:

$$\min_{T} \alpha_s L_s(T,S) + \alpha_c L_c(T,C) + \alpha_v L_v(T)$$

这里优化问题的求解方法采用L-BFGS(一种伪牛顿法),这样做的目的是得到比gradient descent更快的收敛速度。

例子

本人是詹姆斯的铁杆球迷,对詹姆斯的照片采用不同风格转换后的效果图如下所示。需要说明的是:第二列第一张是未加平滑损失$L_v$的效果图,可以看到存在很多噪点,第二列第二张是加入平滑损失$L_v$的效果图,照片清晰了很多。

这里写图片描述

代码

以下代码参考了Siraj Raval on YouTube

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Load library
from __future__ import print_function

import time
from PIL import Image
import numpy as np

from keras import backend
from keras.models import Model
from keras.applications.vgg16 import VGG16

from scipy.optimize import fmin_l_bfgs_b
from scipy.misc import imsave

# Load and preprocess the content and style images
height = 512
width = 512

content_image_path = 'images/hugo.jpg'
content_image = Image.open(content_image_path)
content_image = content_image.resize((height, width))
content_image

style_image_path = 'images/styles/wave.jpg'
style_image = Image.open(style_image_path)
style_image = style_image.resize((height, width))
style_image

content_array = np.asarray(content_image, dtype='float32')
content_array = np.expand_dims(content_array, axis=0)
print(content_array.shape)

style_array = np.asarray(style_image, dtype='float32')
style_array = np.expand_dims(style_array, axis=0)
print(style_array.shape)

content_array[:, :, :, 0] -= 103.939
content_array[:, :, :, 1] -= 116.779
content_array[:, :, :, 2] -= 123.68
content_array = content_array[:, :, :, ::-1]

style_array[:, :, :, 0] -= 103.939
style_array[:, :, :, 1] -= 116.779
style_array[:, :, :, 2] -= 123.68
style_array = style_array[:, :, :, ::-1]

content_image = backend.variable(content_array)
style_image = backend.variable(style_array)
combination_image = backend.placeholder((1, height, width, 3))

input_tensor = backend.concatenate([content_image,
style_image,
combination_image], axis=0)

# Reuse a model pre-trained for image classification to define loss functions
model = VGG16(input_tensor=input_tensor, weights='imagenet',
include_top=False)
layers = dict([(layer.name, layer.output) for layer in model.layers])

content_weight = 0.025
style_weight = 5.0
total_variation_weight = 1.0

# Loss
loss = backend.variable(0.)
# The content loss
def content_loss(content, combination):
return backend.sum(backend.square(combination - content))

layer_features = layers['block2_conv2']
content_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]

loss += content_weight * content_loss(content_image_features,
combination_features)
# The style loss
def gram_matrix(x):
features = backend.batch_flatten(backend.permute_dimensions(x, (2, 0, 1)))
gram = backend.dot(features, backend.transpose(features))
return gram
def style_loss(style, combination):
S = gram_matrix(style)
C = gram_matrix(combination)
channels = 3
size = height * width
return backend.sum(backend.square(S - C)) / (4. * (channels ** 2) * (size ** 2))

feature_layers = ['block1_conv2', 'block2_conv2',
'block3_conv3', 'block4_conv3',
'block5_conv3']
for layer_name in feature_layers:
layer_features = layers[layer_name]
style_features = layer_features[1, :, :, :]
combination_features = layer_features[2, :, :, :]
sl = style_loss(style_features, combination_features)
loss += (style_weight / len(feature_layers)) * sl
# The total variation loss
def total_variation_loss(x):
a = backend.square(x[:, :height-1, :width-1, :] - x[:, 1:, :width-1, :])
b = backend.square(x[:, :height-1, :width-1, :] - x[:, :height-1, 1:, :])
return backend.sum(backend.pow(a + b, 1.25))

loss += total_variation_weight * total_variation_loss(combination_image)

# Define needed gradients and solve the optimisation problem
grads = backend.gradients(loss, combination_image)
outputs = [loss]
outputs += grads
f_outputs = backend.function([combination_image], outputs)

def eval_loss_and_grads(x):
x = x.reshape((1, height, width, 3))
outs = f_outputs([x])
loss_value = outs[0]
grad_values = outs[1].flatten().astype('float64')
return loss_value, grad_values

class Evaluator(object):

def __init__(self):
self.loss_value = None
self.grads_values = None

def loss(self, x):
assert self.loss_value is None
loss_value, grad_values = eval_loss_and_grads(x)
self.loss_value = loss_value
self.grad_values = grad_values
return self.loss_value

def grads(self, x):
assert self.loss_value is not None
grad_values = np.copy(self.grad_values)
self.loss_value = None
self.grad_values = None
return grad_values

evaluator = Evaluator()

# Train
x = np.random.uniform(0, 255, (1, height, width, 3)) - 128.

iterations = 10

for i in range(iterations):
print('Start of iteration', i)
start_time = time.time()
x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(),
fprime=evaluator.grads, maxfun=20)
print('Current loss value:', min_val)
end_time = time.time()
print('Iteration %d completed in %ds' % (i, end_time - start_time))

# Evaluation
x = x.reshape((height, width, 3))
x = x[:, :, ::-1]
x[:, :, 0] += 103.939
x[:, :, 1] += 116.779
x[:, :, 2] += 123.68
x = np.clip(x, 0, 255).astype('uint8')

Image.fromarray(x)

网络转换

结构

将风格转换当成优化问题求解存在如下问题:

  • 每来一张新图片,都需要重新求解优化问题。如果需要将大量图片转换成同一风格的话效率会很低

考虑能否构建一个transformer,将图片$C$转化成目标图片$T$。训练的时候只需要学习transformer的参数。训练完成得到transformer后,当新的图片来到时,直接输入transformer即可得到新的图片,大大提高了效率。

本节中的风格转换即采用上述构建transformer的方法,利用预训练的VGG得到特征进而得到损失函数,通过调节transformer的参数最小化损失函数。图示如下:

这里写图片描述

训练

损失函数的定义与优化问题部分相同,这里求解的优化问题是:

$$\begin{split}
\hat{y} &= f_W(x) \\
\min_{w} &\alpha_s L_s(\hat{y},y_s) + \alpha_c L_c(\hat{y},y_c) + \alpha_v L_v(\hat{y})
\end{split}$$

参考

  1. A Neural Algorithm of Artistic Style
  2. Perceptual Losses for Real-Time Style Transfer
    and Super-Resolution