Ariyo Sanmi, PhD

DATA SCIENTIST | MACHINE LEARNING ENGINEER

Ariyo Sanmi, PhD

Neural Style Transfer

Creating Art from Images Using Convolutional Neural Network.

 

Introduction

Among many other things, deep learning can be used to compose art images through the transfer of style from an image to another.

Indeed, style transfer is both interesting and fun as it portrays the admirable internal representations of neural networks.

Style transfer relies on integrating the content from an image with the style of another. The process involves include:

  • Extracting style from the style image
  • Extracting content from the content image
  • Fuse the extracted style and extracted content together to form the target image.

The target image objects and their arrangement are similar to that of the content image while the style, colors, and textures are similar to that of the style image.

For example, in the example displayed above, the content of the image is the cute little dragon, and the style of the image is of the shipwreck of the Minotaur. The generated target image still contains the dragon but is stylized with the ocean waves and ship wreckage of the Minotaur.

What is the content of an image?

CNNs along the layers generates feature maps which contains complex representations (features) of an image, such as the objects in an image and their orientation. For neural style transfer, we want merely the content of an image, leaving out the image’s texture and color. CNNs are absolutely great at extracting content representations of spatial data.

What is the style of an image?

The style of an image is the traits, color, and brush strokes of an image. It also includes the texture and curvature of the image.

To obtain the style of an image, the spatial correlations or relationship between layers of a convolutional model or feature maps are measured.

As such, the similarities and differences between the feature maps along the layers would definitely provide information on style of an image.

 

Intermediate layers from our network are defined to extract the image style

def get_features(image, model, layers=None):
""" Run an image forward through a model and get the features for
a set of layers. Default layers are for VGGNet matching Gatys et al (2016)
"""

## Map layers in VGGNet for the content and style representations
if layers is None:
layers = {'0': 'conv1_1',
'5': 'conv2_1',
'10': 'conv3_1',
'19': 'conv4_1',
'21': 'conv4_2',
'28': 'conv5_1'}

features = {}
x = image

# model._modules is a dictionary holding each module in the model
for name, layer in model._modules.items():
x = layer(x)
if name in layers:
features[layers[name]] = x

return features

Model

The VGG19 architecture is used to extracts the features and style of our impute image. We load a pretrained VGG19 model and freeze the weights.

# get the "features" portion of VGG19 (we will not need the "classifier" portion)
vgg = models.vgg19(pretrained=True).features

# freeze all VGG parameters since we're only optimizing the target image
for param in vgg.parameters():
param.requires_grad_(False)

print(vgg)

Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace=True)
(18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(24): ReLU(inplace=True)
(25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(26): ReLU(inplace=True)
(27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace=True)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): ReLU(inplace=True)
(32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(33): ReLU(inplace=True)
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): ReLU(inplace=True)
(36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

Compute Gram Matrix

Since the style of the image is obtained at multiple layers of the model, we obtain multi-scale dimensions. This feature map tensors are first flattened and then all multiplied by a transpose of itself to get the gram matrix. The resulting gram matrix then contains non-localized information about the similarities and difference of each layers’ style.

def gram_matrix(tensor):
""" Calculate the Gram Matrix of a given tensor
Gram Matrix: https://en.wikipedia.org/wiki/Gramian_matrix
"""

# get the batch_size, depth, height, and width of the Tensor
_, d, h, w = tensor.size()

# reshape so we're multiplying the features for each channel
tensor = tensor.view(d, h * w)

# calculate the gram matrix
gram = torch.mm(tensor, tensor.t())

return gram

Loss Function

The content loss computes the difference between the content image and the target image.

The style loss of the model is computed by finding the mean squared distance of the target image (wave) and the gram matrices of the style image.

The two losses are added together to get the total loss.

Backpropagation and optimization algorithms are then used to iterate this loss until the model begins to converge, that is the differences bet our target image and the style image are not different after all.

# for displaying the target image, intermittently
show_every = 400

# iteration hyperparameters
optimizer = optim.Adam([target], lr=0.003)
steps = 5000 # decide how many iterations to update your image (5000)

for ii in range(1, steps+1):

# get the features from your target image
target_features = get_features(target, vgg)

# the content loss
content_loss = torch.mean((target_features['conv4_2'] - content_features['conv4_2'])**2)

# the style loss
# initialize the style loss to 0
style_loss = 0
# then add to it for each layer's gram matrix loss
for layer in style_weights:
# get the "target" style representation for the layer
target_feature = target_features[layer]
target_gram = gram_matrix(target_feature)
_, d, h, w = target_feature.shape
# get the "style" style representation
style_gram = style_grams[layer]
# the style loss for one layer, weighted appropriately
layer_style_loss = style_weights[layer] * torch.mean((target_gram - style_gram)**2)
# add to the style loss
style_loss += layer_style_loss / (d * h * w)

# calculate the *total* loss
total_loss = content_weight * content_loss + style_weight * style_loss

# update your target image
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

# display intermediate images and print the loss
if ii % show_every == 0:
print('Total loss: ', total_loss.item())
plt.imshow(im_convert(target))
plt.show()

And here comes my personal favorite, fusing an image of President Obama and my favorite basketball player, Steph Curry, with an in image of The Great Wave off Kanagawa.

If you’d like to know more about Neural Style Transfer, check out the Udacity Deep Learning Nanodegree.

Also, the code of this article can be found on my github here.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *