Super Resolution with SRGAN: from Dogs to Everything Else

7 min readMay 17, 2019

Introduction

Super resolution is a challenging task which enhances the resolution of low-resolution images. It has attracted significant research interest in recent years, and many breakthroughs have been made. This project will be focusing on one of the breakthrough models: the Super-Resolution GAN model. (SRGAN)

This blog post has 4 major parts: an introduction of the original model from the original paper, an examination of one implementation on a specific dataset (dog images), a comparison of 4 modified models, and a test of how the model can generalize on images other than dogs. The idea is to see what architectural changes can affect the results, and how well the model can generalize on different types of images.

The SRGAN model was proposed in 2017 by a group of researchers from Twitter. The major difference of this model is in its selection of the loss function. Instead of using conventional loss functions such as MSE, the new model adopts a “perceptual loss function” which better reflects the difference between images that human eyes can perceive. The avant-garde loss function is summarized below. Instead of measure pixel-wise difference, the new loss function sums the euclidean distance between the feature representation of the low-resolution image and the high-resolution image from all the layers of the generator network.

The formula of the perception loss: (snapshot from the paper)

As for the GAN structure, the generator uses a VGG19 network to extract the feature maps, and the discriminator is a standard CNN binary classification network. While the discriminator generally follows the standard practice, the generator is a bit unusual since it uses the skip connections and a transposed convolutional layer. The complete architecture is posted below.

The illustration of the architecture: (snapshot from the paper)

Baseline Model: One Implementation

My implementation of the SRGAN model builds on this repo. It uses pretrained VGG16 weights to calculate the perceptual loss. The architecture of the generator and discriminator is summarized in the graph below.

Training with a Different Dataset

In the paper, the model was trained using 350 thousand images from ImageNet. In order to find out how well the model generalize on different image contents, I decide to retrain the model on the Stanford Dogs Dataset, which contains 20,580 images of 120 breeds of dogs, and is a subset of the ImageNet. After the model is trained, it will be tested on unseen images (photos that I take) of objects other than dogs. The Stanford Dog Dataset can be downloaded from this link:

The low-resolution images for training are downsampled from the original high-resolution images by a scale factor of 2. Each batch has 64 downsampled images.

Results

The three pictures below show the improvement of the model as the number of epochs increases. All these pictures are from the validation set. Left columns are the low-resolution images (the input), middle columns are the original high-resolution images (the ground truth), and the right columns are the super-resolution images (the output). We can see that after the first epoch, the output is really blurred the model does not pick up the color correctly. Visible improvements happen after the 10th epoch, but the color is still a little off and the quality of the output is worse than the input. After 200 epochs, the color is correct and the output is visibly better than the input. There are 200 epochs in my implementation. The original implementation had 100 epochs, and I chose to run 200 epochs to see if the additional epochs would improve the results. It turns out that the additional epochs do bring visible improvements to the image quality.

The loss graph for 200 epochs is shown as blow.

Modified Model

For both the generator and the discriminator models, I tried different modifications and will talk about 4 variations in this blog. For the generator, the original implementation in the repo (not the paper) uses 6 residual blocks of 64 channels, whose structure can be found in this repo. I modified the structure by changing the number of residual blocks to 5 or 12. The original structure of the discriminator has 9 convolutional layers, and the channels expand from 64 (layer 1) to 1024 (layer 9). To see how the number of channels can affect the result, I also created two different versions: one has 5 layers with 512 channels each, and the other has 9 layers with 128 channels each. So the first version is shallower but has more channels, and the second one reduces the number of channels of the original architecture. These modifications were chosen to see how the tradeoff between number of layers and number of channels will affect the outcome.

Experimenting with different possible modifications, I also tried removing the batch normalization layers or changing the activation functions. These modifications did not result in visible changes to the results.

The combination of the 2 versions of generator and 2 versions of discriminator together make 4 different models, whose high level architectures are summarized below.

The performance on the same training set is summarized as below. Two metrics are used for the model comparison: peak signal to noise ratio (PSNR) and structure similarity index (SSIM). As we can see, none of the modified models perform as good as the original one. Also, it seems that if we shrink the channels of the discriminator, the performance will be significantly worse (model 1 vs. model 2; model 3 vs. model 4). On the other hand, if we add more residual blocks to the generator, the performance will be worse. (model 1 vs. model 3; model 2 vs. model 4)

Quality of the Modified Models:

Although the metrics do show that the 5 models have different levels of performances, their outputs appear roughly the same to human eyes. The following pictures show the results from the best model (model 0) and the worst model (model 4). We think it is quite hard to tell which row comes from the better model. The answer will be given at the end of the post.

A visual comparison of the best and worst models: which one is better?

How does the final model generalize to other types of images?

Using the best architecture, I retrained the model on 1000 epochs and tested the model on images of objects other than dogs and with a variety of exposure conditions. The following images show that the model can generate quite well on these new pictures. (left: low resolution; middle: ground truth; right: high resolution) However, it seems that the model has different performance on different objects, although the overall quality is quite good.