SwinIR: image super-resolution, denoising and JPEG compression artifact reduction

This article describes SwinIR, a state-of-the-art architecture for super-resolution and image denoising. It follows the articles describing transformers and Swin transformers that can be found here. It also discusses Shallow and Deep feature extraction with Residual Swin Transformer Block, HQ image reconstruction and per-pixel / perceptual / Charbonnier losses.

The paper can be found here.

The official Pytorch implementation here.

Table of contents

  1. Introduction
  2. SwinIR architecture
    1. Shallow feature extraction
    2. deep feature extraction
      1. Residual Swin Transformer Block
        1. Multihead self-attention (MSA)
        2. Multi-layer perceptron (MLP)
        3. Layer Norm (LN)
      2. End convolution
    3. HQ Image Reconstruction
  3. Losses
    1. Image Super-Resolution
    2. Image Denoising / JPEG compression artifact reduction
  4. Conclusion

1) Introduction

SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular:

  • Shallow feature extraction module uses a convolution layer to extract shallow feature, which is directly transmitted to the reconstruction module so as to preserve low-frequency information
  • the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection.

SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%:

Swin Transformer has shown great promise as it integrates the advantages of both CNN and Transformer:

  • advantage of CNN to process image with large size due to the local attention mechanism,
  • advantage of Transformer to model long-range dependency with the shifted window scheme.

This article will describe each part of SwinIR in detail.

2) SwinIR architecture

2-1) Shallow feature extraction

A single 3×3 convolutional layer HSF(·) is used to extract shallow feature from the low-quality image with 96 features by default used in the official implementation:

Code found in the official github implementation

2-2) deep feature extraction

The deep feature extraction HDF(.) follows the HSF with K residual Swin Transformer blocks (RSTB) and a 3×3 convolutional layer:

Using a convolution layer end of feature extraction can bring the inductive bias of the convolution operation into the Transformer-based network, and lay a better foundation for the later aggregation of shallow and deep features.

Shallow feature mainly contain low-frequencies, while deep feature focus on recovering lost high-frequencies. With a long skip connection, SwinIR can transmit the low-frequency information directly to the reconstruction module, which can help deep feature extraction module focus on high-frequency information and stabilize training.

2-2-1) Residual Swin Transformer Block

The residual Swin Transformer block (RSTB) is a residual block with Swin Transformer layers (STL) and convolutional layers:

  • Transformer can be viewed as a specific instantiation of spatially varying convolution, convolutional layers with spatially invariant filters can enhance the translational equivariance of SwinIR,
  • The residual connection provides a identity-based connection from different blocks to the reconstruction module, allowing the aggregation of different levels of features.

Swin Transformer layer (STL) is based on the standard multi-head self-attention of the original Transformer layer.

2-2-1-a) Multihead self-attention (MSA)
Multi-Head Attention from the paper Attention is all you need
Shifted window of the Swin Transformer paper

However, some modifications have been made by the SwinIR authors:

  • With input of size HxWxC, their Swin Transformer reshapes the map to (HW/M²) x M² x C. This creates MxM local windows. It then computes the standard self-attention separately for each window. The attention matrix is computed as:

where B is the learnable relative positional encoding. They also perform the attention function for h times in parallel and concatenate the results for multihead self-attention (MSA).

2-2-1-b) Multi-layer perceptron (MLP)

A multi-layer perceptron (MLP) is added with two fully-connected layers with GELU non-linearity between them used for further feature transformations. It was also done in Image Transformer as FFNN block (which used ReLU instead of GELU).

The GELU activation function is xΦ(x), where Φ(x) the standard Gaussian cumulative distribution function. The GELU non-linearity weights inputs by their percentile, rather than gates inputs by their sign as in ReLUs. Consequently the GELU can be thought of as a smoother ReLU (from paperswithcode):

if X ~ N(0, 1), with N
2-2-1-c) Layer Norm (LN)

The LayerNorm (LN) layer is added before both MSA and MLP, and the residual connection is employed for both modules as:

Xmsa = MSA(LN(X)) + X

Xstl = MLP(LN(Xmsa)) + Xmsa

2-2-2) End convolution

In the paper, they mentioned that they added a convolution layer at the end of the block for feature enhancement. However, in the official implementation, we can notice that they can also use a block of 3 convolutions with LeakyReLU:

2-3) HQ Image Reconstruction

Both shallow and deep features are fused in the reconstruction module for high-quality image reconstruction. For the implementation, they used the “sub-pixel convolution layer” to upsample the feature.

This “sub-pixel convolution layer” comes from the paper “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network” which can be found here. The authors demonstrated that common practices of using bicubic interpolation, before reconstruction is sub-optimal. They presented the first convolutional neural network (CNN) capable of real-time SR of 1080p videos. Their sub-pixel convolution layer learns an array of upscaling filters to upscale the final LR feature maps into the HR output:

For the classical Super-Resolution, the implementation is LeakyReLU(Conv2d) -> Upsample -> Conv2D:

Upsample class

The PixelShuffle(upscale_factor) layer rearranges elements in a tensor of shape (*, C x r², H, W) to a tensor of shape (*, C, H x r, W x r), where r is an upscale factor. This is useful for implementing efficient sub-pixel convolution with a stride of 1/r.

3) Losses

3-1) Image Super-Resolution

The authors optimized the parameters of SwinIR by minimizing the L1 pixel loss:

where I_RHQ is obtained by taking I_LQ as the input of SwinIR,
and I_HQ is the corresponding ground-truth HQ image.

For real-world image SR, they used a combination of pixel loss, GAN loss and perceptual loss to improve visual quality.

  • Per-Pixel Loss function uses the absolute error to compare pixel values,
  • Perceptual loss is used when comparing two similar images. It compares high level differences, like content and style discrepancies. It uses the mean squared error to generate the reconstructed output image, applied on high level features:

3-2) Image Denoising / JPEG compression artifact reduction

The authors used the Charbonnier loss:

where epsilon is a constant empirically set to 0.001

4) Conclusion

In this article, a new application area has been investigated: image enhancement with image super-resolution, image denoising and JPEG compression artifact reduction. The architecture presented is transformer-based, which complements previous articles on object detection with other transformer-based architectures.

This area introduces new types of loss such as perceptual loss, which uses the output of multiple layers of a VGG-16/19 architecture to evaluate the similarity of two images in terms of style and content instead of simply applying loss on a pixel-by-pixel comparison between the prediction and the ground truth.