Enhanced Standard Compatible Image Compression
Enhanced Standard Compatible Image Compression
X, X XXXX 1
                                              Abstract—Recent deep neural network-based research to en-                   compression standards, such as JPEG2000 [2], H.264/AVC [3],
                                           hance image compression performance can be divided into                        and High Efficiency Video Coding (HEVC) [4]. Recent video
                                           three categories: learnable codecs, postprocessing networks, and               coding standards [3], [4] have adopted prediction-based coding
                                           compact representation networks. The learnable codec has been
arXiv:2009.14754v2 [eess.IV] 15 Dec 2021
                                           designed for end-to-end learning beyond the conventional com-                  methods to reduce the spatial and temporal redundancy of
                                           pression modules. The postprocessing network increases the                     input video. Prediction-based coding increases the complex-
                                           quality of decoded images using example-based learning. The                    ity of the compression algorithm but produces much better
                                           compact representation network is learned to reduce the capacity               compression performance.
                                           of an input image, reducing the bit rate while maintaining                        On the other hand, compression frameworks with end-to-
                                           the quality of the decoded image. However, these approaches
                                           are not compatible with existing codecs or are not optimal                     end trainable deep neural networks [5]–[14] (learnable codecs
                                           for increasing coding efficiency. Specifically, it is difficult to             in this paper) have been proposed based on the rapid develop-
                                           achieve optimal learning in previous studies using a compact                   ment of deep learning. The approaches use trainable networks
                                           representation network due to the inaccurate consideration of the              to produce bitstreams and reconstruct the original image
                                           codecs. In this paper, we propose a novel standard compatible                  (Fig. 1 (a)). Although these kinds of approaches structurally
                                           image compression framework based on auxiliary codec net-
                                           works (ACNs). In addition, ACNs are designed to imitate image                  consider the compression ratio and reconstruction quality,
                                           degradation operations of the existing codec, which delivers                   their performance is still undesirable and incompatible with
                                           more accurate gradients to the compact representation network.                 standard codecs, which decreases the algorithm’s utility.
                                           Therefore, compact representation and postprocessing networks                     It is easy to propose a method to restore an image after the
                                           can be learned effectively and optimally. We demonstrate that                  compression process to improve the compression performance
                                           the proposed framework based on the JPEG and High Efficiency
                                           Video Coding standard substantially outperforms existing image                 while being compatible with standard codecs. Following the
                                           compression algorithms in a standard compatible manner.                        developments of convolutional neural networks (CNN), such
                                                                                                                          as ResNet [15], DenseNet [16], and attention networks [17],
                                             Index Terms—Image compression, deep neural networks, com-
                                           pact representation, JPEG, High Efficiency Video Coding.                       [18], the CNN-based image postprocessing algorithms [19]–
                                                                                                                          [28] have drastically improved the performance of image
                                                                                                                          restoration. These kinds of approaches are designated as
                                                                     I. I NTRODUCTION
                                                                                                                          a postprocessing network (PPNet) in this paper. Although
Bitstream Bitstream
                                                                                                             ACN
                CRNet                               PPNet                                  CRNet                               PPNet
                                 Codec                                                                      Codec
Bitstream Bitstream
(c) Standard compatible frameworks based on the compact representation        (d) Proposed framework based on the auxiliary codec network (ACN)
network (CRNet)
Fig. 1. Conceptual comparison between frameworks. Green and red arrows indicate forward (or inference) and backward pass (or gradients) to train the
CRNet, respectively. Gray modules indicate that it is not differentiable or a standard codec. Blue modules indicate a differentiable network.
do not consider the degradation process through the standard                                       II. R ELATED W ORK
codec (Fig. 1 (c)) because the standard codec, including the
quantization process, is a nondifferentiable module.                       A. Compression Frameworks Based on End-to-end Trainable
                                                                           Networks (Learnable Codecs)
   In this paper, we propose a novel standard compatible end-                 As deep learning has been successful in the field of image
to-end image compression framework based on auxiliary codec                processing, Toderici et al. [5], [6] first proposed an end-to-end
networks (ACNs). The ACNs are designed to imitate the                      deep neural network-based approach in image compression.
forward image degradation process of existing codecs in dif-               An input image with dimensions reduced through an auto-
ferentiable networks to provide the correct backward gradients             encoder is stored as a binary vector for a given compression
for training the CRNet (Fig. 1 (d)). These gradients allow the             rate and is optimized for minimum distortion.
compact represented image to consider both the degradation
                                                                              As the possibility of the strong modeling capacity of a
process by the ACN and the reconstruction process by the PP-
                                                                           neural network is revealed, many follow-up studies have been
Net. Based on ACNs, both the CRNet and PPNet are learned
                                                                           conducted. Theis et al. [7] proposed a compressive auto-
together to achieve better image compression performance in
                                                                           encoder based on a residual neural network [15] and used a
a standard compatible manner. In addition, a bit estimation
                                                                           Laplace-smoothed histogram as the entropy model. Ballé et al.
network (BENet) is proposed for training as a regularization
                                                                           [8] jointly optimized the entire model for rate-distortion per-
function to prevent undesired bit-rate increments. As recent
                                                                           formance using a generalized divisive normalization transform.
CRNet-based [35]–[39] generate models at a single level, the
                                                                           Further, Ballé et al. [9] proposed a hyperprior to effectively
proposed framework is also trained and optimized per codec
                                                                           capture spatial redundancy in the latent encoding. In addition,
and rate.
                                                                           Johnston et al. [10] proposed a priming technique and spatially
                                                                           adaptive entropy model for image compression. Moreover,
  The contributions of this paper are summarized as follows:               Li et al. [11] proposed a content-weighted method based on
                                                                           spatially adaptive importance map learning. Mentzer et al. [12]
                                                                           proposed a model that concurrently trains a context model
  •   We propose a novel CNN architecture called the ACN,                  with an encoder and used three-dimensional convolutional
      based on the prior of the image compression process to               networks. Minnen et al. [13] and Lee et al. [14] combined
      effectively and precisely train the CRNet.                           a context-adaptive entropy model and hyperprior, producing
  •   Based on the ACN, we propose an enhanced compression                 substantial performance improvements.
      framework based on the collaborative learning scheme                    The primary difficulties of deep image compression algo-
      between the ACN, PPNet, and CRNet. Furthermore, the                  rithms include making the nondifferential quantization process
      BENet facilitates training using a proper bit prediction,            end-to-end trainable, designing an entropy model that predicts
      preventing undesirable artifacts in compactly represented            the bitstream generated from coefficients, and enabling com-
      images.                                                              pression considering both the bit rate and distortion. However,
  •   The framework is compatible with compression algo-                   although many deep network-based approach algorithms have
      rithms from the standard codecs to learning-based codecs             been developed, it is challenging to replace conventional
      and any off-the-shelf image restoration networks. Based              compression schemes due to compatibility. Furthermore, al-
      on the highly accurate ACNs for two standard codecs:                 though the state-of-the-art approaches outperform even the
      JPEG and HEVC, our framework exhibits state-of-the-art               Better Portable Graphics (BPG) [40] codec, which is designed
      results compared to other image compression algorithms,              based on the intra-mode of HEVC, a significant performance
      including standards and learnable codecs.                            improvement has not been demonstrated.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                    3
ℒ𝑹𝑹𝑹𝑹𝑹𝑹
                                                                                             BENet                         ℒ𝑩𝑩𝑩𝑩𝑩𝑩
                                          CRNet
                                                                                              ACN                         Train Phase
                       ℒ𝑹𝑹𝑹𝑹𝑹𝑹
                                                                                                                            Test Phase
                                                                                             Codec
PPNet
                                             Codec-mimicking network
                                                                                                      Prediction Image
xN Residual Image
Fig. 2. (a) Illustration of the end-to-end learning pipeline for image compression. (b)-(d) Detailed structures of the ACN and BENet.
codec. However, the codec-related functions Φ and φ have a                        image. The codec imitation module consisting of the ACN and
nondifferentiable quantization operation that creates problems                    BENet was used only for training and can be used as a gradient
for the backpropagation algorithm. The codec-related func-                        path for training CRNet. In the testing phase, CRNet and
tions were replaced with a differentiable neural networks: h                      PPNet were used with existing codecs, such as conventional
and p, to overcome this problem, as follows:                                      preprocessing and postprocessing modules.
         θf∗ , θg∗ ≈ argmin δ(x, g(h(f (x))) + λp(f (x)),                 (4)     B. Auxiliary Codec Network
                      θf ,θg
                                                                                     The parameter values θf∗ and θg∗ were obtained by opti-
where θf and θg are the parameters of the functions f and g,                      mizing the objective function in (4) using the approximation
respectively. In (4), we can reach the ideally optimal solution                   (imitation) function of the codec modules: Φ and φ. However,
θf and θg , if the two neural networks, h and p are perfectly                     the obtained parameter may be different when the actual codec
modeled as real codec modules: Φ and φ. All parts of the                          is applied. The output of h should be as close as possible
objective function are composed of learnable neural networks,                     to the actual codec module Φ to reduce these differences
enabling backpropagation in the end-to-end learning scheme.                       and perform optimal learning. In this section, we propose
   We defined h as the ACN and p as BENet in this paper.                          novel CNN architectures of the ACN that closely approximate
Both the ACN and BENet have fixed parameters in the                               two typical standard codecs, JPEG and HEVC intra coding
process of optimizing (4). The overall pipeline of the proposed                   (HEVC-intra). The objective function to train the ACNs is as
compression framework is illustrated in detail in Fig. 2 (a). The                 follows:
original image passes through the CRNet and is expressed as
                                                                                                                                             2
a compact image to reduce the amount of information. Next,                                               θh∗ = argmin kh(x) − Φ(x)k2 .           (5)
the BENet calculates the number of predicted bits, and the                                                        θh
ACN generates an imitated decoded image from the compact                             The architectures of networks imitating JPEG and HEVC-
image. Finally, the PPNet performs restoration of the original                    intra are depicted in Fig. 2 (b) and (c). In the JPEG codec,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                     5
PPNet through pretraining. The initial state is defined in the 𝑓𝑓(𝑥𝑥) VCNN ℎ(𝑓𝑓 𝑥𝑥 ) 𝑔𝑔(𝑦𝑦)
following equations:
                                                                    Step 3      𝑥𝑥     CRNet                                 VCNN                             ℎ(𝑓𝑓 𝑥𝑥 )           𝑥𝑥
                                               2
               θf0 = argmin kf (x) − Fs (x)k2 ,            (11)
                        θf
                                                   2                (b) Alternate learning with a virtual codec neural network (VCNN) [39]
             θg0 = argmin kg(Φ(Fs (x))) − xk2 .            (12)
                       θg
                                                                                                                              ACN              ℎ(𝑓𝑓 𝑥𝑥 )
degraded bicubic downsampled image to the original image.
The pretraining strategy for all networks provides a good
initialization point, making the optimal parameter closer to
the ideal and obtaining a faster convergence rate. This result
                                                                     Step 2    𝑥𝑥       CRNet                      ACN                 PPNet             𝑔𝑔(ℎ(𝑓𝑓 𝑥𝑥 )             𝑥𝑥
is verified using a comparative experiment in Section V-B6.
   2) Iterative Fine-tuning Updating: As the fine-tuning of
the end-to-end model progresses, the CRNet learns in the
                                                                  (c) Simultaneous learning with the proposed auxiliary codec network (ACN)
direction of optimizing the objective function. However, as
                                                                  Fig. 3. Comparison of the learning process of compact representation network
mentioned, the use of the approximation function affects the      (CRNet)-based methods. The blue module indicates the status updated in each
learning of the entire model. The ACN and BENet, which have       step, and the yellow module indicates the fixed status. Green and red arrows
fixed weights, gradually decrease the approximation accuracy      indicate a forward pass (or inference) and a backward pass (or gradients) to
                                                                  train the module, respectively. The yellow double arrow indicates the argument
to the standard codec because the unseen compact image from       of the loss function for the backward pass.
the CRNet is input as the whole model is trained. Therefore,
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                                                                                                      7
                                                                              31
CRNet and PPNet are updated for each cycle of the minibatch,                                                                                          0.85
                                                                              30
and the ACN and BENet are updated using the output value                      29                                                                       0.8
                                                                  PSNR (dB)
proposed algorithm, the parameters of the CRNet and PPNet
                                                                                                                                               SSIM
                                                                              27
and θg∗ .                                                                     25
                                                                                                           Proposed
                                                                                                                                                      0.65
                                                                                                                                                                                  Proposed
                                                                              24
   3) Comparison with the other CRNet-based Methods:                                                       Bicubic+Post-processing                     0.6                        Bicubic+Post-processing
                                                                              23
Recent CRNet-based papers [38], [39] proposed a learning                                                   JPEG                                                                   JPEG
                                                                              22                                                                      0.55
strategy that bypasses the nondifferentiable codec, adopting                       0.1    0.2        0.3           0.4
                                                                                                 Bits per pixel (bpp)
                                                                                                                             0.5        0.6                  0.1    0.2         0.3          0.4
                                                                                                                                                                           Bits per pixel (bpp)
                                                                                                                                                                                                     0.5        0.6
                                                                                                                                                      0.87
passed through the codec and PPNet. In both methods, CRNet
PSNR (dB)
                                                                                                                                               SSIM
                                                                              30                                                                      0.85
and PPNet are trained alternately, which is an incomplete                                                                                             0.83
                                                                              29
optimization process.                                                                                                                                 0.81
                                                                              28                                                                                                      Proposed
   In this paper, we adopted a method of simultaneously                                                     Proposed
                                                                                                                                                      0.79
                                                                                                            Bicubic+Post-processing                                                   Bicubic+Post-processing
learning an end-to-end network so that the CRNet and PPNet                    27
                                                                                                            JPEG
                                                                                                                                                      0.77
                                                                                                                                                                                      JPEG
                                                                                                                                               SSIM
                                                                                                                                                      0.75
                V. E XPERIMENTAL R ESULTS                                     28
                                                                                                                                                       0.7
A. Setting                                                                    26                                  Scale Factor = 0.5                                                      Scale Factor = 0.5
                                                                                                                                                      0.65
                                                                                                                  Scale Factor = 0.75                                                     Scale Factor = 0.75
   Previous studies based on the CRNet and PPNet [38], [39]                   24                                  Scale Factor = 1                     0.6
                                                                                                                                                                                          Scale Factor = 1
                                                                                                                  JPEG                                                                    JPEG
have displayed performance improvement in various image                       22                                                                      0.55
codecs, such as JPEG, JPEG2000 [2] and BPG [40]. In this                           0.1    0.3        0.5           0.7
                                                                                                 Bits per pixel (bpp)
                                                                                                                             0.9        1.1                  0.1    0.3         0.5          0.7
                                                                                                                                                                           Bits per pixel (bpp)
                                                                                                                                                                                                     0.9        1.1
(a)
                                                                            (b)
Fig. 5. Visual comparison of different codec-mimicking network structures on the (a) Lighthouse image of the LIVE1 dataset at a quality factor of 10,
and (b) Kimono of the HEVC Test Sequence [51] at an HEVC quality parameter of 42. The result of JPEG-based ACN has blocking artifacts and ringing
artifacts around edges similar to the image compressed with JPEG. The result of ACN with prediction image, unlike other structures, is less blurry and has
compression artifacts similar to HEVC.
                                                                   TABLE I
 Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF JPEG I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
                                             NETWORK S TRUCTURES ON S ET 14 AND LIVE1 DATASETS
                                                             QF = 10              QF = 20          QF = 40             QF = 80
                                  vs JPEG                 Set14    LIVE1      Set14   LIVE1      Set14    LIVE1     Set14    LIVE1
                              Original image              27.49     27.03     29.85    29.30     32.20    31.62     37.00    36.32
                          VCNN [39] depth = 6             29.89     29.56     32.15    31.60     34.29    33.60     37.98    37.49
                          VCNN [39] depth = 20            29.93     29.68     32.33    31.77     34.37    33.67     37.98    37.49
                       JPEG-based ACN depth N = 9         43.85     44.50     39.82    39.82     40.34    40.29     39.53    38.97
                      JPEG-based ACN depth N = 11         45.24     45.41     41.78    41.75     40.33    40.25     39.76    39.25
                      JPEG-based ACN depth N = 12         46.07     46.27     42.32    42.28     41.27    41.25     43.08    42.61
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                               9
                                                                   TABLE II
 Q UANTITATIVE P EAK S IGNAL - TO -N OISE R ATIO (PSNR; D B) C OMPARISON OF HEVC I MITATION P ERFORMANCE BY D IFFERENT C ODEC - MIMICKING
                                              NETWORK S TRUCTURES ON HEVC T EST S EQUENCES [51]
                                        vs HEVC               QP = 32     QP = 37    QP = 42     QP = 47
                                    Original image             36.07        33.06      30.13       27.40
                                      VCNN [39]                38.54        36.17      33.94       32.07
                               ACN without prediction image    39.12        36.63      34.37       32.36
                                ACN with prediction image      40.87        39.09      37.53       36.06
facilitate performance improvement; thus, λbit and λreg are set         particular, when Depth N of the JPEG-based ACN is 12, the
to 5 × 10−5 and 1, respectively, for all QPs. The HEVC-intra-           imitation performance is over 40dB in all QFs. The residual
based model is implemented in the HEVC reference software               block-based CNN structure cannot follow the behavior of the
HM 16.20 [48], [49] with the all intra configuration and tested         JPEG codec regardless of the depth of the network.
based on test conditions, configurations, and sequences pro-               In contrast, the proposed JPEG-based ACN generates a
posed by the Joint Collaborative Team on Video Coding [51].             decoded image similar to the output of JPEG. The learned
The test sequences can be divided into Classes A, B, C, D,              ACN expresses contouring and ringing artifacts, which are
and E according to the spatial resolution. When evaluating              typical compression artifacts of JPEG. As the proposed ACN
the compression performance of HEVC-intra-based models,                 method closely follows the codec operation, backpropagation
the results are expressed in terms of the Bjøntegaard delta             can be performed for the CRNet with a small error.
(BD) rate [57] reductions for the luma component. In both                  In addition, we conducted a comparative experiment on
codec model test situations, we adopted the peak signal-to-             the HEVC-intra-based ACN. We compared the HEVC-based
noise ratio (PSNR) and the structural similarity index measure          ACN with the VCNN and compared using the original image
(SSIM) [58] as image quality evaluation metrics.                        alone and with a prediction image for the input image of the
                                                                        HEVC-intra-based ACN. The results in Fig. 5 (b) and Table II
                                                                        indicate that the proposed HEVC-intra-based ACN structure
B. Ablation Study
                                                                        has superior HEVC imitation performance compared with the
   In this section, we present the evaluation of the contribu-          VCNN with a Resblock-based CNN structure. Furthermore,
tion of each network of the proposed framework. We also                 we improved the ACN to take the prediction image as input
performed ablation studies to analyze the importance of each            with the original image, dramatically improving the imitation
loss term. In addition, we tested the effects of pretraining and        performance.
the iterative update algorithm proposed in Section IV-B. All               3) Bit Estimation Network: To prove the superiority of
experiments for the ablation studies were tested on the LIVE1           the structure of the proposed BENet, we compared it with
dataset and evaluated on the rate-distortion planes.                    ResNet [15], a representative network architecture for re-
   1) Compact Representation Network: To confirm the effect             gression. As a result of the experiment in Table III, BENet
of the proposed CRNet, we experimented on the CRNet                     demonstrated better bit prediction performance than ResNet.
compared to the frameworks that simply downsample and                   In particular, BENet is a more advantageous structure in that
restore [29]–[31]. As illustrated in Fig. 4 (a)-(d), the CRNet          the number of parameters is relatively small.
outperformed the bicubic downsampling preprocessing at two                 4) Simultaneous Learning Strategy: We conducted a com-
scale factors: 0.5 and 0.75. When the scale factor was 0.5,             parative experiment with the learning strategies of recent
a higher performance improvement was obtained because the               CRNet-based papers [38], [39]. In practice, the performance
CRNet and PPNet have a larger capacity to compress and                  of the gradient backpropagation should be compared to know
restore spatially as the scale factor decreases.                        how well the learning strategies mimic the codec role. How-
   Fig. 4 (e) and (f) exhibit the performance analysis according        ever, real nondifferentiable codecs have no ground truth for
to the scale factor. A smaller scale factor result in a greater         gradient propagation. Therefore, we analyzed the mimicking
information loss for the original image; thus, the compression          ability of the proposed method through the final compression
efficiency is improved only at a low bit rate. In contrast,             performance after optimizing CRNet and PPNet.
in a high bit-rate environment, it is better to maintain the               In Fig. 6, we compared the compression methods using
original scale. An efficient scale factor exists according to the       algorithm preprocessing and postprocessing networks. The
bit rate, suggesting that the scale factor should be determined         methodologies selected for comparison are as follows: first,
adaptively according to the target rate.                                the method for learning the CRNet and PPNet alternately by
   2) Auxiliary Codec Network: To prove the superiority of              approximating the codec as an identity function as in [38]
the architecture of the proposed ACN, we conducted a perfor-            (green line) and using the VCNN [39] (magenta line), and
mance comparison by imitating real codecs with several ACN              second, the method for simultaneous optimization by directly
structures. First, we compared the JPEG-based structure with            connecting the CRNet and PPNet (cyan line), using the VCNN
the residual block-based CNN structure of the VCNN [39].                (purple line) and proposed JPEG-based ACN (blue line).
The results in Fig. 5 (a) and Table I reveal that JPEG-based            Furthermore, we additionally performed alternate learning of
ACN mimics JPEG decoded images better than VCNN. In                     the CRNet and PPNet with the JPEG-based ACN (red line) to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                          10
                                                              TABLE III
   Q UANTITATIVE P ERCENT (%) E RROR AND N UMBER OF PARAMETERS C OMPARISON OF B ITS PER P IXEL (BPP) E STIMATION BY D IFFERENT B IT
                              E STIMATION N ETWORK (BEN ET ) S TRUCTURES ON S ET 14 AND LIVE1 DATASETS
      ResNet-18 [15]   1.598%   0.999%   1.687%   1.535%    1.571%     1.591%    1.134%    1.327%   1.430%          15.199M
      ResNet-50 [15]   1.873%   1.934%   1.520%   1.364%    0.945%     1.180%    1.705%    2.005%   1.566%          29.287M
         BENet         1.001%   1.456%   1.221%   1.305%    1.072%     1.909%    1.211%    1.661%   1.355%          2.098M
compare the influence of alternate learning and simultaneous             5) Bit and Regularization Loss: The rate-distortion per-
learning. For a fair comparison, all experiments used the fixed       formance was evaluated according to the two loss weights,
network structure of CRNet and PPNet, and the same training           λbit and λreg , to determine the effectiveness of the bit and
database as described in Section V-A.                                 regularization loss. According to the training progress, the
                                                                      rate-distortion performance results are expressed as traces
   In Fig. 6 (a), the proposed JPEG-based ACN with simulta-
                                                                      according to the training epoch to analyze the change in
neous learning outperforms the other methods. In learning the
                                                                      performance. As displayed in Fig. 7 (a), when learning the
CRNet and PPNet alternately by approximating the codec as
                                                                      network without considering the bit loss (λbit = 0), the
an identity function, the codec characteristics are repeatedly
                                                                      final reconstructed image becomes closer to the original im-
reflected in the PPNet to demonstrate good performance.
                                                                      age, but the number of bits generated during compression
However, an error occurs because the codec is assumed to be
                                                                      increases significantly, resulting in poor coding efficiency.
an identity function when learning the CRNet. Additionally,
                                                                      In contrast, using a proper λbit prevents these problems.
limitations exist in the VCNN, as it is difficult to sufficiently
                                                                      Fig. 7 (b) indicates that the network has a better training
transfer the codec characteristics to the CRNet because of the
                                                                      procedure with stable convergence with the regularization loss
poor approximation of the codec.
                                                                      by preserving the structure of natural images. If the influence
   In the case of simultaneous learning by directly connecting        of the regularization loss increases, the learning of the CRNet
the CRNet and PPNet and with the Resblock-based VCNN,                 is restricted. In this case, no significant change in performance
the compression performances are significantly worse than             exists from the initial state. Fig. 7 (c) indicates that the coding
the others. Table I demonstrates that the VCNN is closer to           efficiency is better when learning with the regularization loss
the JPEG decoded image than the original image (identity              (λreg = 0.1) than learning without the regularization loss. A
function). However, as illustrated in the visual comparison           standard codec is designed to compress natural images; thus,
in Fig. 5 (a), compression mimicking images generated from            regularization loss prevents a decrease in coding efficiency
the VCNN have almost no observed JPEG compression noise               from compressing images with unnatural patterns.
patterns and unpredictable noise. End-to-end learning with               6) Pretraining Strategy: We analyzed the effect of the pre-
the VCNN, which is updated through iterative learning and             training strategy as described in Section IV-B1. Fig. 8 presents
is changeable, unlike an identity function that generates a           the performance comparison according to whether pretraining
constant value, leads to worse optimization. Therefore, the           occurred. The experimental results reveal faster convergence
VCNN can generate even greater error propagation than the             with better coding efficiency when performing pretraining on
identity function approximation.                                      the CRNet and PPNet. The absence of pretraining means that
   In alternate learning, PPNet is continuously trained from a        the parameters of the CRNet and PPNet are initialized with
decoded image obtained from the real codec to compensate              random values. In this case, it is challenging to converge in
for approximation errors. However, for simultaneous learning,         the desired direction because the ACN and BENet do not work
if a sufficient codec approximation is not satisfied, error           properly at the beginning of the training process.
propagation continues to CRNet and PPNet during training,                7) Iterative Updating Strategy: The iterative updating pro-
resulting in performance degradation. The JPEG-based ACN              cess is proposed in Section IV-B2 to reduce errors due to
exhibited better performance in simultaneous learning than            the approximation functions. The fixed ACN is compared
in alternating learning. This result reveals that the proposed        with the repeatedly updated ACN to demonstrate the effec-
learning method, in which CRNet and PPNet are optimized               tiveness of the proposed training scheme. Fig. 9 (a) depicts
simultaneously, is more effective if good mimicking perfor-           the comparison of the codec imitation performance according
mance is guaranteed. In Fig. 6 (b), the experimental results          to the training epochs, and Fig. 9 (b) displays the change
demonstrated that, as the imitation performance gradually             in coding efficiency as training progresses. When the ACN
decreased with a small value of N , the overall compression           is fixed, the imitation performance of the ACN decreases
performance also gradually decreased. The experiments in-             gradually as the epoch increases. In contrast, when the ACN
dicate that the imitation performance is proportional to the          is continuously updated, the imitation performance does not
overall compression performance and that better mimicking             deteriorate, and it converges. Guaranteeing the performance of
performance improves the compression performance through              the ACN helps learning to increase the coding efficiency of
a more precise optimization of the preprocessing network.             the entire framework with less approximation error.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                                                                                                                        11
31.5 30.5
            31                                                                                   31
                                                                                                                                                                                 30
                                                                                                30.5
            30
                                                                                                                                                                    PSNR (dB)
                                                                                    PSNR (dB)
                                                                                                 30                                                    Training                 29.5
                                                                                                                                                                                                                          Training
                                                                                                                                                   λ = 5×10⁻⁵                                                             λ=1
            29                                                                                  29.5                                                                                                                      λ = 0.3
                                                                                                                                                   λ = 3×10⁻⁵
                                                                                                                                                   λ = 1×10⁻⁵                    29                                       λ = 0.1
            28                                                                                   29
                                                                                                                                                   λ = 5×10⁻⁶                                                             λ = 0.03
                                                                                                                  Epoch=0                          λ=0                                         Epoch=0                    λ=0
                                                                                                28.5                                                                            28.5
            27
PSNR (dB)
                                                                                                                            (a)                                                                            (b)
            26
                                            JPEG-based ACN + S. Learning
            25                              JPEG-based ACN + A. Learning                                                                  30.5
                                            VCNN + S. Learning
            24
                                                                                                                                          29.5
                                            VCNN + A. Learning [39]
                                            IdentityApprox. + S. Learning
                                                                                                                              PSNR (dB)
                                                                                                                                          28.5
            23                              IdentityApprox. + A. Learning [38]
                                            JPEG                                                                                          27.5
            22
                                                                                                                                          26.5                             With RegLoss
                 0.1   0.2        0.3         0.4             0.5           0.6
                                                                                                                                                                           Without RegLoss
                             Bits per pixel (bpp)                                                                                         25.5
                                                                                                                                                 0.1              0.3                          0.5
                                      (a)                                                                                                                   Bits per pixel (bpp)
                                                                                                                                                                  (c)
            31                                                                      Fig. 7. Rate-distortion performance comparison according to the weight of the
                                                                                    loss function. Rate-distortion point of compressed image (JPEG QF = 80)
            30                                                                      with increasing learning epochs according to (a) the bit loss λbit , and (b)
                                                                                    regularization loss λreg , and (c) rate-distortion performance comparison with
                                                                                    λreg = 0.1 or without regularization loss λreg after the training process is
            29                                                                      complete.
28
            27
PSNR (dB)
            26
                                              JPEG-based ACN Depth N = 12
            25                                JPEG-based ACN Depth N = 11
                                                                                                30.5                                                                            30.5
29
                                              JPEG                                              28.5
                                                                                                                                                                                27.5
            22
                                                                                                  28
                 0.1   0.2         0.3         0.4              0.5           0.6                                           With pretraining                                    26.5                        With pretraining
                             Bits per pixel (bpp)
                                                                                                27.5
                                                                                                                            Without pretraining                                                             Without pretraining
                                                                                                  27                                                                            25.5
                                     (b)                                                                0     5    10 15 20 25 30 35 40 45 50                                            0.1                0.3              0.5
                                                                                                                             Epochs                                                                      Bits per pixel (bpp)
Fig. 6. Rate-distortion curve comparison of compression methods using
preprocessing and postprocessing networks. (a) Comparison results according                                                 (a)                                                                            (b)
to network architectures and learning methods (simultaneous learning (S.
                                                                                    Fig. 8. Performance comparison with and without pretraining. (a) Change in
Learning) or alternating learning (A. Learning)), (b) Comparison result of
                                                                                    the peak signal-to-noise ratio (PSNR) between the original and reconstructed
the codec-mimicking networks with simultaneous learning.
                                                                                    images (JPEG QF = 80) in the end-to-end model according to the training
                                                                                    epochs and (b) comparison of the rate-distortion performance after the training
                                                                                    process is complete.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                                                                                                                                             12
40.5 31 38
                                                  Updated ACN                                                                                                39
               40
                                                  Fixed ACN                            30.5                                                                                                                        36
              39.5                                                                                                                                           37
30 35 34
                                                                           PSNR (dB)
  PSNR (dB)
                                                                                                                                                                                                       PSNR (dB)
               39
                                                                                                                                                 PSNR (dB)
                                                                                                                                                                                      Proposed                                           Proposed
              38.5                                                                                                         Training                          33                       Li [36]                                            Li [36]
                                                                                       29.5                                                                                                                        32
                                                                                                                                                                                      Lee [14]                                           Lee [14]
               38                                                                                                                                            31                       Minnen [13]                                        Minnen [13]
                                                                                                                       Updated ACN                                                                                                       Balle [9]
                                                                                        29                                                                                            Balle [9]                    30
              37.5                                                                                                                                           29                       BPG                                                BPG
                                                                                                    Epoch=0            Fixed ACN
                                                                                                                                                                                      HEVC-intra                                         HEVC-intra
               37                                                                      28.5                                                                  27                                                    28
                     0     5   10 15 20 25 30 35 40 45 50                                     0.5           0.55            0.6           0.65                    0   20000         40000      60000                    0   20000       40000           60000
                                       Epochs                                                         Bits per pixel (bpp)                                             Bit rate (kbps)                                      Bit rate (kbps)
PSNR (dB)
                                                                                                                                                                                                       PSNR (dB)
                                                                                                                                                                                                                   37
                                                                                                                                                             33                       Proposed                                           Proposed
                                                                                                                                                                                      Li [36]                                            Li [36]
                                                                                                                                                                                      Lee [14]                     35                    Lee [14]
              34                                                                       34                                                                    31
                                                                                                                                                                                      Minnen [13]                                        Minnen [13]
                                                                                                                                                                                      Balle [9]                    33                    Balle [9]
              32                                                                       32                                                                    29
                                                                                                                                                                                      BPG                                                BPG
                                                                                                                                                                                      HEVC-intra                                         HEVC-intra
              30                                                                                                                                             27                                                    31
                                                                                       30
                                                                                                                                                                  0       5000              10000                       0      5000             10000
                                                                                                                                                                       Bit rate (kbps)                                      Bit rate (kbps)
                                                                         PSNR (dB)
PSNR (dB)
              28                                                                       28
                                                      Proposed                                                               Proposed                                         (c)                                               (d)
              26                                                                       26
                                                      Zhao [39]                                                              Zhao [39]           Fig. 11. Rate-distortion curves of several HEVC test sequences [51]: (a)
                                                      Jiang [38]                                                             Jiang [38]
                                                                                                                                                 PeopleOnStreet (2160×1440), (b) Cactus (1920×1080), (c) BasketballDrill
              24                                                                       24
                                                                                                                                                 (832 × 480), and (d) Johnny (1280 × 720).
                                                      JPEG                                                                   JPEG
              22                                                                       22
                   0.1         0.3          0.5        0.7         0.9                      0.1      0.3           0.5            0.7      0.9
                                Bits per pixel (bpp)                                                  Bits per pixel (bpp)
              0.75
                                                                         SSIM
                                                                                       0.75
                                                      Proposed                                                               Proposed            Moreover, the proposed method exhibits the best performance
               0.7                                                                      0.7
                                                      Zhao [39]                                                              Zhao [39]           on all test datasets. The difference between the proposed
              0.65                                                                     0.65
                                                      Jiang [38]                                                             Jiang [38]          algorithm and the existing method is the learning method for
               0.6                                                                      0.6
                                                      JPEG                                                                   JPEG                the CRNet. From the experimental results, the well-trained
              0.55                                                                     0.55
                     0.1        0.3         0.5         0.7        0.9                        0.1     0.3            0.5          0.7      0.9   CRNet improves compression performance.
                                Bits per pixel (bpp)                                                  Bits per pixel (bpp)
                                                                                                                                                    2) HEVC-intra-based Model: The proposed method was
                                      (c)                                                                      (d)
                                                                                                                                                 compared with the conventional codecs, learnable codecs,
Fig. 10. Rate-distortion performance per peak signal-to-noise ratio (PSNR)
and structural similarity index measure (SSIM) of different compression
                                                                                                                                                 and the competing standard compatible algorithm to apply
algorithms on test image datasets: (a), (c) Set14 and (b), (d) LIVE1.                                                                            the proposed method to the recent standard codec and ex-
                                                                                                                                                 hibit state-of-the-art performance. For conventional codecs, the
                                                                                                                                                 BPG image codec was also selected. The image compression
C. Comparison with State-of-the-art Methods                                                                                                      frameworks proposed by Ballé [9], Minnen [13], and Lee [14]
   1) JPEG-based Model: We compared the coding efficiency                                                                                        were selected for the learnable codec, and the work by Li [36]
of the proposed JPEG-based ACN with depth N of 12 with                                                                                           was selected as the competing standard compatible algorithm.
JPEG and other standard compatible compact representation                                                                                        The rate-distortion performance comparison of the typical
frameworks, such as those by Jiang [38] and Zhao [39]. We                                                                                        sequences of each class is illustrated in Fig. 11.
used the experimental results described in [39]; however, in                                                                                        The performance of the learnable codec algorithms does not
[38], the compression rate and distortion results for high QF                                                                                    exceed that of the HEVC in the test sequences. In contrast,
are not described, so we newly trained the model from [38]                                                                                       the standard compatible frameworks are proposed to boost the
and used these results.                                                                                                                          performance of the existing codec, exhibiting better coding
   Considering that the CRNet is a useful tool under a low                                                                                       efficiency than the other frameworks. In particular, our method
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                                   13
                                                           TABLE IV
      BD-R ATE C OMPARISON OF S TANDARD C OMPATIBLE F RAMEWORKS ON HEVC COMMON TEST CONDITIONS [51] WITH A LL I NTRA M AIN
                                                             CONFIGURATIONS
reached state-of-the-art performance.                                  experimental results reveal that this approach outperforms the
   For a detailed performance comparison between the stan-             existing codecs and end-to-end learnable image compression
dard compatible frameworks, the BD-rate performance was                algorithms. For future work, we will extend this work to video
compared in the HEVC test sequences. Considering that the              compression tasks, which are more challenging and complex to
standard compatible framework is effective in a low bit-rate           model because of the high complexity and temporal dynamics.
environment, the QP of the reference HEVC was set to 32, 37,
42, and 47. The BD-rate results are summarized in Table IV.
The results reveal that the proposed method achieves an                                            R EFERENCES
average BD-rate reduction of 15.2% in all classes on the PSNR          [1] G. K. Wallace, “The jpeg still picture compression standard,” IEEE
metric and 22.9% on the SSIM metric. The proposed method                   transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv,
also outperforms Li’s frame-level and block-level scheme                   1992.
                                                                       [2] M. Rabbani, “Jpeg2000: Image compression fundamentals, standards
[36] for both the PSNR and SSIM metrics. In addition, [36]                 and practice,” Journal of Electronic Imaging, vol. 11, no. 2, p. 286,
assumed that the standard codec is an identity function, similar           2002.
to that in [38]; therefore, the CRNet and PPNet are directly           [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
connected, and the CRNet is learned through joint learning of              of the h. 264/avc video coding standard,” IEEE Transactions on circuits
                                                                           and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
the entire network. The proposed framework aims to reduce              [4] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of
errors in learning for the CRNet by modeling the standard                  the high efficiency video coding (hevc) standard,” IEEE Transactions
codec with the ACN and improving the coding efficiency.                    on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–
                                                                           1668, 2012.
                                                                       [5] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
                       VI. C ONCLUSION                                     S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image com-
                                                                           pression with recurrent neural networks,” in International Conference
   In this paper, we proposed a standard compatible deep neu-              on Learning Representations, 2016.
ral network-based framework for image compression. Within              [6] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor,
this framework, image compression is performed optimally                   and M. Covell, “Full resolution image compression with recurrent neural
through the existing off-the-shelf standard codecs, CRNet,                 networks,” in Proceedings of the IEEE Conference on Computer Vision
                                                                           and Pattern Recognition, 2017, pp. 5306–5314.
and PPNet. The ACN was proposed for optimal learning of                [7] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image com-
the entire network and are designed to imitate the forward                 pression with compressive autoencoders,” in International Conference
degradation processes of existing codecs, such as JPEG and                 on Learning Representations, 2017.
                                                                       [8] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
HEVC. Proper training strategies were proposed to minimize                 compression,” in International Conference on Learning Representations,
errors due to the objective function with approximation. The               2017.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                                                                                                 14
 [9] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational    [31] X. Wu, X. Zhang, and X. Wang, “Low bit-rate image compression via
     image compression with a scale hyperprior,” in International Conference            adaptive down-sampling and constrained least squares upconversion,”
     on Learning Representations, 2018.                                                 IEEE Transactions on Image Processing, vol. 18, no. 3, pp. 552–561,
[10] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,                2009.
     S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compres-        [32] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based on
     sion with priming and spatially adaptive bit rates for recurrent networks,”        spatio-temporal resolution adaptation,” IEEE Transactions on Circuits
     in Proceedings of the IEEE Conference on Computer Vision and Pattern               and Systems for Video Technology, vol. 29, no. 1, pp. 275–280, 2018.
     Recognition, 2018, pp. 4385–4393.                                             [33] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolutional
[11] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional               neural network-based block up-sampling for intra frame coding,” IEEE
     networks for content-weighted image compression,” in Proceedings of                Transactions on Circuits and Systems for Video Technology, vol. 28,
     the IEEE Conference on Computer Vision and Pattern Recognition,                    no. 9, pp. 2316–2330, 2018.
     2018, pp. 3214–3223.                                                          [34] Y. Zhang, D. Zhao, J. Zhang, R. Xiong, and W. Gao, “Interpolation-
[12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,               dependent image downsampling,” IEEE Transactions on Image Process-
     “Conditional probability models for deep image compression,” in Pro-               ing, vol. 20, no. 11, pp. 3291–3296, 2011.
     ceedings of the IEEE Conference on Computer Vision and Pattern                [35] H. Kim, M. Choi, B. Lim, and K. Mu Lee, “Task-aware image
     Recognition, 2018, pp. 4394–4402.                                                  downscaling,” in Proceedings of the European Conference on Computer
[13] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and                Vision (ECCV), 2018, pp. 399–414.
     hierarchical priors for learned image compression,” in Advances in            [36] Y. Li, D. Liu, H. Li, L. Li, Z. Li, and F. Wu, “Learning a convolutional
     Neural Information Processing Systems, 2018, pp. 10 771–10 780.                    neural network for image compact-resolution,” IEEE Transactions on
[14] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for               Image Processing, vol. 28, no. 3, pp. 1092–1107, 2018.
     end-to-end optimized image compression,” in International Conference          [37] W. Sun and Z. Chen, “Learned image downscaling for upscaling using
     on Learning Representations, 2019.                                                 content adaptive resampler,” IEEE Transactions on Image Processing,
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image             vol. 29, pp. 4027–4040, 2020.
     recognition,” in Proceedings of the IEEE conference on computer vision        [38] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-end
     and pattern recognition, 2016, pp. 770–778.                                        compression framework based on convolutional neural networks,” IEEE
[16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely                Transactions on Circuits and Systems for Video Technology, vol. 28,
     connected convolutional networks,” in Proceedings of the IEEE confer-              no. 10, pp. 3007–3018, 2017.
     ence on computer vision and pattern recognition, 2017, pp. 4700–4708.         [39] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based
[17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in                  on deep convolutional neural network to compress image,” Journal of
     Proceedings of the IEEE conference on computer vision and pattern                  Visual Communication and Image Representation, vol. 63, p. 102589,
     recognition, 2018, pp. 7132–7141.                                                  2019.
[18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-             [40] F. Bellard, “Bpg image format,” URL https://bellard. org/bpg, 2015.
     works,” in Proceedings of the IEEE conference on computer vision and          [41] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
     pattern recognition, 2018, pp. 7794–7803.                                          deep convolutional networks,” IEEE transactions on pattern analysis
[19] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution               and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
     using very deep convolutional networks,” in Proceedings of the IEEE           [42] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts
     conference on computer vision and pattern recognition, 2016, pp. 1646–             reduction by a deep convolutional network,” in Proceedings of the IEEE
     1654.                                                                              International Conference on Computer Vision, 2015, pp. 576–584.
[20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,       [43] X. Zhang, W. Yang, Y. Hu, and J. Liu, “Dmcnn: Dual-domain multi-
     A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single             scale convolutional neural network for compression artifacts removal,” in
     image super-resolution using a generative adversarial network,” in                 2018 25th IEEE International Conference on Image Processing (ICIP).
     Proceedings of the IEEE conference on computer vision and pattern                  IEEE, 2018, pp. 390–394.
     recognition, 2017, pp. 4681–4690.                                             [44] B. Zheng, Y. Chen, X. Tian, F. Zhou, and X. Liu, “Implicit dual-
[21] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense            domain convolutional network for robust color image compression
     skip connections,” in Proceedings of the IEEE International Conference             artifact reduction,” IEEE Transactions on Circuits and Systems for Video
     on Computer Vision, 2017, pp. 4799–4807.                                           Technology, 2019.
[22] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent      [45] T. Kim, H. Lee, H. Son, and S. Lee, “Sf-cnn: A fast compression artifacts
     network for image restoration,” in Advances in Neural Information                  removal via spatial-to-frequency convolutional neural networks,” in 2019
     Processing Systems, 2018, pp. 1673–1682.                                           IEEE International Conference on Image Processing (ICIP). IEEE,
[23] L. Cavigelli, P. Hager, and L. Benini, “Cas-cnn: A deep convolutional              2019, pp. 3606–3610.
     neural network for image compression artifact suppression,” in 2017           [46] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic filter
     International Joint Conference on Neural Networks (IJCNN). IEEE,                   networks,” in Advances in neural information processing systems, 2016,
     2017, pp. 752–759.                                                                 pp. 667–675.
[24] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deep generative      [47] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
     adversarial compression artifact removal,” in Proceedings of the IEEE              D. Rueckert, and Z. Wang, “Real-time single image and video super-
     International Conference on Computer Vision, 2017, pp. 4826–4835.                  resolution using an efficient sub-pixel convolutional neural network,” in
[25] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory                   Proceedings of the IEEE conference on computer vision and pattern
     network for image restoration,” in Proceedings of the IEEE international           recognition, 2016, pp. 1874–1883.
     conference on computer vision, 2017, pp. 4539–4547.                           [48] K. McCann, B. Bross, W. Han, I. Kim, K. Sugimoto, and G. Sullivan,
[26] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3:                 “High efficiency video coding (hevc) test model 16 (hm 16) encoder
     Deep dual-domain based fast restoration of jpeg-compressed images,” in             description,” JCT-VC, Doc. JCTVC N, vol. 1002, 2014.
     Proceedings of the IEEE Conference on Computer Vision and Pattern             [49] Hm16.20 reference software. [Online]. Available: https://hevc.hhi.
     Recognition, 2016, pp. 2764–2772.                                                  fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.20/
[27] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep                 [50] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
     residual networks for single image super-resolution,” in Proceedings               for biomedical image segmentation,” in International Conference on
     of the IEEE conference on computer vision and pattern recognition                  Medical image computing and computer-assisted intervention. Springer,
     workshops, 2017, pp. 136–144.                                                      2015, pp. 234–241.
[28] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-           [51] F. Bossen et al., “Common test conditions and software reference
     resolution using very deep residual channel attention networks,” in                configurations,” JCTVC-L1100, vol. 12, p. 7, 2013.
     Proceedings of the European Conference on Computer Vision (ECCV),             [52] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,
     2018, pp. 286–301.                                                                 “Ntire 2017 challenge on single image super-resolution: Methods and
[29] A. M. Bruckstein, M. Elad, and R. Kimmel, “Down-scaling for bet-                   results,” in Proceedings of the IEEE conference on computer vision and
     ter transform compression,” IEEE Transactions on Image Processing,                 pattern recognition workshops, 2017, pp. 114–125.
     vol. 12, no. 9, pp. 1132–1144, 2003.                                          [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[30] W. Lin and L. Dong, “Adaptive downsampling to improve image                        P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
     compression at low bit rates,” IEEE Transactions on Image Processing,              context,” in European conference on computer vision. Springer, 2014,
     vol. 15, no. 9, pp. 2513–2521, 2006.                                               pp. 740–755.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. X, X XXXX                     15
[54] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient de-
     scent,” in ICLR: International Conference on Learning Representations,
     2015.
[55] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
     sparse-representations,” in International conference on curves and sur-
     faces. Springer, 2010, pp. 711–730.
[56] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation
     of recent full reference image quality assessment algorithms,” IEEE
     Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
[57] G. Bjontegaard, “Calculation of average psnr differences between rd-
     curves,” VCEG-M33, 2001.
[58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
     quality assessment: from error visibility to structural similarity,” IEEE
     transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.