2005.00305v3 - Defocus Deblurring Using Dual-Pixel Data
2005.00305v3 - Defocus Deblurring Using Dual-Pixel Data
Abstract. Defocus blur arises in images that are captured with a shal-
low depth of field due to the use of a wide aperture. Correcting defocus
blur is challenging because the blur is spatially varying and difficult to
estimate. We propose an effective defocus deblurring method that ex-
ploits data available on dual-pixel (DP) sensors found on most modern
cameras. DP sensors are used to assist a camera’s auto-focus by captur-
ing two sub-aperture views of the scene in a single image shot. The two
sub-aperture images are used to calculate the appropriate lens position
to focus on a particular scene region and are discarded afterwards. We
introduce a deep neural network (DNN) architecture that uses these dis-
carded sub-aperture images to reduce defocus blur. A key contribution
of our effort is a carefully captured dataset of 500 scenes (2000 images)
where each scene has: (i) an image with defocus blur captured at a large
aperture; (ii) the two associated DP sub-aperture views; and (iii) the
corresponding all-in-focus image captured with a small aperture. Our
proposed DNN produces results that are significantly better than con-
ventional single image methods in terms of both quantitative and percep-
tual metrics – all from data that is already available on the camera but
ignored. The dataset, code, and trained models are available at https:
//github.com/Abdullah-Abuolaim/defocus-deblurring-dual-pixel.
1 Introduction
This paper addresses the problem of defocus blur. To understand why defocus
blur is difficult to avoid, it is important to understand the mechanism governing
image exposure. An image’s exposure to light is controlled by adjusting two
parameters: shutter speed and aperture size. The shutter speed controls the
duration of light falling on the sensor, while the aperture controls the amount
of light passing through the lens. The reciprocity between these two parameters
allows the same exposure to occur by fixing one parameter and adjusting the
other. For example, when a camera is placed in aperture-priority mode, the
aperture remains fixed while the shutter speed is adjusted to control how long
light is allowed to pass through the lens. The drawback is that a slow shutter
speed can result in motion blur if the camera and/or an object in the scene moves
while the shutter is open, as shown in Fig. 1. Conversely, in shutter-priority
2 A. Abuolaim et al.
Narrow aperture Wide aperture Left (L) Right (R) Left view Wide aperture
Photo- Photo-
diode diode
Right view
DP sensor
Image A - 𝑓/22 and 3.2k ISO Image B - 𝑓/4 and 3.2k ISO Dual-pixel (DP) images available Image B deblurred using the
shutter speed 0.33 sec shutter speed 0.0025 sec from DP sensor for image B L and R dual pixel images
Fig. 1: Images A and B are of the same scene and same approximate exposure.
Image A is captured with a narrow aperture (f/22) and slow shutter speed.
Image A has a wide depth of field (DoF) and little defocus blur, but exhibits
motion blur from the moving object due to the long shutter speed. Image B is
captured with a wide aperture (f/4) and a fast shutter speed. Image B exhibits
defocus blur due to the shallow DoF, but has no motion blur. Our proposed
DNN uses the two sub-aperture views from the dual-pixel sensor of image B to
deblur image B, resulting in a much sharper image.
mode, the shutter speed remains fixed while the aperture adjusts its size. The
drawback of a variable aperture is that a wide aperture results in a shallow depth
of field (DoF), causing defocus blur to occur in scene regions outside the DoF,
as shown in Fig. 1. There are many computer vision applications that require
a wide aperture but still want an all-in-focus image. An excellent example is
cameras on self-driving cars, or cameras on cars that map environments, where
the camera must use a fixed shutter speed and the only way to get sufficient
light is a wide aperture at the cost of defocus blur.
Our aim is to reduce the unwanted defocus blur. The novelty of our approach
lies in the use of data available from dual-pixel (DP) sensors used by modern
cameras. DP sensors are designed with two photodiodes at each pixel location
on the sensor. The DP design provides the functionality of a simple two-sample
light-field camera and was developed to improve how cameras perform autofocus.
Specifically, the two-sample light-field provides two sub-aperture views of the
scene, denoted in this paper as left and right views. The light rays coming from
scene points that are within the camera’s DoF (i.e., points that are in focus) will
have no difference in phase between the left and right views. However, light rays
coming from scene points outside the camera’s DoF (i.e., points that are out of
focus) will exhibit a detectable disparity in the left/right views that is directly
correlated to the amount of defocus blur. We refere to it as defocus disparity.
Cameras use this phase shift information to determine how to move the lens
to focus on a particular location in the scene. After autofocus calculations are
performed, the DP information is discarded by the camera’s hardware.
Contribution. We propose a deep neural network (DNN) to perform defocus
deblurring that uses the DP images from the sensor available at capture time.
In order to train the proposed DNN, a new dataset of 500 carefully captured
images exhibiting defocus blur and their corresponding all-in-focus image is col-
lected. This dataset consists of 2000 images – 500 DoF blurred images with their
Defocus Deblurring Using Dual-Pixel Data 3
2 Related work
Related work is discussed regarding (1) defocus blur, (2) datasets, and (3) ap-
plications exploiting DP sensors.
Defocus deblurring. Related methods in the literature can be categorized into:
(1) defocus detection methods [8, 27, 31, 35, 38, 39] or (2) defocus map estimation
and deblurring methods [4, 15, 18, 22, 28]. While defocus detection is relevant to
our problem, we focus on the latter category as these methods share the goal of
ultimately producing a sharp deblurred result.
A common strategy for defocus deblurring is to first compute a defocus map
and use that information to guide the deblurring. Defocus map estimation meth-
ods [4, 15, 18, 22, 28] estimate the amount of defocus blur per pixel for an image
with defocus blur. Representative works include Karaali et al. [15], which uses
image gradients to calculate the blur amount difference between the original im-
age edges and their re-blurred ones. Park et al. [22] introduced a method based
on hand-crafted and deep features that were extracted from a pre-trained blur
classification network. The combined feature vector was fed to a regression net-
work to estimate the blur amount on edges and then later deblur the image. Shi
et al. [28] proposed an effective blur feature using a sparse representation and
image decomposition to detect just noticeable blur. Methods that directly de-
blur the image include Andrès et al.’s [4] approach, which uses regression trees to
deblur the image. Recent work by Lee et al. [18] introduced a DNN architecture
to estimate an image defocus map using a domain adaptation approach. This
approach also introduced the first large-scale dataset for DNN-based training.
Our work is inspired by Lee et al.’s [18] success in applying DNNs for the DoF
deblurring task. Our distinction from the prior work is the use of the DP sensor
information available at capture time.
Defocus blur datasets. There are several datasets available for defocus deblur-
ring. The CUHK [27] and DUT [38] datasets have been used for blur detection
and provide real images with their corresponding binary masks of blur/sharp
regions. The SYNDOF [18] dataset provided data for defocus map estimation,
in which their defocus blur is synthesized based on a given depth map of pinhole
image datasets. The datasets of [18, 27, 38] do not provide the corresponding
ground truth all-in-focus image. The RTF [4] dataset provided light-field images
captured by a Lytro camera for the task of defocus deblurring. In their data, each
4 A. Abuolaim et al.
blurred image has a corresponding all-in-focus image. However, the RTF dataset
is small, with only 22 image pairs. While there are other similar and much larger
light-field datasets [11,29], these datasets were introduced for different tasks (i.e.,
depth from focus and synthesizing a 4D RGBD light field), which are different
from the task of this paper. In general, the images captured by Lytro cameras
are not representative of DSLR and smartphone cameras, because they apply
synthetic defocus blur, and have a relatively small spatial resolution [3].
As our approach is to utilize the DP data for defocus deblurring, we found
it necessary to capture a new dataset. Our DP defocus blur dataset provides
500 pairs of images of unrepeated scenes; each pair has a defocus blurred image
with its corresponding sharp image. The two DP views of the blurred image are
also provided, resulting in a total of 2000 images. Details of our dataset capture
are provided in Sec. 4. Similar to the patch-wise training approach followed
in [18, 22], we extract a large number of image patches from our dataset to train
our DNN.
DP sensor applications. The DP sensor design was developed by Canon for
the purpose of optimizing camera autofocus. DP sensors perform what is termed
phase difference autofocus (PDAF) [1, 2, 14], in which the phase difference be-
tween the left and right sub-aperture views of the primary lens is calculated
to measure the blur amount. Using this phase information, the camera’s lens
is adjusted such that the blur is minimized. While intended for autofocus, the
DP images have been found useful for other tasks, such as depth map estima-
tion [6, 24], reflection removal [25], and synthetic DoF [33]. Our work is inspired
by these prior methods and examines the use of DP data for the task of defocus
blur removal.
3 DP image formation
Intensity
L photodiode unit
Intensity
No shift
DP imaging sensor
(E) L view
Position on sensor Position on sensor
(B) DP L/R signals (C) DP L/R signals
Blur size
Aperture
(F) R view
Intensity
Intensity
Main lens In-focus
(A) DP camera model (D) Final combined signal readout (G) Combined
Fig. 2: Image formation diagram for a DP sensor. (A) Shows a thin-lens camera
and a DP sensor. The light rays from different halves of the main lens fall on
different left and right photodiodes. (B) Scene points that are within the DoF
(highlighted in gray) have no phase shift between their L/R views. Scene points
outside DoF have a phase shift as shown in (C). The L/R signals are aggregated
and the corresponding combined signal is shown in (D). The blur size of the L
signal is smaller than the combined one in the out-of-focus case. The defocus
disparity is noticeable between the captured L/R images (see (E) and (F)). The
final combined image in (G) has more blur. Our DNN leverages this additional
information available in the L/R views for image defocus deblurring.
between their DP L/R views (Fig. 2-B). The light rays coming from the out-of-
focus regions spread across multiple DP units and therefore produce a difference
between their DP L/R views, as shown in Fig. 2-C. Intuitively, this information
can be exploited by a DNN to learn where regions of the image exhibit blur
and the extent of this blur. The final output image is a combination of the L/R
views, as shown in Fig. 2-G.
By examining real examples shown in Fig. 3 it becomes apparent how a
DNN can leverage these two sub-aperture views as input to deblur the image. In
particular, patches containing regions that are out-of-focus will exhibit a notable
defocus disparity in the two views that is directly correlated to the amount of
defocus blur. By training a DNN with sufficient examples of the L/R views
and the corresponding all-in-focus image, the DNN can learn how to detect and
correct blurred regions. Animated examples of the difference between the DP
views are provided in the supplemental materials.
4 Dataset collection
Our first task is to collect a dataset with the necessary DP information for
training our DNN. While most consumer cameras employ PDAF sensors, we are
aware of only two camera manufacturers that provide DP data – Google and
Canon. Specifically, Google’s research team has released an application to read
DP data [9] from the Google Pixel 3 and 4 smartphones. However, smartphone
6 A. Abuolaim et al.
In-focus
L patch R patch L/R cross correlation
Left DP view (L) Right DP view (R)
Out-of-focus
L patch R patch L/R cross correlation
Fig. 3: An input image I is shown with a spatially varying defocus blur. The two
dual-pixel (DP) images (L and R) corresponding to I are captured at imaging
time. In-focus and out-of-focus patches in the L and R DP image patches exhibit
different amounts of pixel disparity as shown by the cross-correlation of the two
patches. This information helps the DNN to learn the extent of blur in different
regions of the image.
cameras are currently not suitable for our problem for two reasons. First, smart-
phone cameras use fixed apertures that cannot be adjusted for data collection.
Second, smartphone cameras have narrow aperture and exhibit large DoF; in
fact, most cameras go to great lengths to simulate shallow DoF by purposely
introducing defocus blur [33]. As a result, our dataset is captured using a Canon
EOS 5D Mark IV DSLR camera, which provides the ability to save and extract
full-frame DP images.
Using the Canon camera, we capture a pair of images of the same static
scene at two aperture sizes – f /4 and f /22 – which are the maximum (widest)
and minimum (narrowest) apertures possible for our lens configuration. The lens
position and focal length remain fixed during image capture. Scenes are captured
in aperture-priority mode, in which the exposure compensation between the
image pairs is done automatically by adjusting the shutter speed. The image
captured at f /4 has the smallest DoF and results in the blurred input image IB .
The image captured at f /22 has the largest DoF and serves as the all-in-focus
target image denoted as IS (sharp image). Focus distance and focal length differ
across captured pairs in order to capture a diverse range of defocus blur types.
Our captured images offer the following benefits over prior datasets:
High-quality images. Our captured images are low-noise images (i.e., low
ISO equates to low-noise [23]) and of full resolution of 6720 × 4480. All images,
including the left/right DP views, are processed to an sRGB and encoded with
a lossless 16-bit depth per RGB channel.
Real and diverse defocus blur. Unlike other existing datasets, our dataset
provides real defocus blur and in-focus pairs indicative of real camera optics.
Varying scene contents. To provide a wide range of object categories, we
collect 500 pairs of unique indoor/outdoor scenes with a large variety of scene
contents. Our dataset is also free of faces to avoid privacy issues.
Defocus Deblurring Using Dual-Pixel Data 7
Fig. 4: An example of an image pair with the camera settings used for capturing.
IL and IR represent the Left and Right DP views extracted from IB . The focal
length, ISO, and focus distance are fixed between the two captures of IB and IS .
The aperture size is different, and hence the shutter speed and DoF are accord-
ingly different too. In-focus and out-of-focus zoomed-in patches are extracted
from each image and shown in green and red boxes, respectively.
The f /4 (blurry) and f /22 (sharp) image pairs are carefully imaged static
scenes with the camera fixed on a tripod. To further avoid camera shake, the
camera was controlled remotely to allow hands-free operation. Fig. 4 shows an
example of an image pair from our dataset. The left and right DP views of IB
are provided by the camera and denoted as IL and IR respectively. The ISO
setting is fixed for each image pair. Fig. 4 shows the DP L/R views for only
image IB , because DP L/R views of IS are visually identical due to the fact IS
is our all-in-focus ground truth.
1024
IS∗
512 512 3
256 256
128 128
64 64
E-Block 1 E-Block 2 E-Block 3 E-Block 4 Bottleneck D-Block 1 D-Block 2 D-Block 3 D-Block 4
3 × 3 𝐶𝑜𝑛𝑣, ReLU S𝑘𝑖𝑝 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛 2 × 2 𝑀𝑎𝑥 𝑝𝑜𝑜𝑙 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 2 × 2 𝑈𝑝 − 𝑐𝑜𝑛𝑣 1 × 1 𝐶𝑜𝑛𝑣, sigmoid
where DPDNet is our proposed architecture, and θDPDNet is the set of weights
and parameters.
Training procedure. The size of input and output layers is set to 512×512×6
and 512 × 512 × 3, respectively. This is because we train not on the full-size
images but on the extracted image patches. We adopt the weight initialization
strategy proposed by He [12] and use the Adam optimizer [16] to train the model.
The initial learning rate is set to 2 × 10−5 , which is decreased by half every 60
epochs. We train our model with mini-batches of size 5 using MSE loss between
Defocus Deblurring Using Dual-Pixel Data 9
EBDB [15] 25.77 0.772 0.040 0.297 21.25 0.599 0.058 0.373 23.45 0.683 0.049 0.336
DMENet [18] 25.50 0.788 0.038 0.298 21.43 0.644 0.063 0.397 23.41 0.714 0.051 0.349
JNB [28] 26.73 0.828 0.031 0.273 21.10 0.608 0.064 0.355 23.84 0.715 0.048 0.315
Our DPDNet-Single 26.54 0.816 0.031 0.239 22.25 0.682 0.056 0.313 24.34 0.747 0.044 0.277
Our DPDNet 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
Table 1: The quantitative results for different defocus deblurring methods. The
testing on the dataset is divided into three scene categories: indoor, outdoor,
and combined. The top result numbers are highlighted in green and the second
top in blue. DPDNet-Single is our DPDNet variation that is trained with only a
single blurred input. Our DPDNet that uses the two L/R DP views achieved the
best results on all scene categories for all metrics. Note: the testing set consists
of 37 indoor and 39 outdoor scenes.
where n is the size of the image patch in pixels. During the training phase, we
set the dropout rate to 0.4. All the models described in the subsequent sections
are implemented using Python with the Keras framework on top of TensorFlow
and trained with a NVIDIA TITAN X GPU. We set the maximum number of
training epochs to 200.
6 Experimental results
We first describe our data preparation procedure and then evaluation metrics
used. This is followed by quantitative and qualitative results to evaluate our
proposed method with existing deblurring methods. We also discuss the time
analysis and test the robustness of our DP method against different aperture
settings.
Data preparation. Our dataset has an equal number of indoor and outdoor
scenes. We divide the data into 70% training, 15% validation, and 15% testing
sets. Each set has a balanced number of indoor/outdoor scenes. To prepare the
data for training, we first downscale our images to be 1680 × 1120 in size. Next,
image patches are extracted by sliding a window of size 512 × 512 with 60%
overlap. We empirically found this image size and patch size to work well. An
ablation study of different architecture settings is provided in the supplemental
materials. We compute the sharpness energy (i.e., by applying Sobel filter) of
the in-focus image patches and sort them. We discard 30% of the patches that
have the lowest sharpness energy. Such patches represent homogeneous regions,
cause an ambiguity associated to the amount of blur, and adversely affect the
DNNs training, as found in [22].
10 A. Abuolaim et al.
Blurred input
EBDB [15]
DMENet [18]
JNB [28]
Our DPDNet-Single
DPDNet
Ground truth
ground truth defocus map and provide only the sharp image, since our approach
in this work is to solve directly for defocus deblurring. Therefore, we tested the
DMENet on our dataset using IB as input without retraining. For deblurring,
DMENet adopts a non-blind deconvolution algorithm proposed by [17]. Our
results are compared against code provided by the authors. Unfortunately, the
methods in [4, 22] do not have the deblurring code available for comparison.
To show the advantage of utilizing DP data for defocus deblurring, we in-
troduce a variation of our DPDNet that accepts only a single input (i.e., IB )
and uses exactly the same architecture settings along with the same training
procedure as shown in Fig 5. We refer to this variation as DPDNet-Single in
Table 1. Our proposed architecture is fully convolutional, which enables testing
any image size during the testing phase. Therefore, all the subsequent results are
reported on the testing set using the full image for all methods. Table 1 reports
our findings by testing on three scene categories: indoor, outdoor, and combined.
Top result numbers are highlighted in green and the second top ones in blue.
Our DPDNet method has a significantly better deblurring ability based on all
metrics for all testing categories. Furthermore, DP data is the key that made
our DPDNet method outperforms others, especially the single image input one
(i.e., DPDNet-Single), in which it has exactly the same architecture but does
not utilize DP views. Interestingly, all methods have better deblurring results
for indoor scenes, due to the fact that outdoor scenes tend to have larger depth
variations, and thereby more defocus blur.
Qualitative results. In Fig. 6, we present the qualitative results of different
defocus deblurring methods. The first row shows the input image with a spa-
tially varying defocus blur; the last row shows the corresponding ground truth
sharp image. The rows in between present different methods, including ours.
This figure also shows two zoomed-in cropped patches in green and red to fur-
ther illustrate the difference visually. From the visual comparison with other
methods, our DPDNet has the best deblurring ability and is quite similar to the
ground truth. EBDB [15], DMENet [18], and JNB [28] are not able to handle
spatially varying blur with almost unnoticeable difference with the input image.
EBDB [15] tends to introduce some artifacts in some cases. Our single image
method (i.e., DPDNet-Single) has better deblurring ability compared to other
traditional deblurring methods, but it is not at the level of our method that
utilizes DP views for deblurring. Our DPDNet method, as shown visually, is
effective in handling spatially varying blur. For example, in the second row, the
image has a part that is in focus and another is not; our DPDNet method is
able to determine the deblurring amount required for each pixel, in which the
in-focus part is left untouched. Further qualitative results are provided in our
supplemental materials, including results on DP data obtained from a smart-
phone camera.
Time analysis. We examine evaluating different defocus deblurring methods
based on the time required to process a testing image of size 1680 × 1120 pixels.
Our DPDNet directly computes the sharp image in a single pass, whereas other
Defocus Deblurring Using Dual-Pixel Data 13
Time (Sec) ↓
Method
Defocus map estimation Defocus deblurring Total
Table 2: Time analysis of different defocus deblurring methods. The last column
is the total time required to process a testing image of size 1680 × 1120 pixels.
Our DPDNet is about 1.2×103 times faster compared to the second-best method
(i.e., DMENet).
methods [15,18,28] use two passes: (1) defocus map estimation and (2) non-blind
deblurring based on the estimated defocus map.
Non-learning-based methods (i.e., EBDB [15] and JNB [28]) do not utilize the
GPU and use only the CPU. For the deep-learning method (i.e., DMENet [18]),
it utilizes the GPU for the first pass; however, the deblurring routine is applied
on a CPU. This time evaluation is performed using Intel Core i7-6700 CPU and
NVIDIA TITAN X GPU. Our DPDNet operates in a single pass and can process
the testing image of size 1680×1120 pixels about 1.2×103 times faster compared
to the second-best method (i.e., DMENet), as shown in Table 2.
Robustness to different aperture settings. In our dataset, the image pairs
are captured using aperture settings corresponding to f-stops f /22 and f /4.
Recall that f /4 results in the greatest DoF and thus most defocus blur. Our
DPDNet is trained on diverse images with many different depth values; thus,
our training data spans the worst-case blur that would be observed with any
aperture settings. To test the ability of our DPDNet in generalizing for scenes
with different aperture settings, we capture image pairs with aperture settings
f /10 and f /16 for the blurred image and again f /22 for the corresponding
ground truth image. Our DPDNet is applied to these less blurred images. Fig. 7
shows the results for four scenes, where each scene’s image has its LPIPS measure
compared with the ground truth. For better visual comparison, Fig. 7 provides
zoomed-in patches that are cropped from the blurred input (red box) and the
deblurred one (green box). These results show that our DPDNet is able to deblur
scenes with different aperture settings that have not been used during training.
7 Applications
Image blur can have a negative impact on some computer vision tasks, as found
in [10]. Here we investigate defocus blur effect on two common computer vision
tasks – namely, image segmentation and monocular depth estimation.
Image segmentation. The first two columns in Fig. 8 demonstrate the nega-
tive effect of defocus blur on the task of image segmentation. We use the PSPNet
segementation model from [37], and test two images: one is the blurred input im-
age IB and another is the deblurred one I∗S using our DPDNet deblurring model.
14 A. Abuolaim et al.
Fig. 8: The effect of defocus blur on some computer vision tasks. The first two
columns show the image segmentation results using the PSPNet [37] segmenta-
tion model. The segmentation results are affected by the blurred image IB , where
a large portion is segmented as unknown in cyan. The last two columns show the
results of the monocular depth estimation using the monodepth model from [7].
The depth estimation is highly affected by the defocus blur and produced wrong
results. Deblurring IB using our DP deblurring method has significantly im-
proved the results for both tasks.
The segmentation results are affected by IB – only the foreground tree was cor-
rectly segmented. PSPNet assigns cyan color to unknown categories, where a
large portion of IB is segmented as unknown. On the other hand, the segmen-
tation results of I∗S are much better, in which more categories are segmented
correctly. With that said, image DoF deblurring using our DP method can be
beneficial for the task of image segmentation.
Monocular depth estimation. The monocular depth estimation is the task
of estimating scene depth using a single image. In the last two columns of Fig. 8,
we show the direct effect of defocus blur on this task. We use the monodepth
model from [7] to test the two images IB and I∗S in order to examine the change
in performance. The result of monodepth is affected by the defocus blur, in
which the depth map estimated is completely wrong. Contrarily, the result of
monodepth has been significantly improved after testing with the deblurred input
image using our DPDNet deblurring model. Therefore, deblurring images using
our DPDNet can be useful for the task of monocular depth map estimation.
8 Conclusion
We have presented a novel approach to reduce the effect of defocus blur present
in images captured with a shallow DoF. Our approach leverages the DP data
that is available in most modern camera sensors but currently being ignored for
other uses. We show that the DP images are highly effective in reducing DoF blur
when used in a DNN framework. As part of this effort, we have captured a new
image dataset consisting of blurred and sharp image pairs along with their DP
Defocus Deblurring Using Dual-Pixel Data 15
images. Experimental results show that leveraging the DP data provides state-
of-the-art quantitative results on both signal processing and perceptual metrics.
We also demonstrate that our deblurring method can be beneficial for other
computer vision tasks. We believe our captured dataset and DP-based method
are useful for the research community and will help spur additional ideas about
both defocus deblurring and applications that can leverage data from DP sensors.
Acknowledgments. This study was funded in part by the Canada First Re-
search Excellence Fund for the Vision: Science to Applications (VISTA) pro-
gramme and an NSERC Discovery Grant. Dr. Brown contributed to this article
in his personal capacity as a professor at York University. The views expressed
are his own and do not necessarily represent the views of Samsung Research.
References
1. Abuolaim, A., Brown, M.S.: Online lens motion smoothing for video autofocus. In:
WACV (2020)
2. Abuolaim, A., Punnappurath, A., Brown, M.S.: Revisiting autofocus for smart-
phone cameras. In: ECCV (2018)
3. Boominathan, V., Mitra, K., Veeraraghavan, A.: Improving resolution and depth-
of-field of light field cameras using a hybrid imaging system. In: ICCP (2014)
4. DAndrès, L., Salvador, J., Kochale, A., Süsstrunk, S.: Non-parametric blur map
regression for depth of field extension. TIP 25(4), 1660–1673 (2016)
5. Fish, D., Brinicombe, A., Pike, E., Walker, J.: Blind deconvolution by means of the
richardson–lucy algorithm. Journal of the Optical Society of America (A) 12(1),
58–65 (1995)
6. Garg, R., Wadhwa, N., Ansari, S., Barron, J.T.: Learning single camera depth
estimation using dual-pixels. In: ICCV (2019)
7. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti-
mation with left-right consistency. In: CVPR (2017)
8. Golestaneh, S.A., Karam, L.J.: Spatially-varying blur detection based on multiscale
fused and sorted transform coefficients of gradient magnitudes. In: CVPR (2017)
9. Google: Google research: Android app to capture dual-pixel data. https:
//github.com/google-research/google-research/tree/master/dual_pixels
(2019), last accessed: March, 2020
10. Guo, Q., Feng, W., Chen, Z., Gao, R., Wan, L., Wang, S.: Effects of blur and
deblurring to visual object tracking. arXiv preprint arXiv:1908.07904 (2019)
11. Hazirbas, C., Soyer, S.G., Staab, M.C., Leal-Taixé, L., Cremers, D.: Deep depth
from focus. In: ACCV (2018)
12. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In: ICCV (2015)
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
14. Jang, J., Yoo, Y., Kim, J., Paik, J.: Sensor-based auto-focusing system using multi-
scale feature extraction and phase correlation matching. Sensors 15(3), 5747–5762
(2015)
15. Karaali, A., Jung, C.R.: Edge-based defocus blur estimation with adaptive scale
selection. TIP 27(3), 1126–1137 (2017)
16 A. Abuolaim et al.
16. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
17. Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-laplacian priors.
In: NeurIPS (2009)
18. Lee, J., Lee, S., Cho, S., Lee, S.: Deep defocus map estimation using domain
adaptation. In: CVPR (2019)
19. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a con-
ventional camera with a coded aperture. ACM Transactions on sGraphics 26(3),
70 (2007)
20. Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional
encoder-decoder networks with symmetric skip connections. In: NeurIPS (2016)
21. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts.
Distill 1(10), e3 (2016)
22. Park, J., Tai, Y.W., Cho, D., So Kweon, I.: A unified approach of multi-scale deep
and hand-crafted features for defocus estimation. In: CVPR (2017)
23. Plotz, T., Roth, S.: Benchmarking denoising algorithms with real photographs. In:
CVPR (2017)
24. Punnappurath, A., Abuolaim, A., Afifi, M., Brown, M.S.: Modeling defocus-
disparity in dual-pixel sensors. In: ICCP (2020)
25. Punnappurath, A., Brown, M.S.: Reflection removal using a dual-pixel sensor. In:
CVPR (2019)
26. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI (2015)
27. Shi, J., Xu, L., Jia, J.: Discriminative blur detection features. In: CVPR (2014)
28. Shi, J., Xu, L., Jia, J.: Just noticeable defocus blur detection and estimation. In:
CVPR (2015)
29. Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to
synthesize a 4D RGBD light field from a single image. In: ICCV (2017)
30. Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In:
NeurIPS (2015)
31. Tang, C., Zhu, X., Liu, X., Wang, L., Zomaya, A.: Defusionnet: Defocus blur
detection via recurrently fusing and refining multi-scale deep features. In: CVPR
(2019)
32. Tao, X., Gao, H., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep
image deblurring. In: CVPR (2018)
33. Wadhwa, N., Garg, R., Jacobs, D.E., Feldman, B.E., Kanazawa, N., Carroll, R.,
Movshovitz-Attias, Y., Barron, J.T., Pritch, Y., Levoy, M.: Synthetic depth-of-
field with a single-camera mobile phone. ACM Transactions on Graphics 37(4),
64 (2018)
34. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assess-
ment: from error visibility to structural similarity. TIP 13(4), 600–612 (2004)
35. Yi, X., Eramian, M.: Lbp-based segmentation of defocus blur. TIP 25(4), 1626–
1638 (2016)
36. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: CVPR (2018)
37. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
CVPR (2017)
38. Zhao, W., Zhao, F., Wang, D., Lu, H.: Defocus blur detection via multi-stream
bottom-top-bottom fully convolutional network. In: CVPR (2018)
39. Zhao, W., Zheng, B., Lin, Q., Lu, H.: Enhancing diversity of defocus blur detectors
via cross-ensemble network. In: CVPR (2019)
Defocus Deblurring Using Dual-Pixel Data 17
Supplemental Materials
S1 Ablation study
As described in Sec. 5 of the main paper, our DPDNet takes the two dual-pixel
L/R views, IL and IR , as inputs to estimate the sharp image I∗S . In our dataset, in
addition to the L/R views, we also provide the corresponding combined image IB
that would be outputted by the camera. In this section, we examine training our
DPDNet with all three images, namely IL , IR , and IB . We refer to this variation
as DPDNet(IL , IR , IB ).
Table 3 shows the results of the three-input DPDNet, DPDNet(IL , IR , IB ),
vs. the two-input one, DPDNet(IL , IR ), proposed in the main paper. The results
of all metrics are quite similar with a slight difference. Our conclusion is that
training and testing the DPDNet with the extra input IB provides no noticeable
improvement. Such results are expected, since IB is a combination of IL and IR .
1
https://github.com/Abdullah-Abuolaim/defocus-deblurring-dual-pixel
18 A. Abuolaim et al.
DPDNet(IL , IR , IB ) 27.32 0.842 0.029 0.191 22.94 0.723 0.052 0.257 25.07 0.781 0.041 0.225
DPDNet(IL , IR ) 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
In this section, we train a “lighter” version of our DPDNet with less E-Blocks
and D-Blocks. This is done by reducing E-Block 1 and D-Block 4. We refer
to this light version as DPDNet-Light. In Table 4, we provide a comparison of
DPDNet-Light and our full DPDNet that is proposed in the main paper.
Table 4 shows that our full DPDNet has a better performance compared to
the lighter one. Nevertheless, the sacrifice in performance is not too significant,
which implies that the DPDNet-Light could be an option for environments with
limited computational resources.
DPDNet-Light 27.08 0.824 0.030 0.225 22.81 0.701 0.053 0.309 24.89 0.761 0.042 0.268
DPDNet 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
Our DPDNet is a fully convolutional network. This facilitates training with dif-
ferent input patch sizes with no change required in the network architecture. As
such, we consider training with two different patch sizes, namely 256 × 256 pixels
and 512 × 512 pixels referred to as DPDNet256 and DPDNet512 , respectively.
Defocus Deblurring Using Dual-Pixel Data 19
Table 5 shows that the two different input sizes perform similarly. Particu-
larly, input patch size does not change the performance drastically as long as it
is larger than the blur size.
DPDNet256 27.28 0.847 0.029 0.195 22.86 0.734 0.050 0.257 25.01 0.789 0.040 0.227
DPDNet512 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
Our dataset provides high-quality images that are processed to an sRGB en-
coding with a lossless 16-bit depth per RGB channel. Since we are targeting
dual-pixel information which would be obtained directly in the camera’s hard-
ware, in a real hardware implementation we would expect to have such high
bit-depth images. However, since most standard encodings still rely on 8-bit im-
age, we provide a comparison of training our DPDNet with 8-bit (DPDNet8−bit )
and 16-bit (DPDNet16−bit ) input data type.
Based on the numbers in Table 7, DPDNet16−bit has a slightly better perfor-
mance. In particular, it has a lower LPIPS distance for all categories. As a result,
training with 16-bit images is helpful due to the extra information embedded in,
and is more representative of the hardware’s data.
20 A. Abuolaim et al.
DPDNet0% 27.21 0.838 0.030 0.205 22.86 0.721 0.051 0.275 24.98 0.778 0.041 0.241
DPDNet15% 27.19 0.840 0.029 0.194 22.94 0.721 0.052 0.254 25.01 0.779 0.041 0.225
DPDNet30% 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
DPDNet45% 27.21 0.839 0.030 0.194 22.90 0.724 0.051 0.258 25.00 0.780 0.041 0.227
DPDNet8−bit 27.37 0.834 0.029 0.196 23.10 0.723 0.052 0.258 25.18 0.777 0.041 0.228
DPDNet16−bit 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
One may be curious if motion blur methods can be used to address the defocus
blur problem. While defocus and motion blur both produce a blurring of the
underlying latent image, the physical image formation process of these two types
of blur are different. Therefore, comparing with methods that solve for motion
blur is not expected to give good results. However, for a validity check, we tested
the scale recurrent motion deblurring method (SRNet) in [32] using our testing
set. This method achieved an average LPIPS of 0.452 and PSNR of 20.12, which
is lower than all other existing methods that solve for defocus deblurring. Fig. 9
shows results of applying motion deblurring network SRNet [32] to input image
from our dataset.
S3 Use cases
As discussed in Sec. 1 of the main paper, we described how defocus blur is related
to the size of the aperture used at capture time. The size of the aperture is often
dictated by the desired exposure which is a factor of aperture, shutter speed,
Defocus Deblurring Using Dual-Pixel Data 21
(c) SRNet [32] output image. (d) Our DPDNet output image.
Fig. 9: Qualitative deblurring results using SRNet [32] and our DPDNet.
and ISO setting. As a result, there is a trade-off between image noise (from ISO
gain), motion blur (shutter speed), and defocus blur (aperture). This trade off
is referred to as the exposure triangle. In this section, we show some common
cases, where defocus deblurring is required.
Moving camera. Global motion blur is more likely to occur with the moving
cameras like hand-held cameras (I1 in Fig. 10-A). One way to handle motion blur
is to set a fast shutter speed and this can be done by either increasing the image
gain (i.e., ISO) or the aperture size. However, higher ISO can introduce noise as
stated in [23] (Fig. 10-B), and wider aperture can introduce undesired defocus
blur as shown in I3 (Fig. 10-C). For such case, we offer two solutions: apply
motion deblurring method SRNet [32] on I1 (result shown in Fig. 10-D) or apply
our defocus deblurring method on I3 (result shown in Fig. 10-E). Our defocus
deblurring method is able to obtain sharper and cleaner image as demonstrated
in Fig. 10-E.
Moving object. In this scenario, we have a stationary camera, with a scene
object that is moving (i.e., Newton’s cradle in Fig. 11). Fig. 11-A shows an image
with motion blur, in which the object speed is higher than the shutter speed. In
Fig. 11-B, the ISO is significantly increased in order to make the shutter speed
faster, nevertheless, the pendulum speed remains the fastest and the motion
blur is pronounced. Another way to increase the shutter speed is to open the
22 A. Abuolaim et al.
(A) I1 at 𝑓/22 and 100 ISO (B) I2 at 𝑓/22 and 3200 ISO
0.25 𝑠𝑒𝑐
(C) I3 at 𝑓/8 and 100 ISO (D) Motion deblurring from SRNet I1 [32]
Fig. 10: Image noise, motion and defocus blur relation with a moving camera. The
number shown on each image is the shutter speed. Zoomed-in cropped patches
are also provided. (A) shows an image I1 suffers from motion blur. (B) shows
an image I2 fixes the motion blur by increasing the ISO, however, I2 has more
noise. (C) shows another image I3 handles the motion blur by increasing the
aperture size, nevertheless, I3 suffers from defocus blur. (D) shows the results of
deblurring I1 using the motion deblurring method SRNet [32]. The image in (E)
is the sharp and clean image obtained using our DPDNet to deblur I3 .
aperture wider as shown in Fig. 11-C and this setting handles the motion blur.
However, capturing at wider aperture introduces the undesired defocus blur. To
get a sharper image, we can use the motion deblurring method SRNet [32] to
deblur I1 (result shown in Fig. 11-D) and I2 (result shown in Fig. 11-E), or
apply our defocus deblurring method on I3 (result shown in Fig. 11-F). Our
defocus deblurring method is able to obtain sharper image compared to motion
deblurring method as demonstrated in Fig. 11-F.
(A) I1 at 𝑓/22 and 3.2k ISO (B) I2 at 𝑓/22 and 16k ISO (C) I3 at 𝑓/4 and 3.2k ISO
(D) Motion deblurring SRNet(I1 )[32] (E) Motion deblurring SRNet(I2 )[32] (F) Our DPDNet(I3,L , I3,R )
Fig. 11: Motion and defocus blur relation with a moving object. The number
shown on each image is the shutter speed. (A) shows an image I1 has a moving
object that suffers from motion blur. Image I2 in (B) tries to fix the motion blur
by increasing the ISO, but the motion blur is still pronounced. I3 in (C) handles
the motion blur by setting the aperture wide, nevertheless, it introduces defocus
blur. (D) and (E) show the results of deblurring I1 and I2 , respectively, using the
motion deblurring method SRNet [32]. The image in (F) is sharp and obtained
by drblurring I3 using our DPDNet.
that provide DP data, namely, Google Pixel 3 and 4 smartphones and Canon
EOS 5D Mark IV DSLR. The smartphone camera currently has limitations that
make it challenging to train the DPDNet with. First, the Google Pixel smart-
phone cameras do not have adjustable apertures, so we are unable to capture
corresponding “sharp” images using a small aperture as we did with the Canon
camera. Second, the data currently available from the Pixel smartphones are not
full-frame, but are limited to only one of the Green channels in the raw-Bayer
frame. Finally, the smartphone has a very small aperture so most images do
not exhibit defocus blur. In fact, many smartphone cameras synthetically apply
defocus blur to produce the shallow DoF effect.
As a result, the experiments here are provided to serve as a proof of concept
that our method should generalize to other DP sensors. To this end, we examined
DP images available in the dataset from [6] to find images exhibiting defocus
blur. The L/R views of these images are available in the “animated dp examples”
directory—located at the same directory as this pdf file.
To use our DPDNet, we replicate the single green channel to be 3-channel
image to match our DPDNet input. Fig. 12 shows the deblurring results on
images captured by Pixel camera. The image on the left is the input combined
24 A. Abuolaim et al.
Average LPIPS ↓
Method
DP L view DP R view
image and the image on the right is the deblurred one using our DPDNet. Note
that the Pixel android application, used to extract DP data, does not provide
the combined image [9]. To obtain it, we average the two views. Fig. 12 visually
demonstrates that our DPDNet is able to generalize and deblur for images that
are captured by the smartphone camera. Because it is not possible to adjust
aperture on the smartphone camera to capture a ground truth image, we cannot
report quantitative numbers. The results of two more full images are shown in
Fig. 13.
S5 More results
Quantitative results. In Table 8, we provide evaluation of other methods on
a single DP view separately using the average LPIPS. Note that a single DP L or
R view is formed with a half-disc point spread function in the ideal case. When
the two views are combined to form the final output image; the blur kernel would
look like a full-disc kernel [24]. Non-blind defocus deblurring methods assume
full-disc kernel and the blur kernel of the combined image aligns more with
their assumption. More details about DP view formation and modeling DP blur
kernels can be found in [24].
In addition to above, we report in Table 9 the average LPIPS numbers for
other methods on the images used to test DPDNet robustness to different aper-
ture settings. Note that the LPIPS numbers here are lower than numbers in
Defocus Deblurring Using Dual-Pixel Data 25
Table 1 of the main paper. The reason is that for the robustness test we used
f/10 and f/16, which results in less defocus blur compared to the images captured
at f/4 (a much wider aperture than f/10 and f/16).
26 A. Abuolaim et al.
(A) (B)
(C) (D)
(A) (B)
(C) (D)
Fig. 12: The results of using our DPDNet to deblur images captured by Pixel
smartphone camera. The image on the left is the combined input image with
defocus blur and the one on the right is deblurred one. Our DPDNet is able to
generalize well for images captured by a smartphone camera.
Defocus Deblurring Using Dual-Pixel Data 27
Fig. 13: Qualitative deblurring results using our DPDNet for images captured by
a smartphone camera.