ImagePairs

Building a realistic data set for super-resolution research by using a beam-splitter camera rig

Super resolution concerns techniques to derive a high-resolution image from a single or multiple lower-resolution images of the same scene. It is an ill-posed problem, for high-frequency visual details of the scene are completely lost in low-resolution images. To overcome this, many machine-learning approaches had been proposed aiming at training a model to recover the lost details in the new scenes. Such approaches include the recent successful effort using deep-learning techniques. As proven, data itself plays a significant role in the machine-learning process—especially deep-learning approaches, which are data hungry. Therefore, solving the problem depends as much on the process of gathering and organizing data as it does on the machine-learning technique itself.

In this project, we are proposing a new data-acquisition technique for gathering a real image data set which could be used as an input for super resolution, noise cancellation, and quality enhancement techniques. We use a beam splitter to capture the same scene with two cameras: one low-resolution and one high-resolution. Because we also release the raw images, this large-scale data set could be used for other tasks such as ISP generation. Unlike current small-scale data sets used for these tasks, our proposed data set includes 11,421 pairs of low- and high-resolution images of diverse scenes. To our knowledge this is the most complete data set for super resolution, ISP and image-quality enhancement. The benchmarking result shows how the new data set can be successfully used to achieve significant quality improvements in real-world image super resolution.

Hardware design

The high-resolution camera used had a 20.1-megapixel, 1/2.4″-format CMOS image sensor supporting 5,344 (horizontal) × 3,752 (vertical) frame capture, 1.12 μm pixel size, and lens focal length of f = 4.418 mm (F/1.94), providing a 68.2° × 50.9° field of view (FOV). The camera also featured bidirectional auto-focus (open-loop VCM) and two-axis optical image stabilization (closed-loop VCM) capability. The lower-resolution fixed-focus camera used had a similar FOV with approximately half the angular pixel resolution. It also featured a 5-MP, 1/4″-format CMOS image sensor supporting 2,588 × 1,944 frame capture, 1.4 μm pixel size, and lens focal length f = 2.9 mm (F/2.4), providing a 64° (horizontal) × 50.3° (vertical) FOV.

In order to capture frames on both cameras simultaneously with a common perspective, the FOVs of both cameras were combined using a Thorlabs BS013 50/50 non-polarizing beam-splitter cube. They were then aligned so that the pointing angle of the optical axes was at far distance and entrance pupils at each camera (nodes) were at near distance.

The dual camera setup — The opto-mechanical layout of the dual camera combiner

The camera setup on a tripod mounted on a dolly — The data-acquisition device installed on a tripod. The tripod was mounted on a dolly to move it about easily outdoors.

The high-resolution camera, placed behind the combiner cube in the transmission optical path, was mounted on a Thorlabs K6XS 6-axis stage so that the x and y position of the entrance pupil was centered within the cube and the z position in close proximity. The tip and tilt of camera image center field pointing angle was aligned with a target at distance while rotation about the camera’s optical axis was aligned by matching pixel row(s) with a horizontal line target.

The low-resolution camera was placed behind the combiner cube in the lateral 90° folded optical path and also mounted on a 6-axis stage. It was then aligned in x, y, and z so that the entrance pupil optically overlapped that of the high-resolution camera. The tip-tilt pointing angle as well as camera rotation about the optical axis were adjustable so as to achieve similar scene capture. In order to refine the overlap toward pixel accuracy, a live-capture tool displayed the absolute difference of camera frame image content between cameras so that center pointing and rotation levelling might be adjusted with high sensitivity. Any spatial and angular offsets could be substantially nulled by mechanically locking the camera in position. The unused combiner optical path was painted with carbon black to limit image contrast loss due to scatter.

The two cameras had a difference in perspective due to the different focal lenses, which was solved by a local alignment technique.

Data set formation

We called our data set ImagePairs because it is composed of pairs of images of the exact same scene using two different cameras: one low-resolution image (1,752 × 1,166 pixels) and one high-resolution image that was exactly twice as big in each dimension (3,504 × 2,332 pixels). Unlike other real-world data sets, we do not use zooming levels or scaling to increase the number of pairs so each pair represents a unique scene. This means that we captured 11,421 distinct scenes with the device, generating 11,421 image pairs.

For each image pair, metadata such as gain, exposure, lens position, and scene categories were stored. Each image pair was assigned to a category which may later be used for training purposes. These categories include Document, Board, Office, Face, Car, Tree, Sky, Object, Night and Outdoor. The pairs were later divided in two sets of train and test, containing 8,591 and 2,830 pairs respectively.

The two cameras differed in perspective due to the different focal lenses. In order to make two images from the two cameras correspond to each other in pixel level, we applied the following steps: (1) ISP; (2) image undistortion; (3) pair alignment; and (4) margin cropping. The following examples illustrate the accuracy of the technique across widely varying scenes.

Sample images from the ImagePairs data set showing final alignment. In order to show the accuracy of pixel-by-pixel alignment, each sample image was split in half horizontally, with the low-resolution segment half on the left and the high-resolution half on the right.

Super-resolution benchmark

We trained three 2x super-resolution methods on the ImagePairs training set including SRGAN, EDSR, and WDSR. All super-resolution methods were trained using low- and high-resolution RGB images; we did not use Raw images as input. We used the same patch size of 128 × 128 for high-resolution images and a batch size of 16 for all training. All methods were trained for 150,000 iterations. For evaluation, we ran trained models on a centered quarter of cropped images from the Imagepairs test set. The following table reports the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) for the trained model on ImagePairs as well as for a model trained on the DIV2K data set with similar parameters.

Model	Training Data	PSNR (dB)	SSIM
Bicubic	—	21.451	0.712
SRGAN	DIV2K	21.906	0.699
WDSR	DIV2K	21.299	0.697
EDSR	DIV2K	21.298	0.697
SRGAN	ImagePairs	22.161	0.673
WDSR	ImagePairs	23.805	0.767
EDSR	ImagePairs	23.845	0.764

The PSNR and SSIM for methods trained on DIV2K was comparable with the bicubic method. In some cases, they perform worse than bicubic because noise can increase with some super-resolution methods. On the other hand, when we trained the same models with proposed ImagePairs data set, all methods outperform their PNSR. SRGAN and EDSR did a good job in noise cancellation and, outperforming in PNSR by at least 2 dB and in SSM by 0.6. On the other hand, SRGAN, which is not optimized for PSNR, mainly focuses on color correction and not much on noise cancellation.

The following figure shows a qualitative comparison of these methods when trained on the ImagePairs data set. Needless to say, these models perform much better on noise cancellation, color correction, and super resolution when trained on this data set than when trained on DIV2K.