Safer YOLO, in the Dark (I)

6 min readOct 26, 2019

As a state-of-the-art real-time object detection model, YOLO has been widely studied and commonly implemented by CV engineers. In a project which aims to build a wearable for vision-impaired users, I modified the original YOLO v3 model to make it safer and more robust.

2 main changes I made to make YOLO more adaptable: (1) enable YOLO to work in dark environments, and (2) add class-sensitive loss functions to make it safer. This post focuses on the first main change.

Feedback is very welcome and appreciated.

My major takeaway from the project

Think outside the box.
Solving real-life problems is much more complicated than building models. Corner cases and small details add an extra challenge to the technical problems, and also make the whole learning experience rewarding.

The goal: improve YOLO’s performance in low-light environments

While YOLOv3 models can achieve mAP-50 scores of over 50% on the COCO 2014 validation set, their performance will decrease significantly on darker images. In the experiment, I ran YOLOv3–416 on a darker version of the COCO dataset and its mAP dropped to below 30%. The method of obtaining the darker images is gamma correction (gamma=0.3) and multiplying every pixel value by 0.3. This is only a very rough way to create darker versions of the original images.

So the goal is, how to improve YOLO’s performance when we use it in dark environments? More specifically, how can we achieve this goal using deep learning, not relying on additional hardware? I assume the problem is solvable by adding lighting equipment or maybe infrared camera units. But due to hardware constraints (i.e. weight, battery life, appearance, etc.), it would be better if we could solve this problem from the software side.

Why not train YOLO on dark images?

For two major concerns. First, object detection on dark images could be intrinsically harder, if not impossible. The statistical distribution of pixel intensity in dark images may not be sufficient for YOLO to learn. For example, texture and silhouette information is largely lost in the dark.

Second, it will take a lot of time and manual effort to produce the dark version of training images, to match the distribution of real-life images taken in the dark. The images from ImageNet and COCO come with a wide range of lighting conditions, and I did not find a formula that can take any image and produce a reasonably good dark version of it. For instance, if we choose gamma adjustment, then that gamma value is a hyperparameter that needs to be chosen for each image. Since it takes tens of thousands of images to train YOLO, the manual image editing work looks daunting.

The See in the Dark (SID) Model

SID is a learning-based low-light image processing model proposed by Chen et. al (CVPR 2018) It uses a UNet architecture to recover details from pictures taken in extremely low lighting conditions. It is an end to end image processing pipeline which takes in the raw sensor data and outputs the RGB image in well-lit conditions.

Architecture of SID, taken from the paper

The results look really great.

My proposed solution: Modified SID + YOLO

The very high-level idea: use a trained SID as a pre-processing pipeline which adjusts the inputs to “normal” lighting conditions and then run YOLO on the output of SID.

I made two modifications to the SID model. First, instead of using the raw sensor data, I used RBG images directly. To be honest, I do not know how to get the raw sensor data from our camera. (The original paper uses SLR cameras while our prototype uses a low-budget camera unit.) Second, the authors of SID took great efforts to take image pairs (well-lit and dark) in controlled environments. I did not have the time to take all the images and instead used OpenCV to produce the dark images.

The really cool thing is, you don’t need that many dark images to train a reasonably good SID model. I manually adjusted < 80 images of size 1080*1920 and randomly cropped them to 416*416 during training.

These two modifications are going to make my model sub-optimal as compared with its full potential, although the results were sufficiently good. It was due to time constraints that I only used a small training set.

Training data: quality control

One natural and legit concern is, are the manually adjusted images really representative of what the camera can capture in real low-lighting environments? In order to make the adjusted dark images more “real”, I compared the pixel histogram changes of the “real pair” and the “fake pair”.

The “real pair” are taken using the same SLR camera, on a tripod, with the only difference being the ISO levels. The dark image in the fake pair is made from the original image, with gamma correction, linear contrast reduction and noise.

Top row: “real pair”; bottom row: fake pair (my cat and his friends)

The histograms of the two pairs show that in both the real and fake pair, the pixel intensity distribution gets squeezed and shifted in similar ways. Information is lost for both of the darker versions.

Top left: real, light; top right: real, dark; bottom left: real, light; bottom right: fake, dark

Sample training pairs:

SID Training Results

Training specs: 74 pairs of bright and dark images, with random cropping to extend the dataset during training. 500 epochs. All hyperparameters same as in the original paper.

That looks good, how about testing these images on YOLO?

SID+YOLO: Qualitative Results

Below are sample cases where SID has different impact/interaction with YOLO. I was using the pre-trained YOLO weights (on COCO).

Scenario 1: when SID improves YOLO

Scenario 2: when SID improves YOLO by a little margin

Scenario 3: when SID makes YOLO worse

YOLO is able to correctly detect the keyboard in the dark image, but unable to do so in the reconstructed image. Also, it makes another error by classifying books as a microwave. (middle picture)

Scenario 4: when SID makes no improvement

Even in low-lighting conditions, YOLO correctly detects the two objects. I checked the confidence scores of the three images, and they were all very close, in the high 90s. It seems that for (1) certain classes that have higher APs (like humans and dogs) and (2) larger instances (in proportion to the image size), darkness has less impact.

Scenario 5: when both YOLO and SID fail

Even in normal lighting, YOLO thinks the dove is a teddy bear.

Limitations and Future Improvements

Prepare a suitable test dataset to quantitatively analyze the impact of SID on YOLO.
Instead of training SID and YOLO separately, stack them together and train end-to-end.
Use more image pairs (both in number and in content, lighting variety) to train the SID model, in order to improve the quality of the reconstructed images.
Texture details are lost in the SID reconstructed images, which has a negative impact on YOLO’s performance. The current SID uses L1 loss. What about replacing it with human-perception-based loss? What if add GAN loss to encourage photo realism? (e.g. DeePSiM)