YOLO5

What is YOLO

YOLO (You Only Look Once) is a CNN-based object detector written by Joseph Redmon (who stepped away from CV research in 2018).

The original work of YOLO was the first object detection network to combine the problem of drawing bounding boxes and identifying class labels.

It used a custom framework called Darknet for real-time object detectors. YOLO predicts object bounding boxes together with class labels from images.

YOLO adds a grid system to an existing image, where each grid detects objects separately, but later the grid detections are combined.

YOLO3

YOLO and YOLO2 used a nn.Linear layer at the end to detect objects. These models could predict a number of bounding boxes.

To predict multiple bounding boxes, a grid system with anchor boxes (grid cells) is used.

Starting with YOLOv3, the final detection heads use convolutional layers (nn.Conv2d) following the one-stage detection paradigm (similar to SSD).

YOLOv2 introduced several iterative improvements: BatchNorm, higher resolution training, and anchor boxes.

YOLOv3 further advanced the architecture with multi-scale predictions and the Darknet-53 backbone.

YOLOv4

YOLOv4 (Alexey Bochkovskiy et al.) introduced several key improvements: CSPDarknet53 backbone, PANet for feature aggregation, Mosaic data augmentation, and CIoU loss. The linked article covers popular data augmentation techniques from that era.

YOLO format (darknet)

This format contains one text file per image (containing the annotations and a numeric representation of the label) and a label map which maps the numeric IDs to human-readable strings. The annotations are normalized to lie within the range [0, 1] which makes them easier to work with even after scaling or stretching images. It has become quite popular as it has followed the Darknet framework’s implementations of the various YOLO models.

The same format (or very similar variants) is still widely used by modern YOLO implementations, including Ultralytics YOLOv5 and later versions.

YOLOv5

The initial release of YOLOv5 is very fast, performant, and easy to use. While YOLOv5 has yet to introduce novel model architecture improvements to the family of YOLO models, it introduces a new PyTorch training and deployment framework that improves the state-of-the-art for object detectors.

It was developed by Glenn Jocher at Ultralytics. Mosaic augmentation (introduced in the YOLOv4 paper) was further popularised in the YOLOv5 training pipeline, which made the models very accessible and easy to train.