Dataset: AI-TOD-R has 28,036 images annotated with 752,460 oriented bounding boxes across 8 classes, it has the smallest mean object size ($10.6^{2}$) among existing oriented object detection datasets.
Benchmark: This benchmark covers both fully-supervised and label efficient (SSOD, SAOD, WSOD) methods. Here, ``L.'', ``U.'', ``S. L.'', and ``C. L.'' denote labelled, unlabelled, sparsely labelled, and coarsely labelled images, respectively.
Dynamic Unbiased Learning: Compared to prior arts (left), our proposed pipeline (right) mitigates the model's learning bias against oriented tiny objects with a dynamically updated prior and a coarse-to-fine sample learning scheme.
Images will inevitably contain objects at extremely tiny scales when observations approach the camera's physical limit. Despite extreme, this scenario is ubiquitous in real world from micro-vision (medical, cell imaging) to macro-vision (drone, satellite imagery). In these specialized domains, imaging typically adopts a bird-eye's view to capture object's primary features, making them show arbitrary orientations. These oriented tiny objects pose severe challenges to object detection:
These challenges give rise to the following problems:
AI-TOD-R is obtained via a semi-automative labelling process, where coarse labels are generated by H2RBox-v2 while fine labels are refined manually. AI-TOD-R contains 28,036 images annotated with 752,460 oriented bounding boxes across 8 classes, it has the smallest mean object size ($10.6^{2}$) among existing oriented object detection datasets.
AI-TOD-R's statistical analysis. In addition to extremely tiny object size, this dataset also shows arbitrary object orientation, dense arrangement, and class imbalance chellenges.
More examples from AI-TOD-R:
The objects' tiny size and low confidence make them easily suppressed or ignored during model training. The vanilla optimization process will inevitably pose them into significantly biased prior setting and biased sample learning dilemmas, severely impeding the detection performance:
Towards scale-unbiased sample learning, we reformulate the training process into a Dynamic Coarse-to-Fine Learning (DCFL) pipeline with:
Experiments are performed on AI-TOD-R for banchmarking, and also on seven other datasets to investigate DCFL's superiority on detecting oriented tiny objects, adaptability to various architectures, and generalization to different scenarios.
We benchmark both fully-supervised paradigms (e.g., architecture, representation, refinement, assignment, backbone) and label-efficient paradigms (e.g., SSOD, SAOD, WSOD) with this dataset.
ID | Method | Backbone | Schedule | AP | AP0.5 | AP0.75 | APvt | APt | APs | APm | #Params. |
---|---|---|---|---|---|---|---|---|---|---|---|
Architecture: | |||||||||||
1 | RetinaNet-O | R50 | 1× | 7.3 | 23.9 | 1.8 | 2.2 | 5.9 | 11.1 | 15.4 | 36.3M |
2 | FCOS-O | R50 | 1× | 11.0 | 33.6 | 3.7 | 3.0 | 8.9 | 15.7 | 22.0 | 31.9M |
3 | Faster R-CNN-O | R50 | 1× | 10.2 | 30.8 | 3.6 | 0.6 | 7.8 | 19.0 | 22.9 | 41.1M |
4 | RoI Transformer | R50 | 1× | 10.5 | 34.0 | 2.2 | 1.1 | 8.8 | 16.9 | 20.3 | 55.1M |
5 | Oriented R-CNN | R50 | 1× | 11.2 | 33.2 | 4.3 | 0.6 | 9.1 | 19.5 | 23.2 | 41.1M |
6 | Deformable DETR-O | R50 | 1× | 8.4 | 26.7 | 2.0 | 4.8 | 9.3 | 8.6 | 7.3 | 40.8M |
7 | ARS-DETR | R50 | 1× | 14.3 | 41.1 | 5.8 | 6.3 | 14.5 | 17.6 | 18.7 | 41.1M |
Representation: | |||||||||||
8 | KLD (RetinaNet-O) | R50 | 1× | 7.8 | 24.8 | 2.3 | 3.1 | 6.7 | 10.3 | 15.8 | 36.3M |
9 | KFIoU (RetinaNet-O) | R50 | 1× | 8.1 | 25.2 | 2.8 | 2.0 | 6.6 | 12.3 | 17.1 | 36.3M |
10 | Oriented RepPoints | R50 | 1× | 13.0 | 40.3 | 4.2 | 5.2 | 12.2 | 16.8 | 21.4 | 36.6M |
11 | PSC (RetinaNet-O) | R50 | 1× | 4.5 | 15.8 | 1.2 | 1.0 | 3.7 | 8.2 | 12.7 | 36.4M |
12 | Gliding Vertex | R50 | 1× | 8.1 | 27.4 | 2.1 | 0.9 | 6.7 | 14.7 | 17.9 | 41.1M |
Refinement: | |||||||||||
13 | R3Det | R50 | 1× | 8.1 | 24.4 | 1.8 | 0.4 | 6.6 | 15.2 | 19.1 | 38.6M |
14 | S2A-Net | R50 | 1× | 10.8 | 33.4 | 3.3 | 4.3 | 11.2 | 13.0 | 16.0 | 38.6M |
Assignment: | |||||||||||
15 | ATSS-O | R50 | 1× | 10.9 | 33.8 | 3.1 | 2.7 | 8.9 | 15.5 | 19.4 | 36.0M |
16 | SASM | R50 | 1× | 11.4 | 35.0 | 3.7 | 3.6 | 10.2 | 15.4 | 19.8 | 36.6M |
17 | CFA | R50 | 1× | 12.4 | 38.7 | 4.0 | 5.0 | 11.9 | 16.5 | 18.8 | 36.6M |
Backbone: | |||||||||||
18 | Oriented R-CNN | R101 | 1× | 11.2 | 33.0 | 4.1 | 0.5 | 8.9 | 19.8 | 24.4 | 60.1M |
19 | Oriented R-CNN | Swin-T | 1× | 12.0 | 34.6 | 4.6 | 0.7 | 9.9 | 20.8 | 25.3 | 44.8M |
20 | Oriented R-CNN | LSKNet-T | 1× | 11.1 | 33.4 | 3.8 | 0.6 | 9.2 | 18.9 | 22.6 | 21.0M |
21 | ReDet | ReR50 | 1× | 11.6 | 32.8 | 4.8 | 1.4 | 9.5 | 19.4 | 23.2 | 31.6M |
Ours: | |||||||||||
22 | DCFL (RetinaNet-O) | R50 | 1× | 12.3 (+5.0) | 36.7 (+12.8) | 4.5 (+2.7) | 4.3 | 10.7 | 17.2 | 22.2 | 36.1M |
23 | DCFL (RetinaNet-O) | R50 | 40e | 15.2 (+7.9) | 44.9 (+21.0) | 5.1 (+3.3) | 4.9 | 13.1 | 19.7 | 25.9 | 36.1M |
24 | DCFL (Oriented R-CNN) | R50 | 1× | 15.7 (+4.5) | 47.0 (+13.8) | 5.8 (+1.5) | 6.3 | 14.8 | 19.6 | 22.5 | 41.1M |
25 | DCFL (Oriented R-CNN) | R50 | 40e | 17.1 (+5.9) | 49.0 (+15.8) | 7.2 (+2.9) | 6.4 | 16.0 | 21.6 | 24.9 | 41.1M |
26 | DCFL ($\rm{S^2A}$-Net) | R50 | 1× | 13.7 (+2.9) | 39.7 (+6.3) | 5.3 (+2.0) | 4.7 | 12.4 | 18.6 | 22.6 | 38.6M |
27 | DCFL ($\rm{S^2A}$-Net) | R50 | 40e | 17.5 (+6.7) | 49.6 (+16.2) | 7.9 (+4.6) | 6.5 | 15.7 | 22.6 | 27.4 | 38.6M |
Method | Category | 10% OBB | 20% OBB | 30% OBB | 100% HBB | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP | AP0.5 | APvt | APt | AP | AP0.5 | APvt | APt | AP | AP0.5 | APvt | APt | AP | AP0.5 | APvt | APt | ||
Unbiased Teacher | SSOD | 7.6 | 24.7 | 0.4 | 6.0 | 8.1 | 24.7 | 0.5 | 6.1 | 8.1 | 25.4 | 0.4 | 6.1 | - | - | - | - |
Soft Teacher | SSOD | 9.4 | 29.0 | 0.3 | 7.6 | 10.2 | 31.1 | 0.5 | 7.9 | 10.4 | 32.2 | 0.6 | 7.8 | - | - | - | - |
SOOD | SSOD | 9.4 | 29.3 | 2.8 | 8.1 | 12.1 | 35.7 | 3.5 | 10.2 | 13.0 | 38.8 | 3.9 | 11.1 | - | - | - | - |
Co-mining | SAOD | 6.4 | 20.4 | 0.5 | 4.4 | 8.0 | 24.1 | 0.4 | 6.1 | 8.2 | 25.0 | 0.4 | 6.8 | - | - | - | - |
H2R-Box | WSOD | - | - | - | - | - | - | - | - | - | - | - | - | 11.4 | 39.1 | 3.4 | 9.4 |
H2R-Box-v2 | WSOD | - | - | - | - | - | - | - | - | - | - | - | - | 11.7 | 38.2 | 4.6 | 9.5 |
DCFL shows superior performance on diverse object detection scenarios, including small oriented object detection (AI-TOD-R, SODA-A), oriented object detection with massive tiny objects (DOTA-v1.5, DOTA-v2), generic oriented object detection (DOTA-v1, DIOR-R), and horizontal object detection (VisDrone, MS COCO). Notably, DCFL will not introduce additional inference cost compared to the baseline. Please refer to the paper for details.
DCFL significantly improves the detection performance on oriented tiny objects by suppressing both false negative and false positive predictions.
Visualization analysis of the predicted results. The first row: Oriented R-CNN, the second row: DCFL. True positive, false negative, and false positive predictions are shown in green, red, and blue boxes, respectively.
@article{xu2024oriented,
title={Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning},
author={Xu, Chang and Zhang, Ruixiang and Yang, Wen and Zhu, Haoran and Xu, Fang and Ding, Jian and Xia, Gui-Song},
journal={arXiv preprint},
year={2024}
}