Oriented Tiny Object Detection:
A Dataset, Benchmark, and Dynamic Unbiased Learning

Wuhan University, EPFL, KAUST
Under review
*Indicates Equal Contribution


EPFL Logo
WHU Logo NLP Logo

Motivation

Images will inevitably contain objects at extremely tiny scales when observations approach the camera's physical limit. Despite extreme, this scenario is ubiquitous in real world from micro-vision (medical, cell imaging) to macro-vision (drone, satellite imagery). In these specialized domains, imaging typically adopts a bird-eye's view to capture object's primary features, making them show arbitrary orientations. These oriented tiny objects pose severe challenges to object detection:

  • Difficult Label Acquisition: The limited appearance information makes the labelling quite difficult.
  • Weak Feature Representation: The features of oriented tiny objects will be lost in deep networks.
  • Dense Object Arrangement: There can be thousands of objects per image due to overhead view.
Application Scenarios


These challenges give rise to the following problems:

  • Existing methods struggle to deliver satisfactory performance. 77% objects in DOTA-v2 are in the size range of $10^{2}$-$50^{2}$ pixels, while the SOTA performance is still lower than 30% $\rm{AP^{@50:5:95}}$.
  • Detecting oriented tiny objects is a hard nut to crack in the community, while there is a lack of task-specific datasets and benchmarks to prompt the development of detection methods.
This motivates us to build a task-specific dataset, benchmark and method for oriented tiny objects.

Dataset

AI-TOD-R is obtained via a semi-automative labelling process, where coarse labels are generated by H2RBox-v2 while fine labels are refined manually. AI-TOD-R contains 28,036 images annotated with 752,460 oriented bounding boxes across 8 classes, it has the smallest mean object size ($10.6^{2}$) among existing oriented object detection datasets.

AI-TOD-R Statics

AI-TOD-R's statistical analysis. In addition to extremely tiny object size, this dataset also shows arbitrary object orientation, dense arrangement, and class imbalance chellenges.

More examples from AI-TOD-R:

AI-TOD-R Visualization

Method

The objects' tiny size and low confidence make them easily suppressed or ignored during model training. The vanilla optimization process will inevitably pose them into significantly biased prior setting and biased sample learning dilemmas, severely impeding the detection performance:

  • Biased prior setting: Most prior positions are deviated from tiny objects' main area under fixed prior setting constraint by feature stride (e.g., 8, 16, 32, 64).
  • Biased sample learning: Oriented tiny objects are assigned with much fewer number of positive samples compared to otehr objects due to sub-optimal measaurement and assignment strategy.

Towards scale-unbiased sample learning, we reformulate the training process into a Dynamic Coarse-to-Fine Learning (DCFL) pipeline with:

  • Dynamic prior setting: Prior positions are dynamically updated to better accommodate tiny objects' extreme shapes.
  • Coarse-to-fine sample learning: The coarse step warrants sufficient and diverse positive samples for each object. The fine step guarantees the learning quality by fitting each gt with a Dynamic Gaussian Mixture Model as a constraint to select low-quality samples.

Method Overview


Benchmark and Experiments


Experiments are performed on AI-TOD-R for banchmarking, and also on seven other datasets to investigate DCFL's superiority on detecting oriented tiny objects, adaptability to various architectures, and generalization to different scenarios.

AI-TOD-R Benchmark

We benchmark both fully-supervised paradigms (e.g., architecture, representation, refinement, assignment, backbone) and label-efficient paradigms (e.g., SSOD, SAOD, WSOD) with this dataset.

Table Example

Main results of fully-supervised methods on AI-TOD-R

ID Method Backbone Schedule AP AP0.5 AP0.75 APvt APt APs APm #Params.
Architecture:
1 RetinaNet-O R50 7.3 23.9 1.8 2.2 5.9 11.1 15.4 36.3M
2 FCOS-O R50 11.0 33.6 3.7 3.0 8.9 15.7 22.0 31.9M
3 Faster R-CNN-O R50 10.2 30.8 3.6 0.6 7.8 19.0 22.9 41.1M
4 RoI Transformer R50 10.5 34.0 2.2 1.1 8.8 16.9 20.3 55.1M
5 Oriented R-CNN R50 11.2 33.2 4.3 0.6 9.1 19.5 23.2 41.1M
6 Deformable DETR-O R50 8.4 26.7 2.0 4.8 9.3 8.6 7.3 40.8M
7 ARS-DETR R50 14.3 41.1 5.8 6.3 14.5 17.6 18.7 41.1M
Representation:
8 KLD (RetinaNet-O) R50 7.8 24.8 2.3 3.1 6.7 10.3 15.8 36.3M
9 KFIoU (RetinaNet-O) R50 8.1 25.2 2.8 2.0 6.6 12.3 17.1 36.3M
10 Oriented RepPoints R50 13.0 40.3 4.2 5.2 12.2 16.8 21.4 36.6M
11 PSC (RetinaNet-O) R50 4.5 15.8 1.2 1.0 3.7 8.2 12.7 36.4M
12 Gliding Vertex R50 8.1 27.4 2.1 0.9 6.7 14.7 17.9 41.1M
Refinement:
13 R3Det R50 8.1 24.4 1.8 0.4 6.6 15.2 19.1 38.6M
14 S2A-Net R50 10.8 33.4 3.3 4.3 11.2 13.0 16.0 38.6M
Assignment:
15 ATSS-O R50 10.9 33.8 3.1 2.7 8.9 15.5 19.4 36.0M
16 SASM R50 11.4 35.0 3.7 3.6 10.2 15.4 19.8 36.6M
17 CFA R50 12.4 38.7 4.0 5.0 11.9 16.5 18.8 36.6M
Backbone:
18 Oriented R-CNN R101 11.2 33.0 4.1 0.5 8.9 19.8 24.4 60.1M
19 Oriented R-CNN Swin-T 12.0 34.6 4.6 0.7 9.9 20.8 25.3 44.8M
20 Oriented R-CNN LSKNet-T 11.1 33.4 3.8 0.6 9.2 18.9 22.6 21.0M
21 ReDet ReR50 11.6 32.8 4.8 1.4 9.5 19.4 23.2 31.6M
Ours:
22 DCFL (RetinaNet-O) R50 12.3 (+5.0) 36.7 (+12.8) 4.5 (+2.7) 4.3 10.7 17.2 22.2 36.1M
23 DCFL (RetinaNet-O) R50 40e 15.2 (+7.9) 44.9 (+21.0) 5.1 (+3.3) 4.9 13.1 19.7 25.9 36.1M
24 DCFL (Oriented R-CNN) R50 15.7 (+4.5) 47.0 (+13.8) 5.8 (+1.5) 6.3 14.8 19.6 22.5 41.1M
25 DCFL (Oriented R-CNN) R50 40e 17.1 (+5.9) 49.0 (+15.8) 7.2 (+2.9) 6.4 16.0 21.6 24.9 41.1M
26 DCFL ($\rm{S^2A}$-Net) R50 13.7 (+2.9) 39.7 (+6.3) 5.3 (+2.0) 4.7 12.4 18.6 22.6 38.6M
27 DCFL ($\rm{S^2A}$-Net) R50 40e 17.5 (+6.7) 49.6 (+16.2) 7.9 (+4.6) 6.5 15.7 22.6 27.4 38.6M

Main results of label-efficient methods on AI-TOD-R.
Method Category 10% OBB 20% OBB 30% OBB 100% HBB
AP AP0.5 APvt APt AP AP0.5 APvt APt AP AP0.5 APvt APt AP AP0.5 APvt APt
Unbiased Teacher SSOD 7.6 24.7 0.4 6.0 8.1 24.7 0.5 6.1 8.1 25.4 0.4 6.1 - - - -
Soft Teacher SSOD 9.4 29.0 0.3 7.6 10.2 31.1 0.5 7.9 10.4 32.2 0.6 7.8 - - - -
SOOD SSOD 9.4 29.3 2.8 8.1 12.1 35.7 3.5 10.2 13.0 38.8 3.9 11.1 - - - -
Co-mining SAOD 6.4 20.4 0.5 4.4 8.0 24.1 0.4 6.1 8.2 25.0 0.4 6.8 - - - -
H2R-Box WSOD - - - - - - - - - - - - 11.4 39.1 3.4 9.4
H2R-Box-v2 WSOD - - - - - - - - - - - - 11.7 38.2 4.6 9.5

DCFL's Detection Performance

DCFL shows superior performance on diverse object detection scenarios, including small oriented object detection (AI-TOD-R, SODA-A), oriented object detection with massive tiny objects (DOTA-v1.5, DOTA-v2), generic oriented object detection (DOTA-v1, DIOR-R), and horizontal object detection (VisDrone, MS COCO). Notably, DCFL will not introduce additional inference cost compared to the baseline. Please refer to the paper for details.

Visualization of Detection Results

DCFL significantly improves the detection performance on oriented tiny objects by suppressing both false negative and false positive predictions.


MY ALT TEXT

Visualization analysis of the predicted results. The first row: Oriented R-CNN, the second row: DCFL. True positive, false negative, and false positive predictions are shown in green, red, and blue boxes, respectively.


BibTeX

@article{xu2024oriented,
  title={Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning},
  author={Xu, Chang and Zhang, Ruixiang and Yang, Wen and Zhu, Haoran and Xu, Fang and Ding, Jian and Xia, Gui-Song},
  journal={arXiv preprint},
  year={2024}
}