AI-TOD-R

Oriented Tiny Object Detection:
A Dataset, Benchmark, and Dynamic Unbiased Learning

Wuhan University, EPFL, KAUST

Under review
^*Indicates Equal Contribution

This work addresses the challenging task of detecting oriented tiny objects by establishing a new dataset and a benchmark, and proposing a dynamic coarse-to-fine learning scheme aimed at scale-unbiased learning.

Dataset: AI-TOD-R has 28,036 images annotated with 752,460 oriented bounding boxes across 8 classes, it has the smallest mean object size ($10.6^{2}$) among existing oriented object detection datasets.

Benchmark: This benchmark covers both fully-supervised and label efficient (SSOD, SAOD, WSOD) methods. Here, ``L.'', ``U.'', ``S. L.'', and ``C. L.'' denote labelled, unlabelled, sparsely labelled, and coarsely labelled images, respectively.

Dynamic Unbiased Learning: Compared to prior arts (left), our proposed pipeline (right) mitigates the model's learning bias against oriented tiny objects with a dynamically updated prior and a coarse-to-fine sample learning scheme.

Motivation

Images will inevitably contain objects at extremely tiny scales when observations approach the camera's physical limit. Despite extreme, this scenario is ubiquitous in real world from micro-vision (medical, cell imaging) to macro-vision (drone, satellite imagery). In these specialized domains, imaging typically adopts a bird-eye's view to capture object's primary features, making them show arbitrary orientations. These oriented tiny objects pose severe challenges to object detection:

Difficult Label Acquisition: The limited appearance information makes the labelling quite difficult.
Weak Feature Representation: The features of oriented tiny objects will be lost in deep networks.
Dense Object Arrangement: There can be thousands of objects per image due to overhead view.

These challenges give rise to the following problems:

Existing methods struggle to deliver satisfactory performance. 77% objects in DOTA-v2 are in the size range of $10^{2}$-$50^{2}$ pixels, while the SOTA performance is still lower than 30% $\rm{AP^{@50:5:95}}$.
Detecting oriented tiny objects is a hard nut to crack in the community, while there is a lack of task-specific datasets and benchmarks to prompt the development of detection methods.

This motivates us to build a task-specific dataset, benchmark and method for oriented tiny objects.

Dataset

AI-TOD-R is obtained via a semi-automative labelling process, where coarse labels are generated by H2RBox-v2 while fine labels are refined manually. AI-TOD-R contains 28,036 images annotated with 752,460 oriented bounding boxes across 8 classes, it has the smallest mean object size ($10.6^{2}$) among existing oriented object detection datasets.

AI-TOD-R's statistical analysis. In addition to extremely tiny object size, this dataset also shows arbitrary object orientation, dense arrangement, and class imbalance chellenges.

More examples from AI-TOD-R:

Method

The objects' tiny size and low confidence make them easily suppressed or ignored during model training. The vanilla optimization process will inevitably pose them into significantly biased prior setting and biased sample learning dilemmas, severely impeding the detection performance:

Biased prior setting: Most prior positions are deviated from tiny objects' main area under fixed prior setting constraint by feature stride (e.g., 8, 16, 32, 64).
Biased sample learning: Oriented tiny objects are assigned with much fewer number of positive samples compared to otehr objects due to sub-optimal measaurement and assignment strategy.

Towards scale-unbiased sample learning, we reformulate the training process into a Dynamic Coarse-to-Fine Learning (DCFL) pipeline with:

Dynamic prior setting: Prior positions are dynamically updated to better accommodate tiny objects' extreme shapes.
Coarse-to-fine sample learning: The coarse step warrants sufficient and diverse positive samples for each object. The fine step guarantees the learning quality by fitting each gt with a Dynamic Gaussian Mixture Model as a constraint to select low-quality samples.

Main results of fully-supervised methods on AI-TOD-R

ID	Method	Backbone	Schedule	AP	AP_0.5	AP_0.75	AP_vt	AP_t	AP_s	AP_m	#Params.
Architecture:
1	RetinaNet-O	R50	1×	7.3	23.9	1.8	2.2	5.9	11.1	15.4	36.3M
2	FCOS-O	R50	1×	11.0	33.6	3.7	3.0	8.9	15.7	22.0	31.9M
3	Faster R-CNN-O	R50	1×	10.2	30.8	3.6	0.6	7.8	19.0	22.9	41.1M
4	RoI Transformer	R50	1×	10.5	34.0	2.2	1.1	8.8	16.9	20.3	55.1M
5	Oriented R-CNN	R50	1×	11.2	33.2	4.3	0.6	9.1	19.5	23.2	41.1M
6	Deformable DETR-O	R50	1×	8.4	26.7	2.0	4.8	9.3	8.6	7.3	40.8M
7	ARS-DETR	R50	1×	14.3	41.1	5.8	6.3	14.5	17.6	18.7	41.1M
Representation:
8	KLD (RetinaNet-O)	R50	1×	7.8	24.8	2.3	3.1	6.7	10.3	15.8	36.3M
9	KFIoU (RetinaNet-O)	R50	1×	8.1	25.2	2.8	2.0	6.6	12.3	17.1	36.3M
10	Oriented RepPoints	R50	1×	13.0	40.3	4.2	5.2	12.2	16.8	21.4	36.6M
11	PSC (RetinaNet-O)	R50	1×	4.5	15.8	1.2	1.0	3.7	8.2	12.7	36.4M
12	Gliding Vertex	R50	1×	8.1	27.4	2.1	0.9	6.7	14.7	17.9	41.1M
Refinement:
13	R³Det	R50	1×	8.1	24.4	1.8	0.4	6.6	15.2	19.1	38.6M
14	S²A-Net	R50	1×	10.8	33.4	3.3	4.3	11.2	13.0	16.0	38.6M
Assignment:
15	ATSS-O	R50	1×	10.9	33.8	3.1	2.7	8.9	15.5	19.4	36.0M
16	SASM	R50	1×	11.4	35.0	3.7	3.6	10.2	15.4	19.8	36.6M
17	CFA	R50	1×	12.4	38.7	4.0	5.0	11.9	16.5	18.8	36.6M
Backbone:
18	Oriented R-CNN	R101	1×	11.2	33.0	4.1	0.5	8.9	19.8	24.4	60.1M
19	Oriented R-CNN	Swin-T	1×	12.0	34.6	4.6	0.7	9.9	20.8	25.3	44.8M
20	Oriented R-CNN	LSKNet-T	1×	11.1	33.4	3.8	0.6	9.2	18.9	22.6	21.0M
21	ReDet	ReR50	1×	11.6	32.8	4.8	1.4	9.5	19.4	23.2	31.6M
Ours:
22	DCFL (RetinaNet-O)	R50	1×	12.3 (+5.0)	36.7 (+12.8)	4.5 (+2.7)	4.3	10.7	17.2	22.2	36.1M
23	DCFL (RetinaNet-O)	R50	40e	15.2 (+7.9)	44.9 (+21.0)	5.1 (+3.3)	4.9	13.1	19.7	25.9	36.1M
24	DCFL (Oriented R-CNN)	R50	1×	15.7 (+4.5)	47.0 (+13.8)	5.8 (+1.5)	6.3	14.8	19.6	22.5	41.1M
25	DCFL (Oriented R-CNN)	R50	40e	17.1 (+5.9)	49.0 (+15.8)	7.2 (+2.9)	6.4	16.0	21.6	24.9	41.1M
26	DCFL ($\rm{S^2A}$-Net)	R50	1×	13.7 (+2.9)	39.7 (+6.3)	5.3 (+2.0)	4.7	12.4	18.6	22.6	38.6M
27	DCFL ($\rm{S^2A}$-Net)	R50	40e	17.5 (+6.7)	49.6 (+16.2)	7.9 (+4.6)	6.5	15.7	22.6	27.4	38.6M

Method

Backbone

Schedule

AP_0.5

AP_0.75

AP_vt

AP_t

AP_s

AP_m

#Params.

Architecture:

RetinaNet-O

R50

1×

7.3

23.9

1.8

2.2

5.9

11.1

15.4

36.3M

FCOS-O

R50

1×

11.0

33.6

3.7

3.0

8.9

15.7

22.0

31.9M

Faster R-CNN-O

R50

1×

10.2

30.8

3.6

0.6

7.8

19.0

22.9

41.1M

RoI Transformer

R50

1×

10.5

34.0

2.2

1.1

8.8

16.9

20.3

55.1M

Oriented R-CNN

R50

1×

11.2

33.2

4.3

0.6

9.1

19.5

23.2

41.1M

Deformable DETR-O

R50

1×

8.4

26.7

2.0

4.8

9.3

8.6

7.3

40.8M

ARS-DETR

R50

1×

14.3

41.1

5.8

6.3

14.5

17.6

18.7

41.1M

Representation:

KLD (RetinaNet-O)

R50

1×

7.8

24.8

2.3

3.1

6.7

10.3

15.8

36.3M

KFIoU (RetinaNet-O)

R50

1×

8.1

25.2

2.8

2.0

6.6

12.3

17.1

36.3M

Oriented RepPoints

R50

1×

13.0

40.3

4.2

5.2

12.2

16.8

21.4

36.6M

PSC (RetinaNet-O)

R50

1×

4.5

15.8

1.2

1.0

3.7

8.2

12.7

36.4M

Gliding Vertex

R50

1×

8.1

27.4

2.1

0.9

6.7

14.7

17.9

41.1M

Refinement:

R³Det

R50

1×

8.1

24.4

1.8

0.4

6.6

15.2

19.1

38.6M

S²A-Net

R50

1×

10.8

33.4

3.3

4.3

11.2

13.0

16.0

38.6M

Assignment:

ATSS-O

R50

1×

10.9

33.8

3.1

2.7

8.9

15.5

19.4

36.0M

SASM

R50

1×

11.4

35.0

3.7

3.6

10.2

15.4

19.8

36.6M

CFA

R50

1×

12.4

38.7

4.0

5.0

11.9

16.5

18.8

36.6M

Backbone:

Oriented R-CNN

R101

1×

11.2

33.0

4.1

0.5

8.9

19.8

24.4

60.1M

Oriented R-CNN

Swin-T

1×

12.0

34.6

4.6

0.7

9.9

20.8

25.3

44.8M

Oriented R-CNN

LSKNet-T

1×

11.1

33.4

3.8

0.6

9.2

18.9

22.6

21.0M

ReDet

ReR50

1×

11.6

32.8

4.8

1.4

9.5

19.4

23.2

31.6M

Ours:

DCFL (RetinaNet-O)

R50

1×

12.3 (+5.0)

36.7 (+12.8)

4.5 (+2.7)

4.3

10.7

17.2

22.2

36.1M

DCFL (RetinaNet-O)

R50

40e

15.2 (+7.9)

44.9 (+21.0)

5.1 (+3.3)

4.9

13.1

19.7

25.9

36.1M

DCFL (Oriented R-CNN)

R50

1×

15.7 (+4.5)

47.0 (+13.8)

5.8 (+1.5)

6.3

14.8

19.6

22.5

41.1M

DCFL (Oriented R-CNN)

R50

40e

17.1 (+5.9)

49.0 (+15.8)

7.2 (+2.9)

6.4

16.0

21.6

24.9

41.1M

DCFL ($\rm{S^2A}$-Net)

R50

1×

13.7 (+2.9)

39.7 (+6.3)

5.3 (+2.0)

4.7

12.4

18.6

22.6

38.6M

DCFL ($\rm{S^2A}$-Net)

R50

40e

17.5 (+6.7)

49.6 (+16.2)

7.9 (+4.6)

6.5

15.7

22.6

27.4

38.6M

Main results of label-efficient methods on AI-TOD-R.
Method	Category	10% OBB	20% OBB	30% OBB	100% HBB
Unbiased Teacher	SSOD	7.6	24.7	0.4	6.0	8.1	24.7	0.5	6.1	8.1	25.4	0.4	6.1	-	-	-	-
Soft Teacher	SSOD	9.4	29.0	0.3	7.6	10.2	31.1	0.5	7.9	10.4	32.2	0.6	7.8	-	-	-	-
SOOD	SSOD	9.4	29.3	2.8	8.1	12.1	35.7	3.5	10.2	13.0	38.8	3.9	11.1	-	-	-	-
Co-mining	SAOD	6.4	20.4	0.5	4.4	8.0	24.1	0.4	6.1	8.2	25.0	0.4	6.8	-	-	-	-
H2R-Box	WSOD	-	-	-	-	-	-	-	-	-	-	-	-	11.4	39.1	3.4	9.4
H2R-Box-v2	WSOD	-	-	-	-	-	-	-	-	-	-	-	-	11.7	38.2	4.6	9.5

Main results of label-efficient methods on AI-TOD-R.

Method

DCFL's Detection Performance

DCFL shows superior performance on diverse object detection scenarios, including small oriented object detection (AI-TOD-R, SODA-A), oriented object detection with massive tiny objects (DOTA-v1.5, DOTA-v2), generic oriented object detection (DOTA-v1, DIOR-R), and horizontal object detection (VisDrone, MS COCO). Notably, DCFL will not introduce additional inference cost compared to the baseline. Please refer to the paper for details.

Visualization of Detection Results

DCFL significantly improves the detection performance on oriented tiny objects by suppressing both false negative and false positive predictions.

Visualization analysis of the predicted results. The first row: Oriented R-CNN, the second row: DCFL. True positive, false negative, and false positive predictions are shown in green, red, and blue boxes, respectively.

@article{xu2024oriented, title={Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning}, author={Xu, Chang and Zhang, Ruixiang and Yang, Wen and Zhu, Haoran and Xu, Fang and Ding, Jian and Xia, Gui-Song}, journal={arXiv preprint}, year={2024} }

Oriented Tiny Object Detection:
A Dataset, Benchmark, and Dynamic Unbiased Learning

Motivation

Dataset

Method

Benchmark and Experiments

AI-TOD-R Benchmark

Main results of fully-supervised methods on AI-TOD-R

DCFL's Detection Performance

Visualization of Detection Results

BibTeX