Active self-learning for object detection in an imbalanced data environment: The TAAST approach
Dmytro IvanovIn the context of the growing development and application of computer vision, there is a growing need to reduce the cost of manual data markup, especially in tasks of detecting rare objects in conditions of long-tailed class distribution. The purpose of the study was to improve the efficiency of identifying rare image categories by improving the active self-learning strategy. The study used the Tail-Aware Active Self-Training approach, which was based on strategic selection of frames, considering the entropy of uncertainty, class rarity, and semantic diversity in the feature space of the Contrastive Language-Image Pretraining model, followed by the use of pseudo-markup using the You Only Look Once detector, version 8. As a result of experiments on Large Vocabulary Instance Segmentation datasets, version 1.0, and nuImages-imbalanced, the proposed strategy provided an increase in AP_rare accuracy by 6.3-6.4 percentage points compared to the basic Random and Uncertainty Sampling approaches. The overall accuracy of the model did not decrease, but increased to 36.0-43.2% mAP, depending on the dataset. The markup efficiency indicator reached 42-43%, which was 9-10 points higher than competitive strategies. The results of the experiment were statistically reliable, since the confidence intervals for the AP_rare accuracy metric in the case of using the Tail-Aware Active SelfTraining method do not overlap with the intervals for the basic random and Uncertainty-only strategies. This indicated that the advantage of this method was not random, but was confirmed with high probability. Consequently, the results obtained demonstrated the reliability and stability of the proposed approach. It was demonstrated that after two active iterations, the model reached a performance plateau, which significantly reduced computational costs. The practical significance of the study lies in creating an effective tool for automated deployment of computer vision models in conditions of a limited markup budget
References
[1] Ali, M.L., & Zhang, Z. (2024). The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers, 13(12), article number 336. doi: 10.3390/computers13120336.
[2] Bottou, L. (2012). Stochastic gradient descent tricks. In G. Montavon, G.B. Orr & K.R. Müller (Eds.), Neural networks: Tricks of the trade. Lecture notes in computer science (Vol. 7700, pp 421-436). Berlin: Springer. doi: 10.1007/978-3-64235289-8_25.
[3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2020). nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11621-11631). Seattle: IEEE/CVF. doi: 10.1109/CVPR42600.2020.01164.
[4] De Alvis, C., & Seneviratne, S. (2024). A survey of deep long-tail classification advancements. ArXiv. doi: 10.48550/ arXiv.2404.15593.
[5] Duan, C.-L., Li, Y., Wei, X.-S., & Zhao, L. (2024). Longtail object detection pre-training: Dynamic rebalancing contrastive learning with dual reconstruction. In 38th conference on neural information processing systems (NeurIPS 2024). Vancouver: NeurIPS.
[6] Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of Machine Learning Research, 48, 1050-1059.
[7] Jocher, G., Chaurasia, A., & Qiu, J. (2023). YOLOv5 and YOLOv8: A detailed comparison. Retrieved from https://docs. ultralytics.com/models/yolov8/.
[8] Johnson, J., Douze, M., & Jégou, H. (2021). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547. doi: 10.1109/TBDATA.2019.2921572.
[9] Li, B., Yao, Y., Tan, J., Zhang, G., Yu, F., Lu, J., & Luo, Y. (2022). Improving long-tailed object detection with image-level supervision by multi-task collaborative learning. ArXiv. doi: 10.48550/arXiv.2210.05568.
[10] Li, Y., Wang, T., Kang, B., Tang, S., Wang, Ch., Li, J., & Feng, J. (2020). Overcoming classifier imbalance for long-tail object detection via balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10991-11000). Seattle: IEEE. doi: 10.1109/CVPR42600.2020.01100.
[11] Qi, T., Xie, H., Li, P., Ge, J., & Zhang, Y. (2023). Balanced classification: A unified framework for long-tailed object detection. IEEE Transactions on Multimedia, 26, 3088-3101. doi: 10.1109/TMM.2023.3306968.
[12] Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. In 38th international conference on machine learning (ICML 2021) (pp. 8748-8763). Online Conference.
[13] Sener, O., & Savarese, S. (2018). Active learning for convolutional neural networks: A core-set approach. In ICLR 2018 conference track: 6th international conference on learning representation. Vancouver: Vancouver Convention Center.
[14] Settles, B. (2009). Active learning literature survey. Madison: University of Wisconsin-Madison.
[15] Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E.D., Kurakin, A., Zhang, H., & Raffel, C. (2020). FixMatch: Simplifying semi-supervised learning with consistency and confidence. In NIPS’20: Proceedings of the 34th international conference on neural information processing systems (pp. 596-608). Vancouver: NIPS.
[16] Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In IEEE/CVF international conference on computer vision (ICCV) (pp. 9626-9635). Seoul: IEEE. doi: 10.1109/ICCV.2019.00972.
[17] Wu, J., Chen, J., & Huang, D. (2022). Entropy-based active learning for object detection with progressive diversity constraint. In IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9387-9396). New Orleans: IEEE. doi: 10.1109/CVPR52688.2022.00918.
[18] Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., & Liu, Z. (2021). End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 3060-3069). Montreal: IEEE. doi: 10.1109/ICCV48922.2021.00305.
[19] Yang, C., Huang, L., & Crowley, E.J. (2024). Plug-and-play active learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR 2024) (pp. 17784-17793). Seattle: IEEE. doi: 10.1109/CVPR52733.2024.01684.
[20] Jiang, C.M., Najibi, M., Qi, C.R., Zhou, Y., & Anguelov, D. (2022). Improving the intra-class long-tail in 3D detection via rare example mining. In Computer vision – ECCV 2022. Lecture notes in computer science (Vol. 13670, pp. 155-172). Cham: Springer. doi: 10.1007/978-3-031-20080-9_10.
[21] Peri, N., Dave, A., Ramanan, D., & Kong, S. (2023). Towards long-tailed 3d detection. Proceedings of Machine Learning Research, 205, 1904-1915.
[22] Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S.X. (2022). Open long-tailed recognition in a dynamic world. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3), 1836-1851. doi: 10.1109/TPAMI.2022.3200091.