Оцінювання співпраці людини та ШІ в парному програмуванні  на прикладі CodeLlama і GPT-4

Олександр Дейнега; Олена  Аршава; Ірина  Жовтоніжко; Олександр Дейнега; Олена  Аршава; Ірина  Жовтоніжко

https://doi.org/10.31649/vitce/2.2025.47

Взято з Т. 22, №2, 2025

Отримано 28.03.2025, Доопрацьовано 27.06.2025, Прийнято 28.08.2025

Оцінювання співпраці людини та ШІ в парному програмуванні на прикладі CodeLlama і GPT-4

Олександр Дейнега, Олена Аршава, Ірина Жовтоніжко

Метою дослідження було експериментальне оцінювання ефективності взаємодії людини з великими мовними моделями штучного інтелекту під час виконання програмних завдань у форматі парного програмування. Об’єктом порівняння виступили дві моделі: GPT-4, розроблена компанією OpenAI, та CodeLlama 70B Instruct, створена корпорацією Meta на базі відкритої архітектури. Було досліджено п’ять типових сценаріїв застосування штучного інтелекту в розробці програмного забезпечення: генерацію функцій за описом, рефакторинг коду, пояснення логіки, налагодження помилок і спільне проєктування структури застосунку. У дослідженні взяли участь двадцять фахівців із різним рівнем програмістської підготовки, рівномірно розподілених на дві групи. Було встановлено, що GPT-4 перевищує CodeLlama за інтегральними показниками продуктивності, зокрема досягла 89 % успішності в генерації функцій при вищому балі якості коду (Pylint = 8,3) і пояснюваності (4,3 бала з 5). Натомість CodeLlama виявила переваги в рефакторингу, демонструючи нижчу когнітивну напруженість серед досвідчених розробників (Task Load Index = 51,3 проти 63,8) і меншу складність коду за метрикою Halstead Volume (22,6 проти 27,4). Було проаналізовано статистичну достовірність виявлених відмінностей (t (38) = 4,12; p < 0,01; F (2,14) = 5,84; p < 0,05), що підтверджує надійність емпіричних спостережень. Генеративна модель четвертого покоління виявилася більш придатною для проєктування з нуля та пояснювальних завдань, тоді як CodeLlama була ефективніша в оптимізації вже наявного коду й краще сприймалася користувачами Seniorрівня. Практичне значення проведеного дослідження полягає в формуванні обґрунтованих рекомендацій для розробників, ІТ-команд і технічних керівників щодо доцільного використання мовних моделей штучного інтелекту в робочих процесах. Результати дозволяють вибудовувати оптимальні сценарії взаємодії залежно від досвіду програміста, характеру завдань (генерація, рефакторинг, пояснення, налагодження) та очікуваних метрик продуктивності, що сприяє ефективнішому впровадженню ШІ-асистентів у середовища розробки

великі мовні моделі; генерація коду; когнітивне навантаження; рефакторинг коду; програмування; статистичний аналіз

47-62

Deineha, O., Arshava, O., & Zhovtonizhko, I. (2025). Evaluation of human and AI cooperation in pair programming on the example of CodeLlama and GPT-4. Information Technologies and Computer Engineering, 22(2), 47-62. https://doi.org/10.31649/vitce/2.2025.47

Використані джерела

[1] Abbas, N., & Atwell, E. (2025). Cognitive computing with large language models for student assessment feedback. Big Data and Cognitive Computing, 9(5), article number 112. doi: 10.3390/bdcc9050112 .

[2] American Psychological Association. (2003). Ethical principles of psychologists and code of conduct. Retrieved from https://www.apa.org/ethics/code .

[3] Amiri, S.M., & Islam, M.M. (2025). Enhancing Python programming education with an AI-powered code helper: Design, implementation, and impact. Software Engineering, 11(1), 1-17.

[4] Bai, X., Huang, S., Wei, C., & Wang, R. (2025). Collaboration between intelligent agents and large language models: A novel approach for enhancing code generation capability. Expert Systems with Applications, 269, article number 126357. doi: 10.1016/j.eswa.2024.126357 .

[5] Dai, Z., Chen, B., Zhao, Z., Tang, X., Wu, S., Yao, C., Gao, Z., & Chen, J. (2025). Less is more: Adaptive program repair with bug localization and preference learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(1), 128-136. doi: 10.1609/aaai.v39i1.31988 .

[6] Dong, Y., Jiang, X., Jin, Z., & Li, G. (2024). Self-collaboration code generation via ChatGPT. ACM Transactions on Software Engineering and Methodology, 33(7), article number 189. doi: 10.1145/3672459.

[7] ICC/ESOMAR international code. (2016). Retrieved from https://esomar.org/code-and-guidelines/icc-esomar-code .

[8] Fu, Y., Liang, P., LI, Z., Shahin, M., Yu, J., & Chen, J. (2025). Security weaknesses of Copilot-generated code in GitHub projects: An empirical study. ACM Transactions on Software Engineering and Methodology. doi: 10.1145/3716848.

[9] Godoy, W.F., Valero-Lara, P., Teranishi, K., Balaprakash, P., & Vetter, J.S. (2024). Large language model evaluation for high-performance computing software development. Concurrency and Computation: Practice and Experience, 36(26), article number e8269. doi: 10.1002/cpe.8269.

[10] Haque, M.A. (2025). LLMs: A game-changer for software engineers? BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 5(1), article number 100204. doi: 10.1016/j.tbench.2025.100204 .

[11] Hordiienko, O., & Koval, A. (2024). The future of programming: How artificial intelligence is transforming software development. Information Technology and Society, 4(15), 40-43. doi: 10.32689/maup.it.2024.4.7 .

[12] Hou, W., & Ji, Z. (2025). Comparing large language models and human programmers for generating programming code. Advanced Science, 12(8), article number 2412279. doi: 10.1002/advs.202412279 .

[13] Integrated Collaborative Environment (ICE) for teaching, learning, research & work. (2004). Retrieved from https:// depts.washington.edu/edtecdev/press/ICE_Proposal.pdf.

[14] Khan, A., Shokrizadeh, A., & Cheng, J. (2025). Beyond automation: How designers perceive AI as a creative partner in the divergent thinking stages of UI/UX design. In Proceedings of the 2025 CHI conference on human factors in computing systems (article number 1105). New York: Association for Computing Machinery. doi: 10.1145/3706598.3713500 .

[15] Koshelev, M.O., & Naugolna, L.M. (2024). Artificial intelligence and its impact on software development. In Collection of abstracts of the all-Ukrainian scientific and practical student conference “IT-space of today: Trends, innovations and development prospects” (pp. 159-161). Kharkiv: Karazin Kharkiv National University.

[16] Kravchuk, O. (2024). Artificial intelligence in programming: How AI is changing the approach to code development and automation. Herald of Khmelnytskyi National University. Technical Sciences, 345(6), 238-242. doi: 10.31891/2307 5732-2024-345-6-36 .

[17] Kryvonos, O. (2024). The use of generative AI to create program code. Science and Technology Today, 40, 1314-1325. doi: 10.52058/2786-6025-2024-12(40)-1314-1325.

[18] Lam, T.J., & Li, L. (2024). Large-scale randomized program generation with large language models. Retrieved from https://sc24.supercomputing.org/proceedings/poster/poster_files/post203s2-file3.pdf.

[19] Le, H., Nguyen, P., Nguyen, T., Pham, T., Do, H., Quan, T., & NguyenDuc, A. (2025). Codelsi: Leveraging foundation models for automated code generation with low-rank optimization and domain-specific instruction tuning. SSRN. doi: 10.2139/ssrn.5263010 .

[20] Li, X., Li, Y., Wu, H., Zhang, Y., Xu, K., Cheng, X., Zhong, S., & Xu, F. (2025). Make a feint to the east while attacking in the west: Blinding LLM-based code auditors with flashboom attacks. In IEEE symposium on security and privacy (pp. 576-594). San Francisco: IEEE. doi: 10.1109/SP61157.2025.00125.

[21] Liu, J., & Li, S. (2024). Toward artificial intelligence-human paired programming: A review of the educational applications and research on artificial intelligence code-generation tools. Journal of Educational Computing Research, 62(5), 1385-1415. doi: 10.1177/07356331241240460.

[22] Liu, J., Xia, C.S., Wang, Y., & Zhang, L. (2023). Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 21558-21572.

[23] Lyushenko, L., & Perehuda, Ya. (2024). Method of building software detectors for detecting software bots in social networks. Information Technology and Society, 1(12), 56-64. doi: 10.32689/maup.it.2024.1.8 .

[24] Ma, L., Pu, K., Zhu, Y., & Taylor, W. (2025). Comparing large language models for generating complex queries. Journal of Computer and Communications, 13(2), 236-249. doi: 10.4236/jcc.2025.132015 .

[25] Ma, Q., Wu, T., & Koedinger, K. (2023). Is ai the better programming partner? Human-human pair programming vs. human-ai pair programming. ArXiv. doi: 10.48550/arXiv.2306.05153 .

[26] Majdoub, Y., & Ben Charrada, E. (2024). Debugging with open-source large language models: An evaluation. In X. Franch, M. Daneva, S. Martínez-Fernández & L. Quaranta (Eds.), Proceedings of the 18th ACM/IEEE international symposium on empirical software engineering and measurement (pp. 510-516). New York: Association for Computing Machinery. doi: 10.1145/3674805.3690758.

[27] Mohamed, K., Yousef, M., Medhat, W., Mohamed, E.H., Khoriba, G., & Arafa, T. (2024). Hands-on analysis of using large language models for the auto evaluation of programming assignments. Information Systems, 128, article number 102473. doi: 10.1016/j.is.2024.102473 .

[28] Mozannar, H., Chen, V., Alsobay, M., Das, S., Zhao, S., Wei, D., Nagireddy, M., Sattigeri, P., Talwalkar, A., & Sontag, D. (2024). The RealHumanEval: Evaluating large language models’ abilities to support programmers. ArXiv. doi: 10.48550/arXiv.2404.02806.

[29] Mozannar, H., Chen, V., Wei, D., Sattigeri, P., Nagireddy, M., Das, S., Talwalkar, A., & Sontag, D. (2023). Simulating iterative human-AI interaction in programming with LLMs . In NeurIPS 2023 workshop on instruction tuning and instruction following. New Orleans: ACL HomeAssociation for Computational Linguistics.

[30] NASA task load index (TLX) v.1.0 . (1988). California: NASA Ames Research Center.

[31] Pulavarthi, V., Nandal, D., Dan, S., & Pal, D. (2025). Are LLMs ready for practical adoption for assertion generation? OpenReview.

[32] Rong, Y., Du, T., Li, R., & Bao, W. (2025). Integrating LLM-based code optimization with human-like exclusionary reasoning for computational education. Journal of King Saud University Computer and Information Sciences, 37(5), article number 87. doi: 10.1007/s44443-025-00074-7 .

[33] Slama, F., & Lemire, D. (2025). Enhancing developer productivity: Benchmarking LLM-powered tools like GitHub Copilot and TabNine in real-time coding environments. In 11th international conference on intelligent data and security (pp. 39-45). New York: IEEE Computer Society. doi: 10.1109/IDS66066.2025.00011.

[34] Sun, Z., Du, X., Yang, Z., Li, L., & Lo, D. (2024). AI coders are among us: Rethinking programming language grammar towards efficient code generation. In M. Christakis (Ed.), Proceedings of the 33rd ACM SIGSOFT international symposium on software testing and analysis (pp. 1124-1136). New York: Association for Computing Machinery. doi: 10.1145/3650212.3680347.

[35] Szalontai, B., Vadász, A., Márton, T., Pintér, B., & Gregorics, T. (2023). Fine-tuning CodeLlama to fix bugs. In Z. Illés, C. Verma, P.J. Sequeira Gonçalves & P. Kumar Singh (Eds.), Proceedings of international conference on recent innovations in computing (pp. 497-509). Singapore: Springer. doi: 10.1007/978-981-97-3442-9_34.