Prompt engineering for large language models in test case generation
Anatolii HusakovskyiThe relevance of the study is determined by the need to enhance the effectiveness of software testing, where the use of large language models and prompt engineering techniques opens new opportunities for the automated generation of high-quality test cases. The purpose of the study is to evaluate the effectiveness of prompt engineering strategies in test case generation by large language models. The methodology is based on a comparison of four prompt engineering techniques, namely zero-shot, few-shot, chain-of-thought, and role prompting, for unit test generation using the CodeLlama 2 and StarCoder language models in the PyTest and JUnit environments, with evaluation according to the criteria of code coverage, relevance, defect detection, and integration suitability. The analysis demonstrated that few-shot and role prompting provide the best balance between the quantity and quality of tests, with coverage of 85-100% and relevance of 88-95%, whereas chain-of-thought proved effective for complex logic and identified 16 of 20 embedded defects (80%), while zero-shot was limited to basic checks with coverage of 55-65% and accuracy of 70-75%. CodeLlama 2 demonstrated stable test generation with high consistency across repeated queries (90%), an average generation time of 16.2 s, and 52 tests per module, covering basic and complex scenarios, including edge cases and exceptions. StarCoder demonstrated higher speed (14.7 s), generated 50 tests with slightly lower stability (87%) and reduced coverage of complex scenarios, which rendered it effective for rapid validation of basic functions. The highest levels of readability, modularity, and integration suitability for CI/CD pipelines were observed with role prompting, whereas few-shot ensured a strong balance between structured output and practical test readiness, while chain-of-thought and zero-shot exhibited specific limitations. Combined use of models and prompting strategies enables optimisation of the test generation process, enhancing relevance, coverage, and the effectiveness of automated testing. The results of the study may be applied in automated software testing, integration into continuous integration and delivery pipelines, and training of quality assurance engineers in effective test generation methods
References
- Adu, G. (2024). Artificial Intelligence in software testing: Test scenario and case generation with an AI model (gpt-3.5-turbo) using prompt engineering, fine-tuning and retrieval augmented generation techniques. (Master’s Thesis, University of Eastern, Joensuu, Finland).
- Alagarsamy, S., Tantithamthavorn, C., Takerngsaksiri, W., Arora, C., & Aleti, A. (2025). Enhancing large language models for text-to-testcase generation. Journal of Systems and Software, 230, article number 112531. doi: 10.1016/j.jss.2025.112531.
- Alshahwan, N., Chheda, J., Finogenova, A., Gokkaya, B., Harman, M., Harper, I., Marginean, A., Sengupta, S., & Wang, E. (2024). Automated unit test improvement using large language models at Meta. In M. d’Amorim (Ed.), Companion proceedings of the 32nd ACM international conference on the foundations of software engineering (pp. 185-196). New York: Association for Computing Machinery. doi: 10.1145/3663529.3663839.
- Anasuri, S. (2024). Prompt engineering best practices for code generation tools. International Journal of Emerging Trends in Computer Science and Information Technology, 5(1), 69-81. doi: 10.63282/3050-9246.IJETCSIT-V5I1P108.
- Belzner, L., Gabor, T., & Wirsing, M. (2023). Large language model assisted software engineering: Prospects, challenges, and a case study. In B. Steffen (Ed.), Bridging the gap between AI and reality (pp. 355-374). Cham: Springer. doi: 10.1007/978-3-031-46002-9_23.
- Cain, W. (2024). Prompting change: Exploring prompt engineering in large language model AI and its potential to transform education. TechTrends, 68(1), 47-57. doi: 10.1007/s11528-023-00896-0.
- Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2025). Unleashing the potential of prompt engineering for large language models. Patterns, 6(6), article number 101260. doi: 10.1016/j.patter.2025.101260.
- Clavié, B., Ciceu, A., Naylor, F., Soulié, G., & Brightwell, T. (2023). Large language models in the workplace: A case study on prompt engineering for job type classification. In E. Métais, F. Meziane, V. Sugumaran, W. Manning & S. Reiff-Marganiec (Eds.), Natural language processing and information systems (pp. 3-17). Cham: Springer. doi: 10.1007/978-3-031-35320-8_1.
- Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., & Zhang, J.M. (2023). Large language models for software engineering: Survey and open problems. In Proceedings of the IEEE/ACM international conference on software engineering: Future of software engineering (pp. 31-53). Melbourne: IEEE. doi: 10.1109/ICSE-FoSE59343.2023.00008.
- Feng, S., & Chen, C. (2024). Prompting is all you need: Automated android bug replay with large language models. In A. Paiva & R. Abreu (Eds.), Proceedings of the 46th IEEE/ACM international conference on software engineering (article number 67). New York: Association for Computing Machinery. doi: 10.1145/3597503.3608137.
- Gao, A. (2023). Prompt engineering for large language models. SSRN. doi: 10.2139/ssrn.4504303.
- Grabb, D. (2023). The impact of prompt engineering in large language model performance: A psychiatric example. Journal of Medical Artificial Intelligence, 6, article number 20. doi: 10.21037/jmai-23-71.
- Jiang, E., Olson, K., Toh, E., Molina, A., Donsbach, A., Terry, M., & Cai, C.J. (2022). PromptMaker: Prompt-based prototyping with large language models. In S. Barbosa, C. Lampe, C. Appert & D.A. Shamma (Eds.), CHI Conference on human factors in computing systems extended abstracts (article number 35). New York: Association for Computing Machinery. doi: 10.1145/3491101.3503564.
- Levitskyi, S., & Mokin, V. (2025). Analysis of benchmark tests of large language models’ resilience to disinformation and various types of manipulation. Retrieved from http://ir.lib.vntu.edu.ua/handle/123456789/49249.
- Lim, S., & Schmälzle, R. (2023). Artificial intelligence for health message generation: An empirical study using a large language model (LLM) and prompt engineers. Frontiers in Communication, 8, article number 1129082. doi: 10.3389/fcomm.2023.1129082.
- Naimi, L., Manaouch, M., & Jakimi, A. (2024). A new approach for automatic test case generation from use case diagram using LLMs and prompt engineering. In Proceedings of the international conference on circuit, systems and communication (pp. 1-5). Fes: IEEE. doi: 10.1109/ICCSC62074.2024.10616548.
- Nayyar, A., Vairamani, A.D., & Kaswan, K. (2025). Mastering prompt engineering: Deep insights for optimizing large language models (LLMs). London: Elsevier. doi: 10.1016/C2024-0-00708-4.
- Novakovsky, A., & Yalovega, I. (2025). Categorisation of the capabilities of large language models of artificial intelligence. In Proceedings of the 29th international youth forum “Radio electronics and youth in the 21st century” (pp. 296-298). Kharkiv: Kharkiv National University of Radio Electronics.
- Plein, L., Ouédraogo, W.C., Klein, J., & Bissyandé, T.F. (2024). Automatic generation of test cases based on bug reports: A feasibility study with large language models. In A. Paiva & R. Abreu (Eds.), Proceedings of the 2024 IEEE/ACM 46th international conference on software engineering: Companion proceedings (pp. 360-361). New York: Association for Computing Machinery. doi: 10.1145/3639478.3643119.
- Pornprasit, C., & Tantithamthavorn, C. (2024). Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology, 175, article number 107523. doi: 10.1016/j.infsof.2024.107523.
- Radcliffe, T., Lockhart, E., & Wetherington, J. (2024). Automated prompt engineering for semantic vulnerabilities in large language models. Authorea. doi: 10.22541/au.172348895.52207804/v1.
- Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. ArXiv. doi: 10.48550/arXiv.2402.07927.
- Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2023). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1), 85-105. doi: 10.1109/TSE.2023.3334955.
- Strobelt, H., Webson, A., Sanh, V., Hoover, B., Beyer, J., Pfister, H., & Rush, A.M. (2022). Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics, 29(1), 1146-1156. doi: 10.1109/TVCG.2022.3209479.
- Vatsal, S., & Dubey, H. (2024). A survey of prompt engineering methods in large language models for different NLP tasks. ArXiv. doi: 10.48550/arXiv.2407.12994.
- Velásquez-Henao, J.D., Franco-Cardona, C.J., & Cadavid-Higuita, L. (2023). Prompt engineering: A methodology for optimizing interactions with AI-Language models in the field of engineering. Dyna, 90, 9-17.
- Wang, C.Y. (2025). Application and optimization of prompt engineering techniques for code generation in large language models. (Master’s thesis, York University, Toronto, Canada).
- Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2024). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50(4), 911-936. doi: 10.1109/TSE.2024.3368208.
- Yurchak, I., Kychuk, O., Oksentyuk, V., & Khich, A. (2024). Prompting techniques for enhancing the use of large language models. Computer Systems and Networks, 6(2), 286-300. doi: 10.23939/csn2024.02.268.