Boxi Yu

I am a Senior Research Fellow at Lero, the Science Foundation Ireland Research Centre for Software, directed by Prof. Lionel C. Briand. I obtained my Ph.D. from The Chinese University of Hong Kong, Shenzhen in 2025, supervised by Prof. Pinjia He.

My research focuses on Trustworthy AI, Code Agents, and Automated Testing. I proposed Retromorphic Testing, a technique for automatically constructing test oracles for modern software. My work has been published at top-tier venues including ICML, ICSE, ISSTA, ESEC/FSE, and ACL.

news

Apr 30, 2026	Our ICML and ICML Position papers were accepted: “SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark” and “How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs”.
May 20, 2025	Our paper “UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench” was accepted by ACL’2025.
May 20, 2024	Our extended abstract “DSPy Guardrails: Building Safe LLM Applications via Self-Refining Language Model Pipelines” was accepted by Compound AI Systems Workshop (June 13th, 2024 in San Francisco at Data + AI Summit).
Dec 15, 2023	Our paper “Testing Graph Database Systems via Equivalent Query Rewriting” was accepted by ICSE’2024.
Oct 11, 2023	We introduce “Retromorphic Testing,” a new, general methodology to the test oracle problem. It is a black-box technique, which constructs a dual program architecture to test the target software, inspired by the concept of inverse function. Read the paper

Selected publications

ICML

SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark

Boxi Yu, and others

ICML’26: International Conference on Machine Learning, 2026
ACL

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Boxi Yu, Yuxuan Zhu, Pinjia He, and Daniel Kang

2025

arXiv PDF Code
arXiv

Retromorphic Testing: A New Approach to the Test Oracle Problem

Boxi Yu, Qiuyang Mang, Qingshuo Guo, and Pinjia He

ArXiv, 2023

arXiv PDF Code
ICML Position

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, and 5 more authors

ICML Position, 2026
ICSE

Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection

Boxi Yu, Jiayi Yao, Qiuai Fu, Zhiqing Zhong, Haotian Xie, Yaoliang Wu, Yuchi Ma, and Pinjia He

ICSE’24: International Conference on Software Engineering, 2024

PDF Code
CASW

DSPy Guardrails: Building Safe LLM Applications via Self-Refining Language Model Pipelines

Boxi Yu, and Pinjia He

Compound AI Systems Workshop, 2024

PDF Code
ESEC/FSE

Automated Testing and Improvement of Named Entity Recognition Systems

Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, and Pinjia He

ESEC/FSE’23: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

Abs PDF Code Slides

Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.
ISSTA

ROME: Testing Image Captioning Systems via Recursive Object Melting

Boxi Yu, Zhiqing Zhong, Jiaqi Li, Yixing Yang, Shilin He, and Pinjia He

In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2023

HTML PDF Code Slides
ISSTA

Automated testing of image captioning systems

Boxi Yu, Zhiqing Zhong, Xinran Qin, Jiayi Yao, Yuancheng Wang, and Pinjia He

In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022

Abs HTML PDF Code Slides

Image captioning (IC) systems, which automatically generate a text description of the salient objects in an image (real or synthetic), have seen great progress over the past few years due to the development of deep neural networks. IC plays an indispensable role in human society, for example, labeling massive photos for scientific studies and assisting visually-impaired people in perceiving the world. However, even the top-notch IC systems, such as Microsoft Azure Cognitive Services and IBM Image Caption Generator, may return incorrect results, leading to the omission of important objects, deep misunderstanding, and threats to personal safety. To address this problem, we propose MetaIC, the first metamorphic testing approach to validate IC systems. Our core idea is that the object names should exhibit directional changes after object insertion. Specifically, MetaIC (1) extracts objects from existing images to construct an object corpus; (2) inserts an object into an image via novel object resizing and location tuning algorithms; and (3) reports image pairs whose captions do not exhibit differences in an expected way. In our evaluation, we use MetaIC to test one widely-adopted image captioning API and five state-of-the-art (SOTA) image captioning models. Using 1,000 seeds, MetaIC successfully reports 16,825 erroneous issues with high precision (84.9%-98.4%). There are three kinds of errors: misclassification, omission, and incorrect quantity. We visualize the errors reported by MetaIC, which shows that flexible overlapping setting facilitates IC testing by increasing and diversifying the reported errors. In addition, MetaIC can be further generalized to detect label errors in the training dataset, which has successfully detected 151 incorrect labels in MS COCO Caption, a standard dataset in image captioning.