EdiVal-Agent
Automated, object-centric evaluation for multi-turn instruction-based image editing.
Tianyu Chen*1, Yasi Zhang*2, Zhi Zhang2, Peiyu Yu2, Shu Wang2, Zhendong Wang3, Kevin Lin3, Xiaofei Wang3, Zhengyuan Yang3, Linjie Li3, Chung-Ching Lin3, Jianwen Xie4, Oscar Leong†2, Lijuan Wang†3, Ying Nian Wu†2, Mingyuan Zhou†1,3
1University of Texas at Austin 2University of California, Los Angeles 3Microsoft 4Lambda, Inc.
*Equal contribution †Equal advising
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images—resulting in limited coverage and inheriting biases from prior generative models—or (ii) rely solely on zero-shot vision–language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.
To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline’s modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time.
Instantiating this pipeline, we build EdiVal-Bench , a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR), flow-matching, and diffusion paradigms. Please see our analysis below and we will give an answer who is the winner.
Resources: Paper PDF (preprint) · Code repository (coming soon) · Contact
🍌 Nano Banana is the SOTA in Multi-turn Editor! 🔷 OpenAI’s GPT-Image-1 follows as Runner-up.
Results of multi-turn editing. Instruction following, content consistency, and overall across three sequential editing turns. Overall is the geometric mean of instruction following and content consistency. Best per column is shown in dark red; second-best in light red. Latency is seconds per image (lower is better).
- 🍌 Nano Banana: Best speed–quality trade-off with top Overall scores at Turn 1/Turn 2/Turn 3 (81.48 / 67.70 / 56.24) and fast latency (9.7 s/img).
- 🎯 GPT-Image-1: Strongest instruction following, but slowed by high latency (71.3 s/img) and weaker consistency; Nano Banana trails by only ~4.2 (Turn 2) and 3.0 (Turn 3) points.
- 🔄 Consistency: FLUX.1-Kontext-dev leads, with Nano Banana close behind; GPT-Image-1 ranks second-to-last due to more regenerative/restyling edits.
- 🌐 Qwen-Image-Edit: Best open source editing model. Instruction following is strong at Turn 1 (Overall 78.36) but degrades quickly with more turns—likely from single-turn training and limited edit-history handling.
Technique | Model | Date | Latency (s/img) | Instruction Following | Content Consistency | Overall | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
T1 | T2 | T3 | T1 | T2 | T3 | T1 | T2 | T3 | ||||
Autoregressive | Nano Banana | 25.08.26 | 9.70 | 70.70 | 50.66 | 35.35 | 93.91 | 90.48 | 89.48 | 81.48 | 67.70 | 56.24 |
GPT-Image-1 | 25.07.16 | 71.30 | 73.12 | 54.89 | 38.35 | 81.00 | 77.78 | 75.50 | 76.96 | 65.34 | 53.81 | |
Gemini 2.0 Flash | 25.02.05 | 8.20 | 68.07 | 45.96 | 28.42 | 90.58 | 85.10 | 80.88 | 78.52 | 62.54 | 47.94 | |
Flow Matching | Qwen-Image-Edit | 25.08.04 | 115.08 | 72.90 | 44.06 | 22.55 | 84.22 | 80.52 | 77.98 | 78.36 | 59.56 | 41.93 |
Step1X-Edit | 25.04.25 | 20.42 | 61.89 | 34.97 | 17.83 | 92.76 | 88.52 | 85.21 | 75.77 | 55.64 | 38.98 | |
FLUX.1-Kontext-dev | 25.06.25 | 29.21 | 59.97 | 32.69 | 16.61 | 95.32 | 92.24 | 90.22 | 75.61 | 54.91 | 38.71 | |
OmniGen | 24.09.11 | 19.70 | 54.72 | 24.48 | 10.66 | 93.00 | 88.42 | 83.92 | 71.34 | 46.52 | 29.91 | |
Diffusion | AnyEdit | 24.11.24 | 3.93 | 41.07 | 16.32 | 7.22 | 86.42 | 78.91 | 70.10 | 59.58 | 35.89 | 22.50 |
UltraEdit | 24.07.07 | 3.15 | 51.37 | 17.70 | 6.36 | 86.80 | 84.50 | 82.40 | 66.78 | 38.67 | 22.89 | |
MagicBrush | 23.06.16 | 4.08 | 42.31 | 15.73 | 4.90 | 86.96 | 81.26 | 76.86 | 60.66 | 35.75 | 19.41 | |
InstructPix2Pix | 23.12.15 | 4.09 | 37.41 | 10.66 | 2.80 | 76.85 | 68.36 | 60.30 | 53.62 | 26.99 | 12.99 |
Based on the data in this table, across three turns Nano Banana offers the best speed–quality trade-off—highest Overall at T1/T2/T3 (81.48/67.70/56.24) with 9.7 s/img. GPT-Image-1 delivers the strongest instruction following across turns, but its latency (71.3 s/img)* and weaker consistency leave it second in Overall; Nano Banana trails GPT-Image-1 by only ~4.2 (T2) and 3.0 (T3) points on instruction following. For consistency, FLUX.1-Kontext-dev leads across turns with Nano Banana close behind, whereas GPT-Image-1 ranks second-to-last—consistent with more regenerative/restyling behavior that can erode pixel- or feature-level stability despite aesthetic gains. Gemini 2.0 Flash is competitive at T1 (second-best Overall) but exhibits a steeper decline by T3. Among open-source systems, Qwen-Image-Edit is strongest at T1 (Overall 78.36) yet degrades rapidly with additional turns, likely due to exposure bias from single-turn training on real images and a short edit-history window that forces the model to operate on its own outputs.
* Closed-source latencies were measured in the provider’s hosted web UI; open-source latencies on a single NVIDIA A100 GPU with default settings.
Overview of Our Workflow
Overview of our workflow and representative model performance. For visualization we adopt two thresholds: a consistency score of at least 90 and a visual quality score of at least 6. Details of the automated evaluation pipeline are provided in Figure 2 and Section 2. In multi-turn editing the models exhibit distinct weaknesses—GPT-Image-1 struggles with content consistency, Qwen-Image-Edit underperforms in both visual quality and content consistency, and FLUX.1-Kontext-dev lags in instruction following—whereas Nano Banana shows no single dominant weakness. A comprehensive analysis is presented in Section 4 and Table 2.

Framework of EdiVal-Agent
Framework of EdiVal-Agent. It first decomposes images into semantically meaningful objects—such as metal yellow sign and metal brown pole—and identifies their contextual relationships (e.g. both are in the foreground). It then generates diverse, grounded editing scenarios based on that analysis (for example, Change the color of metal brown pole to gray). Finally, it evaluates editor outputs along instruction following, content consistency, and visual quality by integrating Qwen2.5-VL with Grounding-DINO for instruction checks, DINOv3 features plus pixel-level L1 distances for consistency, and HPSv3 for human preference alignment. The agentic pipeline is tool-agnostic and can readily incorporate stronger experts as they become available.

Why QWEN Image Edit fails in multi-turn editing? Exposure Bias. Evidence from Marginal Instruction-Following Across Turns

We analyze instruction-following ability across nine edit tasks. For a given turn, the marginal task success rate is the proportion of prompts for which the requested edit in that turn is successfully implemented. In contrast, the instruction following score reported in Table 2 corresponds to the multi-turn task success rate at turn i—the fraction of images for which all edits up to turn i are successful.
Figure 4 shows the evolution of per-turn performance. Models that are AR (autoregressive) and condition on the full editing history are relatively stable across turns: their marginal task success rates change only slightly between turns. By contrast, non-AR models with a very short history window (effectively conditioning only on the previous output) suffer substantial degradation in marginal task success, particularly for flow-matching style models.
A striking example is Qwen-Image-Edit. It is the strongest open-source system at turn 1 (Overall 78.36 vs. 81.48 for Nano Banana) but degrades much faster over subsequent turns. We hypothesize that this is primarily an exposure bias issue: many single-turn edit models are trained to operate on real images and ground-truth inputs, not on their own earlier outputs. When a model must operate on its own previous edits, small distribution mismatches compound across turns and reduce stability—this effect is exacerbated when the model can attend only to a short slice of the history.
Content Consistency: more than just following instructions.
Consistency metrics rely on Grounding-DINO to capture the same spatial regions across the editing trajectory. DINOv3 embeddings and pixel distances are computed over those regions to detect drift in untouched objects and background areas.
How We Measure Object Consistency




Illustration of object consistency. Instruction: “Remove brick beige house.” The grounding box, extracted from the raw input image, highlights the localized region used to compute unchanged-object consistency. The corresponding consistency score is shown in brackets.
How We Measure Background Consistency


Illustration of background consistency. Instruction: “Remove brick beige house.” The background mask (in black) is derived by excluding all detected object regions from the entire image. The background consistency score is computed over this masked area to assess unintended alterations to the background.
Visual Quality: aesthetic appeal or fidelity preservation?
Beyond instruction following and content consistency, the perceptual quality of the edited image is a key dimension. We therefore report (i) a learned aesthetic score and (ii) several low-level image statistics that can surface systematic artifacts and drift in multi-turn editing pipelines.
Technique | Model | HPSv3 | |Δ| vs. Base | ||||
---|---|---|---|---|---|---|---|
T1 | T2 | T3 | T1 | T2 | T3 | ||
Autoregressive | Nano Banana | 4.94 | 5.12 | 5.26 | 0.56 | 0.73 | 0.88 |
GPT-Image-1 | 6.65 | 6.59 | 6.56 | 2.27 | 2.21 | 2.18 | |
Gemini 2.0 Flash | 4.44 | 4.23 | 4.07 | 0.05 | 0.15 | 0.32 | |
Flow Matching | Qwen-Image-Edit | 5.86 | 5.72 | 5.15 | 1.47 | 1.34 | 0.77 |
Step1X-Edit | 4.06 | 3.34 | 2.76 | 0.33 | 1.04 | 1.63 | |
FLUX.1-Kontext-dev | 5.12 | 5.07 | 5.04 | 0.73 | 0.69 | 0.65 | |
OmniGen | 4.61 | 4.07 | 3.50 | 0.23 | 0.31 | 0.89 | |
Diffusion | AnyEdit | 3.66 | 2.80 | 1.95 | 0.72 | 1.58 | 2.44 |
UltraEdit | 4.79 | 4.68 | 4.36 | 0.41 | 0.30 | 0.02 | |
MagicBrush | 3.85 | 3.08 | 2.36 | 0.54 | 1.30 | 2.02 | |
InstructPix2Pix | 3.20 | 2.38 | 1.44 | 1.18 | 2.01 | 2.94 |
We quantify aesthetics with HPSv3, which we found to generalize reliably to our generated images, whereas alternatives (e.g., RAHF) underperform in this setting. We do not fold these quality metrics into the aggregate “Overall” score, as preferences differ on whether an edited image should strictly preserve the input style or pursue beautification.
To disentangle these preferences, we report the absolute change in aesthetic score relative to the base image: Δi = |HPSturn i − HPSbase|. Smaller Δ indicates stronger style fidelity to the base image; larger Δ reflects greater beautification or stylistic drift. As summarized in the table above, GPT-Image-1 achieves the highest aesthetic scores across turns and remains stable. Qwen-Image-Edit is the next strongest on absolute HPS. For preserving the base image’s look (small Δ), Gemini 2.0 Flash shows the least drift, with Nano Banana also performing well.




This figure provides qualitative examples from Qwen-Image-Edit. The edited images exhibit elevated luminance and noticeable high-frequency bright artifacts (e.g., white streaks or “line” textures) that degrade perceptual quality, with luminance quintiles increasing substantially. Correspondingly, HPS drops from 6.19 to 4.19 and 3.34, suggesting that HPS is sensitive to over-exposure to some extent. In contrast, when querying VLMs about the visual quality of these images, the returned scores do not change in the first two turns and remain consistently above 50, reflecting a positive evaluation under the [0, 100] scale—even though the T2/T3 edited images show significant artifacts.
Multi-Turn vs. Complex Prompts: The power of Chain of Edit.
Autoregressive editors benefit from step-by-step reasoning—multi-turn execution outperforms complex single-shot prompts. Flow-matching models prefer complex prompts, avoiding repeated exposure to their own outputs.
Technique | Model | Multi-turn (T3) | Complex (C3) |
---|---|---|---|
Autoregressive | Nano Banana | 35.35 | 28.14 |
GPT-Image-1 | 38.35 | 28.78 | |
Gemini 2.0 Flash | 28.42 | 21.89 | |
Flow Matching | Qwen-Image-Edit | 22.55 | 27.62 |
Step1X-Edit | 17.83 | 15.73 | |
FLUX.1-Kontext-dev | 16.61 | 19.58 | |
OmniGen | 10.66 | 11.01 | |
Diffusion | AnyEdit | 7.22 | 2.80 |
UltraEdit | 6.36 | 8.22 | |
MagicBrush | 4.90 | 4.55 | |
InstructPix2Pix | 2.80 | 2.80 |
Appendix · Multi-turn Quality Examples
T1: Change the color of pumpkin to purple; T2: Change the background to forest; T3: Remove fabric orange bow. Each strip shows the input image followed by three consecutive turns executed by the same model.
Nano Banana




GPT-Image-1




Gemini 2.0 Flash




Qwen-Image-Edit




Step1X-Edit




FLUX.1-Kontext-dev




OmniGen




AnyEdit




UltraEdit




MagicBrush




InstructPix2Pix




BibTeX
@inproceedings{chen2025edival,
title={EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing},
author={Chen, Tianyu and Zhang, Yasi and Zhang, Zhi and Yu, Peiyu and Wang, Shu and Wang, Zhendong and Lin, Kevin and Wang, Xiaofei and Yang, Zhengyuan and Li, Linjie and Lin, Chung-Ching and Xie, Jianwen and Leong, Oscar and Wang, Lijuan and Wu, Ying Nian and Zhou, Mingyuan},
journal= {arXiv preprint},
volume= {arXiv:2509.13399},
year={2025},
url={https://arxiv.org/abs/2509.13399}
}