EdiVal-Agent
Automated, object-centric evaluation for multi-turn instruction-based image editing.
Tianyu Chen*1, Yasi Zhang*2, Zhi Zhang2, Peiyu Yu2, Shu Wang2, Zhendong Wang3, Kevin Lin3, Xiaofei Wang3, Zhengyuan Yang3, Linjie Li3, Chung-Ching Lin3, Jianwen Xie4, Oscar Leong†2, Lijuan Wang†3, Ying Nian Wu†2, Mingyuan Zhou†1,3
1University of Texas at Austin 2University of California, Los Angeles 3Microsoft 4Lambda, Inc.
*Equal contribution †Equal advising
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images—resulting in limited coverage and inheriting biases from prior generative models—or (ii) rely solely on zero-shot vision–language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.
To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective that handles both single-turn and multi-turn instruction-based editing. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns.
These stages enable three complementary metrics tailored for multi-turn evaluation: 1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for detector-guided semantic verification; 2) EdiVal-CC, which evaluates content consistency by computing semantic similarity of unchanged objects and background using the evolving object pools; and 3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models.
Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering nine instruction types and thirteen state-of-the-art editing models across in-context autoregressive, flow-matching, and diffusion paradigms. We label closed-source systems such as GPT-Image-1, Nano Banana, and Gemini 2.0 Flash Image as in-context because they integrate with autoregressive language models in web interfaces and support in-context multi-turn editing. We further compare multi-turn editing with single-shot complex editing to surface paradigm-specific behaviors, and show that EdiVal-Agent pinpoints existing failure modes to guide the next generation of editing models.
🏆 Seedream 4.0 leads overall; Nano Banana keeps the speed crown.
Results of multi-turn editing. We report EdiVal-IF (instruction following), EdiVal-CC (content consistency), and EdiVal-O (overall) over three turns, alongside latency. EdiVal-O is the geometric mean of EdiVal-IF and EdiVal-CC. Dark red marks the best score per column, light red the runner-up.
- 🌟 Seedream 4.0: New overall leader with EdiVal-O scores of 83.81 / 69.95 / 59.76 and strong instruction following across turns.
- 🍌 Nano Banana: Fastest high-performing model (9.7 s/img) and still second overall, with balanced EdiVal-IF and EdiVal-CC.
- 🎯 GPT-Image-1: Highest autoregressive EdiVal-IF but latency remains the slowest at 71.3 s/img.
- 🌐 Qwen-Image-Edit: Strong turn-one EdiVal-IF yet steep drop by turn three—underscoring exposure-bias concerns for open models. But still the best open source model.
| Technique | Model | In-Context | Latency (s/img) | EdiVal-IF | EdiVal-CC | EdiVal-O | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T1 | T2 | T3 | T1 | T2 | T3 | T1 | T2 | T3 | ||||
| Unknown | Seedream 4.0 | No | 15.78 | 75.93 | 55.58 | 41.59 | 92.51 | 88.03 | 85.86 | 83.81 | 69.95 | 59.76 |
| Nano Banana | Yes | 9.70 | 70.70 | 50.66 | 35.35 | 93.91 | 90.48 | 89.48 | 81.48 | 67.70 | 56.24 | |
| GPT-Image-1 | Yes | 71.30 | 73.12 | 54.89 | 38.35 | 81.00 | 77.78 | 75.50 | 76.96 | 65.34 | 53.81 | |
| Gemini 2.0 Flash | Yes | 8.20 | 68.07 | 45.96 | 28.42 | 90.58 | 85.10 | 80.88 | 78.52 | 62.54 | 47.94 | |
| Flow Matching | FLUX.1-Kontext-max | No | 10.34 | 69.49 | 46.89 | 31.83 | 93.93 | 90.90 | 88.40 | 80.79 | 65.29 | 53.04 |
| Qwen-Image-Edit | No | 115.08 | 72.90 | 44.06 | 22.55 | 84.22 | 80.52 | 77.98 | 78.36 | 59.56 | 41.93 | |
| Step1X-Edit | No | 20.42 | 61.89 | 34.97 | 17.83 | 92.76 | 88.52 | 85.21 | 75.77 | 55.64 | 38.98 | |
| FLUX.1-Kontext-dev | No | 29.21 | 59.97 | 32.69 | 16.61 | 95.32 | 92.24 | 90.22 | 75.61 | 54.91 | 38.71 | |
| OmniGen | No | 19.70 | 54.72 | 24.48 | 10.66 | 93.00 | 88.42 | 83.92 | 71.34 | 46.52 | 29.91 | |
| Diffusion | AnyEdit | No | 3.93 | 41.07 | 16.32 | 7.22 | 86.42 | 78.91 | 70.10 | 59.58 | 35.89 | 22.50 |
| UltraEdit | No | 3.15 | 51.37 | 17.70 | 6.36 | 86.80 | 84.50 | 82.40 | 66.78 | 38.67 | 22.89 | |
| MagicBrush | No | 4.08 | 42.31 | 15.73 | 4.90 | 86.96 | 81.26 | 76.86 | 60.66 | 35.75 | 19.41 | |
| InstructPix2Pix | No | 4.09 | 37.41 | 10.66 | 2.80 | 76.85 | 68.36 | 60.30 | 53.62 | 26.99 | 12.99 | |
Seedream 4.0 comes out on top. In our tests, it delivered the best results across three rounds of edits while staying reasonably quick at about 16 seconds per image. Nano Banana hits the best speed-quality sweet spot—about 10 seconds per image—placing second overall and staying close to Seedream 4.0 in how well it follows instructions and keeps results consistent. GPT-Image-1 is great at doing exactly what you ask, but its very long processing time (around 71 seconds per image) and less steady results pull down its overall score; it often favors eye-catching images over repeatable edits. Among open-source tools, Qwen-Image-Edit starts strong but drops off in later rounds, likely because small errors snowball with each edit. Overall, closed-source tools still hold a clear edge—and aside from Qwen-Image-Edit, our rankings match a major community leaderboard based on human votes as of September 12, 2025.
* Closed-source latencies were measured in the provider’s hosted web UI; open-source latencies on a single NVIDIA A100 GPU with default settings.
Overview of Our Workflow
Overview of our workflow and representative model performance. For visualization we adopt two thresholds: a consistency score of at least 90 and a visual quality score of at least 6. Details of the automated evaluation pipeline are provided in Figure 2 and Section 2. In multi-turn editing the models exhibit distinct weaknesses—GPT-Image-1 struggles with content consistency, Qwen-Image-Edit underperforms in both visual quality and content consistency, and FLUX.1-Kontext-dev lags in instruction following—whereas Nano Banana shows no single dominant weakness. A comprehensive analysis is presented in Section 4 and Table 2.
Framework of EdiVal-Agent
Framework of EdiVal-Agent. It first decomposes images into semantically meaningful objects—such as metal yellow sign and metal brown pole—and identifies their contextual relationships (e.g. both are in the foreground). It then generates diverse, grounded editing scenarios based on that analysis (for example, Change the color of metal brown pole to gray). Finally, it evaluates editor outputs along instruction following, content consistency, and visual quality by integrating Qwen2.5-VL with Grounding-DINO for instruction checks, DINOv3 features plus pixel-level L1 distances for consistency, and HPSv3 for human preference alignment. The agentic pipeline is tool-agnostic and can readily incorporate stronger experts as they become available.
Why QWEN Image Edit fails in multi-turn editing? Exposure Bias. Evidence from Marginal Instruction-Following Across Turns
When we look at how well each model follows instructions over multiple editing steps, we measure two things. The first is the marginal success rate—how often a model successfully applies the edit for a specific turn. The second, shown in the results table, is the multi-turn success rate, which checks if the model keeps succeeding across all edits in a sequence.
Top performers like Seedream 4.0, Nano Banana, and FLUX.1-Kontext-max manage to stay consistent across turns, even though some of them do not actually “remember” earlier edits. Other models, however, struggle more as the sequence gets longer.
A good example is Qwen-Image-Edit. It starts strong, nearly matching the leaders on the first edit, but its performance drops quickly afterward. This likely happens because it was trained mostly on perfect, single-edit examples—not on its own previous outputs. When asked to keep refining its own generations, small mistakes tend to build up over time, especially if the model cannot fully reference its past work.
Content Consistency: more than just following instructions.
Consistency metrics rely on Grounding-DINO to capture the same spatial regions across the editing trajectory. DINOv3 embeddings and pixel distances are computed over those regions to detect drift in untouched objects and background areas.
How We Measure Object Consistency
Illustration of object consistency. Instruction: “Remove brick beige house.” The grounding box, extracted from the raw input image, highlights the localized region used to compute unchanged-object consistency. The corresponding consistency score is shown in brackets.
How We Measure Background Consistency
Illustration of background consistency. Instruction: “Remove brick beige house.” The background mask (in black) is derived by excluding all detected object regions from the entire image. The background consistency score is computed over this masked area to assess unintended alterations to the background.
Visual Quality: aesthetic appeal or fidelity preservation?
Beyond instruction following and content consistency, the perceptual quality of the edited image is a key dimension. We therefore report (i) a learned aesthetic score and (ii) several low-level image statistics that can surface systematic artifacts and drift in multi-turn editing pipelines.
| Technique | Model | HPS | Δ v.s. Base | ||||
|---|---|---|---|---|---|---|---|
| T1 | T2 | T3 | T1 | T2 | T3 | ||
| Unknown | Seedream 4.0 | 5.14 | 5.15 | 5.15 | 0.76 | 0.77 | 0.77 |
| Nano Banana | 4.94 | 5.12 | 5.26 | 0.56 | 0.73 | 0.88 | |
| GPT-Image-1 | 6.65 | 6.59 | 6.56 | 2.27 | 2.21 | 2.18 | |
| Gemini 2.0 Flash | 4.44 | 4.23 | 4.07 | 0.05 | 0.15 | 0.32 | |
| Flow Matching | FLUX.1-Kontext-max | 5.12 | 5.07 | 5.04 | 0.41 | 0.49 | 0.47 |
| Qwen-Image-Edit | 5.86 | 5.72 | 5.15 | 1.47 | 1.34 | 0.77 | |
| Step1X-Edit | 4.06 | 3.34 | 2.76 | 0.33 | 1.04 | 1.63 | |
| FLUX.1-Kontext-dev | 5.12 | 5.07 | 5.04 | 0.73 | 0.69 | 0.65 | |
| OmniGen | 4.61 | 4.07 | 3.50 | 0.23 | 0.31 | 0.89 | |
| Diffusion | AnyEdit | 3.66 | 2.80 | 1.95 | 0.72 | 1.58 | 2.44 |
| UltraEdit | 4.79 | 4.68 | 4.36 | 0.41 | 0.30 | 0.02 | |
| MagicBrush | 3.85 | 3.08 | 2.36 | 0.54 | 1.30 | 2.02 | |
| IP2P | 3.20 | 2.38 | 1.44 | 1.18 | 2.01 | 2.94 | |
We quantify aesthetics with HPSv3, which we found to generalize reliably to our generated images, whereas alternatives (e.g., RAHF) underperform in this setting. We do not fold these quality metrics into the aggregate “Overall” score, as preferences differ on whether an edited image should strictly preserve the input style or pursue beautification.
To disentangle these preferences, we report the absolute change in aesthetic score relative to the base image: Δi = |HPSturn i − HPSbase|. Smaller Δ indicates stronger style fidelity to the base image; larger Δ reflects greater beautification or stylistic drift. As summarized in the table above, GPT-Image-1 achieves the highest aesthetic scores across turns and remains stable. Qwen-Image-Edit is the next strongest on absolute HPS. For preserving the base image’s look (small Δ), Gemini 2.0 Flash shows the least drift, with Nano Banana also performing well.
This figure provides qualitative examples from Qwen-Image-Edit. The edited images exhibit elevated luminance and noticeable high-frequency bright artifacts (e.g., white streaks or “line” textures) that degrade perceptual quality, with luminance quintiles increasing substantially. Correspondingly, HPS drops from 6.19 to 4.19 and 3.34, suggesting that HPS is sensitive to over-exposure to some extent. In contrast, when querying VLMs about the visual quality of these images, the returned scores do not change in the first two turns and remain consistently above 50, reflecting a positive evaluation under the [0, 100] scale—even though the T2/T3 edited images show significant artifacts.
Multi-Turn vs. Complex Prompts: The power of Chain of Edit.
In Context editors benefit from step-by-step reasoning—multi-turn execution outperforms complex single-shot prompts. Flow-matching models prefer complex prompts, avoiding repeated exposure to their own outputs.
| Technique | Model | Multi-turn (T3) | Complex (C3) |
|---|---|---|---|
| Unknown | Nano Banana | 35.35 | 28.14 |
| GPT-Image-1 | 38.35 | 28.78 | |
| Gemini 2.0 Flash | 28.42 | 21.89 | |
| Flow Matching | Qwen-Image-Edit | 22.55 | 27.62 |
| Step1X-Edit | 17.83 | 15.73 | |
| FLUX.1-Kontext-dev | 16.61 | 19.58 | |
| OmniGen | 10.66 | 11.01 | |
| Diffusion | AnyEdit | 7.22 | 2.80 |
| UltraEdit | 6.36 | 8.22 | |
| MagicBrush | 4.90 | 4.55 | |
| InstructPix2Pix | 2.80 | 2.80 |
Appendix · Multi-turn Quality Examples
T1: Change the color of pumpkin to purple
T2: Change the background to forest
T3: Remove fabric orange bow
Each strip shows the input image followed by three consecutive turns executed by the same model.
Seedream 4.0




Nano Banana




GPT-Image-1




FLUX.1-Kontext-max




Gemini 2.0 Flash




Qwen-Image-Edit




Step1X-Edit




FLUX.1-Kontext-dev




OmniGen




AnyEdit




UltraEdit




MagicBrush




InstructPix2Pix




BibTeX
@article{chen2025edival,
title={EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing},
author={Chen, Tianyu and Zhang, Yasi and Zhang, Zhi and Yu, Peiyu and Wang, Shu and Wang, Zhendong and Lin, Kevin and Wang, Xiaofei and Yang, Zhengyuan and Li, Linjie and others},
journal={arXiv preprint arXiv:2509.13399},
year={2025}
}