EdiVal-Agent

Automated, object-centric evaluation for multi-turn instruction-based image editing.

Tianyu Chen*1, Yasi Zhang*2, Zhi Zhang2, Peiyu Yu2, Shu Wang2, Zhendong Wang3, Kevin Lin3, Xiaofei Wang3, Zhengyuan Yang3, Linjie Li3, Chung-Ching Lin3, Jianwen Xie4, Oscar Leong†2, Lijuan Wang†3, Ying Nian Wu†2, Mingyuan Zhou†1,3

1University of Texas at Austin   2University of California, Los Angeles   3Microsoft   4Lambda, Inc.

*Equal contribution   †Equal advising

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images—resulting in limited coverage and inheriting biases from prior generative models—or (ii) rely solely on zero-shot vision–language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.

To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline’s modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time.

Instantiating this pipeline, we build EdiVal-Bench , a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR), flow-matching, and diffusion paradigms. Please see our analysis below and we will give an answer who is the winner.

Resources: Paper PDF (preprint) · Code repository (coming soon) · Contact

🍌 Nano Banana is the SOTA in Multi-turn Editor! 🔷 OpenAI’s GPT-Image-1 follows as Runner-up.

Results of multi-turn editing. Instruction following, content consistency, and overall across three sequential editing turns. Overall is the geometric mean of instruction following and content consistency. Best per column is shown in dark red; second-best in light red. Latency is seconds per image (lower is better).

  • 🍌 Nano Banana: Best speed–quality trade-off with top Overall scores at Turn 1/Turn 2/Turn 3 (81.48 / 67.70 / 56.24) and fast latency (9.7 s/img).
  • 🎯 GPT-Image-1: Strongest instruction following, but slowed by high latency (71.3 s/img) and weaker consistency; Nano Banana trails by only ~4.2 (Turn 2) and 3.0 (Turn 3) points.
  • 🔄 Consistency: FLUX.1-Kontext-dev leads, with Nano Banana close behind; GPT-Image-1 ranks second-to-last due to more regenerative/restyling edits.
  • 🌐 Qwen-Image-Edit: Best open source editing model. Instruction following is strong at Turn 1 (Overall 78.36) but degrades quickly with more turns—likely from single-turn training and limited edit-history handling.
Technique Model Date Latency (s/img) Instruction Following Content Consistency Overall
T1T2T3 T1T2T3 T1T2T3
Autoregressive Nano Banana 25.08.26 9.70 70.70 50.66 35.35 93.91 90.48 89.48 81.48 67.70 56.24
GPT-Image-1 25.07.16 71.30 73.12 54.89 38.35 81.00 77.78 75.50 76.96 65.34 53.81
Gemini 2.0 Flash 25.02.05 8.20 68.07 45.96 28.42 90.58 85.10 80.88 78.52 62.54 47.94
Flow Matching Qwen-Image-Edit 25.08.04 115.08 72.90 44.06 22.55 84.22 80.52 77.98 78.36 59.56 41.93
Step1X-Edit 25.04.25 20.42 61.89 34.97 17.83 92.76 88.52 85.21 75.77 55.64 38.98
FLUX.1-Kontext-dev 25.06.25 29.21 59.97 32.69 16.61 95.32 92.24 90.22 75.61 54.91 38.71
OmniGen 24.09.11 19.70 54.72 24.48 10.66 93.00 88.42 83.92 71.34 46.52 29.91
Diffusion AnyEdit 24.11.24 3.93 41.07 16.32 7.22 86.42 78.91 70.10 59.58 35.89 22.50
UltraEdit 24.07.07 3.15 51.37 17.70 6.36 86.80 84.50 82.40 66.78 38.67 22.89
MagicBrush 23.06.16 4.08 42.31 15.73 4.90 86.96 81.26 76.86 60.66 35.75 19.41
InstructPix2Pix 23.12.15 4.09 37.41 10.66 2.80 76.85 68.36 60.30 53.62 26.99 12.99

Based on the data in this table, across three turns Nano Banana offers the best speed–quality trade-off—highest Overall at T1/T2/T3 (81.48/67.70/56.24) with 9.7 s/img. GPT-Image-1 delivers the strongest instruction following across turns, but its latency (71.3 s/img)* and weaker consistency leave it second in Overall; Nano Banana trails GPT-Image-1 by only ~4.2 (T2) and 3.0 (T3) points on instruction following. For consistency, FLUX.1-Kontext-dev leads across turns with Nano Banana close behind, whereas GPT-Image-1 ranks second-to-last—consistent with more regenerative/restyling behavior that can erode pixel- or feature-level stability despite aesthetic gains. Gemini 2.0 Flash is competitive at T1 (second-best Overall) but exhibits a steeper decline by T3. Among open-source systems, Qwen-Image-Edit is strongest at T1 (Overall 78.36) yet degrades rapidly with additional turns, likely due to exposure bias from single-turn training on real images and a short edit-history window that forces the model to operate on its own outputs.

* Closed-source latencies were measured in the provider’s hosted web UI; open-source latencies on a single NVIDIA A100 GPU with default settings.

Overview of Our Workflow

Overview of our workflow and representative model performance. For visualization we adopt two thresholds: a consistency score of at least 90 and a visual quality score of at least 6. Details of the automated evaluation pipeline are provided in Figure 2 and Section 2. In multi-turn editing the models exhibit distinct weaknesses—GPT-Image-1 struggles with content consistency, Qwen-Image-Edit underperforms in both visual quality and content consistency, and FLUX.1-Kontext-dev lags in instruction following—whereas Nano Banana shows no single dominant weakness. A comprehensive analysis is presented in Section 4 and Table 2.

Overview of the EdiVAL workflow

Framework of EdiVal-Agent

Framework of EdiVal-Agent. It first decomposes images into semantically meaningful objects—such as metal yellow sign and metal brown pole—and identifies their contextual relationships (e.g. both are in the foreground). It then generates diverse, grounded editing scenarios based on that analysis (for example, Change the color of metal brown pole to gray). Finally, it evaluates editor outputs along instruction following, content consistency, and visual quality by integrating Qwen2.5-VL with Grounding-DINO for instruction checks, DINOv3 features plus pixel-level L1 distances for consistency, and HPSv3 for human preference alignment. The agentic pipeline is tool-agnostic and can readily incorporate stronger experts as they become available.

Framework diagram of the EdiVAL-Agent

Why QWEN Image Edit fails in multi-turn editing? Exposure Bias. Evidence from Marginal Instruction-Following Across Turns

Line plot of marginal instruction-following success rate across turns

We analyze instruction-following ability across nine edit tasks. For a given turn, the marginal task success rate is the proportion of prompts for which the requested edit in that turn is successfully implemented. In contrast, the instruction following score reported in Table 2 corresponds to the multi-turn task success rate at turn i—the fraction of images for which all edits up to turn i are successful.

Figure 4 shows the evolution of per-turn performance. Models that are AR (autoregressive) and condition on the full editing history are relatively stable across turns: their marginal task success rates change only slightly between turns. By contrast, non-AR models with a very short history window (effectively conditioning only on the previous output) suffer substantial degradation in marginal task success, particularly for flow-matching style models.

A striking example is Qwen-Image-Edit. It is the strongest open-source system at turn 1 (Overall 78.36 vs. 81.48 for Nano Banana) but degrades much faster over subsequent turns. We hypothesize that this is primarily an exposure bias issue: many single-turn edit models are trained to operate on real images and ground-truth inputs, not on their own earlier outputs. When a model must operate on its own previous edits, small distribution mismatches compound across turns and reduce stability—this effect is exacerbated when the model can attend only to a short slice of the history.

Content Consistency: more than just following instructions.

Consistency metrics rely on Grounding-DINO to capture the same spatial regions across the editing trajectory. DINOv3 embeddings and pixel distances are computed over those regions to detect drift in untouched objects and background areas.

How We Measure Object Consistency

Object consistency visualization for raw input
Raw input
Object consistency visualization for Nano Banana
Nano Banana (98.05)
Object consistency visualization for GPT-Image-1
GPT-Image-1 (95.19)
Object consistency visualization for Qwen-Image-Edit
Qwen-Image-Edit (94.96)

Illustration of object consistency. Instruction: “Remove brick beige house.” The grounding box, extracted from the raw input image, highlights the localized region used to compute unchanged-object consistency. The corresponding consistency score is shown in brackets.

How We Measure Background Consistency

Base image showing the background mask used for consistency scoring
Base image (+ mask)
Edited image illustrating background drift
Qwen-Image-Edit (Turn 3)

Illustration of background consistency. Instruction: “Remove brick beige house.” The background mask (in black) is derived by excluding all detected object regions from the entire image. The background consistency score is computed over this masked area to assess unintended alterations to the background.

Visual Quality: aesthetic appeal or fidelity preservation?

Beyond instruction following and content consistency, the perceptual quality of the edited image is a key dimension. We therefore report (i) a learned aesthetic score and (ii) several low-level image statistics that can surface systematic artifacts and drift in multi-turn editing pipelines.

Technique Model HPSv3 |Δ| vs. Base
T1T2T3 T1T2T3
Autoregressive Nano Banana 4.945.125.26 0.560.730.88
GPT-Image-1 6.656.596.56 2.272.212.18
Gemini 2.0 Flash 4.444.234.07 0.050.150.32
Flow Matching Qwen-Image-Edit 5.865.725.15 1.471.340.77
Step1X-Edit 4.063.342.76 0.331.041.63
FLUX.1-Kontext-dev 5.125.075.04 0.730.690.65
OmniGen 4.614.073.50 0.230.310.89
Diffusion AnyEdit 3.662.801.95 0.721.582.44
UltraEdit 4.794.684.36 0.410.300.02
MagicBrush 3.853.082.36 0.541.302.02
InstructPix2Pix 3.202.381.44 1.182.012.94

We quantify aesthetics with HPSv3, which we found to generalize reliably to our generated images, whereas alternatives (e.g., RAHF) underperform in this setting. We do not fold these quality metrics into the aggregate “Overall” score, as preferences differ on whether an edited image should strictly preserve the input style or pursue beautification.

To disentangle these preferences, we report the absolute change in aesthetic score relative to the base image: Δi = |HPSturn i − HPSbase|. Smaller Δ indicates stronger style fidelity to the base image; larger Δ reflects greater beautification or stylistic drift. As summarized in the table above, GPT-Image-1 achieves the highest aesthetic scores across turns and remains stable. Qwen-Image-Edit is the next strongest on absolute HPS. For preserving the base image’s look (small Δ), Gemini 2.0 Flash shows the least drift, with Nano Banana also performing well.

Input image
Input · HPS 4.25
Qwen-Image-Edit after turn one
Turn 1 · HPS 6.19
Qwen-Image-Edit after turn two
Turn 2 · HPS 4.19
Qwen-Image-Edit after turn three
Turn 3 · HPS 3.34

This figure provides qualitative examples from Qwen-Image-Edit. The edited images exhibit elevated luminance and noticeable high-frequency bright artifacts (e.g., white streaks or “line” textures) that degrade perceptual quality, with luminance quintiles increasing substantially. Correspondingly, HPS drops from 6.19 to 4.19 and 3.34, suggesting that HPS is sensitive to over-exposure to some extent. In contrast, when querying VLMs about the visual quality of these images, the returned scores do not change in the first two turns and remain consistently above 50, reflecting a positive evaluation under the [0, 100] scale—even though the T2/T3 edited images show significant artifacts.

Multi-Turn vs. Complex Prompts: The power of Chain of Edit.

Autoregressive editors benefit from step-by-step reasoning—multi-turn execution outperforms complex single-shot prompts. Flow-matching models prefer complex prompts, avoiding repeated exposure to their own outputs.

Technique Model Multi-turn (T3) Complex (C3)
AutoregressiveNano Banana35.3528.14
GPT-Image-138.3528.78
Gemini 2.0 Flash28.4221.89
Flow MatchingQwen-Image-Edit22.5527.62
Step1X-Edit17.8315.73
FLUX.1-Kontext-dev16.6119.58
OmniGen10.6611.01
DiffusionAnyEdit7.222.80
UltraEdit6.368.22
MagicBrush4.904.55
InstructPix2Pix2.802.80

Appendix · Multi-turn Quality Examples

T1: Change the color of pumpkin to purple; T2: Change the background to forest; T3: Remove fabric orange bow. Each strip shows the input image followed by three consecutive turns executed by the same model.

Nano Banana

Nano Banana input
Input
Nano Banana turn 1
Turn 1
Nano Banana turn 2
Turn 2
Nano Banana turn 3
Turn 3

GPT-Image-1

GPT-Image-1 input
Input
GPT-Image-1 turn 1
Turn 1
GPT-Image-1 turn 2
Turn 2
GPT-Image-1 turn 3
Turn 3

Gemini 2.0 Flash

Gemini input
Input
Gemini turn 1
Turn 1
Gemini turn 2
Turn 2
Gemini turn 3
Turn 3

Qwen-Image-Edit

Qwen input
Input
Qwen turn 1
Turn 1
Qwen turn 2
Turn 2
Qwen turn 3
Turn 3

Step1X-Edit

Step1X input
Input
Step1X turn 1
Turn 1
Step1X turn 2
Turn 2
Step1X turn 3
Turn 3

FLUX.1-Kontext-dev

FLUX input
Input
FLUX turn 1
Turn 1
FLUX turn 2
Turn 2
FLUX turn 3
Turn 3

OmniGen

OmniGen input
Input
OmniGen turn 1
Turn 1
OmniGen turn 2
Turn 2
OmniGen turn 3
Turn 3

AnyEdit

AnyEdit input
Input
AnyEdit turn 1
Turn 1
AnyEdit turn 2
Turn 2
AnyEdit turn 3
Turn 3

UltraEdit

UltraEdit input
Input
UltraEdit turn 1
Turn 1
UltraEdit turn 2
Turn 2
UltraEdit turn 3
Turn 3

MagicBrush

MagicBrush input
Input
MagicBrush turn 1
Turn 1
MagicBrush turn 2
Turn 2
MagicBrush turn 3
Turn 3

InstructPix2Pix

InstructPix2Pix input
Input
InstructPix2Pix turn 1
Turn 1
InstructPix2Pix turn 2
Turn 2
InstructPix2Pix turn 3
Turn 3

BibTeX

@inproceedings{chen2025edival,
  title={EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing},
  author={Chen, Tianyu and Zhang, Yasi and Zhang, Zhi and Yu, Peiyu and Wang, Shu and Wang, Zhendong and Lin, Kevin and Wang, Xiaofei and Yang, Zhengyuan and Li, Linjie and Lin, Chung-Ching and Xie, Jianwen and Leong, Oscar and Wang, Lijuan and Wu, Ying Nian and Zhou, Mingyuan},
  journal= {arXiv preprint},
  volume= {arXiv:2509.13399},
  year={2025},
  url={https://arxiv.org/abs/2509.13399}
}