EdiVal-Agent: An Object-Centric Framework for  Automated, Fine-Grained Evaluation of Multi-Turn  Editing

Chen, Tianyu; Zhang, Yasi; Zhang, Zhi; Yu, Peiyu; Wang, Shu; Wang, Zhendong; Lin, Kevin; Wang, Xiaofei; Yang, Zhengyuan; Li, Linjie; Lin, Chung-Ching; Xie, Jianwen; Leong, Oscar; Wang, Lijuan; Wu, Ying Nian; Zhou, Mingyuan

EdiVal-Agent

Automated, object-centric evaluation for multi-turn instruction-based image editing.

Tianyu Chen^*1, Yasi Zhang^*2, Zhi Zhang², Peiyu Yu², Shu Wang², Zhendong Wang³, Kevin Lin³, Xiaofei Wang³, Zhengyuan Yang³, Linjie Li³, Chung-Ching Lin³, Jianwen Xie⁴, Oscar Leong^†2, Lijuan Wang^†3, Ying Nian Wu^†2, Mingyuan Zhou^†1,3

¹University of Texas at Austin ²University of California, Los Angeles ³Microsoft ⁴Lambda, Inc.

*Equal contribution †Equal advising

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images—resulting in limited coverage and inheriting biases from prior generative models—or (ii) rely solely on zero-shot vision–language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise.

To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective that handles both single-turn and multi-turn instruction-based editing. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns.

These stages enable three complementary metrics tailored for multi-turn evaluation: 1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for detector-guided semantic verification; 2) EdiVal-CC, which evaluates content consistency by computing semantic similarity of unchanged objects and background using the evolving object pools; and 3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models.

Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering nine instruction types and thirteen state-of-the-art editing models across in-context autoregressive, flow-matching, and diffusion paradigms. We label closed-source systems such as GPT-Image-1, Nano Banana, and Gemini 2.0 Flash Image as in-context because they integrate with autoregressive language models in web interfaces and support in-context multi-turn editing. We further compare multi-turn editing with single-shot complex editing to surface paradigm-specific behaviors, and show that EdiVal-Agent pinpoints existing failure modes to guide the next generation of editing models.

arXiv

🤗

HF Paper

Code

🤗

Dataset

Questions? Contact Tianyu or Yasi.

Overall Multi-turn Editing Leaderboard

Higher is better

#	Creator	Model	Scores	Release
1	ByteDance	Seedream 4.0 Diffusion	59.76	Sep 2025
2	Google	Nano Banana In Context	56.24	Aug 2025
3	OpenAI	GPT-Image-1 In Context	53.81	Jul 2025
4	Black Forest Labs	FLUX.1-Kontext-max Flow Matching	53.04	Jun 2025
5	Google	Gemini 2.0 Flash In Context	47.94	Feb 2025
6	Alibaba	Qwen-Image-Edit Flow Matching	41.93	Aug 2025
7	StepFun	Step1X-Edit Flow Matching	38.98	Apr 2025
8	Black Forest Labs	FLUX.1-Kontext-dev Flow Matching	38.71	Jun 2025
9	VectorSpaceLab	OmniGen Flow Matching	29.91	Sep 2024
10		UltraEdit Diffusion	22.89	Jul 2024
11		AnyEdit Diffusion	22.50	Nov 2024
12		MagicBrush Diffusion	19.41	Jun 2023
13		InstructPix2Pix Diffusion	12.99	Dec 2023

🏆 Seedream 4.0 leads overall; Nano Banana keeps the speed crown.

Results of multi-turn editing. We report EdiVal-IF (instruction following), EdiVal-CC (content consistency), and EdiVal-O (overall) over three turns, alongside latency. EdiVal-O is the geometric mean of EdiVal-IF and EdiVal-CC. Dark red marks the best score per column, light red the runner-up.

🌟 Seedream 4.0: New overall leader with EdiVal-O scores of 83.81 / 69.95 / 59.76 and strong instruction following across turns.
🍌 Nano Banana: Fastest high-performing model (9.7 s/img) and still second overall, with balanced EdiVal-IF and EdiVal-CC.
🎯 GPT-Image-1: Highest autoregressive EdiVal-IF but latency remains the slowest at 71.3 s/img.
🌐 Qwen-Image-Edit: Strong turn-one EdiVal-IF yet steep drop by turn three—underscoring exposure-bias concerns for open models. But still the best open source model.

Technique	Model	In-Context	Latency (s/img)	EdiVal-IF			EdiVal-CC			EdiVal-O
Technique	Model	In-Context	Latency (s/img)	T1	T2	T3	T1	T2	T3	T1	T2	T3
Unknown	Seedream 4.0	No	15.78	75.93	55.58	41.59	92.51	88.03	85.86	83.81	69.95	59.76
	Nano Banana	Yes	9.70	70.70	50.66	35.35	93.91	90.48	89.48	81.48	67.70	56.24
	GPT-Image-1	Yes	71.30	73.12	54.89	38.35	81.00	77.78	75.50	76.96	65.34	53.81
	Gemini 2.0 Flash	Yes	8.20	68.07	45.96	28.42	90.58	85.10	80.88	78.52	62.54	47.94
Flow Matching	FLUX.1-Kontext-max	No	10.34	69.49	46.89	31.83	93.93	90.90	88.40	80.79	65.29	53.04
	Qwen-Image-Edit	No	115.08	72.90	44.06	22.55	84.22	80.52	77.98	78.36	59.56	41.93
	Step1X-Edit	No	20.42	61.89	34.97	17.83	92.76	88.52	85.21	75.77	55.64	38.98
	FLUX.1-Kontext-dev	No	29.21	59.97	32.69	16.61	95.32	92.24	90.22	75.61	54.91	38.71
	OmniGen	No	19.70	54.72	24.48	10.66	93.00	88.42	83.92	71.34	46.52	29.91
Diffusion	AnyEdit	No	3.93	41.07	16.32	7.22	86.42	78.91	70.10	59.58	35.89	22.50
	UltraEdit	No	3.15	51.37	17.70	6.36	86.80	84.50	82.40	66.78	38.67	22.89
	MagicBrush	No	4.08	42.31	15.73	4.90	86.96	81.26	76.86	60.66	35.75	19.41
	InstructPix2Pix	No	4.09	37.41	10.66	2.80	76.85	68.36	60.30	53.62	26.99	12.99

Seedream 4.0 comes out on top. In our tests, it delivered the best results across three rounds of edits while staying reasonably quick at about 16 seconds per image. Nano Banana hits the best speed-quality sweet spot—about 10 seconds per image—placing second overall and staying close to Seedream 4.0 in how well it follows instructions and keeps results consistent. GPT-Image-1 is great at doing exactly what you ask, but its very long processing time (around 71 seconds per image) and less steady results pull down its overall score; it often favors eye-catching images over repeatable edits. Among open-source tools, Qwen-Image-Edit starts strong but drops off in later rounds, likely because small errors snowball with each edit. Overall, closed-source tools still hold a clear edge—and aside from Qwen-Image-Edit, our rankings match a major community leaderboard based on human votes as of September 12, 2025.

* Closed-source latencies were measured in the provider’s hosted web UI; open-source latencies on a single NVIDIA A100 GPU with default settings.

Overview of Our Workflow

Overview of our workflow and representative model performance. For visualization we adopt two thresholds: a consistency score of at least 90 and a visual quality score of at least 6. Details of the automated evaluation pipeline are provided in Figure 2 and Section 2. In multi-turn editing the models exhibit distinct weaknesses—GPT-Image-1 struggles with content consistency, Qwen-Image-Edit underperforms in both visual quality and content consistency, and FLUX.1-Kontext-dev lags in instruction following—whereas Nano Banana shows no single dominant weakness. A comprehensive analysis is presented in Section 4 and Table 2.

Framework of EdiVal-Agent

Framework of EdiVal-Agent. It first decomposes images into semantically meaningful objects—such as metal yellow sign and metal brown pole—and identifies their contextual relationships (e.g. both are in the foreground). It then generates diverse, grounded editing scenarios based on that analysis (for example, Change the color of metal brown pole to gray). Finally, it evaluates editor outputs along instruction following, content consistency, and visual quality by integrating Qwen2.5-VL with Grounding-DINO for instruction checks, DINOv3 features plus pixel-level L₁ distances for consistency, and HPSv3 for human preference alignment. The agentic pipeline is tool-agnostic and can readily incorporate stronger experts as they become available.

Why QWEN Image Edit fails in multi-turn editing? Exposure Bias. Evidence from Marginal Instruction-Following Across Turns

Line plot of marginal instruction-following success rate across turns

When we look at how well each model follows instructions over multiple editing steps, we measure two things. The first is the marginal success rate—how often a model successfully applies the edit for a specific turn. The second, shown in the results table, is the multi-turn success rate, which checks if the model keeps succeeding across all edits in a sequence.

Top performers like Seedream 4.0, Nano Banana, and FLUX.1-Kontext-max manage to stay consistent across turns, even though some of them do not actually “remember” earlier edits. Other models, however, struggle more as the sequence gets longer.

A good example is Qwen-Image-Edit. It starts strong, nearly matching the leaders on the first edit, but its performance drops quickly afterward. This likely happens because it was trained mostly on perfect, single-edit examples—not on its own previous outputs. When asked to keep refining its own generations, small mistakes tend to build up over time, especially if the model cannot fully reference its past work.

Content Consistency: more than just following instructions.

Consistency metrics rely on Grounding-DINO to capture the same spatial regions across the editing trajectory. DINOv3 embeddings and pixel distances are computed over those regions to detect drift in untouched objects and background areas.

How We Measure Object Consistency

Object consistency visualization for raw input — Raw input

Object consistency visualization for Nano Banana — Nano Banana (98.05)

Object consistency visualization for GPT-Image-1 — GPT-Image-1 (95.19)

Object consistency visualization for Qwen-Image-Edit — Qwen-Image-Edit (94.96)

Illustration of object consistency. Instruction: “Remove brick beige house.” The grounding box, extracted from the raw input image, highlights the localized region used to compute unchanged-object consistency. The corresponding consistency score is shown in brackets.

How We Measure Background Consistency

Base image showing the background mask used for consistency scoring — Base image (+ mask)

Edited image illustrating background drift — Qwen-Image-Edit (Turn 3)

Illustration of background consistency. Instruction: “Remove brick beige house.” The background mask (in black) is derived by excluding all detected object regions from the entire image. The background consistency score is computed over this masked area to assess unintended alterations to the background.

Visual Quality: aesthetic appeal or fidelity preservation?

Beyond instruction following and content consistency, the perceptual quality of the edited image is a key dimension. We therefore report (i) a learned aesthetic score and (ii) several low-level image statistics that can surface systematic artifacts and drift in multi-turn editing pipelines.

Technique	Model	HPS			Δ v.s. Base
Technique	Model	T1	T2	T3	T1	T2	T3
Unknown	Seedream 4.0	5.14	5.15	5.15	0.76	0.77	0.77
	Nano Banana	4.94	5.12	5.26	0.56	0.73	0.88
	GPT-Image-1	6.65	6.59	6.56	2.27	2.21	2.18
	Gemini 2.0 Flash	4.44	4.23	4.07	0.05	0.15	0.32
Flow Matching	FLUX.1-Kontext-max	5.12	5.07	5.04	0.41	0.49	0.47
	Qwen-Image-Edit	5.86	5.72	5.15	1.47	1.34	0.77
	Step1X-Edit	4.06	3.34	2.76	0.33	1.04	1.63
	FLUX.1-Kontext-dev	5.12	5.07	5.04	0.73	0.69	0.65
	OmniGen	4.61	4.07	3.50	0.23	0.31	0.89
Diffusion	AnyEdit	3.66	2.80	1.95	0.72	1.58	2.44
	UltraEdit	4.79	4.68	4.36	0.41	0.30	0.02
	MagicBrush	3.85	3.08	2.36	0.54	1.30	2.02
	IP2P	3.20	2.38	1.44	1.18	2.01	2.94

We quantify aesthetics with HPSv3, which we found to generalize reliably to our generated images, whereas alternatives (e.g., RAHF) underperform in this setting. We do not fold these quality metrics into the aggregate “Overall” score, as preferences differ on whether an edited image should strictly preserve the input style or pursue beautification.

To disentangle these preferences, we report the absolute change in aesthetic score relative to the base image: Δ_i = |HPS_turn i − HPS_base|. Smaller Δ indicates stronger style fidelity to the base image; larger Δ reflects greater beautification or stylistic drift. As summarized in the table above, GPT-Image-1 achieves the highest aesthetic scores across turns and remains stable. Qwen-Image-Edit is the next strongest on absolute HPS. For preserving the base image’s look (small Δ), Gemini 2.0 Flash shows the least drift, with Nano Banana also performing well.

Qwen-Image-Edit after turn one — Turn 1 · HPS 6.19

Qwen-Image-Edit after turn two — Turn 2 · HPS 4.19

Qwen-Image-Edit after turn three — Turn 3 · HPS 3.34

This figure provides qualitative examples from Qwen-Image-Edit. The edited images exhibit elevated luminance and noticeable high-frequency bright artifacts (e.g., white streaks or “line” textures) that degrade perceptual quality, with luminance quintiles increasing substantially. Correspondingly, HPS drops from 6.19 to 4.19 and 3.34, suggesting that HPS is sensitive to over-exposure to some extent. In contrast, when querying VLMs about the visual quality of these images, the returned scores do not change in the first two turns and remain consistently above 50, reflecting a positive evaluation under the [0, 100] scale—even though the T2/T3 edited images show significant artifacts.

Multi-Turn vs. Complex Prompts: The power of Chain of Edit.

In Context editors benefit from step-by-step reasoning—multi-turn execution outperforms complex single-shot prompts. Flow-matching models prefer complex prompts, avoiding repeated exposure to their own outputs.

Technique	Model	Multi-turn (T3)	Complex (C3)
Unknown	Nano Banana	35.35	28.14
	GPT-Image-1	38.35	28.78
	Gemini 2.0 Flash	28.42	21.89
Flow Matching	Qwen-Image-Edit	22.55	27.62
	Step1X-Edit	17.83	15.73
	FLUX.1-Kontext-dev	16.61	19.58
	OmniGen	10.66	11.01
Diffusion	AnyEdit	7.22	2.80
	UltraEdit	6.36	8.22
	MagicBrush	4.90	4.55
	InstructPix2Pix	2.80	2.80

Appendix · Multi-turn Quality Examples

T1: Change the color of pumpkin to purple
T2: Change the background to forest
T3: Remove fabric orange bow

Each strip shows the input image followed by three consecutive turns executed by the same model.

Seedream 4.0

Nano Banana

GPT-Image-1

FLUX.1-Kontext-max

Gemini 2.0 Flash

Qwen-Image-Edit

Step1X-Edit

FLUX.1-Kontext-dev

OmniGen

AnyEdit

UltraEdit

MagicBrush

InstructPix2Pix

BibTeX

@article{chen2025edival,
  title={EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing},
  author={Chen, Tianyu and Zhang, Yasi and Zhang, Zhi and Yu, Peiyu and Wang, Shu and Wang, Zhendong and Lin, Kevin and Wang, Xiaofei and Yang, Zhengyuan and Li, Linjie and others},
  journal={arXiv preprint arXiv:2509.13399},
  year={2025}
}