Explicit reward criteria for visual generation

Auto-Rubric as Reward

From Implicit Preferences to Explicit Multimodal Generative Criteria

Juanxi Tian1,2,*, Fengyuan Liu1,*, Jiaming Han3, Yilei Jiang3, Yongliang Wu4, Yesheng Liu1, Haodong Li1, Furong Xu2, Wanhua Li1,†

1Nanyang Technological University 2Ant Group 3CUHK MMLab 4UIUC

*Equal first authorship Corresponding author

Auto-Rubric as Reward converts a small set of labeled visual supervision into readable rubric text, supports both pointwise and pairwise VLM grading, and lets practitioners freely scale up the rubric dimensions they care about. On top of that, we provide a concise pairwise online RL algorithm for diffusion models that emphasizes data efficiency, training stability, and scalability, verifying that Rubric as Reward extends beyond multimodal reasoning into multimodal generation, including text-to-image and image editing.

Text-to-Image Image Editing VLM-as-Judge Auto-Rubric Rubric as Reward
Pipeline Overview Final paper-aligned story
Pipeline overview for Auto-Rubric as Reward
Core Thesis Readable rubrics replace hidden reward heuristics.

The judge no longer improvises a standard from scratch on every comparison.

Cross-Judge Benefit Explicit rubrics make judgment criteria portable.

Once the desired dimensions are written down, different VLM judges stop improvising the evaluation standard and start aligning around the same task-specific notion of quality.

01

Generate Criteria

Start from labeled supervision and ask a VLM to spell out the exact visual dimensions that should matter for scoring or ranking.

02

Verify Before Use

Keep only rubrics that recover the intended answer in pointwise or pairwise settings, and revise them when they miss the target.

03

Auto-Rubric As Reward

Reuse the verified rubric set inside a frozen judge and connect it to a minimal pairwise online RL loop for diffusion training.

Abstract

Auto-Rubric as Reward

Aligning multimodal generative models with human preferences requires reward signals that preserve the compositional and multi-dimensional structure of judgment. Auto-Rubric as Reward reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition: before any comparison, it externalizes a VLM's internalized preference knowledge into prompt-specific rubrics, verifies those criteria against minimal supervision, and consolidates them into a reusable structured protocol for pointwise grading, pairwise evaluation, and reward construction. By converting latent preference structure into inspectable multimodal criteria, ARR reduces positional bias, improves data efficiency, and exposes a stable factorized interface for both zero-shot evaluation and downstream generative alignment.

Why it matters

The central bottleneck is not that VLMs lack preference knowledge, but that existing scalar and pairwise objectives fail to expose a stable factorized interface for applying it. ARR addresses this mismatch by transforming holistic, latent judgments into explicit and independently verifiable multimodal criteria, thereby improving interpretability, reducing reward hacking risk, and suppressing positional bias.

Where it works

ARR supports both pointwise and pairwise VLM evaluation, then extends naturally into multimodal generation through rubric- conditioned policy optimization. In the paper, this interface scales from evaluator fidelity benchmarks to text-to-image and image-editing post-training, where explicit criteria become a reusable supervision substrate rather than a task-specific prompt trick.

Contributions

Designed for interpretability without giving up performance

A

Explicit Reward Language

ARR externalizes implicit multimodal preferences into prompt-conditioned natural-language rubrics that are interpretable, verifiable, and highly data-efficient.

B

Pointwise And Pairwise Auto-Rubric

The same rubric interface supports scalar-style pointwise grading and pairwise comparison, allowing one structured preference representation to unify evaluation, ranking, and reward construction.

C

Diagnosing the Interface Bottleneck

The paper argues that multimodal alignment is bottlenecked less by missing knowledge than by the absence of a stable, factorized interface for expressing and applying preference.

D

A Concise Diffusion Online RL Algorithm

We also introduce a concise pairwise diffusion online RL algorithm that emphasizes data efficiency, training stability, and scalability, validating Rubric as Reward in text-to-image and image editing rather than only multimodal reasoning.

Method

From preference examples to reusable ARR rewards

Method diagram for Auto-Rubric as Reward
The pipeline generates task-specific rubrics, verifies them on supervision, organizes the surviving criteria into reusable structure, and connects a frozen ARR judge to pairwise reward construction for diffusion training.
Step 1

Query-Specific Rubric Generation

Each supervised example becomes a prompt for extracting the dimensions that should matter, whether the downstream task needs pointwise scoring or pairwise ranking.

Step 2

Verification And Revision

The same rubric must recover the desired supervision signal. If the generated criteria fail under grading, the system refines them instead of passing noisy reward logic downstream.

Step 3

Structured Reward Reuse

Verified rubrics are grouped into reusable themes and tips, then consumed by a frozen VLM judge whose outputs can power evaluation or pairwise online RL for generation models.

Example text-to-image rubric produced by Auto-Rubric
Text-to-image rubrics make scene composition, object attributes, lighting realism, material quality, and artifact control explicit.
Example image-edit rubric produced by Auto-Rubric
Editing rubrics emphasize instruction fulfillment, local edit quality, preservation of source content, natural blending, and avoidance of unintended changes.

Benchmarks

Consistent gains across judges, preference benchmarks, and downstream generation quality

The paper shows two complementary effects: rubric-conditioning makes VLM judging more reliable, and ARR-RPO translates that better supervision into stronger text-to-image and image-editing performance.

Pointwise / Pairwise One rubric interface

Auto-Rubric can serve scalar grading, comparison, and reward construction in one workflow.

Custom Dimensions Scale what matters

Researchers can expand rubric dimensions toward fidelity, preservation, composition, artifacts, or domain-specific constraints.

Generation Scope Beyond reasoning

The paper validates Rubric as Reward on multimodal generation with diffusion-based text-to-image and image-editing training.

Benchmark gains over specialist baseline models
ARR-RPO improves FLUX.1-Dev and Qwen-Image-Edit on GenEval, DPG-Bench, TIIF, UniGenBench++, GEdit-Bench, and ImgEdit.
Cross-model rubric transfer chart
Structured rubrics transfer cleanly across judge families and consistently outperform direct judging without explicit criteria.

Reading the results

  • Explicit criteria improve alignment while keeping the reward interpretable and editable.
  • Pointwise and pairwise Auto-Rubric make the same rubric assets reusable across evaluation and training.
  • Custom rubric dimensions reduce reward drift toward generic aesthetic preference.
  • The diffusion RL recipe shows that rubric-conditioned rewards scale beyond multimodal reasoning tasks.

Preference Evaluation

Agreement with human labels across four preference benchmarks

Accuracy denotes how often the judge matches the annotated preference.

Method HPDv3 MM-RewardBench2 (T2I) MM-RewardBench2 (Edit) EditReward-Bench
Trained reward model
PickScore 65.6 58.6 --- ---
ImageReward 58.6 54.0 --- ---
UnifiedReward 66.0 59.8 --- ---
UnifiedReward-Thinking 68.1 66.0 --- ---
HPSv3 76.9 60.2 --- ---
EditReward --- --- 67.2 56.45
VLM-as-Judge (direct)
Qwen3-VL-8B 67.2 57.6 59.2 54.01
GPT-5 72.4 70.5 73.8 57.53
Gemini 3.1 Pro 76.6 75.1 77.4 61.23
ARR (ours)
Qwen3-VL-8B + ARR 70.2 (+3.0) 62.7 (+5.1) 65.5 (+6.3) 57.22 (+3.21)
GPT-5 + ARR 76.1 (+3.7) 74.7 (+4.2) 77.5 (+3.7) 61.01 (+3.48)
Gemini 3.1 Pro + ARR 78.3 (+1.7) 78.9 (+3.8) 79.2 (+1.8) 63.27 (+2.04)

Generative Quality

Text-to-image and image-editing gains under ARR-RPO

The strongest results come from rubric-conditioned judges, especially with Gemini 3.1 Pro.

Method GenEval DPG-Bench TIIF UniGen++ Short UniGen++ Long GEdit-Bench ImgEdit
Specialist model (T2I)
Emu3 0.54 80.60 -- 45.42 50.59 --- ---
JanusFlow 0.63 79.68 -- 47.10 54.80 --- ---
FLUX.1-Dev 0.66 83.84 71.09 60.97 69.42 --- ---
DALLĀ·E 3 0.67 83.50 74.96 68.85 70.82 --- ---
Show-o2 0.76 86.14 -- 61.90 70.33 --- ---
OmniGen2 0.80 83.57 -- 63.09 71.39 --- ---
BAGEL 0.82 85.07 71.50 59.91 71.26 --- ---
ARR-RPO / T2I (ours)
w/ RPO-Qwen3-VL-8B-ARR 0.74 (+0.08) 85.03 (+1.19) 74.92 (+3.83) 64.17 (+3.20) 71.82 (+2.40) --- ---
w/ RPO-GPT-5-ARR 0.78 (+0.12) 85.41 (+1.57) 76.18 (+5.09) 65.36 (+4.39) 72.41 (+2.99) --- ---
w/ RPO-Gemini 3.1 Pro-ARR 0.80 (+0.14) 85.76 (+1.92) 76.85 (+5.76) 65.89 (+4.92) 72.93 (+3.51) --- ---
Specialist model (editing)
Instruct-Pix2Pix --- --- --- --- --- 3.68 1.88
AnyEdit --- --- --- --- --- 3.21 2.45
Step1X-Edit --- --- --- --- --- 6.97 3.06
Qwen-Image-Edit-2509 --- --- --- --- --- 7.54 4.35
UniWorldv2 --- --- --- --- --- 7.76 4.48
ARR-RPO / image editing (ours)
w/ RPO-Qwen3-VL-8B-ARR --- --- --- --- --- 7.66 (+0.12) 4.38 (+0.03)
w/ RPO-GPT-5-ARR --- --- --- --- --- 7.72 (+0.18) 4.40 (+0.05)
w/ RPO-Gemini 3.1 Pro-ARR --- --- --- --- --- 7.85 (+0.31) 4.43 (+0.08)

Qualitative Results

Stronger instruction fidelity, cleaner structures, and more faithful local edits

Combined qualitative results for text-to-image and image editing
ARR-RPO improves prompt satisfaction, structural plausibility, and edit consistency while retaining source-image content when preservation is part of the task.
Text-to-image qualitative examples
Text-to-image outputs show tighter content matching and more deliberate compositional control under rubric-conditioned optimization.
Image-editing qualitative examples
Image-editing outputs preserve identity and global scene structure while executing localized instructions more faithfully.

Citation

BibTeX

If Auto-Rubric as Reward contributes to your work, please cite the project and link back to the official repository.

Citation arXiv link coming soon
@misc{tian2026autorubricrewardimplicitpreferences,
      title={Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria}, 
      author={Juanxi Tian and Fengyuan Liu and Jiaming Han and Yilei Jiang and Yongliang Wu and Yesheng Liu and Haodong Li and Furong Xu and Wanhua Li},
      year={2026},
      eprint={2605.08354},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.08354}, 
}