MLLM-as-a-Judge Exhibits Self-Preference Bias

Abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge systems were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge systems favor outputs from specific MLLMs. In this study, we investigate such model-specific preference biases. To this end, we introduce Philautia-Eval, which quantifies the degree of the biases by disentangling preference tendencies from differences in generation quality. Using 1.2M caption-score pairs collected from twelve MLLMs, we discovered that representative MLLMs tend to exhibit self-preference bias. Our analysis also revealed cross-model preference bias among models within the same family. Finally, we investigate whether a simple ensemble of MLLMs mitigates the influence of model-specific preference bias. We demonstrated that the ensemble effectively mitigates these influences while maintaining evaluation performance.

Schematic of our approach for investigating model-specific preference biases in MLLM-as-a-Judge. Each MLLM typically favors its own generations (self-preference bias), whereas LLaVA-1.5 favors texts generated by other models within the LLaVA family (cross-model preference bias). Our MoMetrics exhibits less model-specific preference bias.

Overview

Pipeline of the proposed method. In the figure, "std." indicates the standardization. Generators generate image captions, which are then given evaluation scores by Evaluators. From these scores, we construct the matrix \(\mathrm{\Phi}\), whose rows and columns correspond to the Generators and the Evaluators, respectively. We then standardize \(\mathrm{\Phi}\) column-wise and subsequently row-wise to obtain \(\tilde{\mathrm{\Phi}}\). The diagonal entries indicate the degree of self-preference bias, which we name as philautia score.

Results

RQ1: How much does MLLM-as-a-Judge Exhibit Self-preference Bias?

Main findings:

Representative MLLMs tend to exhibit self-preference bias.
References affect self-preference bias in Gemini 2.5 Pro.
GPT-4o has a relatively low self-preference bias.

(i) Visualization of \(\mathrm{\Phi}\) in the reference-based setting. The vertical and horizontal axes represent the Generators and Evaluators, respectively. High scores are shown in red and low scores in blue. (ii) Visualization of the \(\mathrm{\Phi}\) after Evaluator-wise standardization. Because the means and standard deviations differ across Evaluators, the matrix entries should be standardized in the columnwise direction. Similarly, they should be standardized in the row direction.

Qualitative Results

Example of self-preference bias. The barchart shows the scores given to a caption generated by Gemini-2.5-Pro. Gemini-2.5-Pro exceptionally gave high scores to its own generations compared to other Evaluators. The symbol \(\blacklozenge\) represents the mean value of scores by each Evaluator. Red text within \(\hat{\mathbf{y}}_{g}\) highlights hallucination.

1 / 3

RQ2: To What Extent does Cross-model Preference Bias Appear in MLLM-as-a-Judge?

Main findings:

Qwen-based MLLMs tend to favor each other.
Within the LLaVA family, LLaVA-1.5 tends to favor its successor models.

Cross-model preference bias visualization.

Visualization of preference bias within model families. (i) The submatrix for Qwen-based models: nine of the 12 off-diagonal entries were positive, suggesting a preference bias within the model family. (ii) The submatrix for LLaVA-family models: LLaVA-1.5-13B tends to favor its successor models (e.g., LLaVA-NeXT-Vicuna-7B, LLaVA-OneVision-7B).

RQ3: Can an Ensemble of Evaluators Mitigate the Influence of Model-specific Preference Bias while Maintaining Alignment with Human Judgments?

Main findings:

MoMetrics mitigates preference bias while maintaining performance.

Metrics		Nebula		Flickr8k-Ex		SelfEval-Cap
		\(\tau_b\) ↑	\(\tau_c\) ↑	\(\tau_b\) ↑	\(\tau_c\) ↑	Φ-score
G-VEval	GPT-4o	56.1	53.2	61.5	59.7	1.09
	Qwen2.5-VL-7B	55.3	52.4	54.6	54.0	1.12
	InternVL2.5-8B	54.1	51.3	54.6	52.9	3.02
MoMetrics	(i) GPT-4o and InternVL2.5-8B	56.4	53.5	61.5	59.7	1.31
	(ii) + Eagle2-9B	56.6	53.6	62.7	60.8	0.45
	(iii) + LLaVA-OneVision-7B	56.6	53.7	60.6	58.8	-0.19
	(iv) + DeepSeek-VL2	56.4	53.5	59.0	57.3	0.15
	(v) + Qwen2.5-VL-7B	57.0	54.1	59.6	57.8	0.52
	(vi) + Phi-3.5-Vision	57.0	54.1	59.6	57.8	0.42

Quantitative comparison between MoMetrics and the baselines. For MoMetrics, each row sequentially adds a model to the ensemble.

Evaluators	Generators
	GPT-4o	Gemini-2.5-Pro	Qwen2.5-VL-7B	Molmo-7B-D	Eagle2-9B	LLaVA-OneVision-7B	DeepSeek-VL2	Gemma3-4B-IT	Phi-3.5-Vision	LLaVA-NeXT-Vicuna-7B	LLaVA-1.5-13B	InternVL2.5-8B
Self	1.08	1.84	1.12	0.86	1.05	1.83	2.00	1.10	1.26	1.09	1.26	3.03
MoMetrics	0.23	0.18	0.67	-0.26	0.01	-0.90	0.67	-0.33	0.11	0.14	-0.18	0.34

Quantitative comparison between \(\tilde{\mathrm{\Phi}}_{E^{(i)}}(G^{(i)})\) and \(\tilde{\mathrm{\Phi}}_{\text{MoMetrics}}(G^{(i)})\). Bold font represents the best results.

BibTeX

Coming soon.

MLLM‑as‑a‑Judge Exhibits Self‑Preference Bias

Abstract

Overview

Results

RQ1: How much does MLLM-as-a-Judge Exhibit Self-preference Bias?

Qualitative Results

RQ2: To What Extent does Cross-model Preference Bias Appear in MLLM-as-a-Judge?

RQ3: Can an Ensemble of Evaluators Mitigate the Influence of Model-specific Preference Bias while Maintaining Alignment with Human Judgments?

BibTeX