MLLM‑as‑a‑Judge Exhibits Self‑Preference Bias

Anonymous


Under review

Abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge systems were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge systems favor outputs from specific MLLMs. In this study, we investigate such model-specific preference biases. To this end, we introduce Philautia-Eval, which quantifies the degree of the biases by disentangling preference tendencies from differences in generation quality. Using 1.2M caption-score pairs collected from twelve MLLMs, we discovered that representative MLLMs tend to exhibit self-preference bias. Our analysis also revealed cross-model preference bias among models within the same family. Finally, we investigate whether a simple ensemble of MLLMs mitigates the influence of model-specific preference bias. We demonstrated that the ensemble effectively mitigates these influences while maintaining evaluation performance.


Eye catch figure.

Schematic of our approach for investigating model-specific preference biases in MLLM-as-a-Judge. Each MLLM typically favors its own generations (self-preference bias), whereas LLaVA-1.5 favors texts generated by other models within the LLaVA family (cross-model preference bias). Our MoMetrics exhibits less model-specific preference bias.

Overview


Pipeline of the proposed method.

Pipeline of the proposed method. In the figure, "std." indicates the standardization. Generators generate image captions, which are then given evaluation scores by Evaluators. From these scores, we construct the matrix \(\mathrm{\Phi}\), whose rows and columns correspond to the Generators and the Evaluators, respectively. We then standardize \(\mathrm{\Phi}\) column-wise and subsequently row-wise to obtain \(\tilde{\mathrm{\Phi}}\). The diagonal entries indicate the degree of self-preference bias, which we name as philautia score.

Results


RQ1: How much does MLLM-as-a-Judge Exhibit Self-preference Bias?

Main findings:

  • Representative MLLMs tend to exhibit self-preference bias.
  • References affect self-preference bias in Gemini 2.5 Pro.
  • GPT-4o has a relatively low self-preference bias.

Standardized matrix visualization.

(i) Visualization of \(\mathrm{\Phi}\) in the reference-based setting. The vertical and horizontal axes represent the Generators and Evaluators, respectively. High scores are shown in red and low scores in blue. (ii) Visualization of the \(\mathrm{\Phi}\) after Evaluator-wise standardization. Because the means and standard deviations differ across Evaluators, the matrix entries should be standardized in the columnwise direction. Similarly, they should be standardized in the row direction.



Qualitative Results

Qualitative result 1

Example of self-preference bias. The barchart shows the scores given to a caption generated by Gemini-2.5-Pro. Gemini-2.5-Pro exceptionally gave high scores to its own generations compared to other Evaluators. The symbol \(\blacklozenge\) represents the mean value of scores by each Evaluator. Red text within \(\hat{\mathbf{y}}_{g}\) highlights hallucination.

1 / 3


RQ2: To What Extent does Cross-model Preference Bias Appear in MLLM-as-a-Judge?

Main findings:

  • Qwen-based MLLMs tend to favor each other.
  • Within the LLaVA family, LLaVA-1.5 tends to favor its successor models.

Cross-model preference bias visualization.

Visualization of preference bias within model families. (i) The submatrix for Qwen-based models: nine of the 12 off-diagonal entries were positive, suggesting a preference bias within the model family. (ii) The submatrix for LLaVA-family models: LLaVA-1.5-13B tends to favor its successor models (e.g., LLaVA-NeXT-Vicuna-7B, LLaVA-OneVision-7B).



RQ3: Can an Ensemble of Evaluators Mitigate the Influence of Model-specific Preference Bias while Maintaining Alignment with Human Judgments?

Main findings:

  • MoMetrics mitigates preference bias while maintaining performance.

Metrics Nebula Flickr8k-Ex SelfEval-Cap
\(\tau_b\) ↑ \(\tau_c\) ↑ \(\tau_b\) ↑ \(\tau_c\) ↑ Φ-score
G-VEval GPT-4o 56.1 53.2 61.5 59.7 1.09
Qwen2.5-VL-7B 55.3 52.4 54.6 54.0 1.12
InternVL2.5-8B 54.1 51.3 54.6 52.9 3.02
MoMetrics (i) GPT-4o and InternVL2.5-8B 56.4 53.5 61.5 59.7 1.31
(ii) + Eagle2-9B 56.6 53.6 62.7 60.8 0.45
(iii) + LLaVA-OneVision-7B 56.6 53.7 60.6 58.8 -0.19
(iv) + DeepSeek-VL2 56.4 53.5 59.0 57.3 0.15
(v) + Qwen2.5-VL-7B 57.0 54.1 59.6 57.8 0.52
(vi) + Phi-3.5-Vision 57.0 54.1 59.6 57.8 0.42

Quantitative comparison between MoMetrics and the baselines. For MoMetrics, each row sequentially adds a model to the ensemble.


Evaluators Generators
GPT-4o Gemini-2.5-Pro Qwen2.5-VL-7B Molmo-7B-D Eagle2-9B LLaVA-OneVision-7B DeepSeek-VL2 Gemma3-4B-IT Phi-3.5-Vision LLaVA-NeXT-Vicuna-7B LLaVA-1.5-13B InternVL2.5-8B
Self 1.08 1.84 1.12 0.86 1.05 1.83 2.00 1.10 1.26 1.09 1.26 3.03
MoMetrics 0.23 0.18 0.67 -0.26 0.01 -0.90 0.67 -0.33 0.11 0.14 -0.18 0.34

Quantitative comparison between \(\tilde{\mathrm{\Phi}}_{E^{(i)}}(G^{(i)})\) and \(\tilde{\mathrm{\Phi}}_{\text{MoMetrics}}(G^{(i)})\). Bold font represents the best results.



BibTeX

Coming soon.