Can Prompting Enforce Visual Consistency?

The Hypothesis

Vision-language models (VLMs) can sometimes produce inconsistent scores for the same image. Can this be mitigated via strategic prompting?

Experiment

  • Used OpenAI VLM on the same facial image with the same prompt multiple times
  • Added progressively stricter format instructions in the prompt

Observations

  • Without format instructions: ~8% variance
  • With format instructions: variance reduced to ~3%
  • With reference outputs included: variance reduced to <2%

Conclusion

Explicit structure, range constraints, and exemplar outputs greatly improve consistency.


Prompting is not just about what you say — it’s how you say it, when you format it, and whether the model can trust your anchors.