Week 11 – Prompt Engineering for Visual Scoring & Ground Truth Stability

Dates: August 10 – August 16
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir


Focus

This week concentrated on refining prompt strategies to improve visual-fat quantification accuracy. A major emphasis was placed on ensuring consistency, region-wise interpretability, and building mental mappings for float-value scoring.


Goals for the Week

  • Enhance the existing prompt using few-shot and chain-of-thought prompting techniques
  • Investigate inconsistency across repeated image inputs and design stability measures
  • Develop a mental visual scale to standardize fat prominence scores (0.00 to 1.00)
  • Create annotated prompt-ready instructions derived from real visual examples
  • Ensure precise parsing and region-wise separation for each side (left/right)

Tasks Completed

Task Status Notes
Added CoT and structured reasoning layers to prompt ✅ Completed Boosted VLM interpretability and regional independence
Designed and discussed image-level consistency mechanisms ✅ Completed Proposed “cache + score threshold + output clamp” ensemble logic
Constructed mental scoring map per region based on bulge prominence ✅ Completed Helps guide both prompting and human validation
Rewrote output parser format for strict float range (0.00–1.00) enforcement ✅ Completed Removed random float fluctuations via prompt calibration
Logged visual inconsistencies across random generations ✅ Completed Documented variation patterns and their correlation with ambiguity
Coordinated prompt changes with DevOps team via updated prompt version ✅ Completed Used prompt version 10 after modifications

Key Learnings

  • Chain-of-thought improves per-region breakdown and avoids cross-region bias.
  • Few-shot prompting improves formatting fidelity by modeling ideal completions.
  • Visual-to-float mappings require stable mental anchors for human-model alignment.
  • Output formatting needs strict post-filtering to avoid hallucinated or malformed strings.

Problems Faced & Solutions

Problem Solution
Inconsistent float values for same image Used mental mapping + prompt enhancement + caching
Prompt exceeding token limits after embedding examples Pruned explanations, shortened examples, and modularized format section
Model output outside allowed float range (e.g., 1.2, -0.1) Rewrote scoring boundaries directly in prompt and added float clamp

📎 References


Goals for Next Week

  • Begin region-specific visual dataset curation for fine-tuning
  • Start work on feedback loop interface between model score and human validation
  • Explore statistical calibration techniques using real score distributions

Screenshots (Optional)

Example of new prompt structure, before/after score comparison, cache logs.


“Week 11 was about making the model think like a human grader—clear, structured, and region-wise grounded.”