Week 11 – Prompt Engineering for Visual Scoring & Ground Truth Stability

Dates: August 10 – August 16
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir

Focus

This week concentrated on refining prompt strategies to improve visual-fat quantification accuracy. A major emphasis was placed on ensuring consistency, region-wise interpretability, and building mental mappings for float-value scoring.

Goals for the Week

Enhance the existing prompt using few-shot and chain-of-thought prompting techniques
Investigate inconsistency across repeated image inputs and design stability measures
Develop a mental visual scale to standardize fat prominence scores (0.00 to 1.00)
Create annotated prompt-ready instructions derived from real visual examples
Ensure precise parsing and region-wise separation for each side (left/right)

Tasks Completed

Task	Status	Notes
Added CoT and structured reasoning layers to prompt	✅ Completed	Boosted VLM interpretability and regional independence
Designed and discussed image-level consistency mechanisms	✅ Completed	Proposed “cache + score threshold + output clamp” ensemble logic
Constructed mental scoring map per region based on bulge prominence	✅ Completed	Helps guide both prompting and human validation
Rewrote output parser format for strict float range (0.00–1.00) enforcement	✅ Completed	Removed random float fluctuations via prompt calibration
Logged visual inconsistencies across random generations	✅ Completed	Documented variation patterns and their correlation with ambiguity
Coordinated prompt changes with DevOps team via updated prompt version	✅ Completed	Used prompt version 10 after modifications

Key Learnings

Chain-of-thought improves per-region breakdown and avoids cross-region bias.
Few-shot prompting improves formatting fidelity by modeling ideal completions.
Visual-to-float mappings require stable mental anchors for human-model alignment.
Output formatting needs strict post-filtering to avoid hallucinated or malformed strings.

Problems Faced & Solutions

Problem	Solution
Inconsistent float values for same image	Used mental mapping + prompt enhancement + caching
Prompt exceeding token limits after embedding examples	Pruned explanations, shortened examples, and modularized format section
Model output outside allowed float range (e.g., 1.2, -0.1)	Rewrote scoring boundaries directly in prompt and added float clamp

📎 References

Goals for Next Week

Begin region-specific visual dataset curation for fine-tuning
Start work on feedback loop interface between model score and human validation
Explore statistical calibration techniques using real score distributions

Screenshots (Optional)

Example of new prompt structure, before/after score comparison, cache logs.

“Week 11 was about making the model think like a human grader—clear, structured, and region-wise grounded.”