Week 13 – Fine-Tuning Prep & Human-in-the-Loop Validation

Dates: August 24 – August 30
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir

Focus

This week’s focus was on transforming our high-level goals into concrete, executable steps. The main efforts were directed toward curating the fine-tuning dataset, preparing it for the OpenAI API, and implementing the first version of our human-in-the-loop feedback UI.

Goals for the Week

Begin region-specific dataset curation for fine-tuning
Experiment with OpenAI fine-tuning JSONL preparation ({input, output} pairs)
Implement feedback loop UI for human validation against model scores
Explore integration of statistical calibration for score distributions

Tasks Completed

Task	Status	Notes
Curation of `R2A` & `R9` datasets for fine-tuning	✅ Completed	Partitioned dataset into region-specific `JSONL` files with `image_url` and `score`
Created `prepare_finetune_data.py` script	✅ Completed	Transforms data into OpenAI’s required `{input, output}` format
Added feedback loop UI to Gradio app	✅ Completed	Allows human annotators to correct model predictions in real-time
Explored and integrated Platt Scaling for score calibration	✅ Completed	Significantly improved the alignment between prediction scores and true probabilities
Ran first successful fine-tuning data validation check	✅ Completed	Confirmed data integrity for initial fine-tuning job
Finalized data validation for all regions	✅ Completed	Ensured all datasets are clean and ready for fine-tuning

Key Learnings

Data curation for fine-tuning is more than just formatting; it’s a critical process of selecting high-quality, representative examples.
A human-in-the-loop feedback loop is invaluable for correcting model bias and building a robust ground truth dataset.
Statistical calibration techniques like Platt Scaling can correct overconfident or underconfident model scores, making them more interpretable and reliable.

Problems Faced & Solutions

Problem	Solution
`json.decoder.JSONDecodeError` during data preparation	Wrote a robust validation script to catch and skip malformed JSON lines before API ingestion
Incorrect fine-tuning data format (`messages` vs. `prompt`)	Revised the `prepare_finetune_data.py` to match the latest `Chat completions` fine-tuning format
`scikit-learn` version conflict for calibration functions	Set up a new `pip install` with a specific version to ensure library compatibility

📎 References

Goals for Next Week

Run the first OpenAI fine-tuning experiment on R2A dataset
Benchmark the fine-tuned model against the base GPT-4o model
Analyze and document performance gains and changes in specific regions
Start the fine-tuning process for other key regions (R9, R_1, etc.)

Screenshots (Optional)

Side-by-side comparison of model prediction vs. human feedback in the Gradio UI. A plot showing the calibration curve of the model before and after Platt Scaling.

“This week, we moved from building the cockpit to fueling the engine with a high-quality dataset and a human-in-the-loop validation system.”