Week 13 – Fine-Tuning Prep & Human-in-the-Loop Validation

Dates: August 24 – August 30
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir


Focus

This week’s focus was on transforming our high-level goals into concrete, executable steps. The main efforts were directed toward curating the fine-tuning dataset, preparing it for the OpenAI API, and implementing the first version of our human-in-the-loop feedback UI.


Goals for the Week

  • Begin region-specific dataset curation for fine-tuning
  • Experiment with OpenAI fine-tuning JSONL preparation ({input, output} pairs)
  • Implement feedback loop UI for human validation against model scores
  • Explore integration of statistical calibration for score distributions

Tasks Completed

Task Status Notes
Curation of R2A & R9 datasets for fine-tuning ✅ Completed Partitioned dataset into region-specific JSONL files with image_url and score
Created prepare_finetune_data.py script ✅ Completed Transforms data into OpenAI’s required {input, output} format
Added feedback loop UI to Gradio app ✅ Completed Allows human annotators to correct model predictions in real-time
Explored and integrated Platt Scaling for score calibration ✅ Completed Significantly improved the alignment between prediction scores and true probabilities
Ran first successful fine-tuning data validation check ✅ Completed Confirmed data integrity for initial fine-tuning job
Finalized data validation for all regions ✅ Completed Ensured all datasets are clean and ready for fine-tuning

Key Learnings

  • Data curation for fine-tuning is more than just formatting; it’s a critical process of selecting high-quality, representative examples.
  • A human-in-the-loop feedback loop is invaluable for correcting model bias and building a robust ground truth dataset.
  • Statistical calibration techniques like Platt Scaling can correct overconfident or underconfident model scores, making them more interpretable and reliable.

Problems Faced & Solutions

Problem Solution
json.decoder.JSONDecodeError during data preparation Wrote a robust validation script to catch and skip malformed JSON lines before API ingestion
Incorrect fine-tuning data format (messages vs. prompt) Revised the prepare_finetune_data.py to match the latest Chat completions fine-tuning format
scikit-learn version conflict for calibration functions Set up a new pip install with a specific version to ensure library compatibility

📎 References


Goals for Next Week

  • Run the first OpenAI fine-tuning experiment on R2A dataset
  • Benchmark the fine-tuned model against the base GPT-4o model
  • Analyze and document performance gains and changes in specific regions
  • Start the fine-tuning process for other key regions (R9, R_1, etc.)

Screenshots (Optional)

Side-by-side comparison of model prediction vs. human feedback in the Gradio UI. A plot showing the calibration curve of the model before and after Platt Scaling.


“This week, we moved from building the cockpit to fueling the engine with a high-quality dataset and a human-in-the-loop validation system.”