Week 13 – Fine-Tuning Prep & Human-in-the-Loop Validation
Dates: August 24 – August 30
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir
Focus
This week’s focus was on transforming our high-level goals into concrete, executable steps. The main efforts were directed toward curating the fine-tuning dataset, preparing it for the OpenAI API, and implementing the first version of our human-in-the-loop feedback UI.
Goals for the Week
- Begin region-specific dataset curation for fine-tuning
- Experiment with OpenAI fine-tuning JSONL preparation (
{input, output}
pairs) - Implement feedback loop UI for human validation against model scores
- Explore integration of statistical calibration for score distributions
Tasks Completed
Task | Status | Notes |
---|---|---|
Curation of R2A & R9 datasets for fine-tuning | ✅ Completed | Partitioned dataset into region-specific JSONL files with image_url and score |
Created prepare_finetune_data.py script | ✅ Completed | Transforms data into OpenAI’s required {input, output} format |
Added feedback loop UI to Gradio app | ✅ Completed | Allows human annotators to correct model predictions in real-time |
Explored and integrated Platt Scaling for score calibration | ✅ Completed | Significantly improved the alignment between prediction scores and true probabilities |
Ran first successful fine-tuning data validation check | ✅ Completed | Confirmed data integrity for initial fine-tuning job |
Finalized data validation for all regions | ✅ Completed | Ensured all datasets are clean and ready for fine-tuning |
Key Learnings
- Data curation for fine-tuning is more than just formatting; it’s a critical process of selecting high-quality, representative examples.
- A human-in-the-loop feedback loop is invaluable for correcting model bias and building a robust ground truth dataset.
- Statistical calibration techniques like Platt Scaling can correct overconfident or underconfident model scores, making them more interpretable and reliable.
Problems Faced & Solutions
Problem | Solution |
---|---|
json.decoder.JSONDecodeError during data preparation | Wrote a robust validation script to catch and skip malformed JSON lines before API ingestion |
Incorrect fine-tuning data format (messages vs. prompt ) | Revised the prepare_finetune_data.py to match the latest Chat completions fine-tuning format |
scikit-learn version conflict for calibration functions | Set up a new pip install with a specific version to ensure library compatibility |
📎 References
- OpenAI Fine-Tuning Guide: Preparing your Dataset
- A Simple Guide to Platt Scaling for Model Calibration
- Gradio: Interactive GUI for Machine Learning
Goals for Next Week
- Run the first OpenAI fine-tuning experiment on
R2A
dataset - Benchmark the fine-tuned model against the base GPT-4o model
- Analyze and document performance gains and changes in specific regions
- Start the fine-tuning process for other key regions (
R9
,R_1
, etc.)
Screenshots (Optional)
Side-by-side comparison of model prediction vs. human feedback in the Gradio UI. A plot showing the calibration curve of the model before and after Platt Scaling.
“This week, we moved from building the cockpit to fueling the engine with a high-quality dataset and a human-in-the-loop validation system.”