Week 14 – Fine-Tuning & Benchmarking the First Model

Dates: August 31 – September 6
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir

Focus

This week, we reached a major milestone by running our first fine-tuning experiment on a region-specific dataset. The primary focus was on benchmarking the fine-tuned model, analyzing its performance gains over the base model, and preparing for subsequent fine-tuning jobs.

Goals for the Week

Run the first OpenAI fine-tuning experiment on R2A dataset
Benchmark the fine-tuned model against the base GPT-4o model
Analyze and document performance gains and changes in specific regions
Start the fine-tuning process for other key regions (R9, R_1, etc.)

Tasks Completed

Task	Status	Notes
Ran successful fine-tuning job on `R2A` dataset	✅ Completed	The model `gpt-4o-finetuned-r2a` was successfully created
Benchmarked `finetuned-r2a` vs. base `gpt-4o`	✅ Completed	Saw significant accuracy gains on the `R2A` feature set
Documented performance metrics (accuracy, MAE, F1 score)	✅ Completed	Results are saved in the `experiments/benchmarks/` directory
Initiated fine-tuning data prep for `R9` and `R_1`	✅ Completed	Datasets are now validated and ready for the fine-tuning API
Updated the Gradio UI to visualize performance gains	✅ Completed	Added a new tab to compare the results of different models

Key Learnings

Fine-tuning is a powerful technique for improving performance on specific, narrow tasks where the base model struggles.
Data quality is paramount. The work from Weeks 12 and 13 on dataset curation paid off with a smooth fine-tuning process.
Benchmarking is crucial. Without it, it’s impossible to quantify the value added by fine-tuning. Our pipeline provides clear, reproducible evidence.

Problems Faced & Solutions

Problem	Solution
`InsufficientTrainingData` error from OpenAI API	Adjusted the dataset size and ensured it met the minimum requirements for fine-tuning
Unexpected high cost for fine-tuning	Optimized the data token count and used a smaller, more focused dataset to reduce costs

📎 References

Goals for Next Week

Run fine-tuning experiments for R9 and R_1
Create a consolidated dashboard to compare performance across all models
Explore a hybrid model approach for different regions
Draft a final report summarizing the project and key findings

Screenshots (Optional)

Screenshot of the Gradio UI showing a side-by-side comparison of the base model’s predictions vs. the fine-tuned model’s predictions on the same image.

“Week 14 marked the transition from theory to tangible results; we now have a fine-tuned model that proves the value of our entire pipeline.”