Week 14 – Fine-Tuning & Benchmarking the First Model
Dates: August 31 – September 6
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir
Focus
This week, we reached a major milestone by running our first fine-tuning experiment on a region-specific dataset. The primary focus was on benchmarking the fine-tuned model, analyzing its performance gains over the base model, and preparing for subsequent fine-tuning jobs.
Goals for the Week
- Run the first OpenAI fine-tuning experiment on
R2A
dataset - Benchmark the fine-tuned model against the base GPT-4o model
- Analyze and document performance gains and changes in specific regions
- Start the fine-tuning process for other key regions (
R9
,R_1
, etc.)
Tasks Completed
Task | Status | Notes |
---|---|---|
Ran successful fine-tuning job on R2A dataset | ✅ Completed | The model gpt-4o-finetuned-r2a was successfully created |
Benchmarked finetuned-r2a vs. base gpt-4o | ✅ Completed | Saw significant accuracy gains on the R2A feature set |
Documented performance metrics (accuracy, MAE, F1 score) | ✅ Completed | Results are saved in the experiments/benchmarks/ directory |
Initiated fine-tuning data prep for R9 and R_1 | ✅ Completed | Datasets are now validated and ready for the fine-tuning API |
Updated the Gradio UI to visualize performance gains | ✅ Completed | Added a new tab to compare the results of different models |
Key Learnings
- Fine-tuning is a powerful technique for improving performance on specific, narrow tasks where the base model struggles.
- Data quality is paramount. The work from Weeks 12 and 13 on dataset curation paid off with a smooth fine-tuning process.
- Benchmarking is crucial. Without it, it’s impossible to quantify the value added by fine-tuning. Our pipeline provides clear, reproducible evidence.
Problems Faced & Solutions
Problem | Solution |
---|---|
InsufficientTrainingData error from OpenAI API | Adjusted the dataset size and ensured it met the minimum requirements for fine-tuning |
Unexpected high cost for fine-tuning | Optimized the data token count and used a smaller, more focused dataset to reduce costs |
📎 References
Goals for Next Week
- Run fine-tuning experiments for R9 and R_1
- Create a consolidated dashboard to compare performance across all models
- Explore a hybrid model approach for different regions
- Draft a final report summarizing the project and key findings
Screenshots (Optional)
Screenshot of the Gradio UI showing a side-by-side comparison of the base model’s predictions vs. the fine-tuned model’s predictions on the same image.
“Week 14 marked the transition from theory to tangible results; we now have a fine-tuned model that proves the value of our entire pipeline.”