Week 14 – Fine-Tuning & Benchmarking the First Model

Dates: August 31 – September 6
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir


Focus

This week, we reached a major milestone by running our first fine-tuning experiment on a region-specific dataset. The primary focus was on benchmarking the fine-tuned model, analyzing its performance gains over the base model, and preparing for subsequent fine-tuning jobs.


Goals for the Week

  • Run the first OpenAI fine-tuning experiment on R2A dataset
  • Benchmark the fine-tuned model against the base GPT-4o model
  • Analyze and document performance gains and changes in specific regions
  • Start the fine-tuning process for other key regions (R9, R_1, etc.)

Tasks Completed

Task Status Notes
Ran successful fine-tuning job on R2A dataset ✅ Completed The model gpt-4o-finetuned-r2a was successfully created
Benchmarked finetuned-r2a vs. base gpt-4o ✅ Completed Saw significant accuracy gains on the R2A feature set
Documented performance metrics (accuracy, MAE, F1 score) ✅ Completed Results are saved in the experiments/benchmarks/ directory
Initiated fine-tuning data prep for R9 and R_1 ✅ Completed Datasets are now validated and ready for the fine-tuning API
Updated the Gradio UI to visualize performance gains ✅ Completed Added a new tab to compare the results of different models

Key Learnings

  • Fine-tuning is a powerful technique for improving performance on specific, narrow tasks where the base model struggles.
  • Data quality is paramount. The work from Weeks 12 and 13 on dataset curation paid off with a smooth fine-tuning process.
  • Benchmarking is crucial. Without it, it’s impossible to quantify the value added by fine-tuning. Our pipeline provides clear, reproducible evidence.

Problems Faced & Solutions

Problem Solution
InsufficientTrainingData error from OpenAI API Adjusted the dataset size and ensured it met the minimum requirements for fine-tuning
Unexpected high cost for fine-tuning Optimized the data token count and used a smaller, more focused dataset to reduce costs

📎 References


Goals for Next Week

  • Run fine-tuning experiments for R9 and R_1
  • Create a consolidated dashboard to compare performance across all models
  • Explore a hybrid model approach for different regions
  • Draft a final report summarizing the project and key findings

Screenshots (Optional)

Screenshot of the Gradio UI showing a side-by-side comparison of the base model’s predictions vs. the fine-tuned model’s predictions on the same image.


“Week 14 marked the transition from theory to tangible results; we now have a fine-tuned model that proves the value of our entire pipeline.”