Week 12 – Benchmarking, Dataset Prep & Multi-Dashboard Tracking

Dates: August 17 – August 23
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir


Focus

This week was centered on setting up benchmarking pipelines, integrating tracking frameworks (W&B, Weave, Trackio), and preparing the evaluation dataset for fine-tuning readiness.


Goals for the Week

  • Build a reproducible benchmarking pipeline for prompt versions
  • Integrate WandB, Weave, and Trackio dashboards for side-by-side tracking
  • Prepare evaluation dataset in JSONL format with {image, label, feature} entries
  • Ensure error handling and consistency in dataset parsing
  • Lay groundwork for region-specific fine-tuning

Tasks Completed

Task Status Notes
Implemented benchmark.py with accuracy, parse error, and latency metrics ✅ Completed Supports per-feature filtering (R2A, R9, etc.)
Added WandB logging for accuracy/latency tracking ✅ Completed Centralized experiment history maintained
Integrated Trackio for Hugging Face-style lightweight metrics ✅ Completed Local + HF dashboard visualization
Connected Weave for function-level tracing ✅ Completed Ready to capture model reasoning traces
Built Gradio UI with tabs for analysis & benchmarking ✅ Completed Users can analyze single images OR run full benchmarks
Resolved dataset JSONL parsing errors and empty-line handling ✅ Completed Now skips empty lines gracefully
Drafted region-specific dataset curation plan ✅ Completed To be used in Week 13 fine-tuning stage

Key Learnings

  • Parallel dashboards (W&B + Weave + Trackio) provide a 360° experiment view:
    • W&B → performance metrics
    • Weave → tracing, interpretability
    • Trackio → lightweight local + HF ecosystem integration
  • A structured JSONL dataset acts as the backbone for both evaluation and fine-tuning.
  • Benchmark reproducibility depends on stable prompts + controlled temperature.

Problems Faced & Solutions

Problem Solution
FileNotFoundError during benchmark runs Fixed image path consistency & ensured proper renaming utilities
JSON parsing errors from empty lines Added line validation & continue on blanks
Weave initialization bug with SyncClientSession Reinstalled weave + aligned with WandB entity/project setup

📎 References


Goals for Next Week

  • Begin region-specific dataset curation for fine-tuning
  • Experiment with OpenAI fine-tuning JSONL preparation ({input, output} pairs)
  • Implement feedback loop UI for human validation against model scores
  • Explore integration of statistical calibration for score distributions

Screenshots (Optional)

Example Gradio UI with tabs, side-by-side dashboards (W&B + Weave + Trackio), benchmark results.


“Week 12 was about building the cockpit: a complete experiment tracking and benchmarking ecosystem.”