Week 12 – Benchmarking, Dataset Prep & Multi-Dashboard Tracking

Dates: August 17 – August 23
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir

Focus

This week was centered on setting up benchmarking pipelines, integrating tracking frameworks (W&B, Weave, Trackio), and preparing the evaluation dataset for fine-tuning readiness.

Goals for the Week

Build a reproducible benchmarking pipeline for prompt versions
Integrate WandB, Weave, and Trackio dashboards for side-by-side tracking
Prepare evaluation dataset in JSONL format with {image, label, feature} entries
Ensure error handling and consistency in dataset parsing
Lay groundwork for region-specific fine-tuning

Tasks Completed

Task	Status	Notes
Implemented `benchmark.py` with accuracy, parse error, and latency metrics	✅ Completed	Supports per-feature filtering (R2A, R9, etc.)
Added WandB logging for accuracy/latency tracking	✅ Completed	Centralized experiment history maintained
Integrated Trackio for Hugging Face-style lightweight metrics	✅ Completed	Local + HF dashboard visualization
Connected Weave for function-level tracing	✅ Completed	Ready to capture model reasoning traces
Built Gradio UI with tabs for analysis & benchmarking	✅ Completed	Users can analyze single images OR run full benchmarks
Resolved dataset JSONL parsing errors and empty-line handling	✅ Completed	Now skips empty lines gracefully
Drafted region-specific dataset curation plan	✅ Completed	To be used in Week 13 fine-tuning stage

Key Learnings

Parallel dashboards (W&B + Weave + Trackio) provide a 360° experiment view:
- W&B → performance metrics
- Weave → tracing, interpretability
- Trackio → lightweight local + HF ecosystem integration
A structured JSONL dataset acts as the backbone for both evaluation and fine-tuning.
Benchmark reproducibility depends on stable prompts + controlled temperature.

Problems Faced & Solutions

Problem	Solution
`FileNotFoundError` during benchmark runs	Fixed image path consistency & ensured proper renaming utilities
JSON parsing errors from empty lines	Added line validation & `continue` on blanks
Weave initialization bug with `SyncClientSession`	Reinstalled `weave` + aligned with WandB entity/project setup

📎 References

Goals for Next Week

Begin region-specific dataset curation for fine-tuning
Experiment with OpenAI fine-tuning JSONL preparation ({input, output} pairs)
Implement feedback loop UI for human validation against model scores
Explore integration of statistical calibration for score distributions

Screenshots (Optional)

Example Gradio UI with tabs, side-by-side dashboards (W&B + Weave + Trackio), benchmark results.

“Week 12 was about building the cockpit: a complete experiment tracking and benchmarking ecosystem.”