Week 12 – Benchmarking, Dataset Prep & Multi-Dashboard Tracking
Dates: August 17 – August 23
Internship: AI/ML Intern at SynerSense Pvt. Ltd.
Mentor: Praveen Kulkarni Sir
Focus
This week was centered on setting up benchmarking pipelines, integrating tracking frameworks (W&B, Weave, Trackio), and preparing the evaluation dataset for fine-tuning readiness.
Goals for the Week
- Build a reproducible benchmarking pipeline for prompt versions
- Integrate WandB, Weave, and Trackio dashboards for side-by-side tracking
- Prepare evaluation dataset in
JSONL
format with{image, label, feature}
entries - Ensure error handling and consistency in dataset parsing
- Lay groundwork for region-specific fine-tuning
Tasks Completed
Task | Status | Notes |
---|---|---|
Implemented benchmark.py with accuracy, parse error, and latency metrics | ✅ Completed | Supports per-feature filtering (R2A, R9, etc.) |
Added WandB logging for accuracy/latency tracking | ✅ Completed | Centralized experiment history maintained |
Integrated Trackio for Hugging Face-style lightweight metrics | ✅ Completed | Local + HF dashboard visualization |
Connected Weave for function-level tracing | ✅ Completed | Ready to capture model reasoning traces |
Built Gradio UI with tabs for analysis & benchmarking | ✅ Completed | Users can analyze single images OR run full benchmarks |
Resolved dataset JSONL parsing errors and empty-line handling | ✅ Completed | Now skips empty lines gracefully |
Drafted region-specific dataset curation plan | ✅ Completed | To be used in Week 13 fine-tuning stage |
Key Learnings
- Parallel dashboards (W&B + Weave + Trackio) provide a 360° experiment view:
- W&B → performance metrics
- Weave → tracing, interpretability
- Trackio → lightweight local + HF ecosystem integration
- A structured JSONL dataset acts as the backbone for both evaluation and fine-tuning.
- Benchmark reproducibility depends on stable prompts + controlled temperature.
Problems Faced & Solutions
Problem | Solution |
---|---|
FileNotFoundError during benchmark runs | Fixed image path consistency & ensured proper renaming utilities |
JSON parsing errors from empty lines | Added line validation & continue on blanks |
Weave initialization bug with SyncClientSession | Reinstalled weave + aligned with WandB entity/project setup |
📎 References
Goals for Next Week
- Begin region-specific dataset curation for fine-tuning
- Experiment with OpenAI fine-tuning JSONL preparation (
{input, output}
pairs) - Implement feedback loop UI for human validation against model scores
- Explore integration of statistical calibration for score distributions
Screenshots (Optional)
Example Gradio UI with tabs, side-by-side dashboards (W&B + Weave + Trackio), benchmark results.
“Week 12 was about building the cockpit: a complete experiment tracking and benchmarking ecosystem.”