Daily Work Report - 2026-03-17

Summary

  • Implemented live-fetching of remote Hugging Face job logs and integrated it into the backend poller.
  • Polished documentation (root, backend, frontend) and updated .env.example.

Files changed

  • Docs: README.md (root, backend, frontend), .env.example
  • Backend: job_runner.py - added _fetch_hf_job_logs() and integrated it into _poll_hf_jobs_loop().
  • Project plan: updated internal todo list to track HF live-log polling.

What I implemented

  • HF live-log polling: _fetch_hf_job_logs() tries huggingface_hub first, then falls back to hf jobs logs <id> CLI. New lines are appended to per-job logs for incremental reads.
  • Docs: rewrote and cross-linked top-level, backend, and frontend READMEs; clarified .env.example placeholders and formats.
  • Tracking: added and completed a todo item for HF live-log polling.

How to verify locally

  1. Ensure HF_TOKEN is set (repo .env or environment).

  2. Start the backend (PowerShell):

cd backend
.\.venv\Scripts\Activate.ps1   # if using venv
uv run uvicorn main:app --reload --port 8000
  1. (Optional) Start the frontend:
cd frontend
npm install
npm run dev -- --port 5173
  1. Submit a Hugging Face job (example):
curl -X POST http://localhost:8000/api/jobs/start \
  -H "Content-Type: application/json" \
  -d '{"command":"tune","is_local":false,"flavor":"cpu-basic","timeout":"3h","script":"train.py","args":{ "n_trials":3 }}'
  1. Check job registry and logs:
  • Poll job list: GET http://localhost:8000/jobs
  • Poll incremental logs: GET http://localhost:8000/jobs/<job_id>/logs?since=0
  • Tail per-job file: backend/.anana-results/logs/<job_id>.log
  1. Observe HF logs: after submission you should see submission output (including HF job URL). The backend poller will fetch remote logs at HF_POLL_INTERVAL_SECONDS (default 30s) and append new lines to the job log.

Notes & recommendations

  • Requirements: huggingface_hub Python package or the hf CLI must be available, and HF_TOKEN must be valid.
  • Default HF_POLL_INTERVAL_SECONDS is 30s. Lower to 5–10s for near-real-time logs but be mindful of rate limits.
  • Option: add a dedicated log-only poller with backoff for more frequent updates.

Context & diary (concise)

You’ve transitioned from local scripts to a control-plane architecture: the FastAPI backend now orchestrates long-running training jobs and surfaces logs/metadata to the SvelteKit frontend.

Internship - Mar 17, 2026

  • Role: AI Engineer - SynerSense
  • Project: AnanaCare ML Control Plane
  • Hours: 8

Work summary

  • Migrated execution behavior into a backend control plane and added HF live-log polling.
  • Implemented streaming of training stdout into a live UI console.
  • Prepared train.py to accept external config.json inputs for reproducible runs.

Tech stack

  • Backend: FastAPI / Python - async background tasks, subprocess management, WebSockets
  • Frontend: SvelteKit / Tailwind - live console and dashboard
  • MLOps: Hugging Face Hub / Jobs - remote GPU orchestration and metadata
  • Infra: PowerShell / Docker - environment management

Learnings

  • Focus on job lifecycle management rather than single-run scripts.
  • HF Hub can be used as a lightweight experiment store to simplify deployment.

Blockers & risks

  • Zombie processes: child training tasks may outlive killed parent shells - plan to add psutil process-tree termination.
  • HF Job latency: HF Jobs API can be slow; consider optimizations for eventual consistency in the leaderboard.

If you’d like, I can extract an HF live-logs troubleshooting checklist into backend/README.md or add an explicit .env variables section.


This site uses Just the Docs, a documentation theme for Jekyll.