Week 08 – VLM Integration and Embedding Transfer

Dates: 2025-07-22 – 2025-07-28 Internship: AI/ML Intern at SynerSense Pvt. Ltd. Mentor: Praveen Kulkarni Sir


Focus

This week involved slicing a pretrained CLIP model to extract vision embeddings and connect them to a lightweight MLP classifier for face verification.


Goals for the Week

  • Load pretrained vision-language model (VLM)
  • Extract image embeddings from CLIP’s visual backbone
  • Train MLP on embeddings to verify face identity
  • Track loss and evaluate accuracy over a labeled test set

Tasks Completed

Task Status Notes
Loaded openai/clip-vit-base-patch32 model ✅ Completed Used HuggingFace Transformers to extract vision model
Built MLP classifier with frozen vision encoder ✅ Completed Used 768-dim pooled output as input to dense layers
Trained model on sample dataset with labels ✅ Completed Achieved good separation; tested with CrossEntropyLoss
Visualized embeddings for known identities ✅ Completed Showed distinct regions for each class in t-SNE plot

Key Learnings

  • Learned how to use CLIP’s visual backbone for downstream tasks
  • Understood freezing strategies when using pretrained vision encoders
  • Learned model slicing and integration between libraries (HF + PyTorch)
  • Visualized VLM-based embeddings and verified their quality

Problems Faced & Solutions

Problem Solution
Memory errors during image batch processing Reduced batch size and cleared CUDA cache
Shape mismatch between VLM and MLP Used .pooler_output from vision encoder
Slow inference on large image folders Batched inputs using CLIPProcessor and loops

📌 References


Goals for Next Week

  • Wrap up internship with polished documentation and learning summary
  • Record demo videos for Gradio and CLIP inference pipelines
  • Prepare LinkedIn write-up and GitHub README updates

“Working with a pretrained model like CLIP gave me practical insights into modern transfer learning workflows for vision tasks.”