Week 08 – VLM Integration and Embedding Transfer

Dates: 2025-07-22 – 2025-07-28 Internship: AI/ML Intern at SynerSense Pvt. Ltd. Mentor: Praveen Kulkarni Sir

Focus

This week involved slicing a pretrained CLIP model to extract vision embeddings and connect them to a lightweight MLP classifier for face verification.

Goals for the Week

Load pretrained vision-language model (VLM)
Extract image embeddings from CLIP’s visual backbone
Train MLP on embeddings to verify face identity
Track loss and evaluate accuracy over a labeled test set

Tasks Completed

Task	Status	Notes
Loaded `openai/clip-vit-base-patch32` model	✅ Completed	Used HuggingFace Transformers to extract vision model
Built MLP classifier with frozen vision encoder	✅ Completed	Used 768-dim pooled output as input to dense layers
Trained model on sample dataset with labels	✅ Completed	Achieved good separation; tested with CrossEntropyLoss
Visualized embeddings for known identities	✅ Completed	Showed distinct regions for each class in t-SNE plot

Key Learnings

Learned how to use CLIP’s visual backbone for downstream tasks
Understood freezing strategies when using pretrained vision encoders
Learned model slicing and integration between libraries (HF + PyTorch)
Visualized VLM-based embeddings and verified their quality

Problems Faced & Solutions

Problem	Solution
Memory errors during image batch processing	Reduced batch size and cleared CUDA cache
Shape mismatch between VLM and MLP	Used `.pooler_output` from vision encoder
Slow inference on large image folders	Batched inputs using CLIPProcessor and loops

📌 References

Goals for Next Week

Wrap up internship with polished documentation and learning summary
Record demo videos for Gradio and CLIP inference pipelines
Prepare LinkedIn write-up and GitHub README updates

“Working with a pretrained model like CLIP gave me practical insights into modern transfer learning workflows for vision tasks.”