MASTERCLASS: Vision-Language Modeling with CLIP + Custom Architectures in PyTorch


PHASE 1: FOUNDATION – What are Vision-Language Models?

What is CLIP?

CLIP (Contrastive Language–Image Pretraining) by OpenAI learns to connect images and text using contrastive learning:

  • It encodes images and text separately into the same vector space
  • Trains with a contrastive loss: image embeddings should be close to correct text embeddings, far from incorrect ones

Real-World Use Cases:

  • Image classification with text prompts
  • Image search engines
  • Zero-shot transfer learning
  • VLM + MLP for custom predictions (just like we’re building!)

PHASE 2: CODE EXPLAINED – Line by Line

1. Load CLIP and Freeze Vision Encoder

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
vision_encoder = model.vision_model
for param in vision_encoder.parameters():
    param.requires_grad = False

We’re freezing CLIP’s vision backbone to avoid updating it during training (common practice for small datasets)

2. Process Image Input

inputs = processor(images=image, return_tensors="pt").to(device)
  • processor handles resizing, normalization, etc., for CLIP
  • Output: image tensor formatted correctly for the model

3. Extract Features from Vision Encoder

vision_outputs = vision_encoder(**inputs)
pooled_output = vision_outputs.pooler_output
  • pooled_output: [1, 768] embedding of the image (like BERT’s CLS token)

PHASE 3: CUSTOM ARCHITECTURE – MLP Head

We defined a lightweight classifier:

class VLMtoMLP(nn.Module):
    def __init__(self, input_dim=768, hidden_dim=256, output_dim=2):

This MLP learns to classify based on CLIP’s visual understanding

Use cases for your MLP head:

  • Binary classification (e.g., “cat” vs. “dog”)
  • Custom outputs (e.g., emotion detection from images)
  • Small datasets where fine-tuning the whole CLIP is overkill

PHASE 4: TRAINING LOOP & LOSS

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp_head.parameters(), lr=1e-4)

Clean and minimal setup!

Pro Tip: Use scheduler = torch.optim.lr_scheduler.StepLR(...) for better convergence.


PHASE 5: PRACTICALS – Build & Train a Real Dataset

Project Idea:

Classify medical images (e.g., normal vs pneumonia) using pre-trained CLIP + your MLP head.

Training Loop (Full Dataset)

for epoch in range(num_epochs):
    for image, label in dataloader:
        inputs = processor(images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            pooled_output = vision_encoder(**inputs).pooler_output
        
        logits = mlp_head(pooled_output)
        loss = criterion(logits, label)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

PHASE 6: ADVANCED LEARNING – Expand Your Expertise

Fine-Tune the Entire CLIP Model

Unfreeze later layers:

for name, param in vision_encoder.named_parameters():
    if "layer11" in name or "layer10" in name:  # last 2 layers
        param.requires_grad = True

Add Text Input to Fuse Image + Text

Use CLIPModel.forward(input_ids, pixel_values) with text + image to do multi-modal classification.

Go Beyond:

  • Use CLIPModel.get_image_features() and get_text_features()
  • Try SigLIP, BLIP, or CLIP-ViT-L models
  • Integrate with gradio or streamlit to build apps

PHASE 7: EXERCISES TO MASTER CONCEPTS

Task Description
Investigate CLIPProcessor Understand how it preprocesses images
Write Custom DatasetLoader Load and label images from folders
Build Training Dashboard Use matplotlib or TensorBoard
Zero-Shot Classifier Use CLIP without training for few-shot learning
Add Text Branch Combine image and text inputs using CLIP