BLIP: Bootstrapping Language–Image Pre‑training

Paper: BLIP: Bootstrapping Language–Image Pre‑training for Unified Vision–Language Understanding and Generation
Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi — Salesforce Research (Feb 2022)


Summary

BLIP is a unified vision–language pre-training framework that excels at both understanding (e.g., retrieval, VQA) and generation (e.g., captioning) tasks. It combines:

  • A Multimodal Mixture of Encoder–Decoder (MED) backbone supporting three modes: contrastive encoding, matching, and image-conditioned decoding :contentReference[oaicite:1]{index=1}
  • A Captioner + Filter (CapFilt) bootstrapping pipeline: generates synthetic captions and filters noisy ones to refine pre-training data :contentReference[oaicite:2]{index=2}

Achieves state-of-the-art results on image–text retrieval (+2.7 % Recall@1), captioning (+2.8 % CIDEr), and VQA (+1.6 %) benchmarking :contentReference[oaicite:3]{index=3}.


Key Concepts

Multimodal Encoder–Decoder (MED)

One architecture supports:

  1. Contrastive vision–text alignment
  2. Image-text matching
  3. Image-grounded text generation

Bootstrapped Data Cleaning (CapFilt)

Synthetic captions are generated and filtered to enhance real web-scraped image-text pairs.

Performance Focus

Aims at strong performance across both retrieval and generation tasks—not favoring one over the other.


Workflow Diagram

graph TD;
  Img[Raw Image] --> Cap[Captioner generates caption];
  Img & Cap --> Filt[Filter cleans captions];
  Img & Valid Caps --> MED[Multi-modal Encoder–Decoder];
  MED --> Tasks[Retrieval · Captioning · VQA]

Working Code & Examples

Yes—the authors released PyTorch implementations on GitHub and Hugging Face models:

from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

from PIL import Image
import requests

img = Image.open(requests.get(
    "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg",
    stream=True).raw).convert("RGB")

inputs = processor(img, return_tensors="pt")
output = model.generate(**inputs)
print(processor.decode(output[0], skip_special_tokens=True))

Reflections

“BLIP showed me that clean data matters just as much as scale—in vision-language tasks, synthetic captions + filtering enable both retrieval and captioning strength.”

Takeaways:

  • Unified models excel when architecture adapts to varied tasks
  • Synthetic captioning cleans noisy data effectively
  • Strong open-source implementation supports instant experimentation

References