BLIP: Bootstrapping Language–Image Pre‑training
Paper: BLIP: Bootstrapping Language–Image Pre‑training for Unified Vision–Language Understanding and Generation
Authors: Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi — Salesforce Research (Feb 2022)
Summary
BLIP is a unified vision–language pre-training framework that excels at both understanding (e.g., retrieval, VQA) and generation (e.g., captioning) tasks. It combines:
- A Multimodal Mixture of Encoder–Decoder (MED) backbone supporting three modes: contrastive encoding, matching, and image-conditioned decoding :contentReference[oaicite:1]{index=1}
- A Captioner + Filter (CapFilt) bootstrapping pipeline: generates synthetic captions and filters noisy ones to refine pre-training data :contentReference[oaicite:2]{index=2}
Achieves state-of-the-art results on image–text retrieval (+2.7 % Recall@1), captioning (+2.8 % CIDEr), and VQA (+1.6 %) benchmarking :contentReference[oaicite:3]{index=3}.
Key Concepts
Multimodal Encoder–Decoder (MED)
One architecture supports:
- Contrastive vision–text alignment
- Image-text matching
- Image-grounded text generation
Bootstrapped Data Cleaning (CapFilt)
Synthetic captions are generated and filtered to enhance real web-scraped image-text pairs.
Performance Focus
Aims at strong performance across both retrieval and generation tasks—not favoring one over the other.
Workflow Diagram
graph TD;
Img[Raw Image] --> Cap[Captioner generates caption];
Img & Cap --> Filt[Filter cleans captions];
Img & Valid Caps --> MED[Multi-modal Encoder–Decoder];
MED --> Tasks[Retrieval · Captioning · VQA]
Working Code & Examples
Yes—the authors released PyTorch implementations on GitHub and Hugging Face models:
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
from PIL import Image
import requests
img = Image.open(requests.get(
"https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg",
stream=True).raw).convert("RGB")
inputs = processor(img, return_tensors="pt")
output = model.generate(**inputs)
print(processor.decode(output[0], skip_special_tokens=True))
-
Also available:
BlipForImageTextRetrieval
for matching tasks (Hugging Face, arXiv, ar5iv, Replicate)- Models published on Hugging Face and Replicate
Reflections
“BLIP showed me that clean data matters just as much as scale—in vision-language tasks, synthetic captions + filtering enable both retrieval and captioning strength.”
Takeaways:
- Unified models excel when architecture adapts to varied tasks
- Synthetic captioning cleans noisy data effectively
- Strong open-source implementation supports instant experimentation
References
- [Official GitHub Code & Model Weights]
- BLIP Paper (arXiv)
- [Hugging Face model cards: image-captioning & ITM versions] (Hugging Face, arXiv, Hugging Face)
- [Replicate API details] (Replicate)