MS‑Celeb‑1M: Large‑Scale Face Recognition Benchmark

Paper: MS‑Celeb‑1M: A Dataset and Benchmark for Large‑Scale Face Recognition
Authors: Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, Jianfeng Gao – ECCV 2016


Summary

  • Introduces a 10 million image, 100K identity dataset—the largest public face recognition corpus at that time :contentReference[oaicite:1]{index=1}.
  • Defines face recognition as not just matching faces but linking to unique entity keys (via Freebase), supporting disambiguation and structured retrieval :contentReference[oaicite:2]{index=2}.
  • Includes aligned face crops, a manually-annotated test set, and benchmarking protocols where top-1 accuracy at 95% precision was ~44.2% on hard cases :contentReference[oaicite:3]{index=3}.

Key Insights

Benchmark Design

Advance from verification to identity recognition: predict who the person is, not merely whether two images match.

Large‑Scale Dataset

Provides millions of images for 100K celebrities, enabling deep model training at an unprecedented public scale :contentReference[oaicite:4]{index=4}.

Ethical Considerations

The dataset was later retracted amid privacy concerns :contentReference[oaicite:5]{index=5}. A cleaned 6M subset exists, but access is restricted or removed.


Working Code & Tools

MSCELEB1M-GenImage Script

A community-made Python tool to decode Base64 image data from dev‑set TSV files:

# Extract from GitHub: wuyuebupt/MSCELEB1M-GenImage
python msceleb1m_genImage.py MsCelebV1-Faces-Aligned-DevSet1.tsv
  • Saves decoded .jpg images in an images/ directory (GitHub).

Official Code

No official download or loaders; Microsoft provided aligned crops & TSV files, not scripts.


Reflections

“MS‑Celeb‑1M shows ambition at industrial scale—both the technical leap and ethical implications of large web-scraped biometric datasets.”

  • Powerful scale, but prone to noise and sensitive to consent/privacy.
  • This dataset informed both technical innovation and ethical discourse around face data.

Resources


This analysis has been added to my internship documentation on dataset scale, recognition benchmarks, and responsible AI.



This site uses Just the Docs, a documentation theme for Jekyll.