CelebA: Large-Scale Face Attribute Dataset & Its High‑Quality Variants
Dataset: [CelebA: Large-scale CelebFaces Attributes in the Wild]
Source: MMLab CUHK & PyTorch Vision
Released: 2015 for CelebA; CelebA‑HQ & downstream variants emerged later
Overview
- CelebA offers ~200K celebrity face images annotated with 40+ binary attributes and identities—ideal for attribute recognition and generative modeling.
- CelebA-HQ refines this into a high-quality 30K images at 1024×1024 resolution with precise face crops (via Progressive GANs).
- CelebAMask-HQ adds rich semantic segmentation masks for detailed facial components (19 classes) :contentReference[oaicite:1]{index=1}.
🛠 Working Code that Can be Used
1. PyTorch Loader
from torchvision.datasets import CelebA
dataset = CelebA(root="data/", split="train", target_type="attr", download=True)
Built-in support in torchvision
for CelebA — no third-party installs needed (GitHub).
2. Downloader & HQ Converter
The make-CelebA-HQ script can reconstruct CelebA-HQ from the original dataset:
- Downloads CelebA & CelebA‑HQ archives
- Runs
make_HQ_images.py
to produce high‑res.npy
image files at 1024×1024 (GitHub).
Other Community Tools
- PyTorch loader with identities: includes MS-CelebA identity labels (
identity_CelebA.txt
) and a notebook for testing (GitHub).
Practical Notes
- Official MMLab page provides dataset info but no full code (mmlab.ie.cuhk.edu.hk).
- Third-party scripts exist but may require manual data placement and external downloads.
- Community tools are more reliable and tested.
Reflections
“CelebA is foundational — but preparing high-quality versions (HQ, mask, identity) makes it usable for advanced generation and evaluation tasks.”
- Using high-resolution data with segmentation enables precise inpainting and control networks.
- Attribute-rich annotations allow for strong evaluation on vision and face tasks.
- The PyTorch loader is simple and seamless for everyday use.
Resources
- CelebA Dataset (torchvision) (GitHub, GitHub, GitHub, GitHub)
- make‑CelebA‑HQ GitHub (GitHub)
- CelebAMask‑HQ dataset page (mmlab.ie.cuhk.edu.hk)
- PyTorch loader with identities GitHub (GitHub)
This analysis documents dataset readiness and practical tools for experimentation on CelebA during my internship.
Summary of Findings
- Official dataset provides only data and metadata—no code samples.
- Working code exists in community tools:
- PyTorch
CelebA
loader (built-in) make-CelebA-HQ
script for preparing upscale dataset- Notebooks/helpers for identity attribute loading
- PyTorch
- No fully official code for parsing or segmentation—community alternatives recommended.