CelebA: Large-Scale Face Attribute Dataset & Its High‑Quality Variants

Dataset: [CelebA: Large-scale CelebFaces Attributes in the Wild]
Source: MMLab CUHK & PyTorch Vision
Released: 2015 for CelebA; CelebA‑HQ & downstream variants emerged later

Overview

CelebA offers ~200K celebrity face images annotated with 40+ binary attributes and identities—ideal for attribute recognition and generative modeling.
CelebA-HQ refines this into a high-quality 30K images at 1024×1024 resolution with precise face crops (via Progressive GANs).
CelebAMask-HQ adds rich semantic segmentation masks for detailed facial components (19 classes) :contentReference[oaicite:1]{index=1}.

🛠 Working Code that Can be Used

1. PyTorch Loader

from torchvision.datasets import CelebA
dataset = CelebA(root="data/", split="train", target_type="attr", download=True)

Built-in support in torchvision for CelebA — no third-party installs needed (GitHub).

2. Downloader & HQ Converter

The make-CelebA-HQ script can reconstruct CelebA-HQ from the original dataset:

Downloads CelebA & CelebA‑HQ archives
Runs make_HQ_images.py to produce high‑res .npy image files at 1024×1024 (GitHub).

Other Community Tools

PyTorch loader with identities: includes MS-CelebA identity labels (identity_CelebA.txt) and a notebook for testing (GitHub).

Practical Notes

Official MMLab page provides dataset info but no full code (mmlab.ie.cuhk.edu.hk).
Third-party scripts exist but may require manual data placement and external downloads.
Community tools are more reliable and tested.

Reflections

“CelebA is foundational — but preparing high-quality versions (HQ, mask, identity) makes it usable for advanced generation and evaluation tasks.”

Using high-resolution data with segmentation enables precise inpainting and control networks.
Attribute-rich annotations allow for strong evaluation on vision and face tasks.
The PyTorch loader is simple and seamless for everyday use.

Resources

This analysis documents dataset readiness and practical tools for experimentation on CelebA during my internship.

Summary of Findings

Official dataset provides only data and metadata—no code samples.
Working code exists in community tools:
- PyTorch CelebA loader (built-in)
- make-CelebA-HQ script for preparing upscale dataset
- Notebooks/helpers for identity attribute loading
No fully official code for parsing or segmentation—community alternatives recommended.