CelebA: Large-Scale Face Attribute Dataset & Its High‑Quality Variants

Dataset: [CelebA: Large-scale CelebFaces Attributes in the Wild]
Source: MMLab CUHK & PyTorch Vision
Released: 2015 for CelebA; CelebA‑HQ & downstream variants emerged later


Overview

  • CelebA offers ~200K celebrity face images annotated with 40+ binary attributes and identities—ideal for attribute recognition and generative modeling.
  • CelebA-HQ refines this into a high-quality 30K images at 1024×1024 resolution with precise face crops (via Progressive GANs).
  • CelebAMask-HQ adds rich semantic segmentation masks for detailed facial components (19 classes) :contentReference[oaicite:1]{index=1}.

🛠 Working Code that Can be Used

1. PyTorch Loader

from torchvision.datasets import CelebA
dataset = CelebA(root="data/", split="train", target_type="attr", download=True)

Built-in support in torchvision for CelebA — no third-party installs needed (GitHub).

2. Downloader & HQ Converter

The make-CelebA-HQ script can reconstruct CelebA-HQ from the original dataset:

  • Downloads CelebA & CelebA‑HQ archives
  • Runs make_HQ_images.py to produce high‑res .npy image files at 1024×1024 (GitHub).

Other Community Tools

  • PyTorch loader with identities: includes MS-CelebA identity labels (identity_CelebA.txt) and a notebook for testing (GitHub).

Practical Notes

  • Official MMLab page provides dataset info but no full code (mmlab.ie.cuhk.edu.hk).
  • Third-party scripts exist but may require manual data placement and external downloads.
  • Community tools are more reliable and tested.

Reflections

“CelebA is foundational — but preparing high-quality versions (HQ, mask, identity) makes it usable for advanced generation and evaluation tasks.”

  • Using high-resolution data with segmentation enables precise inpainting and control networks.
  • Attribute-rich annotations allow for strong evaluation on vision and face tasks.
  • The PyTorch loader is simple and seamless for everyday use.

Resources


This analysis documents dataset readiness and practical tools for experimentation on CelebA during my internship.


Summary of Findings

  • Official dataset provides only data and metadata—no code samples.
  • Working code exists in community tools:
    • PyTorch CelebA loader (built-in)
    • make-CelebA-HQ script for preparing upscale dataset
    • Notebooks/helpers for identity attribute loading
  • No fully official code for parsing or segmentation—community alternatives recommended.