IMDB Classificatoin using BERT
Developed a text classification model leveraging BERT (Bidirectional Encoder Representations from Transformers).
BERT: An Overview of Architecture, Training, and Bias
Abstract
This paper explores the architecture, training process, and biases associated with BERT (Bidirectional Encoder Representations from Transformers), a transformative model in natural language processing (NLP). We delve into the components of BERT, its training methodologies, and the challenges posed by biases in its predictions.
Table of Contents
- Introduction
- BERT Architecture
- Training BERT
- Transfer Learning with BERT
- Biases in BERT
- Fine-Tuning BERT for Text Classification
- Experimental Setup
- Results
- Conclusion
- References
Introduction
BERT has revolutionized the field of NLP by providing a robust framework for understanding and generating human language. This paper aims to provide a comprehensive overview of BERT’s architecture, training process, and the challenges posed by biases in its predictions.
BERT Architecture
Components
BERT’s architecture consists of two main components: an encoder and a decoder. The encoder processes input data, while the decoder generates output. Encoder-decoder models are suitable for generative tasks like translation and summarization. Encoder-only models, like BERT, excel in tasks requiring understanding of the input, such as text classification and named entity recognition. Decoder-only models, like GPT, are used for text generation.
Tokenization
BERT uses subword tokenization, breaking words into smaller subwords or tokens. For example, “tokenization” is split into “token” and “##ization”. This approach ensures frequently used words are not split into smaller subwords and decomposes rarely used words into meaningful subwords. BERT uncased has a vocabulary of around 30,000 tokens, mapping each subword to a unique numerical ID.
Model | Vocabulary Size | Tokenization Method |
---|---|---|
BERT | 30,000 | WordPiece |
GPT-2 | 50,000 | Byte-Pair Encoding |
GPT-3 | 50,000 | Byte-Pair Encoding |
Training BERT
Training Datasets
BERT was trained on the English Wikipedia (around 2.5 billion words) and BookCorpus (11,000 books with around 800 million words). The training process did not require labeled data, allowing BERT to be trained on raw text.
Training Tasks
BERT was trained using two key tasks: masked language modeling (predicting masked-out words in a sentence) and next sentence prediction (determining if one sentence logically follows another).
Compute Resources
Training BERT required significant compute resources, costing hundreds of thousands of dollars on Tensor Processing Units (TPUs).
Transfer Learning with BERT
Components
Transfer learning with BERT involves two main components: pre-training and fine-tuning. Pre-training involves training a model on a large dataset, while fine-tuning adapts the pre-trained model to a specific task using labeled data.
Resources
Pre-training is resource-intensive, requiring large datasets and significant computational power. Fine-tuning, however, is less resource-demanding and faster.
Benefits
Transfer learning allows for better accuracy and efficiency. Starting with a pre-trained model generally yields better results than training a model from scratch for specific tasks like sentiment analysis.
Biases in BERT
Gender Bias in Predictions
BERT shows a strong gender bias in its predictions, associating certain professions with specific genders (e.g., nurses with females and doctors with males).
Implications of Bias
This bias can have significant implications, such as in job resume filtering, where AI systems might prefer resumes based on gender associations rather than qualifications.
Mitigation Strategies
To mitigate these biases, it’s crucial to have human oversight in AI systems to ensure fair and unbiased outcomes.
Fine-Tuning BERT for Text Classification
IMDB Dataset
The IMDB dataset is used for training, where movie reviews are classified as positive (1) or negative (0).
Adding a Linear Classifier
To fine-tune BERT for text classification, a linear classifier layer is added on top of the final embedding of the [CLS] token. The [CLS] token is a special token added at the beginning of each input sequence and is used for classification tasks.
Dropout Layer
A dropout layer is often added before the linear classifier to reduce overfitting. Dropout randomly sets a fraction of input units to zero during training, which helps the model generalize better.
Training Process
The model is fine-tuned using the labeled IMDB dataset. The text reviews are tokenized and fed into BERT, and the final embedding of the [CLS] token is passed through the linear classifier to predict the label (positive or negative). The model is trained to minimize the classification error by adjusting the weights of both the pre-trained BERT model and the newly added linear classifier. —
Example code for fine-tuning BERT
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
checkpoint = “distilbert-base-cased” tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
def tokenize_function(batch): return tokenizer(batch[“text”], padding=True, truncation=True)
Assuming imdb_encoded
is the encoded dataset
training_args = TrainingArguments( output_dir=”./results”, evaluation_strategy=”epoch”, learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=2, weight_decay=0.01, )
trainer = Trainer( model=model, args=training_args, train_dataset=imdb_encoded[“train”], eval_dataset=imdb_encoded[“validation”], tokenizer=tokenizer, )
trainer.train()
Experimental Setup
Hugging Face Libraries
The Hugging Face libraries, including transformers
and datasets
, provide pre-trained models and tools for NLP tasks. These libraries allow for easy access to a wide range of datasets and facilitate the conversion of datasets to Pandas DataFrames for visualization and manipulation.
Data Preparation
The IMDB dataset is prepared by removing HTML tags using regular expressions and balancing the dataset to ensure an approximately equal number of positive and negative reviews.
Visualizing Data
A box plot is used to visualize the distribution of review lengths, providing insights into the range and average length of reviews in the dataset.
Review Length | Count |
---|---|
Short (< 50) | 120 |
Medium (50-150) | 350 |
Long (> 150) | 50 |
Results
Training and Evaluation
The fine-tuned BERT model achieved an accuracy of approximately 89.5% on the validation set after two epochs of training. The model demonstrated effective learning and generalization on the IMDB dataset.
Inference
The fine-tuned model can classify new movie reviews as positive or negative based on the learned patterns and context from the training data.
Conclusion
BERT’s architecture and training methodologies have significantly advanced the field of NLP. However, addressing biases in its predictions remains a critical challenge. Future research should focus on developing more equitable and unbiased AI systems.
References
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
-
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
layout: page title: project description: a project with a background image img: /assets/img/12.jpg —