Skip to main content

Text Classification

Text classification is a common task in Natural Language Processing (NLP) that categorizes text into predefined classes. This tutorial will guide you through building a simple text classification model using PyTorch.

Prerequisite

Before beginning, make sure you have a basic understanding of Python, and have PyTorch, TorchText, and NLTK installed on your machine.

Dataset

We will use the AG News dataset, a collection of news articles from the web pertaining to four classes: World, Sports, Business, and Science/Technology.

TorchText's datasets library has the AG News dataset. We can import and split the dataset into training and testing sets:

from torchtext.datasets import AG_NEWS

train_iter, test_iter = AG_NEWS()

Preprocessing

Next, we need to preprocess our text. This involves tokenizing the text into words, converting them into vectors, and padding the vectors to maintain uniformity in length. Thankfully, TorchText handles most of this for us.

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Model Architecture

We will design a simple model with an embedding layer, a linear layer, and a softmax layer.

import torch.nn as nn

class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()

def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()

def forward(self, text, offsets):
embedded = self.embedding(text, offsets)
return self.fc(embedded)

Training and Evaluation

Let's define our hyperparameters, instantiate our model, and define our loss function and optimizer:

import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

VOCAB_SIZE = len(vocab)
EMBED_DIM = 32
NUN_CLASS = len(AG_NEWS.classes)
model = TextClassificationModel(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)

We can define our training and evaluation loops:

from torch.utils.data.dataset import random_split

def train(dataloader):
model.train()
total_loss = 0
for (label, text, offsets) in dataloader:
optimizer.zero_grad()
text, offsets, label = text.to(device), offsets.to(device), label.to(device)
output = model(text, offsets)
loss = criterion(output, label)
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)

def evaluate(dataloader):
model.eval()
total_loss = 0
for (label, text, offsets) in dataloader:
text, offsets, label = text.to(device), offsets.to(device), label.to(device)
with torch.no_grad():
output = model(text, offsets)
loss = criterion(output, label)
total_loss += loss.item()
return total_loss / len(dataloader)

Now we can train our model:

for epoch in range(1, EPOCHS + 1):
train_loss = train(train_dataloader)
valid_loss = evaluate(valid_dataloader)
print(f'Epoch: {epoch}, Training Loss: {train_loss}, Validation Loss: {valid_loss}')

That's it! You have just trained a text classification model using PyTorch. Experiment with different hyperparameters and model architectures to improve the model's performance.


This tutorial is a basic introduction to text classification using PyTorch. More advanced techniques could include using different types of layers like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), or using pre-trained models for transfer learning. I hope this tutorial helps you in your journey with PyTorch and NLP. Happy Learning!