AG News Corpus — Dataset for Topic Classification

Overview

The AG News Corpus is a large English-language collection of news articles commonly used in Natural Language Processing for supervised topic classification. The primary purpose is topic classification of news articles where each example pairs short textual content with one of four topic labels. The dataset is widely used for benchmarking classification models and for experiments on tokenization, embeddings, and model architectures.

Core data structure

Each unit of observation is a news article (title + description). Examples combine the headline and a short descriptive text as the input, with a numeric category assigned as the target. The canonical input/output pairing is: input = title + description; output = numeric category.

Key file formats and exports available include plain text and structured files such as:

texts.txt (full text of articles)
score.txt (labels for full corpus)
cross-validation files
train.csv and test.csv
JSON/JSONL exports
Text + label files for the full corpus

Fields present in the commonly distributed classification subset are the Title (headline text), Description (short summary), and Label/Category (numeric label corresponding to one of four topics).

Labels and label schema

The task uses a four-class label schema. The classes are balanced and encoded as numeric labels. The set of label names is provided below:

World
Sports
Business
Sci/Tech

The label schema is described as: Four numeric labels corresponding to the four topics; balanced across classes.

Scale, splits, and coverage

Two different size figures are reported for the corpus:

Full corpus: ≈496,835 articles in the full corpus.
Classification subset: 127,600 total samples in classification subset (120,000 train, 7,600 test).

Reported dataset splits and per-class counts for the classification subset:

Train: 120,000 (Per-class: 30,000 train per class)
Test: 7,600 (Per-class: 1,900 test per class)

Coverage notes:

Languages: English
Topics/domains covered: World, Sports, Business, Sci/Tech The dataset is described as balanced across classes.

Collection and provenance

Content originates primarily from the ComeToMyHead news search engine and is gathered from more than 2,000 news sources. The collection method is web crawling. No further provenance metadata (such as exact time ranges or geographic coverage) is provided in the available descriptions.

Annotation and labeling process

The provided metadata does not include details about annotators, annotation instructions, labeling workflow, or quality control procedures. The labels appear to be categorical topic assignments but no annotation provenance or inter-annotator agreement statistics are recorded in the available information.

Recommended uses and evaluation

The collection is intended for training and evaluating supervised topic classification systems and for benchmarking text classification performance. Typical use cases include experiments on tokenization, embeddings, model architectures, and supervised fine-tuning for topic prediction.

Common evaluation metrics reported or suggested are accuracy, error rate, precision, recall, and F1 score. These metrics are suitable for measuring per-example classification performance on the balanced four-class task.

Access, licensing, and permitted use

Licensing descriptions indicate non-commercial research and educational use, with licensing details varying by distribution. Academic use is generally permitted according to the provided summary. Specific license terms depend on the distribution channel and should be consulted before use in production or commercial settings.

Limitations and open questions

No explicit biases, coverage gaps, data quality issues, or common failure modes are documented in the available metadata. Similarly, no annotation quality or provenance information is provided. As a result, the following limitations and open questions remain unaddressed in the available descriptions:

No recorded annotation process or inter-annotator agreement information.
No detailed provenance regarding time ranges or geographic distribution of source articles.
No explicit notes on data cleaning, deduplication, or decontamination procedures.

Practical notes for fine-tuning

When using the AG News Corpus for supervised model training or fine-tuning, treat each example as a short text classification instance where the model input combines Title and Description and the model predicts the numeric Label/Category. Standard classification metrics such as accuracy and F1 score are appropriate for benchmarking on the balanced four-class task.

Sources

AG News Corpus Dataset General Info