20 Newsgroups
Overview
20 Newsgroups is a classical text corpus used as a benchmark for multiclass text classification and clustering tasks in natural language processing. It comprises Usenet posts collected from 20 distinct newsgroups and is commonly distributed as archives (notably 20_newsgroups.tar.gz (~16-17 MB) and mini_newsgroups.tar.gz) and is accessible via standard loaders such as scikit-learn's fetch_20newsgroups. The dataset was released by Ken Lang (Newsweeder project) and the UCI Machine Learning Repository and is widely used for experiments in topic modeling, clustering, and classification.
Key facts
- ~18,846–20,000 documents.
- 20 category labels (newsgroups).
- Modality: text.
- Primary tasks: multiclass text classification, text clustering, topic modeling.
Data specification and formats
The unit of observation is individual documents (messages). Data are provided as raw text posts that typically include header, body, and an optional footer. Standard loader outputs contain:
- data: a list of raw text strings (document text);
- target: numeric labels corresponding to newsgroup indices;
- target_names: category names for each numeric label;
- optional filenames for original file paths.
Archive and auxiliary file formats include 20_newsgroups.tar.gz, mini_newsgroups.tar.gz, and cross-validation split artifacts such as texts.txt, score.txt, and split_k.pkl. Loader outputs are commonly consumed by downstream feature extractors (for example, TF-IDF) to produce vectorized inputs for models.
Fields present include a primary text field named "document_text" (containing header, body, and optional footer) paired with a numeric label in "target". The label schema maps numeric labels to 20 category names; example category names given include: alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, rec.autos, rec.sport.hockey, sci.crypt, sci.space, soc.religion.christian, talk.politics.mideast, ...
Scale, coverage, and temporal context
Reported size is ~18,846–20,000 documents. Topic coverage is oriented toward text classification, clustering, and topic modeling across the 20 included news categories. The dataset originates from Usenet activity in the 1990s.
No explicit language, geographic, or finer topical coverage metadata is provided in the available descriptions.
Collection and preprocessing
Source material consists of posts from 20 Usenet newsgroups. Collection entailed gathering Usenet posts for each newsgroup; labels are derived directly from the newsgroup to which a post was posted.
Common preprocessing choices applied by practitioners include removal of headers, footers, and quoted text to focus on the substantive content of posts. Archive formats and loader outputs facilitate reproducible splits and downstream processing.
Annotation and labeling
Documents are annotated with a single label indicating the newsgroup category; annotation is effectively the original posting location. Annotation attribution lists Ken Lang (dataset assembler) and dataset authors as responsible for the assembled labels. Instructional guidance is minimal: each document is labeled with its newsgroup category, and labels correspond to the newsgroup to which the post was posted.
No specific quality-control procedures, inter-annotator agreement metrics, or known annotation issues are documented in the available metadata.
Recommended uses and evaluation
Intended use cases include benchmarking for multiclass classification, clustering, and topic modeling in NLP research and algorithm comparison. Standard practice is to use loader-provided splits or to create custom train/test splits and to apply common feature extraction pipelines (e.g., tokenization + TF-IDF) prior to model training. No specific evaluation protocol, metrics, or baseline results are specified in the provided metadata; users typically evaluate classification models using common supervised metrics appropriate to multiclass problems.
Access and license
The dataset is openly available and distributed under an open license. License summary: Open access; CC-BY 4.0 license; available via UCI and mirrors.
Limitations and open questions
No explicit entries for known biases, coverage gaps, data quality issues, or common failure modes are reported in the available metadata. The dataset reflects Usenet activity from the 1990s, which may imply topical and linguistic characteristics tied to that era; no formal caveats or limitations are specified in the provided descriptions.
Sources
20 Newsgroups Dataset General Info