The Pile — an 825.18 GiB multidomain English text corpus

Overview

The Pile is an English-centered, large-scale text corpus designed for training and benchmarking large language models. It combines large-scale web scrapes with a variety of smaller, higher-quality corpora to improve cross-domain knowledge and downstream generalization. The dataset is presented as an interleaved mixture of 22 constituent sources and emphasizes reproducible preprocessing and documented component choices. The stated raw size of the collection is 825.18 GiB, with an Effective Size reported as 1254.20 GiB and a mean document size of 5.91 KiB across the full corpus.

The Pile aims to address limitations of single-source training sets by increasing diversity, reducing redundancy within components, and including targeted high-quality domains such as academic and programming texts. Key methodological choices include extraction of Common Crawl records via jusText, English filtering via pycld2, document-level deduplication in some components, and a 13-gram overlap decontamination applied to evaluation sets.

Representative constituent datasets

Pile-CC
PubMed Central
Books3
OpenWebText2
GitHub

Composition, scale, and splits

The corpus is a mixture of 22 datasets spanning academic articles, books, web text, code repositories, legal opinions, forums, subtitles, and email archives. Component-level size accounting is provided and preserved here exactly as reported for major constituents: Pile-CC Raw Size 227.12 GiB (Weight 18.11%; Epochs 1.0; Effective Size 227.12 GiB; Mean Document Size 4.33 KiB), PubMed Central Raw Size 90.27 GiB (Weight 14.40%; Epochs 2.0; Effective Size 180.55 GiB; Mean Document Size 30.55 KiB), Books3 Raw Size 100.96 GiB (Weight 12.07%; Epochs 1.5; Effective Size 151.44 GiB; Mean Document Size 538.36 KiB), OpenWebText2 Raw Size 62.77 GiB (Weight 10.01%; Epochs 2.0; Effective Size 125.54 GiB; Mean Document Size 3.85 KiB), ArXiv Raw Size 56.21 GiB (Weight 8.96%; Epochs 2.0; Effective Size 112.42 GiB; Mean Document Size 46.61 KiB), GitHub Raw Size 95.16 GiB (Weight 7.59%; Epochs 1.0; Effective Size 95.16 GiB; Mean Document Size 5.25 KiB), and additional components (FreeLaw, Stack Exchange, USPTO Backgrounds, PubMed Abstracts, Gutenberg (PG-19), OpenSubtitles, Wikipedia (en), DMMathematics, Ubuntu IRC, BookCorpus2, EuroParl, HackerNews, YoutubeSubtitles, PhilPapers, NIH ExPorter, Enron Emails) with detailed Raw Size, Weight, Epochs, Effective Size, and Mean Document Size values reported per component.

Splitting and held-out data: validation and testing are each reported as 0.1% of data; explicit statements indicate 2 GiB for validation and test splits, an 8 GiB reserve, and 10 GiB held out from the Pile for other uses. For some experiments the evaluation data are downsampled to approximately 40 GB to enable controlled comparisons.

Data specification and formats

Unit of observation ranges from full documents and books to sentences and sub-document text segments depending on component. Typical data formats and inputs include WARC files and raw HTTP responses (for Common Crawl / Pile-CC), WET files, JATS XML (PubMed Central), Markdown outputs (PubMed Central and converted arXiv TEX), EPUB input (BookCorpus), XML subtitle formats (OpenSubtitles), plaintext extractions from database dumps (Stack Exchange), and extracted email bodies (Enron). Document-level fields preserved across components include textual content, structural cues (e.g., chapter breaks, titles), and component-specific metadata such as inventors/assignees for USPTO Backgrounds or comment tree structure for Hacker News.

Conversion and normalization tools explicitly used include jusText for web extraction, pandoc for JATS and TEX conversion to Markdown, Newspaper for web scraping of OpenWebText2, a modified epubto-text converter for books (with improved table and code rendering), and ftfy.fix_text() for Unicode normalization (e.g., replacing Unicode apostrophes with ASCII and expanding Unicode ellipses to '...').

Collection, inclusion/exclusion, and preprocessing

Collection combined new crawls, bulk downloads, and reprocessed public corpora. Notable collection rules and limits: OpenWebText2 URLs were deduplicated and URLs with aggregate score < 3 were removed; Pile-CC processed 22 random chunks from 2013–2020 WARC data; GitHub repositories were restricted to repos with >100 stars, <1GB of files, a 300s time limit for cloning and extraction, and a 100kB cap on extracted files; Hacker News entries required at least one comment and exclusion of items flagged for conduct violations. Several datasets were excluded for policy or practical reasons (e.g., US Congressional Record excluded due to racist content; Fanfiction and Literotica excluded for logistical or bias concerns).

Cleaning and filtering steps include component-specific procedures: Pile-CC uses a fasttext classifier (n-gram size 2) trained on an OpenWebText2-based training set with Pareto-thresholding α = 3 for quality filtering; OpenWebText2 applied document-level deduplication via MinHashLSH (DataSketch) and URL-level deduplication in raw form; PhilPapers applied pdf_filter.py and discarded papers with less than 1000 characters or conversion errors; arXiv and PubMed conversions used pandoc and excluded documents with conversion errors.

Deduplication and decontamination: deduplication was applied within some components (OpenWebText2 and Pile-CC) using MinHashLSH with 10 MinHash functions and an approximate Jaccard similarity threshold of 0.5. Duplicate rates were reported as 28% for OpenWebText2 and 26% for Common Crawl. There was no Pile-wide deduplication across all components; duplicates may exist across train/validation/test splits. Evaluation splits were decontaminated using a 13-gram overlap filter to reduce leakage into held-out evaluation.

Recommended use, evaluation protocol, and metrics

Intended use cases focus on self-supervised pretraining and benchmarking of general-purpose language models: training large-scale LMs, evaluating cross-domain knowledge and generalization, and enabling research access to NIH grant abstracts and other specialized corpora. Recommended evaluation metrics and practices include Bits Per UTF-8 encoded byte (BPB) for compression-normalized scoring and perplexity-based evaluations on tasks such as WikiText and LAMBADA. Specific recommendations and experimental choices reported: train 1.3B-parameter models under settings identical to Brown et al. (2020) for comparative experiments; decontaminate evaluation sets with 13-gram overlap filtering; downsample comparisons to ~40 GB for dataset-size control; compute aggregated perplexity across The Pile by weighting constituent datasets by dataset size.

Reported baseline results (preserved exactly): Pile (val) BPB = 0.9281; Pile (test) BPB = 0.9433; CC-100 (en) (val) BPB = 1.3143; (test) BPB = 1.3293; Raw CC (val) BPB = 1.118; (test) BPB = 1.1275. Task results reported include WikiText (PPL) = 5.59, LAMBADA (PPL) = 12.78, and LAMBADA (ACC) = 50.1. Evaluation specifics note GPT-2 uses a 1024-token max context and GPT-3 uses 2048, with scoring contexts defined as tokens 0–1023 used when predicting tokens 1–1024 for GPT-2 experiments. GPT-2 experiments were implemented via Hugging Face; GPT-3 access used the OpenAI API with 'davinci' ≈ 175B parameters for comparisons in reporting.

Additional evaluation analyses include 16-topic LDA models trained on each Pile component validation set for topical analysis, profanity percentage analysis using profanity-checker, and gender and demographic cooccurrence bias analyses (cooccurrence with gendered pronouns and average sentiment of cooccurring words).

Access, licensing, and reproduction

Availability is described as a three-tiered policy: Public data, ToS-compliant data, and authorial-consented data. Reproduction tooling and preprocessing code are provided to allow exclusion of specific components. The legal basis described for noncommercial, not-for-profit use of copyrighted data is framed as fair use with a transformative rationale emphasizing full-text use to capture long-range dependencies. Component-level licensing and copyright status are not uniformly determinable from available metadata, and component-level decisions are required to manage reuse.

Limitations, known biases, and common failure modes

Known biases, coverage gaps, and quality constraints are explicitly acknowledged. The dataset is predominantly English (reported content is 97.4% English as a rough estimate due to language identification issues), with limited multilingual Common Crawl inclusion for the portion described. Coverage gaps reported include underrepresentation of some data modes such as programming, logic, physics, and legal knowledge relative to other web clusters. Data quality challenges include Common Crawl WET files containing boilerplate and noisy extractions, language identification issues for rare languages, conversion errors that lead to discarded papers, partial extractions from large repositories due to time/file-size limits, and insufficient metadata for reliably determining copyright.

Bias analyses reported include gender cooccurrence patterns with top biased adjectives/adverbs (e.g., 'military', 'criminal', 'offensive' toward men; 'little', 'married', 'sexual', 'happy' toward women) and demographic average sentiment co-occurrence values (White -0.114; Black -0.148; Asian -0.028; Hispanic -0.024). The dataset may contain pejorative, sexually explicit, or otherwise objectionable content. Decontamination choices do not remove all overlaps between training and downstream evaluation for every use case; a downstream leakage policy decision was made not to remove overlaps for downstream evaluations in some analyses.

Observed failure modes and operational issues include partial repository extraction due to the 300s cloning limit, database corruption during deduplication attempts (Cassandra), memory constraints influencing LSH approaches, and puzzling performance patterns where larger models (e.g., GPT-3) sometimes outperform GPT-2 trained on The Pile on certain tasks, indicating complex interactions between dataset composition and model scale.

Summary of key contributions and design decisions

The collection emphasizes combining web-scale scraping with targeted high-quality sources, providing reproducible preprocessing code, and applying component-level cleaning and deduplication where feasible. Notable methodological decisions include using jusText for Common Crawl extraction, English filtering via pycld2 to reduce computation (reported as approximately halving computation by language filtering before extraction), document-level deduplication for selected components (MinHashLSH), and evaluation decontamination via 13-gram overlap. The curation and analysis report a mixture of quantitative artifacts (component sizes, BPB/perplexity baselines, profanity and bias analyses) intended to support cross-domain benchmarking and study of dataset effects on model behavior.

Sources

https://arxiv.org/abs/2101.00027v1