BookCorpus and Derived Corpus Variants — Dataset Overview for Language-Model Pretraining and Sentence-Embedding Work

Overview

BookCorpus is a widely used text dataset composed of full-text books originally gathered from Smashwords and authors' websites. It has been applied for unsupervised pretraining of language models and for training sentence embedding models that align dialogue and book sentences. Multiple replications and derived variants exist, including BookCorpusOpen, a collected version cited in larger mixes such as The Pile (appearing there as BookCorpus2), and community tools sometimes called "Homemade BookCorpus."

The collection has been sparsely documented, which has led to persistent documentation debt, multiple public replications, and divergent versions. The dataset has notable use in pretraining language models (examples cited: BERT and GPT) and in producing sentence-level corpora for embedding training.

Key statistics and quick facts

11,038 total entries (reported)
7,185 unique books after deduplication (maximum unique count reported)
74,004,228 total sentences (books_in_sentences files)
984,846,357 total words reported in sentence corpus
mean words per sentence: 13; median words per sentence: 11

Data content and file formats

The dataset is distributed as plain text files and as two large sentence-per-line files:

A directory named books_txt_full contains individual plain-text book files (.txt), historically reported as 11,040 text files (note: a reported discrepancy between file counts and other totals exists).
A directory named books_in_sentences contains two large files: books_large_p1.txt and books_large_p2.txt. Those files contain one sentence per row and together hold 74,004,228 sentences and 984,846,357 words (books_large_p1.txt: 536,517,284 words; books_large_p2.txt: 448,329,073 words).
Some large aggregate files were distributed, for example romance-all.txt (1.12 GB) and adventure-all.txt (150.9 MB); after removing those two files the remaining individual text files were reported to contain 811,601,031 words.

Fields present in the distributed files include:

unstructured full-book text (text_content / book_text), often with preamble and postscript material retained;
genres as string labels supplied by authors on Smashwords;
author_email_addresses present in a small sample (~2%) of files.

Several text files were observed to be empty (98 empty book files reported) or shorter than the expected thresholds (655 files shorter than 20,000 words; 291 shorter than 10,000 words). Some files contain internal licensing/copyright statements or other preamble/postscript noise; tokenization differences (e.g., contractions split) were also observed.

Unit of observation and intended labels

The primary unit of observation is the book (single-book text files) and, for sentence-level use, individual sentences (one row per sentence in the large sentence files). The only explicit label/target present in the dataset is genres, provided by authors on Smashwords and aggregated into a 16-genre taxonomy used in summaries (Romance, Fantasy, Science fiction, Vampires, etc.).

Collection, inclusion rules, cleaning, and deduplication

Data sources and collection methods:

Books were collected from smashwords.com via web scraping of free books and historically from a self-hosted authors' website for the original BookCorpus.
Smashwords21 is a metadata-only superset assembled by scraping Smashwords listings; BookCorpusOpen is a collector-maintained replication (collected as of August 2020 by Shawn Presser).
Community repositories and tools exist to automate collection (e.g., "Homemade BookCorpus" GitHub repositories and scraping software).

Inclusion and exclusion rules reported:

Public listings on Smashwords applied a default filter excluding adult erotica.
Books included were typically those available for free on Smashwords and filtered for length (commonly requiring more than 20,000 words).
Some distributed aggregates explicitly excluded particular files (e.g., romance-all.txt and adventure-all.txt were removed in some counts).

Cleaning and deduplication:

Partial cleaning was applied in places (removal of some preamble/postscript material), but many files retained licensing text and other non-content noise.
Deduplication is incomplete across versions: 2,930 duplicates were identified in one analysis, leaving 7,185 unique books from an 11,038 total entry count. Duplicates may exist in multiple genre folders and may differ slightly (e.g., 30-line differences observed for some files).

Scale, coverage, and distributional notes

Coverage highlights:

The dataset is focused on fiction and genre books; author-supplied Smashwords genres indicate substantial representation in Romance (2,880–2,881 books reported in different summaries), Fantasy (1,502), Science Fiction (823), New Adult (766), Young Adult (748), Thriller (646), Mystery (621), Vampires (600), plus several other genre categories (16 total genre labels summarized).
Temporal coverage spans multiple collection dates and versions: original 2014 release, a BookCorpusOpen version collected as of August 2020, and Smashwords21 metadata sweeps up to April 2021. Smashwords launched in 2008 with early catalog statistics cited (140 books, 90 authors); larger platform statistics cited for 2014 and 2020 are included in provenance notes.

Distributional observations:

The dataset is a subset of Smashwords and therefore biased by the platform’s public listing filters and the “free books” criterion; adult erotica is underrepresented due to default filters.
Religious representation skew is reported: Smashwords metadata over-represents Christianity overall while BookCorpusOpen was reported to over-represent Islam in one analysis.
Author-contribution imbalance was observed (some authors contribute many works).

Intended use cases and evaluation guidance

Primary intended uses:

Pretraining language models (unsupervised language modeling) and fine-tuning language models (commonly cited models trained with BookCorpus-style data include BERT and GPT variants).
Training and evaluating sentence embedding models; the sentence-per-line files were explicitly used to train sentence-embedding models for aligning dialogue sentences with book sentences. Evaluation and metrics:
Reported evaluation approaches for sentence alignment use vector inner products of sentence embeddings (inner product similarity).
Other suggested analyses include sentiment- or representation-based diagnostics to study religious or demographic skews.

Recommended caution:

Use with caution for downstream tasks due to potential copyright issues, duplications, sampling skews, and content concerns (see Limitations).

Annotation and metadata quality

Annotation status:

The dataset is effectively unannotated for supervised labels beyond author-supplied genre labels. Genre labels originate from authors themselves rather than a curated annotation process.
No structured annotation instructions, inter-annotator agreement metrics, or standardized quality-control processes are reported.

Known metadata and annotation issues:

Genre labels are author-supplied and may be inconsistent across authors.
Copyright and licensing statements embedded within book text act as annotation noise.
Personal data leakage: author email addresses appear in a small proportion of files (~2% sample).
Duplicates and sampling skew may affect reliability of genre distributions and downstream analyses.

Access, licensing, and redistribution considerations

License and rights summary:

No dataset-wide open license is reported. Books within the corpus frequently carry individual copyright restrictions, and many files contain internal statements indicating redistribution restrictions.
Several books included may have changed availability or pricing since collection; 406 of 2,680 matched Smashwords entries were reported to now require purchase, with a reported total purchase cost of $1,182.21 as of April 2021 for those items.
Because redistribution rights vary by book, the dataset’s legal status for redistribution or commercial use is ambiguous and constrained by the original authors’ rights documented in the text.

Access pathways:

Historical distribution methods include a self-contained website hosted by the original authors, community mirrors such as BookCorpusOpen, inclusion in compilation datasets like The Pile, and community scripts for re-collecting from Smashwords (e.g., "Homemade BookCorpus").

Limitations, biases, and known failure modes

Major limitations and biases:

Documentation debt: sparse documentation and multiple public replications have produced version ambiguity and replication difficulty.
Sampling bias: selection of free, longer books from Smashwords (and the platform’s default filters) produces a non-representative sample of published books.
Genre skew: Romance is markedly over-represented relative to other genres in the corpus.
Religious and demographic skews: analyses report differing religious over-representation across versions (Christianity on Smashwords overall, Islam over-represented in BookCorpusOpen in one analysis).
Author-contribution imbalance: some “superauthors” contribute disproportionate numbers of titles.
Data quality problems: duplicates (2,930 duplicates identified), empty files (98 reported), truncated or short files (655 < 20,000 words; 291 < 10,000 words), presence of personally-identifiable information (author emails), retained preamble/postscript/copyright text, and inconsistent tokenization.

Common failure modes observed:

Models pretrained on these texts risk reproducing offensive or pornographic content present in the corpus and may amplify harmful gender stereotypes or other biases.
Duplicate content reduces effective dataset diversity and can distort model evaluation or downstream behavior.
Copyright-embedded text and redistribution-restricted content create legal and operational risks for redistribution and reuse.

Key contributions and provenance notes

Notable dataset contributions and provenance points cited in analyses:

A detailed datasheet-style documentation effort documenting motivation, composition, and collection process for the BookCorpus materials.
Identification and quantification of deficiencies in the original corpus: copyright statements within files, duplicate books, empty or truncated files, and genre/religious skews.
Provision of two large sentence-per-line files intended for sentence-embedding training and alignment tasks, totaling 74,004,228 sentences.
Community reproductions and derivative artifacts: BookCorpusOpen (collected August 2020), inclusion of BookCorpusOpen as BookCorpus2 in The Pile, and third-party tools/repositories enabling re-collection ("Homemade BookCorpus").

Final note: the dataset has been widely used for unsupervised language-model pretraining and sentence-embedding research, but users should account for the documented quality, legal, and distributional caveats when employing the data for model training or evaluation.

Sources

https://arxiv.org/abs/2105.05241v1