SimpleBooks: Word-Level Long-Term Dependency Dataset for Language Modeling
Overview
SimpleBooks is a family of English word-level language modeling datasets released by NVIDIA that emphasize long-term dependencies while keeping vocabulary and training complexity reduced for rapid experimentation. The collection includes two main tokenized datasets: SimpleBooks-2 and SimpleBooks-92, intended to serve as smaller, representative benchmarks for architecture search, meta-learning, and quick prototyping of language models.
Key characteristics
SimpleBooks is constructed from a large set of Gutenberg books and is positioned to provide fast-training, word-level language modeling benchmarks that approximate properties of larger corpora. Notable characteristics include: a corpus size of ~100 million tokens overall, a vocabulary size under 100K, and availability of both raw text for character-level experiments and tokenized word-level text for LM training.
Composition and format
The dataset is provided as a ZIP archive available at https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip. Word-level tokenized text is provided using SpaCy tokenization with explicit number separation (examples: '300 @,@ 000', '1 @.@ 93'). Original case and punctuation are preserved. Raw unprocessed text is also included for character-level language model experiments.
Collection and preprocessing
Texts were sourced from Gutenberg US. A selection of 1,573 books was made by choosing titles with the highest ratio of word-level book length to vocabulary size. The pipeline included downloading available Gutenberg books, filtering out mal-formatted works and excluding poems, plays, manuals, recipes, and literary nonsense. Metadata, tables of contents, and illustrations were removed; minimal cleaning otherwise preserved original case and punctuation. Tokenization used SpaCy rules with digits separated into tokens; tokenization was performed at the word level.
Scale, splits, and coverage
The collection scale and splits are reported as follows:
SimpleBooks provides two main datasets:
- SimpleBooks-2 (SB-2): train 2,000,000 tokens; val 200,000 tokens; test 200,000 tokens.
- SimpleBooks-92 (SB-92): train 92,000,000 tokens; val 200,000 tokens; test 200,000 tokens.
Aggregate and coverage figures reported include 92,000,000 tokens, vocabulary size 98,000, and 39,432 books in the original corpus selection. Language coverage is English and the source geography is the United States.
Intended use cases
- benchmarking language modeling architectures on small, representative datasets
- architectural search and meta-learning for language modeling
- train word-level language models on English text (Gutenberg subset)
- language modeling benchmarks for word-level LMs on SimpleBooks-2 and SimpleBooks-92
Recommended evaluation protocol and metrics
Evaluation is recommended using validation and test perplexities. Reported metrics and protocols include perplexity on validation and test splits and direct comparisons of model families (e.g., AWD-LSTM vs. Transformer variants). The primary metric used is perplexity (validation and test).
Baseline results and observations
Baseline perplexity results are reported for both AWD-LSTM and Transformer-XL on SB-2 and SB-92. Reported values are:
- SB-2 AWD-LSTM: valid 17.16; test 16.78
- SB-92 AWD-LSTM: valid 21.45; test 20.64
- SB-2 Transformer-XL: valid 17.27; test 16.41
- SB-92 Transformer-XL: valid 9.3; test 8.92
One highlighted observation is that Transformer-XL can outperform AWD-LSTM on small datasets and with fewer parameters, according to the reported comparisons.
Access and distribution
The dataset is distributed as a ZIP archive at the URL above. Tokenized word-level files and raw unprocessed text are included in the archive. License information is not provided in the dataset metadata.
Limitations and open questions
No explicit biases, coverage gaps, data quality issues, or common failure modes are listed for the collection. The dataset selection focused on books with high word-length-to-vocabulary ratios and excluded several categories (poems, plays, manuals, recipes, literary nonsense), which shapes topical and stylistic coverage toward prose that fits those selection criteria.