CNN/Daily Mail Reading Comprehension and Related QA Datasets

Overview

The collection comprises large-scale reading comprehension resources derived from news articles and several benchmark RC tasks: CNN/Daily Mail, CBT, bAbI, and MCTest. The central formulation is a cloze-style reading-comprehension task where each example is a passage-question-answer triple: a question contains a single placeholder token and the correct answer is an entity or word drawn from the passage. The datasets were created to address the scarcity of large human-annotated RC corpora by heuristically pairing articles with summary bullet points and by automatic annotation (entity recognition and coreference resolution).

Key published findings include simple systems reaching 73.6% on CNN and 76.6% on Daily Mail, analyses showing that some linguistic pipelines (e.g., frame-semantic parsers) perform worse than heuristics, and reported model improvements (including a reported 7-10% improvement over AttentiveReader for some methods). Architectural contributions cited include replacing tanh-based attention with bilinear attention, removing a non-linear layer before final predictions, and restricting predictions to entities that appear in the passage.

Data and Format

Each example is a passage-question-answer triple (p, q, a). Typical fields and formats are:

passage / passage_text: text of the article or passage; average length reported as CNN: ~761.8 tokens, Daily Mail: ~813.1 tokens; average sentences CNN: 32.3, Daily Mail: 28.9.
question / question_text: text containing exactly one @placeholder token (cloze-style).
answer: an entity or word that appears in the passage; for CNN/Daily Mail the answer is an entity from the passage (a ∈ p ∩ E).
entity_markers / entity_marker: placeholders like @entity1, @entity2 used for candidate answers.
CBT-specific fields include sentence_index, passage_id, and target_word; CBT passages are 21 sentences where the first 20 form the passage and the 21st contains the target word.

Data is available as plain text with some files using a .question extension.

Unit of Observation and Labeling

The fundamental unit is the passage-question-answer triple. Labels or targets are:

an entity name (string) for CNN/Daily Mail,
a missing word for CBT.

Label schema is therefore either an entity string or a single word, depending on the dataset. Examples are set up so the answer is present in the passage and corresponds to the placeholder token in the question.

Collection, Preprocessing, and Annotation

Articles and summaries were paired heuristically by matching news articles with bullet-point summaries. Annotation was largely automatic: coreference chains and entity mentions were replaced by entity markers (@entity n), representing an automated anonymization pipeline that includes entity recognition and coreference resolution. Annotation-related signals used by ranking systems or features include:

occurrence in the passage,
occurrence in the question,
frequency of the entity in the passage,
first position of occurrence in the passage,
n-gram exact match between placeholder context and entity context,
word distance (average minimum distance from non-stop question words to the entity),
sentence co-occurrence with another entity or verb in the question,
dependency-parse match between placeholder and entity contexts.

Known annotation issues include failures in NER/coreference and the impact of anonymization on human and model understanding.

Baselines and Modeling Approaches

Benchmarks and published baselines include both conventional and neural approaches. Representative models and findings reported:

AttentiveReader (neural baseline; Hermann et al., 2015) — used as a reference baseline on CNN/Daily Mail.
end-to-end memory networks (Hill et al., 2016) — reported as another baseline family.
window-based memory networks — explored as dataset-specific architectures.
Frame-semantic parser baselines were reported to perform poorly relative to simple heuristics.
Architectural refinements reported in analyses include bilinear attention replacing tanh-based attention, removal of a non-linear layer before final prediction, and constraining prediction candidates to entities appearing in the passage.

Reported quantitative improvements include simple, carefully designed systems achieving 73.6% on CNN and 76.6% on Daily Mail, and a claimed 7-10% improvement over AttentiveReader for certain methods.

Scale, Splits, and Coverage

The collection is large-scale and English-language, focused on news-article reading comprehension and reasoning.

Reported sizes and splits:

Overall scale: 1000000+
CNN_train = 380298
CNN_dev = 3924
CNN_test = 3198
DailyMail_train = 879450
DailyMail_dev = 64835
DailyMail_test = 53182
Additional mention: 100 examples from the CNN dataset (used in specific analyses)

Coverage centers on news articles for reading comprehension and reasoning-focused QA tasks. Other included benchmarks have different characteristics: CBT uses 21-sentence children's-book passages; bAbI provides synthetic tasks focused on 20 reasoning types with a limited vocabulary (100-200 words); MCTest is an open-domain RC challenge dataset.

Recommended Use and Evaluation

Primary intended use cases are training and evaluating reading-comprehension and QA models on passage-level cloze tasks and reasoning categories. Recommended evaluation protocols are dataset-specific:

CBT: missing-word accuracy.
bAbI: per-category reasoning accuracy.
CNN/Daily Mail: standard RC QA evaluation with accuracy reported on development and test sets.

The central metric used across evaluations is accuracy, reported as percent accuracy on development and test splits. Baseline performance points cited include the 73.6% and 76.6% figures for CNN and Daily Mail respectively.

The following list summarizes the most common intended use cases:

Train and evaluate reading-comprehension (RC) models on cloze-style passage tasks.
Benchmark neural and conventional RC systems on news-article QA.
Evaluate missing-word prediction models (CBT).
Assess multi-step reasoning models across reasoning categories (bAbI).
Compare model architectures and attention mechanisms on large-scale RC.

Limitations, Biases, and Common Failure Modes

Known limitations and quality issues include:

Entity anonymization can reduce realism and make human interpretation more difficult; anonymization may hinder model access to world knowledge.
Dependence on automated NER and coreference resolution introduces noisy annotations; coreference or NER failures can degrade dataset utility.
Reported data preparation and coreference errors contribute to dataset noisiness.
Synthetic benchmarks like bAbI have limited vocabulary (100-200 words) and simplified language, which limits realism.

Common failure modes observed are coreference errors, NER failures, and anonymization artifacts that introduce artificial constraints on understanding.

Practical Notes for Researchers

Design choices in modeling should account for the cloze-style answer constraint (answers restricted to entities or words in the passage) and the anonymization strategy. Feature engineering and model architectures that explicitly model entity occurrences, positional signals, and question–entity interactions have been used successfully. Evaluation should follow dataset-specific accuracy measures and use the provided train/dev/test splits.

Sources

https://arxiv.org/abs/1606.02858v2