Skip to content

SQuALITY — Long-Context, Question-Focused Summarization Dataset

Overview

SQuALITY is a crowdsourced, multi-reference dataset designed for question-focused abstractive summarization and long-form question answering on narrative texts. It is built primarily from Project Gutenberg science fiction short stories (1930s–1970s) and emphasizes faithful, original summaries collected under a peer-review style workflow. The core release reports 100 stories, 500 questions, and 2000 summaries, with SQuALITY v1.1 listing 127 stories. The dataset is released under a CC BY license and publicly available at https://github.com/nyu-mll/SQuALITY.

SQuALITY was created to address shortcomings of scraped or heuristic summarization datasets (noise, HTML artifacts, unfaithful summaries) by using trained writers, peer review, and incentives to produce high-quality reference summaries. It provides multiple reference summaries per question to support better evaluation of summarization systems and studies of automatic metric correlations with human judgments.

Data composition and format

Each observational unit is a short story paired with a set of questions and multiple reference summaries. Input documents are long: stories are typically 3000–6000 words and average ~5200 tokens (no punctuation) per story, with a reported minimum of 3473 and maximum of 6165 tokens (en_core_web_sm tokenizer used). Questions average 8.9 tokens (min 6; max 12). Plot summaries average 441.9 tokens (sd 90.9); other question responses average 185.9 tokens.

The dataset contains the following primary fields:

  • story_text: plain text (Project Gutenberg stories, predominantly science fiction)
  • question_text: crowd-sourced questions about each story
  • summary_texts / reference: multiple human-written reference summaries per question
  • response_text: writer responses corresponding to questions
  • ratings: human ratings for model outputs (Correctness, Coverage, Overall) on a 1–100 scale

Core structural properties: each story is paired with five questions (one fixed general plot question and four additional story-focused questions). For validation and training, there are four reference summaries per question; each (story, question, reference) tuple corresponds to a training instance, so one (story, question) input may map to four training examples (one per reference).

Collection, annotation, and quality control

Stories were selected from Project Gutenberg (science fiction short stories from roughly 1930–1970). Collection used a crowdsourcing pipeline involving two distinct worker populations (Upwork writers and undergraduate workers). The data collection protocol includes a writing phase and a validation phase:

  • Writing phase: one worker creates four questions per story (in addition to a fixed plot question), and four writers produce answers to each question, including a general story summary as the fifth question.
  • Validation phase: peer reviewers rank other writers' responses, highlight typos and factual errors, and provide written feedback. Reviewers rate responses on three properties: Correctness, Coverage, and Overall (scale 1–100). Rankings and agreement determine bonuses and quality incentives.

Quality-control features include peer-review ranking, monetary bonuses based on ranking agreement (typical bonuses ranging from $2.50 to $0.50 per ranking position), and a validation workflow with ranking and feedback. Reported reviewer agreement for pairwise comparisons is approximately 76% (inter-annotator agreement: 76%). Typical task times reported include reading time per story (20–40 minutes), writing time per writer per story (40–120 minutes), and validation time per task (20–30 minutes). Average bonus figures are reported as $1.25 per response and $6.25 per story.

Intended uses and evaluation protocol

SQuALITY is intended primarily as a research benchmark for long-context abstractive summarization and question-focused summarization of narrative texts. It is suitable for developing and evaluating long-form QFS and LFQA systems, studying the diversity of acceptable summaries, and analyzing correlations between automatic metrics and human judgments.

Recommended evaluation practices emphasize human evaluation as the gold standard. The suggested protocol includes presenting annotators with full stories and candidate outputs, using end-to-end human-in-the-loop ranking, and collecting ratings on Correctness, Coverage, and Overall (1–100 scale) from multiple annotators (e.g., three annotators per task on 20 stories / 100 questions in reported experiments). Standard automatic metrics can be reported but should be combined with human evaluation due to weak correlations in many settings.

The dataset supports the following evaluation metrics: ROUGE-1, ROUGE-2, ROUGE-L, METEOR, BERTScore, and F1. Empirical findings indicate that naive multi-reference ROUGE correlates poorly with human judgments and that automatic metrics often show near-zero Pearson correlation with human ratings when restricted to model-only or human-only subsets; correlations become significantly positive only when considering combined sets.

Data users should note that the dataset is primarily intended for benchmarking and research; training production systems on the full texts requires caution because of historical content and potential biases.

  • story text and length properties
  • question and response formats
  • multiple reference summaries per question (four in validation)
  • ratings on Correctness, Coverage, Overall (1–100)

Baseline models and reported performance

Baseline experiments demonstrate that publicly available medium-scale pretrained models struggle compared to human-written summaries on this long-context, question-focused task. Key reported results follow:

  • LED (parameters 160M): ROUGE-1: 27.7; ROUGE-2: 5.9; ROUGE-L: 17.7; METEOR: 16.5; BERTScore: 82.7. Observed failure modes include long, repeated sentences and degenerative outputs.
  • PEGASUS (parameters 540M): ROUGE-1: 38.2; ROUGE-2: 9.0; ROUGE-L: 20.2; METEOR: 23.4; BERTScore: 84.9.
  • BART (parameters 340M): ROUGE-1: 40.2; ROUGE-2: 10.4; ROUGE-L: 20.8; METEOR: 24.5; BERTScore: 85.3. Table 5 reported human-evaluation scores for BART: Corr.: 34.8; Coverage: 15.6; Overall: 18.1.
  • BART+DPR (parameters 340M): ROUGE-1: 41.5; ROUGE-2: 11.4; ROUGE-L: 21.0; METEOR: 26.1; BERTScore: 85.5. Table 5 reported scores for BART+DPR: Corr.: 45.4; Coverage: 24.3; Overall: 27.9.
  • Human references: ROUGE-1: 46.6; ROUGE-2: 12.5; ROUGE-L: 22.7; METEOR: 30.6; BERTScore: 86.2. Table 5 human ratings: Corr.: 94.1; Coverage: 88.8; Overall: 91.3.

Baseline analyses conclude that the best-performing approach among those tested is an extract-then-summarize pipeline that uses input questions to retrieve story sentences, but even these methods remain substantially below human performance. Human references have average ratings around or above 90 for all three properties, while BART and BART+DPR have average ratings below 50 for those properties in reported experiments.

Limitations, biases, and open questions

SQuALITY has several known limitations and open questions that affect use and interpretation:

  • Temporal and topical coverage is limited: source stories date from 1930–1970. These historical texts can contain dated or potentially harmful stances on topics such as race and gender; models trained on full texts may reproduce such stances.
  • Automatic metrics correlate poorly with human judgments on this dataset, even with multiple references, limiting the effectiveness of automatic evaluation as a proxy for human quality assessments.
  • Dataset scale is modest (100 stories; 127 stories reported for v1.1), which may restrict supervised fine-tuning for approaches requiring extensive parameter adaptation (for example, additional positional embeddings).
  • Crowdsourced collection incurs variability in response quality and expensive validation; quality depends on worker populations and incentive structures.
  • Observed model failure modes include degenerative outputs, such as repeated long sentences from LED, and general difficulty learning the task from medium-scale pretrained models.

Open questions emphasized by the dataset authors include how best to design metrics and evaluation protocols that account for reference diversity, how to scale reliable human evaluation of long inputs, and how to leverage multi-reference data to improve automatic metric correlations.

Access and license

The dataset and associated writings are released under a CC BY license. Writings collected from contributors are publicly released for AI development. The dataset distribution is consistent with QuALITY preprocessing and is available at the project repository: https://github.com/nyu-mll/SQuALITY

Sources

https://arxiv.org/abs/2205.11465v1