WINOGRANDE — Large-scale Winograd Schema-style Benchmark and Bias-Reduction Approach
Overview
WINOGRANDE is a large-scale, Winograd Schema Challenge–inspired dataset aimed at evaluating pronoun resolution and commonsense reasoning in NLP systems. Released by the Allen Institute for Artificial Intelligence and researchers at the University of Washington, the collection targets the long-standing problem that models can exploit incidental dataset artifacts rather than demonstrate true reasoning. The dataset is provided in text form and as text embeddings, and is intended to be human-trivial but AI-hard for pronoun-disambiguation and Winograd-style tasks.
Key emphases throughout the construction and evaluation pipeline include robust crowdsourced problem creation, embedding-based representations derived from RoBERTa, and a lightweight bias-reduction filter called AFLITE designed to reduce spurious dataset correlations.
Construction and composition
Problems are authored by crowdworkers via crowdsourcing (Amazon Mechanical Turk). Workers produced twin-sentence pairs constrained to 15–30 words with at least 70% word overlap; two domains were targeted: social commonsense and physical commonsense. Each problem is formatted as a fill-in-the-blank (the blank corresponds to one of two names in context), with two answer options in most cases and up to five candidate antecedents in other variants.
A specialized embedding pipeline, RoBERTaembed, used a RoBERTa model fine-tuned on 6,000 instances to compute embeddings for the remaining instances; the 6,000 instances used for fine-tuning were then discarded from the final dataset. Validation required majority agreement among three crowd workers per question, ensuring unambiguous answers and discouraging problems solvable by simple word association.
Reported outcomes of the crowdsourcing and quality control process include 77,000 questions collected, 53,000 validated as valid, and 24,000 invalid questions discarded. The dataset includes twin-sentence pairs (twin sentences) and variants intended to reduce annotation artifacts common in large-scale collections.
Data specification
Each unit of observation is an individual Winograd-style problem, typically represented as a twin-sentence pair with a binary label. Data is provided both as raw text and as pre-computed numerical embedding vectors.
Fields and formats observed in the collection include:
- twin_sentences: string pairs of nearly identical questions used for pronoun resolution.
- options / choices: two answer-option strings in most instances (up to five in some variants).
- embedding_vector: numeric floating-point vectors (pre-computed RoBERTa embeddings).
- label: integer in {1, 2} indicating the correct option index.
- context_with_blank / problem_text: text containing the pronoun or a blank corresponding to one of two names.
Each data instance can be represented as (X, y) where X is either the text or embedding_vector and y is the single correct option. Label schema is binary (1 or 2), sometimes framed as Gotcha vs non-Gotcha for gender-bias diagnostics.
Bias-reduction method: AFLITE
AFLITE is an ensemble, score-based filtering algorithm developed to mitigate dataset biases that models could exploit. The method operates on model-derived features (in WINOGRANDE, the RoBERTaembed embeddings) and removes instances that appear predictable by simple classifiers.
Core elements and parameters reported for AFLITE applied to WINOGRANDE:
- Ensemble of linear classifiers produces per-instance scores (ratio of correct predictions across ensemble members).
- Iterative top-k removal: in the reported run AFLITE used m = 10,000, n = 64, k = 500, and τ = 0.75. The algorithm removes the top-k instances with score ≥ τ, repeats until fewer than k are removed or the remaining pool is smaller than m.
- AFLITE filtering reduces distributional artifacts: reported KL-divergence reduction to 0.12 for the debiased WINOGRANDE, versus PMI-based filtering reducing to 2.42 and random reduction to 2.51.
AFLITE was used to produce a debiased subset (referred to as WINOGRANDE debiased) and to identify a larger pool of instances filtered out for other training/resource uses. The reported effect is a dramatic reduction in measurable bias signals, while noting that not all instances in the debiased subset are twin pairs (approximately one-third are not twins).
Fine-tuning, models, and usage
The embedding pipeline relied on fine-tuning RoBERTa on 6,000 instances (the RoBERTaembed step) to compute embeddings for the remainder of the corpus; those 6,000 instances were discarded from the final public set. Beyond embeddings, the dataset is recommended for fine-tuning and transfer learning of models to Winograd-style tasks.
Reported baseline and state-of-the-art model results (development and test where noted) include:
- RoBERTa: dev 79.3%, test 79.1%.
- BERT: dev 65.8%, test 64.9%.
- RoBERTa-DPR: dev 59.4%, test 58.9%.
- BERT (local context): dev 52.5%, test 51.9%.
- RoBERTa (local context): dev 52.1%, test 50%.
- Ensemble language models: dev 53%, test 50.9%.
- Majority baseline: 65.1%.
Aggregate SOTA ranges on WINOGRANDE were reported as 59.4%–79.1% depending on model and training data size; human performance was reported as 94.0%. Training-size sensitivity was observed (performance ranged from 59% to 79% when training size varied from 800 to 41K instances).
Evaluation protocol and recommended use
WINOGRANDE is positioned as a benchmark for evaluating machine commonsense reasoning and pronoun-resolution capabilities, transfer learning to other WSC-style benchmarks, and for diagnosing and comparing bias-reduction methods.
Suggested evaluation practices and metrics reported include:
- Accuracy on development and test splits (per-dataset and cross-dataset transfer).
- Per-instance score defined as the ratio of correct predictions across ensemble classifiers (used by AFLITE).
- KL-divergence between distributions p(d1, y=1) and q(d1, y=2) to quantify distributional bias.
- PMI-based filtering as a baseline for bias reduction; comparison to random data reduction.
- Evaluation on related benchmarks after fine-tuning: WSC, PDP, SuperGLUE-WSC, DPR, KnowRef, Winogender.
- Diagnostic metrics for gender bias (Gotcha vs non-Gotcha), and accuracy differences reported as ∆F and ∆M where applicable.
Scale, splits, and key counts
- 44k-question Winograd-inspired dataset; reported variants include WINOGRANDE all and WINOGRANDE debiased.
- WINOGRANDE debiased: 12,282 problems (train=9,248; dev=1,267; test=1,767).
- WINOGRANDE all: 43,972 problems (train=40,938; dev=1,267; test=1,767).
- 77,000 questions collected; 53,000 validated as valid; 24,000 invalid discarded; 38,000 twin sentences reported.
- 6,000 instances used for RoBERTa fine-tuning and discarded; RoBERTaembed used to embed remaining instances.
Limitations and open questions
Known limitations and failure modes documented in association with WINOGRANDE and the bias-reduction approach include:
- Residual word-association biases and dataset-specific spurious correlations may persist despite AFLITE filtering; AFLITE reduces but may not eliminate all biases.
- Structural correlations (for example, sentiment correlations between answer options and target pronouns) were detected in some instances.
- Annotation artifacts remain a concern in large-scale datasets, and a reported 24,000 invalid questions were discarded during validation.
- Specific benchmarks retain measurable bias: an example cited is that 13.5% of PDP questions may still exhibit word-association bias.
- Common failure modes include models exploiting word-association shortcuts and other spurious cues.
Resources and availability
The project reported release of resources including datasets, crowdsourcing interfaces, and models, with resources hosted at winogrande.allenai.org.