SuperGLUE Benchmark and Dataset Overview
Overview and Purpose
SuperGLUE is a benchmark designed to provide a more rigorous test of general-purpose language understanding than its predecessor GLUE. It introduces eight challenging natural language understanding tasks with diverse formats and low-data regimes, a modular toolkit for pretraining and multitask learning, and a public leaderboard. The benchmark aims to provide a hard-to-game single-number metric and to maintain attention on human-vs-machine performance gaps by evaluating tasks that expose remaining headroom in model capabilities. The benchmark focuses on English language understanding and includes formats beyond sentence classification, such as coreference resolution and question answering.
Key Tasks and Data Characteristics
Number of tasks = 8. The benchmark aggregates tasks drawn from existing public datasets and evaluation suites. Key properties of the dataset and task suite include the following overall characteristics: text modality, sentence-pair or context-plus-question units of observation, and a mix of binary, multi-class, and multi-label label schemas. Notable tasks and formats include BoolQ (short passage + yes/no question), CB (premise + embedded clause with hypothesis; three-class entailment), COPA (premise + two options for cause or effect), MultiRC (context paragraph + question + multiple true/false answer options), and ReCoRD (CNN/Daily Mail articles + cloze-style questions with candidate entities).
Task sizes reported in the benchmark materials include:
- BoolQ: train=9427, dev=3270, test=3245
- CB: train=250, dev=57, test=250
- COPA: train=400, dev=100, test=500
- MultiRC: train=5100, dev=953, test=1800
- ReCoRD: train=101000, dev=10000, test=10000
- RTE: train=2500, dev=278, test=300
- WiC: train=6000, dev=638, test=1400
- WSC: train=554, dev=104, test=146
Nearly half of tasks have examples < 1000; all but one tasks have examples < 10000. Common label schemas include two-class entailment vs not_entailment, BoolQ yes/no, COPA two-choice (cause/effect), MultiRC multi-label true/false per answer, and ReCoRD multiple-choice over candidate entities.
Representative Tasks (selected)
- BoolQ: binary yes/no question answering from short passages.
- CB: three-class textual entailment.
- COPA: commonsense causal reasoning with two plausible alternatives.
- MultiRC: multi-answer reading comprehension with F1 and exact match metrics.
- ReCoRD: cloze-style reading comprehension predicting masked entities.
Data Sources, Collection, and Structure
The benchmark collects and reuses existing public data from a variety of sources and prior tasks to avoid creating new proprietary datasets. Sources mentioned include CNN and Daily Mail articles, Wikipedia, RTE datasets (RTE1, RTE2, RTE3, RTE5), plays (including early English snippets for continuation detection), Wikipedia sentence pairs, WordNet, VerbNet, and other curated corpora. Some tasks are framed or reformulated into the GLUE data format and sentence-pair structures (text_pair fields such as snippet_1 and snippet_2). The collection process includes competition-based gathering for certain textual entailment data and reuse/recasting of prior corpora (for example, converting RTE datasets into a merged two-class entailment dataset).
Annotation, Training, and Quality Control
Annotation is performed by human annotators, including crowdworkers via Amazon Mechanical Turk for samples of test sets and specific tasks. The annotation process commonly follows a two-step training and annotation workflow, with a training phase modeled after prior task instructions and a subsequent annotation phase. Task-specific instructions or FAQs are provided; some tasks include interactive training elements such as a "Check Work" button that reveals true labels during training.
Quality control measures and annotation practices include majority-vote aggregation to compute human performance (used for WSC, COPA, CommitmentBank, and others), validation subsets with inter-annotator agreement thresholds (CB subset: >80% agreement), and exclusion of low inter-annotator agreement examples from benchmark scoring in some cases. Known annotation issues cited include data imbalance (e.g., relatively fewer neutral examples in CB) and the influence of instruction clarity and crowd-worker engagement on human baseline estimates.
Recommended Use and Evaluation Protocol
Intended use cases center on assessing progress toward general-purpose language understanding and evaluating transfer learning and multitask methods for NLP. The benchmark is positioned as a research and evaluation platform for application-agnostic transfer learning in English.
Recommended evaluation practices include task-specific metrics and an aggregate single-number performance metric computed across tasks. Task-level recommended metrics include:
- BoolQ: accuracy
- CB: accuracy and F1 (unweighted across classes)
- COPA: accuracy
- MultiRC: F1 over answer-options (F1_a) and exact match (EM)
- ReCoRD: token-level F1 max over mentions and EM
- Aggregate: average across tasks to compute an overall benchmark score (as in GLUE)
Additional metrics referenced across the benchmark materials include macro-average F1, Matthews' correlation (MCC), Pearson correlation, and exact match variants. Evaluation protocols emphasize comparing machine performance with human baselines and tracking the difference between human and machine results as a primary analytic metric.
Baselines and Reported Results
Human performance estimates are provided for all benchmark tasks and are used as reference points. Strong model baselines emphasize transformer-based pretraining and fine-tuning, with BERT-based baselines repeatedly cited. Reported baseline and comparative results include:
- BERT-based baselines increased average SuperGLUE score by 25 points.
- RTE accuracy 86.3% (reported).
- Example baseline numbers cited: GPT:72.8; BERT:80.2; ELMo:66.5.
- Task-level examples: PAWS-Wiki BERT accuracy 91.9%, human accuracy 84.0; G A P BERT 91.0 F1, human 94.9 F1 (without span); Ultrafine Entity Typing human F1 60.2, machine F1 55.0.
- At GLUE launch, baseline accuracy was near random-chance (~56%); reported gaps between human and machine baselines averaged ~20 points, with the largest gap ~35 points on WSC and the smallest margins ~10 points on BoolQ, CB, RTE, WiC.
These baselines reflect the effectiveness of pretraining and fine-tuning approaches (notably transformer-based), demonstrate measurable transfer learning gains, and motivate continued development of multitask and unsupervised methods to close human–machine gaps.
Limitations, Biases, and Common Failure Modes
Known limitations and biases in the benchmark and constituent datasets include class imbalances (e.g., CB has relatively fewer neutral examples; some tasks are dominated by negative examples with class imbalance: 90% negative in specific two-snippet continuation data), sensitivity to pronoun substitutions indicating gender bias (Winogender and GAP findings), and limited coverage of gender-neutral or non-binary pronouns. The benchmark does not claim comprehensive coverage of all forms of social bias.
Common failure modes observed with baseline models and simple baselines include near-chance performance from most-frequent-class or CBOW baselines on several tasks, degraded performance on small datasets (WSC) without data augmentation, and persistent human–machine performance gaps (avg ~20 points; WSC gap ~35). Annotation- and instruction-related issues such as low crowd-worker engagement or unclear instructions can affect human baselines and thereby change the interpretation of model gaps.
Access, Licensing, and Practical Notes
Data in the benchmark is expected to be available under licenses that allow use and redistribution for research purposes. The benchmark provides a public leaderboard and software tools to support pretraining, multitask learning, and transfer learning workflows. Practical deployment or evaluation should follow the task-specific evaluation metrics and adhere to task licensing and usage rules specified for leaderboard inclusion.
Key tasks highlighted in the benchmark:
- BoolQ
- CB
- COPA
- MultiRC
- ReCoRD