CoNLL-2003 / CoNLL++ — Temporal Generalization in Named Entity Recognition

Datasets and data sources

The primary evaluation material centers on the CoNLL-2003 English NER collection and a newly created modern test set called CoNLL++. CoNLL++ is a manually annotated CoNLL-style test set drawn from Reuters news articles published Dec 5–7, 2020 and tokenized with the original CoNLL-2003 tokenizer to match the original test set’s token counts. Source corpora and pre-training datasets referenced in the experiments include the One Billion Word Benchmark (WMT11 English monolingual data, 2007–2011), REALNEWS (news articles, 2016–2019), and the WMT20 English dataset (2007–2021; experiments used 2007–2019 to avoid temporal overlap). The collection process included random sampling (for example, 1B tokens sampled from REALNEWS for Flair/ELMo experiments) and removal of tabular data that inflated entity counts.

Data format follows the CoNLL-2003 convention: sentences separated by blank lines and tokens separated by spaces. Labels include the standard NER classes: LOC, MISC, ORG, and PER. Units of observation reported are tokens and Reuters news articles; one cited statistic is "231 articles in CoNLL-2003 test set." Aggregate counts across files (71 files) are reported as ORG 872; PER 889; LOC 657; MISC 159. Time ranges covered in the experiments span 1996–2020 for downstream tests and 2007–2019 / 2016–2019 for pre-training corpora.

Annotation and quality

Annotation followed a CoNLL-style NER protocol using the BRAT interface. The dataset is annotated: two annotators contributed, with the first author producing a gold standard for many items and the second author annotating others; some articles annotated only by the second author were reviewed and then used as gold. Quality control measured token-level agreement: Cohen's Kappa = 97.42. Manual reannotation results report F1 = 95.46 with the CoNLL-2003 gold (manual reannotation) and F1 = 96.23 when using the second author's annotation as gold.

Experimental design and recommended evaluation protocol

Experiments are framed to isolate temporal drift and test reuse effects on NER generalization. The recommended evaluation protocol used in the reported experiments includes:

Fine-tune pre-trained checkpoints on the CoNLL-2003 training set for 10 epochs; use the dev set to select the best epoch and hyperparameters; evaluate five times with different random seeds on the CoNLL-2003 test set and on CoNLL++ and report the average F1 (with standard deviations).

Additional experimental controls include avoiding pre-training data that temporally overlaps with CoNLL++ and reporting delta metrics such as ∆F1 (%) between CoNLL++ and CoNLL-2003'. Evaluation metrics used across experiments include F1 score (averaged over five runs with standard deviations), delta F1 and ranking changes between models, token-level Cohen's Kappa, and perplexity for embedding models.

Models, pretraining corpora, and factors affecting generalization

The empirical comparisons evaluate more than 20 NER models and embeddings, including LSTM-based contextualized models (Flair, ELMo) and transformer-based models (RoBERTa, BERTLarge, T5 and larger variants referenced such as LUKELarge and T53B). Experiments vary the pre-training corpora to measure temporal alignment effects: pre-training on older corpora (e.g., One Billion Word Benchmark, 2007–2011) versus more recent corpora (e.g., REALNEWS, 2016–2019) demonstrated differences in downstream generalization.

Four factors repeatedly cited as improving transfer to modern test data are: modern transformer-based architecture, a large number of parameters, a large amount of fine-tuning data, and temporally closer pre-training data. Continued pre-training of RoBERTa on temporally closer data (2007–2019) improved performance on CoNLL++, and a positive correlation between pre-training year and F1 delta was reported (correlation = 0.55).

Key contributions

Created CoNLL++ test set (2020 Reuters data) modeled after the CoNLL-2003 test set.
Empirical study of >20 NER models trained on CoNLL-2003 and evaluated on temporally shifted data.
Evidence that performance deterioration on modern data is primarily due to temporal misalignment rather than adaptive overfitting.
Demonstrated that pre-training on temporally closer data (e.g., REALNEWS) improves generalization for LSTM-based embeddings (Flair, ELMo) compared to older corpora (e.g., One Billion Word Benchmark).
Quantified embedding perplexities and delta F1 outcomes showing temporal effects (e.g., ELMoRN ∆F1 = -1.43% vs BERTLarge ∆F1 = -2.01% in the closest-data scenario).
Highlighted four factors that improve generalization and called for more annotated test sets to study modern generalization in NLP.

Findings and reported metrics

Results emphasize temporal drift as a dominant factor in NER generalization. Key reported findings include:

No widespread adaptive overfitting to the original CoNLL-2003 test set was found; reported improvements were larger on CoNLL++, indicating temporal misalignment explains much of the degradation.
Transformer-based and larger models often generalize better; RoBERTa and T5 showed no evidence of performance degradation when fine-tuned on a 20-year-old public dataset in the reported experiments.
Continued pre-training on temporally closer data improves downstream F1; the correlation between pre-training year and F1 delta was measured as 0.55 for RoBERTa experiments.
Embedding-level perplexities reported for Flair and ELMo: forward 2.45; backward 2.46; baseline 2.42.
Delta F1 and ranking changes were used to quantify generalization shifts; specific comparative numbers include ELMoRN ∆F1 = -1.43% and BERTLarge ∆F1 = -2.01% in the closest pre-training-data scenario.
Inter-annotator agreement: token-level Cohen's Kappa = 97.42. Manual reannotation F1 values reported as 95.46 and 96.23 against two annotation standards.

Scale, coverage, and temporal aspects

Reported scale and coverage details include several dataset size figures: 46435, 46587, "1B tokens sampled from REALNEWS for Flair/ELMo experiments.", "1000000000 words", and 231 articles in the CoNLL-2003 test set. Coverage is English-language news (Reuters-based) NER across 1996–2020; pre-training data time ranges are explicitly noted (e.g., 2007–2011, 2016–2019, 2007–2019). Distributional comments note that removal of tabular data increased average sentence length to 18.50 and that tabular content had previously inflated entity counts. A reported "diminishing return slope on CoNLL++ = 2.729 (>1)" was interpreted as indicating no diminishing return in the measured regime.

Limitations and open questions

Known limitations and biases highlighted include temporal drift between pre-training corpora and downstream test sets, which can cause degradation in generalization; tabular data in older corpora that inflates entity counts (affecting average sentence length and entity distributions); and test reuse risks leading to overestimation of progress. The analyses emphasize that temporal misalignment of pre-training data is a main driver of performance differences across models and that continuous updates or temporal adaptation do not always substitute for pre-training on temporally aligned corpora.

Sources

https://arxiv.org/abs/2212.09747v2