BLOOM family — multilingual open-access language models

Overview

The BLOOM family was developed by the BigScience Workshop / BigScience project with the stated aim of democratizing access to large language models and providing broad multilingual capabilities. The work emphasizes human involvement and local expertise in data curation and explores multilingual zero-shot task generalization, machine translation, and summarization. Key public claims include the creation of a 176 billion parameter open-access language model and variant models spanning multiple parameter scales.

Key highlights:

176B parameters (BLOOM-176B) and a spectrum of smaller variants
Trained on a multilingual ROOTS corpus and P3/xP3 prompt mixtures
Prioritization of curated data and multilingual evaluation
Use of ALiBi positional embeddings and tied embeddings
Trained on Jean Zay supercomputer with stated compute and carbon metrics

Variants and language coverage

The family includes a range of model sizes and specialized variants. Reported names and parameter counts include:

BLOOM-176B: "176B"
Common smaller sizes: "560M", "1.1B", "1.7B", "3B", "7.1B" (appearing as variants like BLOOM-560M, BLOOM-1.1B, BLOOM-1.7B, BLOOM-3B, BLOOM-7.1B)
BLOOM-1B7 / "BLOOM 1B7": "1.7B"
BLOOMZ family: BLOOMZ and size-mirrored variants (e.g., BLOOMZ-560M through BLOOMZ-7.1B)
SGPT specializations: SGPT-BLOOM-7.1Bmsmarco ("7.1 billion") and SGPT-BLOOM-1.7B-nli ("1.7 billion")

Reported language coverage varies by source within the family:

BLOOMZ: "46 languages"
Other entries report "104 languages" for BLOOM and "BLOOM 1B7"
ROOTS corpus and P3/xP3 mixtures underpin multilingual training and evaluation

Max context length for several variants is reported as "2048" tokens.

Architecture and tokenization

Architecture details emphasize a decoder-only Transformer design:

Model type: decoder-only Transformer, described as "Causal decoder-only" and "decoder-only"
Dense parameterization (not MoE)
Notable design choices: ALiBi / ALiBi Positional Embeddings, Embedding LayerNorm, tied embeddings, greedy decoding until EOS token, multilingual tokenizer for evaluation, and ablations focusing on zero-shot generalization and hyperparameters

Key reported sizing elements (multiple configurations reported):

Layers: "70", "24", "30"
Hidden sizes: "1024", "1536", "2048", "2560", "4096", "14336"
Attention heads: "16", "32", "112"

Tokenization:

Tokenizer types reported: "Byte-level BPE", "SentencePiece", "spm-flores-101", "BertTokenizer", "XLMRobertaTokenizer"
Vocab size reported as both "250680" and "250,680"
Chat / prompt format referenced: "xglm-source+target"
A truncation rule is noted for prompts with overgeneration related to repeating prompt patterns

Training and compute

Pretraining mixture and dataset:

The ROOTS corpus is a composite collection reported as "498 Hugging Face datasets" and a "ROOTS dataset" and is said to include around "11% of code."
The BigScience Catalogue lists "252 sources identified."
OSCAR version 21.09 constituted "38% of the corpus."
P3 includes "2000+ prompts for 170+ datasets"; extension to xP3 for multilingual tasks is reported.

Total tokens reported for pretraining include multiple values: "341B", "366B", and "13 billion".

Optimization and hyperparameters:

ZeRO stage 1 and cosine learning rate decay are reported as parts of the optimizer/schedule.
Learning rates were determined by doubling the minimum learning rate of the respective pretrained model and rounding; global batch sizes multiplied by four for small variants.
Important hyperparameters include bfloat16 for final training, mixed-precision training, weight decay, gradient clipping, and "no dropout".

Compute and runtime:

Training provided through a French public grant and leveraged IDRIS' Jean Zay supercomputer.
Reported hardware: "48 nodes with 384 NVIDIA A100 80GB GPUs".
Training duration: "Training took about 3.5 months" and "Trained for 3.5 months".
Compute consumption: "Consumed 1,082,990 compute hours".
Data volume: "1.61 terabytes of text".

Post-training:

Multitask finetuning (BLOOMZ family) is reported; however a quoted claim states "Multitask finetuned BLOOMZ models do not improve significantly over BLOOM models."
No explicit supervised fine-tuning (SFT) or preference-alignment procedure details are reported in the available data.

Evaluation and benchmark performance

General claims:

Headline: "Competitive performance on a wide variety of benchmarks" and "Competitive performance after multitask finetuning."
Reported observation: "performance plateau after 1 – 6 billion tokens of finetuning."

Selected benchmark results and summaries:

Zero-shot evaluations: EAI-Eval ("29 tasks evaluated"), T0-Eval ("9 tasks evaluated").
SuperGLUE: Zero-shot performance "Well above random chance for entailment tasks"; one-shot performance "Increased performance compared to zero-shot"; "BLOOM176B outperforms OPT-175B on Ax-b, CB, WSC, and WiC tasks."
WMT / translation (WMT'14 / WMT14 / WMT): Reported BLEU scores for BLOOM-176B:
en→fr: "34.2 (1-shot), 15.38 (0-shot)"
fr→en: "35.4 (1-shot), 14.15 (0-shot)"
en→hi: "14.49 (1-shot), 1.90 (0-shot)"
hi→en: "24.60 (1-shot), 10.19 (0-shot)"
Comparators include "M2M-100".
DiaBLa translation BLEU: "BLOOM 1-shot context with previous sentence: 38.5 (en→fr)" and "1-shot context with random sentence: 37.6 (en→fr)".
Flores-101 devtest spBLEU: reported as "BLOOM scores across various language pairs" and often comparable or better than "M2M-100 in 1-shot setting"; comparators include "M2M-100" and "AlexaTM".
HumanEval: "Performance of pretrained BLOOM models is similar to that of similar-sized GPT models."
HELM: "BLOOM is roughly on par with previous generation English-only models but behind more recent monolingual models." Also "BLOOM-7.1B achieves state-of-the-art performance on several classification and semantic textual similarity splits" in 5-shot settings.
MASSIVE, STS22, and other multilingual benchmarks report per-language scores for accuracy and Spearman correlation; selected languages and numeric arrays are reported in source tables (examples include per-language arrays for Arabic, Bengali, English, Spanish, French, Hindi, etc.).
Probing and morphosyntactic evaluation: "80 morphosyntactic features" and "38 morphosyntactic features in total" are cited; "Probing performance: BLOOM-1B7 performs on par or better than BLOOM, outperforming count-based baselines."

Where reported as strengths:

Better performance after multitask prompted finetuning and in one-shot vs zero-shot settings.
BLOOM-176B: improved performance on several tasks compared to OPT-175B.
Noted strengths in multilingual translation (especially some low-resource pairs and Romance languages) and summarization.

Where reported as weaknesses:

BLOOM-176B underperforms dedicated translation models like M2M-100 on some tasks.
Very poor results for certain under-represented low-resource languages (examples cited: Swahili and Yoruba).
Calibration and grammatical generalization issues for some languages.
Limited assessment of bias across all languages.

Strengths, discriminators, and intended positioning

The BLOOM family positions itself around several coordinated goals: open-access availability, multilingual coverage, human-centered data curation, and systematic evaluation of zero-shot generalization. Key claimed contributions include:

A large-scale open-access model ("176 billion parameter open-access language model")
Training on "46 natural languages and 13 programming languages" (reported among contributions)
Prioritization of human involvement and local expertise in data curation, including use of P3 and xP3 prompt pools
Exploration across architectures and pretraining objectives (causal, prefix, masked) and evaluation on decoder-only and encoder-decoder designs

Limitations and caveats

Reported limitations, operational issues, and caveats include:

Architecture search omissions: "Did not consider mixture-of-experts (MoE) architectures" and "Did not consider state-space models."
Operational instability during development: "1-2 GPU failures per week" and "5-10h downtimes due to bugs and disk space issues."
Output issues: "Over-generation in translations" and "Incorrect language production in zero-shot settings."
Data and evaluation caveats: "Validity issues with original CrowS-Pairs corpus" and coverage limited "to situations, languages, and language variants covered by multilingual CrowS-Pairs."
Performance limitations: "Quality on underrepresented low-resource languages is questionable."
Behavioral and license: "13 behavioral-use restrictions" are reported; licensing cited as "Apache 2.0 open source license."

Notable figures and operational metrics

Several resource, performance, and environmental metrics are reported:

Peak/performance: "Achieved 156 TFLOPs in fastest configuration with A100 GPUs" and "Half of the theoretical peak performance of 312 TFLOPs."
Compute and runtime: "Trained on 48 nodes with 384 NVIDIA A100 80GB GPUs", "Consumed 1,082,990 compute hours", "Training took about 3.5 months".
Energy and carbon footprint claims:
"carbon emissions from BLOOM training add up to approximately 81 tons of CO2eq"
"BLOOM training energy consumption is 433 MWh"
"BLOOM emissions are approximately 2/3 less than OPT (25 tons versus 70 tons)"
"real-time deployment of the model results in approximately 20 kg of CO2eq emitted per day"
"0.83 kg per hour"
Additional numeric observations and percentages tied to dataset composition and model mix:
"ROOTS corpus consists of around 11% of code."
Reported model-mix percentages such as "BLOOM 15.52%", "BLOOMZ 12.06%", and per-size values like "BLOOM-560M 0.82%", "BLOOMZ-7.1B 8.06%", among others.

Sources

https://arxiv.org/abs/2211.05100v4