SMS Spam Collection — Dataset and Evaluation Overview
Overview
The SMS Spam Collection is a public corpus of mobile text messages intended for research in SMS spam filtering, text classification, near-duplicate detection, and related information retrieval tasks. It is composed of multiple subcollections assembled from public sources and positioned as the largest available real, non-encoded SMS spam dataset for benchmarking and validation. The collection includes explicit labeling for binary classification (spam vs ham) and has been used to evaluate a wide range of machine learning classifiers, with support for near-duplicate analysis and N-gram based similarity methods.
Dataset composition and scale
The collection is assembled from multiple sources and organized into three named subcollections: INIT, ADD, and FINAL. The combined corpus size and notable subcomponents reported are summarized below.
- Spanish dataset: 199 spam, 1,157 ham
- English dataset: 82 spam, 1,119 ham
- NUS SMS Corpus: about 10,000 messages (legitimate)
- Grumbletext: 425 SMS spam messages
- SMS Spam Corpus v0.1 Big (INIT): 1,002 ham + 322 spam (1,324 total)
- New collection / FINAL: 5,574 messages total (4,827 ham, 747 spam)
The following points describe relative distributions and token statistics within subcollections: INIT is more balanced than ADD (INIT: ham 1002 (75.68%), spam 322 (24.32%); ADD: ham 3825 (90.00%), spam 425 (10.00%)). Average tokens per message are reported as INIT 15.01 and ADD 14.42. Total token counts reported include INIT 19,874 and ADD 61,280; ham tokens (INIT 12,192; ADD 51,419) and spam tokens (INIT 7,682; ADD 9,861).
Data format, fields, and labels
Each observation unit is an SMS message (text message). The dataset is provided as plain text, with a single file format where each line contains the label followed by the raw message. Fields documented include:
- label: string (ham or spam). The first token per line is the class label.
- message_text: string containing the raw SMS message content.
Labels follow a binary schema: spam / ham (legitimate). Pairing and structure information notes the presence of three subcollections (INIT, ADD, FINAL) and that token sequences may be transformed into feature vectors using a reported tokenizer variant (tok2).
Collection sources and preprocessing
Data sources include multiple public collections and manual extractions: the NUS SMS Corpus (volunteer-contributed messages, mostly Singapore-based), Grumbletext (messages manually extracted from a website), SMS Spam Corpus v0.1 Big, and contributions from other public/academic sources (including Caroline Tag’s PhD thesis for some ham messages). Collection methods involved combining these sources into unified subcollections and performing near-duplicate analysis to monitor overlap when creating FINAL.
Preprocessing and cleaning choices noted:
- Messages are retained as raw plain text with no stop word removal or word stemming applied.
- Deduplication and near-duplicate detection were explicitly performed to assess and reduce overlap across subcollections; duplicates present in original sources were not duplicated beyond what was already present in the inputs.
Near-duplicate detection and N-gram methodology
Near-duplicate detection is a central quality-control activity for this collection. The approach is based on N-gram matching and normalization, with explicit attention to false positives produced by short N-grams.
Key points of the N-gram strategy:
- N-gram sizes evaluated: N=5, 6, 10.
- Normalization steps include replacing separators with spaces, lowercasing, and replacing digits with the character 'N'.
- Short N-grams (5- and 6-grams) produce many false positives; longer 10-grams reduce false positives and align better with typical message length.
- A baseline "String-of-Text" method is described for near-duplicate detection; n-gram matching with normalization (e.g., N=6) is used for refinement.
- Reported statistics include N-gram occurrence counts per subcollection and measures such as number of unique N-grams with frequency >= 2, total hits, average hits per N-gram, and standard deviation.
- Observed behavior: 52% of 10-grams with frequency 2 (in FINAL occurring in both INIT and ADD) contain N+ numbers (short/telephone numbers) in spam messages. FINAL reportedly contains more unique 10-grams than the sum of INIT and ADD, and many 10-grams are associated with specific spam campaigns.
The N-gram analysis is used both to estimate an upper bound on near-duplicates and to reveal non-symmetric matching behavior in comparisons across subcollections.
Tokenization and features
Two tokenizers are referenced (tok1 and tok2); tok2 is noted as the tokenizer used to transform token sequences into feature vectors for experiments. Feature-level statistics include token counts by class, average tokens per message, and information gain (IG) scores to identify informative keywords and top N-grams (examples of top 5-grams include "sorry i ll call later" and "i cant pick the phone").
Recommended uses and evaluation protocol
Primary intended use cases are validation, benchmarking, and training of SMS spam classifiers and studies of near-duplicate detection in SMS corpora. Suggested evaluation practices and metrics include:
- A train/test split used in reported experiments: 30% training (1,674 messages) and 70% testing (3,900 messages).
- Token-level statistics and information gain (IG) scores to characterize features and top tokens.
- N-gram based evaluation across N=5, 6, 10 with reporting of unique N-grams, total hits, average hits, and standard deviation; analysis of top frequent N-grams and campaign-related chains.
- Similarity and near-duplicate metrics: Edit Distance, Jaro Distance, Cosine similarity, longest common character sequence, and N-gram matching counts.
- Classification performance metrics: Accuracy (Acc), Matthews Correlation Coefficient (MCC), Spam Caught (SC), Blocked Hams (BH), and class-level precision/recall style measures.
Single bullet list with key dataset counts:
- SMS Spam Corpus final totals: 5,574 messages (4,827 ham, 747 spam); INIT: 1,324 messages (1,002 ham, 322 spam); ADD: 4,250 messages (3,825 ham, 425 spam); train/test split: 1,674 / 3,900; Grumbletext spam subset: 425 messages.
Baselines, classifiers evaluated, and reported results
A wide range of classical machine learning classifiers have been evaluated on the corpus, including but not limited to Naive Bayes variants (Multinomial, Bernoulli, Boolean, Gaussian, and others), Logistic Regression, MLP, SVM (including Linear variants), SMO, MDL, KNN, decision trees (C4.5 and boosted variants), PART, Random Forest, boosted classifiers, and an EM clustering baseline in WEKA (max iterations = 20).
Reported baseline outcomes and highlights:
- SVM is reported to outperform other evaluated techniques and serves as the recommended baseline.
- Linear SVM achieved the highest baseline performance among evaluated classifiers (accuracy 97.64%).
- High-performing algorithms reached >97% accuracy in the experiments referenced.
- A trivial rejector (TR) baseline and EM clustering were also used for comparative purposes.
- Reported practical trade-offs: Logistic Regression reportedly blocked >2% of legitimate messages (Blocked Hams >2%), while SVM blocked 0.18% in a cited evaluation.
Evaluation emphasized MCC as a primary performance measure alongside accuracy and spam/ham trade-off metrics (SC and BH).
Limitations, biases, and common failure modes
Known limitations and quality considerations for the collection and evaluation include:
- Geographic and source bias: a substantial portion of ham messages originate from the NUS SMS Corpus (contributors mainly Singaporeans/students), Grumbletext content originates in the United Kingdom, and Spanish messages appear in SMS Spam Corpus v0.1 Big. This results in geographic sampling bias.
- Class imbalance across subcollections: INIT is more balanced than ADD; spam proportion in INIT is about three times that in ADD, and the combined dataset is heavily skewed toward legitimate messages (overall 4,827 ham vs 747 spam in FINAL).
- Near-duplicate and duplication risk: potential duplicates across sources were identified and deduplication was applied, but the presence of duplicates and near-duplicates remains a documented concern and was a primary motivation for the N-gram analysis.
- N-gram analysis trade-offs: short N-grams (5- and 6-grams) generate many false positives and non-symmetric matching; 10-grams reduce false positives but are rarer and may miss shorter shared fragments. Reported issues include non-symmetric matching leading to incomplete counting and the fact that many 10-grams are tied to specific spam campaigns or phone-number patterns.
- Data quality impact on classifiers: short message length, idioms, and abbreviations challenge content-based filters; low feature counts per message constrain modeling. Observed classifier failure modes include false positives attributable to feature overlap and classifier thresholds (e.g., Logistic Regression blocking >2% of legitimate messages vs SVM blocking 0.18%).
Practical notes and key claims
Key empirical and procedural claims associated with the collection and its evaluation are preserved as reported: the corpus is public and composed of real non-encoded SMS messages; near-duplicate detection and N-gram analysis are central to ensuring content uniqueness across subcollections; and SVM (specifically Linear SVM) is cited as the best-performing baseline with accuracy 97.64% and a relatively low legitimate-message blocking rate in the evaluations summarized.
Sources
Towards SMS Spam Filtering Results under a New Dataset