IMDB Review Sentiment Analysis — Dataset and Modeling Summary
Overview
This document describes an approach to binary sentiment classification on the IMDB/IMDb review dataset focused on classifying movie reviews as positive or negative. The work evaluates multiple machine learning and deep learning models, compares their performance on standard metrics, and outlines preprocessing, data specification, training splits, and recommended evaluation protocols.
Dataset and Data Specification
The dataset used is commonly referred to as the IMDB Review Dataset (also cited as IMDb dataset, MR dataset, SST dataset, YelpNYC dataset in related experiments). The unit of observation is the individual IMDb review (a text document). Data formats reported include CSV and Excel, and the raw content is unstructured text.
Labels and structure:
- The label schema is binary with classes: positive and negative.
- Texts in the training data are pre-labeled with sentiment classes.
- A frequently reported canonical split is Total reviews: 50,000, with Training set: 25,000 reviews and Testing set: 25,000 reviews (balanced: 12,500 positive and 12,500 negative in each).
- A subset usage is also reported using an 80/20 split on 10,000 reviews (8,000 training, 2,000 testing), preserving balance between classes.
- Cross-validation: experiments report use of 10-fold cross-validation for evaluation.
Collection, Coverage, and Provenance
The dataset sources cited include the IMDb reviews compiled by Andrew Maas and versions procured from Kaggle. The dataset covers movie review content intended for sentiment analysis. No geographic or temporal coverage details are provided beyond the movie review domain. Licensing for the dataset is reported as open access under the Creative Commons Attribution-Non Commercial 4.0 International License (CC BY-NC 4.0).
Preprocessing and Feature Engineering
Preprocessing steps applied to the textual data are extensive and include the following operations (applied in various combinations across experiments):
- Removal of HTML tags (via regex/BeautifulSoup), removal of URLs, and removal of punctuation and special characters.
- Conversion to lowercase.
- Stop-word removal to reduce dimensionality.
- Tokenization into words.
- Stemming and lemmatization.
- Spell checking and correction (tools mentioned: PySpellChecker, TextBlob).
- Negation handling.
- Removal of rare and overly frequent words via frequency thresholds and curated lists.
Feature engineering methods reported:
- Bag-of-Words (BoW) features.
- TF-IDF features.
The preprocessing pipeline emphasizes converting unstructured text into numerical features suitable for classical ML models and deep learning models alike.
Models, Training, and Model Selection
Models compared in the reported experiments include classical machine learning classifiers and neural networks. Model classes explicitly cited:
- Naive Bayes
- Logistic Regression
- Linear SVM
- Decision Tree
- LSTM
- BiLSTM Additional mentions in related baselines include Random Forest (RF) as an efficient baseline in some IMDB sentiment analyses.
Training details reported:
- Training/testing split commonly 80%/20% or the canonical 25k/25k split as noted above.
- Hyperparameter tuning and preprocessing choices are highlighted as important contributors to achieving high performance.
- Model evaluation used held-out testing and 10-fold cross-validation where specified.
Evaluation Metrics and Baseline Results
Evaluation emphasizes classification metrics including accuracy, precision, recall, and additional suggested metrics such as F1-score and AUC. Confusion-matrix-based analyses are also recommended for deeper insight into precision/recall tradeoffs.
Reported baseline results (models and metrics reported exactly as given):
- BiLSTM: accuracy 0.91, precision 0.89, recall 0.94
- LSTM: accuracy 0.91, precision 0.89, recall 0.92
- Logistic Regression: accuracy 0.89, precision 0.90, recall 0.88
- Linear SVM: accuracy 0.89, precision 0.90, recall 0.88
- Naive Bayes: accuracy 0.86, precision 0.86, recall 0.87
- Decision Tree: accuracy 0.70, precision 0.70, recall 0.71
Additional reported figures and observations:
- An SVM experiment using TF-IDF is reported with 79% accuracy, 75% precision, and 87% recall.
- A Random Forest (RF) classifier is noted in related work as the most efficient baseline model for IMDB sentiment analysis.
(Exactly one bullet list appears here to summarize the primary models compared and their reported performance.)
- BiLSTM — accuracy 0.91; precision 0.89; recall 0.94
- LSTM — accuracy 0.91; precision 0.89; recall 0.92
- Logistic Regression — accuracy 0.89; precision 0.90; recall 0.88
- Linear SVM — accuracy 0.89; precision 0.90; recall 0.88
- Naive Bayes — accuracy 0.86; precision 0.86; recall 0.87
- Decision Tree — accuracy 0.70; precision 0.70; recall 0.71
Recommended Use Cases and Evaluation Protocol
Primary intended use cases:
- Sentiment analysis and binary sentiment classification of IMDb/movie reviews.
- Benchmarking and evaluation of ML and deep learning classifiers on textual sentiment tasks.
Recommended evaluation protocol based on reported practice:
- Use 10-fold cross-validation where possible for robust performance estimates.
- Report core metrics: accuracy, precision, and recall; also report F1-score and AUC when applicable.
- Include confusion matrices to analyze precision/recall tradeoffs and class-specific performance.
Key Contributions and Findings
Key empirical and methodological contributions reported:
- Demonstration that deep learning models, particularly BiLSTM, can achieve top-tier performance on IMDB sentiment classification (BiLSTM reported to achieve highest recall, precision, and accuracy among the compared models).
- Emphasis on the importance of thorough preprocessing (HTML removal, lowercasing, stop-word removal, tokenization, stemming/lemmatization, spell correction, negation handling) and hyperparameter tuning to reach high accuracy.
- Comparative evaluation across classical and neural models showing a range of performance from Decision Tree (lower performance) to recurrent neural networks (higher performance).
Annotation and Labeling
The dataset is reported as annotated (is_annotated: true). Labels are binary sentiment tags (positive/negative). No information on annotator identity, annotation instructions, inter-annotator agreement, or quality-control procedures is provided in the reported material.
Access and Licensing
The dataset is reported as open access under the Creative Commons Attribution-Non Commercial 4.0 International License (CC BY-NC 4.0), permitting non-commercial reuse with attribution as specified by that license.
Sources
Sentiment Analysis on IMDB Review Dataset