Movie Review Text Classification Using scikit-learn

Published in

Python in Plain English

6 min readJun 21, 2021

In this article, I will show you how to use the popular Python library scikit-learn to implement a movie review classifier.

First, we will see how to prepare text data to feed a machine learning model, next, we will see how to use scikit-learn to implement a classification model, and, finally, we will discuss the model performance.

Data overview.

The dataset I will use can be found in the following link. It is a binary dataset for sentiment classification, divided into two folders: positive and negative reviews, each of them containing 1000 reviews. Since the movie reviews are text files, we need to pre-process the data to prepare it and then feed machine learning models. After loading the text data, we need to separate them to create both the training and testing data set.

Feature extraction.

Since we cannot use text data to feed machine learning models, we need more steps to fit any model. There are some techniques to extract features from text data. One of the more common methods to convert the text data into numeric data is to use the frequency that each word appears in the data.

The code shown above creates a vocabulary of words. This is a dictionary that maps each word to its frequency. The word frequency refers to how many times the word appears in the text, therefore, our vocabulary has a length equal to the number of different words in the text.

In the image shown above, we can see the word frequency of one document (movie review), zero means that the word only appears just once in the text. This process ends with a sparse vector with a large number of zeros. If we repeat this process for all the reviews, we will have a sparse matrix with most of their terms being zero.

Term frequency.

The sparse matrix that results after applying the last technique may not be helpful until considering the “term frequency” concept. This concept can be used to determine the weight or importance of a term in a document (review). This technique may be summarized as:

The weight of a term that occurs in a document is simply proportional to the term frequency.

The way to calculate the term frequency is by dividing the number of occurrences of a word by the total number of words in the document.

Here “d” refers to the document we are dealing with. One issue with this approach is that reviews with words like “the” or “is” will be more emphasized, since these words are more frequent, these words are called “stop-words”. Conversely, words more meaningful won’t have enough weight since these words tend to be less frequent or rare.

Inverse document frequency

In order to re-weight the count features into floating-point values suitable to feed a classifier, and also give more importance to rare words or less frequent words, it is common to use the tf-idf transform, which means term-frequency times inverse document-frequency. This transformation is implemented by scikit-learn with the class TfidTransformer

Using Scikit-learn to extract features from text data.

Scikit-learn has pre-built functions to convert text data into numeric data. The steps to follow are shown below.

Let’s delve into each process involved in the feature extraction process from text data.

Term Frequency.

To convert the raw text into a matrix of word counts, scikit-learn has a class called CountVectorizer that converts a collection of text documents to a matrix of token counts. In addition, this class can filter out stop-words. Let’s see how it works.

First, since each review is quite large, I am going to create a corpus to use as an example. After instantiating the class CountVectorizer I can fit the object calling the method fit_transform . To see the words in our vocabulary I used the method get_feature_names which returns.

['and','document','first','one','second','the','third','this']

We can see the word frequency converting the X vector into a NumPy array where each row corresponds to each document in the corpus (strings stored in the list). The numbers in the matrix represent the number of occurrences of every word within the documents.

X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

By default, the class offers a set of parameters that are enough in most of the cases, however, if we want to remove “stop-words” from the vocabulary, we can change this parameter from None to “english”. This library uses a predefined list of words considered as “stop-words” to filter out stop-words from the corpus.

vectorizer = CountVectorizer(stop_words = "english")

Changing this parameter reduces the number of words in the vocabulary, for this particular example we obtained the following words and their respective frequency matrix.

>>>vectorizer.get_feature_names()
['document', 'second']
>>>X.toarray()
array([[1, 0],
       [2, 1],
       [0, 0],
       [1, 0]])

Let’s apply this method to our dataset of movie reviews.

freq_vector = CountVectorizer(stop_words = "english")
X_train_freq = freq_vector.fit_transform(X_train)
X_train_freq.shape
>>>(1340, 33470)

The method returns a matrix with 1340 rows (reviews) and 33470 features. Now let’s see how to convert this matrix of word occurrences into a floating matrix with more meaningful information about the content review. To do this, I will use the class TfidfTransformer which transforms the count matrix into a normalized tf-idf representation.

Building the model

The tf-id transformation returns the feature matrix that can be used to train a classifier model, so let’s create a basic classifier and see how it performs. The algorithm will be the “Linear Support Vector” this algorithm is able to handle sparse features and large numbers of samples.

That’s all, the process to create a movie review classifier has already finished. It can be quite complex to develop a classifier model to deal with text data, fortunately, scikit-learn provides a Pipeline class that serves to assemble several steps that can be cross-validated together. In addition, we can use the class TfidVectorizer which converts a collection of documents to a matrix of tf-idf features directly. So let’s compressed the code to develop a text classifier.

This method is much more compact than executing each step one by one. Remember that to fit the model we just need to use the original data, that is the text data.

Model Performance.

Using a confusion matrix we can see at a glance information about our model performance. Let’s use the test dataset to make predictions and see how good the model is.

The code shown above returns the confusion matrix.

The model performance is pretty good as a basic approach for text classification, it tends to classify more negative reviews as positive reviews, however, the behavior is quite similar for both labels. The general accuracy is.

>>>from sklearn.metrics import accuracy_score
>>>print(accuracy_score(y_test,predictions))
0.8257575757575758

Conclusions

We have learned the basic aspects of Natural Language Processing (NLP) tasks in this article, there are many things that I am not covered with this little project, however, we already know how powerful scikit-learn can be to develop NLP projects. The code for this project can be found in my GitHub repository.

I am passionate about data science and like to explain how these concepts can be used to solve problems in a simple way. If you have any questions or just want to connect, you can find me on Linkedin or email me at manuelgilsitio@gmail.com

References.

More content at plainenglish.io