Real or Fake Disasters

Project Description

This article is going to talk about my approach to a natural language processing Kaggle competition which can be found here.

What we need to do is create a machine learning model that is able to accurately classify tweets that reference disasters into two categories: real or fake.

For example, one tweet could say:

“The sunset was beautiful today!”

This does not reference a real-life disaster, and thus, it should be classified as fake, which will be represented by 0.

“There is a fire on 5th avenue!”

This tweet, on the other hand, does references an actual disaster, and it is should be as real (represented by 1).

Our goal is to create a traditional classification model that can accurately distinguish between these two types of tweets.

Let’s get started!

Dataset Analysis

Like with any machine learning problem, the first step is to examine the datasets and determine which columns would be useful to the model. To do this, let’s look at the training dataset that we are provided.

Note: The dataset and all the code samples can be found in my GitHub Repository.

This is a sample from rows 51-74 of the training dataset

This is a sample from rows 51-74 of the training dataset

As we can see, there are 5 columns in the dataset: Id, Keyword, Location, Text, and Target.

However, only some of these columns will need to be used as part of our program. Let’s analyze each one.

The Target column is absolutely necessary since we need it to provide the labels for our classifier. The Text column is also needed since it includes the actual tweet that we need to analyze.

However, we should remove the Id column from the dataset since it has no correlation to whether or not the tweet references an actual disaster.

We should also remove the Locations column from the dataset as the input is extremely inconsistent: we can see that some rows contain countries, while others contain states. Furthermore, a large portion of the dataset has missing Location values, which may actually undermine the performance of our classifier.

Although the Keyword column contains some ‘NaN’ values, it is important to keep it in the dataset as it could prove to be beneficial to the model. But, we must keep in mind that this column will have to be categorically encoded since it consists of strings which will need to be converted to integers.

Library Installation

Before we begin, however, we must install four important libraries: Scikit-Learn, Pandas, Numpy, and Natural Language Toolkit (NLTK).

  • Scikit-Learn is a machine learning library that provides machine learning algorithms to perform regression, classification, clustering, and more.

  • Pandas is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.

  • Numpy is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.

  • NLTK is a language processing library which includes several methods and functions that will help us clean the Tweets.

Fortunately, these libraries can be quickly installed by using Pip, Python’s default package-management system. All we have to do is enter the following lines of code into terminal:

pip install sklearn
pip install numpy
pip install pandas
pip install nltk

Now we can begin coding our algorithm in Python!

Training Data Preprocessing

Step 1: Importing Datasets

We’ve already decided which columns do include and which to exclude, so all we have to do is import the training dataset and split it into different arrays—one for each variable we want to include.

Let’s first start by importing the main libraries that we will use: Pandas and Numpy.

import numpy as np
import pandas as pd

Now, we need to import the dataset called train.csv since we will be using this to create and train our model. We’ll call this dataset train.

train = pd.read_csv('data/train.csv')

Let’s take a look at our dataset once again.

dataset.png

As we can see, Keyword is the 2nd column, Text is the 3rd column, and Target is the 4th and final column. We need to index into train to create three independent arrays for each of these columns.

train_keywords = train.iloc[:, 1].values
train_text = train.iloc[:, 3].values
train_targets = train.iloc[:, -1].values

Notice how we are indexing train.csv with two comma separated indices. This is because numpy arrays are indexed in the format [row, column].

Let’s take train_keywords as an example. The first value, the colon, means that we are indexing all the rows. The second value, 1, means that we are indexing the second column (indexes in Python begin from zero).

We do this for all the different arrays we want to create.

Step 2: Categorically Encoding Keywords

Like usual, we will have to clean the data in order to pass it to a classifier. If we look at our dataset, we’ll notice that we need to clean both train_keywords and train_text since they consist of strings.

Let’s start by categorically encoding the Keywords column. This seems quite simple, but upon closer examination, we will see that Keywords contains ‘NaN’ values. Unfortunately, that means that we can’t directly encode this array.

We must first use Scikit-Learn’s SimpleImputer to eliminate these ‘NaN’ values by replacing them with a constant term. This can be done as shown below:

from sklearn.impute import SimpleImputer

train_keywords = train_keywords.reshape(len(train_keywords), 1)
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='no_keyword')
train_keywords = imputer.fit_transform(train_keywords)

The reason why we are using the .reshape method is because SimpleImputer will throw an error if we pass in a one-dimensional array. Thus, we are reshaping train_keywords into a two-dimensional array.

Now, we can finally categorically encode our keywords! There are many methods to encode independent variables, but one of the most popular is OneHotEncoding.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
encoder = OneHotEncoder(drop='first', dtype=int)
ct = ColumnTransformer([('encoding', encoder, [0])], remainder='passthrough')
train_keywords = ct.fit_transform(train_keywords)

Scikit-Learn’s OneHotEncoder has the argument drop, which allows us to circumvent the dummy variable trap without having to manually remove a column. In order to use OneHotEncoder on train_keywords, we must use ColumnTransformer, an object which will apply the encoder of our choosing. The first argument we pass is a tuple in a list. The tuple contains three items:

  • A name for ColumnTransformer(this can be any string)

  • The name of the encoder that we want to use

  • The columns of the dataset that we want to transform

The second argument we pass is remainder. If we set it to ‘passthrough’, all the original columns (other than the original dummy variable column) will be kept. Usually, we want to set remainder to ‘passthrough’ so that the rest of the information in the dataset is kept. In this case, however, it won’t make a difference as the dataset only contained the dummy variable column in the first place. For more information on ColumnTransformer, please see Scikit-Learn’s documentation.

Now that we’ve one hot encoded train_keywords, it should look something like this:

Small portion of train_keywords

Small portion of train_keywords

As we can see, train_keywords has been transformed into a large sparse matrix composed of solely 0’s and 1’s. Keep in mind that the above picture is only a small portion of train_keywords, so the full dataset has many more columns and rows.

Step 3: Text Cleaning

In order to create a natural language processing model, we need to clean the texts themselves by transforming them into a bag of words model.

To do this, we first need to clean each individual Tweet and append them into a list called corpus.

But how do we clean the Tweets?

We can do this with a combination of the libraries NLTK and Re. We have already installed NLTK, and Re comes installed with Python, so we can get started right away!

To clean the texts, we will create an iterate through all the rows in the dataset so and perform a series of operations on each individual Tweet. The text cleaning operations that we will perform are:

  1. Delete all numbers and symbols from the Tweet. We will only include the letters A-Z.

  2. Transform the Tweet to all lowercase.

  3. Split the Tweet by its individual words to create a list

  4. Delete all stopwords from the Tweet.

  5. Take the stem of all the remaining words.

  6. Combine all the elements of the list into one string again.

  7. Append the cleaned string to corpus.

Before we complete these steps, however, we need to import all the libraries that we will use for this portion of our code.

First, we will import the entire NLTK package so that we can download stopwords.

import nltk
nltk.download('stopwords')

Now, we hit run so that our program downloads stopwords from the NLTK library. We don’t need to download stopwords each time our program runs, so we can now delete these lines of code so that our program doesn’t get too cluttered.

After running the above lines of code, we import all the specific methods from NLTK and Re.

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

Stopwords contains all the unnecessary words that we will delete from the Tweet, and PorterStemmer is a method which will return the stem of a word that we input.

For example, the stem of the word loved is love.

Let’s implement the Tweet cleaning steps that I explained above into code:

train_corpus = []
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
ps = PorterStemmer()
for i in range(0, len(train_text)):
    tweet = re.sub('[^a-zA-Z]', ' ', train_text[i])     # Step 1
    tweet = tweet.lower()     # Step 2
    tweet = tweet.split()     # Step 3
    tweet = [ps.stem(word) for word in tweet if word not in set(all_stopwords)]     # Steps 4 and 5
    tweet = ' '.join(tweet)     # Step 6
    train_corpus.append(tweet)     # Step 7

We’re done! Now let's take a look at train_corpus and see what we have:

A .csv file with each element of train_corpus per line

A .csv file with each element of train_corpus per line

As we can see, each Tweet in the training dataset has been cleaned and appended to train_corpus (which is a regular Python list). A lot of the Tweets look like they make no sense, and this is because we only took the stem of each word (we did this with PorterStemmer).

Great! Now all we have to do is create a bag of words model. I suggest you read the article (linked above) by Jason Brownlee for more information, but to summarize, a bag of words model contains a column for every unique word that shows up in all the training text, and contains a count of how many times each word appears for each individual text.

In order to create a bag of words model, we’ll use Scikit-Learn’s CountVectorizer and apply it on train_corpus. This can be done as shown below.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
train_corpus = cv.fit_transform(train_corpus)

Now, train_corpus is a large sparse matrix consisting of solely integers.

Each integer represents the amount of times a certain word appears in each Tweet. Since there are many more unique words in the whole dataset than in each Tweet, a large portion of this matrix is made up of 0’s.

Step 4: Reshaping Data

We’re finally done with cleaning the Tweets! However, there is one minor step that we must complete in order to create a single dataset for all our variables—concatenating train_keywords and train_text. Fortunately, this can be done by using Numpy’s .toarray() function. But first, we’ll have to transform both train_keywords and train_text into Numpy arrays so that they can be concatenated.

train_keywords = train_keywords.toarray()
train_text = train_text.toarray()

Now, we can concatenate these two arrays along axis 1 (the vertical axis) such that train_keywords will be to the left of train_text. We will call the resulting array x_train.

x_train = np.concatenate((train_keywords, train_text), axis=1)

We don’t need to scale our data since we are passing in a sparse matrix, which already contains values close to 0. This means that we’re finally done with data preprocessing! We can now create and train our clasisifer on x_train and train_targets (which we created in the beginning of our program).

Creating and Training the Classifier

There are a variety of classifiers we could use, but in order to find the best one, we need to try all the classifiers and find their accuracy. Since we don’t have the labels for the test dataset, we must split our training dataset into its own test and training subsets. The implementation of this is very similar to what we have already done—the only thing that is different is that we will have to split the training dataset using Scikit-Learn’s train_test_split function.

Since most of the logic is the same, I’m not going to go over how to do this in this article, but I do have the code in my GitHub Repository. The file where I do this is find_best_model.py. This program uses the test_classifiers() that I created in another file: test_models.py.

After we run this program, we’ll find that the Support Vector Classifier using the Radial Basis Function gives us the highest accuracy.

Note: I’ll be writing an article about the workings behind SVC (Support Vector Classifiers) and the RBF (Radial Basis Function) very soon, so make sure to keep a look out for that.

Since we know that SVC’s using the RBF kernel perform the best (out of the classifiers we tested) on our dataset, we don’t have to waste any time testing out all the other classifiers again. This classifier can implemented as shown below:

from slkearn.svm import SVC
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(x_train, train_targets)

We finished training the classifier, but before we use it to predict on the test dataset, we must complete the same preprocessing steps that we did with the training data. Fortunately, it will be much quicker this time around since we already know what to do.

Test Data Preprocessing

Step 1: Importing Datasets

As we did with the training dataset, we can import the test dataset by using the Pandas .read_csv function.

test = pd.read_csv('data/test.csv')

Now, we can separate the dataset into individual Numpy arrays—one for each independent variable. This time, however, we don’t have the labels since we have to submit our predictions to Kaggle to receive our model’s scores. For the submission, we need to have an ids column which contains the Tweet Id for each row in the dataset. Let’s create all our Numpy arrays as shown below.

test_keywords = test.iloc[:, 1].values
test_text = test.iloc[:, -1].values
ids = test.iloc[:, 0].values

Great! Now we can move on to categorically encoding the keywords column.

Step 2: Categorically Encoding Keywords

Once again, we will reshape test_keywords so that we can use SimpleImputer to get rid of all the ‘NaN’ values. Then, we will categorically encode test_keywords with the use of Scikit-Learn’s ColumnTransformer and OneHotEncoder.

Let’s begin by reshaping test_keywords. As you can see, it is identical to what we did with the training dataset.

test_keywords = test_keywords.reshape(len(test_keywords), 1)

Now, we will use SimpleImputer just like we did before to get rid of ‘NaN’ values so that test_keywords can be categorically encoded.

This is where things change slightly.

We already have an object of Imputer that has been fit to the training dataset, so we shouldn’t re-fit the object on our test dataset.

But why is this?

In this case, we don’t want to introduce data leakage. Data leakage is essentially train-test cross contamination, which can result in poor results with the model. In this case, we are only imputing the dataset with a constant value, but in other cases, we might impute datasets by using mean, median or mode. In these cases, data leakage could have a large effect on model performance. Thus, we make sure to never introduce cross-contamination into our program.

To do this, we will just use the .transform method instead of the .fit_transform.

test_keywords = imputer.transform(test_keywords)

We must also only use the .transform method to categorically encode our dataset. However, this is not because of data leakage.

Let’s say, for example, that our test dataset contains a keyword that is not in our training dataset. If we re-fit an encoder on our test dataset, that keyword would become a new column. This may not seem like a problem at first, but our would not be able to deal with a new keyword that it has not been trained with. This will lead to an error in our code. Thus, we must never re-fit a categorical encoder on our test dataset. Once again, we will use the .transform method.

test_keywords = ct.transform(test_keywords)

Now that we’ve done this, we can move onto cleaning the Tweets in our test dataset.

Step 3: Text Cleaning

This process should be identical to the procedure we carried out on the training dataset. If we cleaned our data in a different way, then we would not receive accurate results from the model. We must always preprocess our test data the same way as our training data.

test_corpus = []
for i in range(0, len(test_text)):
    tweet = re.sub('[^a-zA-Z]', ' ', test_text[i])
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(word) for word in tweet if word not in set(all_stopwords)]
    tweet = ' '.join(tweet)
    test_corpus.append(tweet)

As you can see, our code is exactly the same as before, except we changed train_text to test_text and train_corpus to test_corpus.

Now, we must create our bag of words model. Like with our categorical encoder, we must not re-fit the CountVectorizer on our test dataset as it could lead to a different amount of features being inputted into our classifier. Thus, we will use the .transform method instead of the .fit_transform method.

test_corpus = cv.transform(test_corpus)

Step 4: Reshaping the Data

We’re super close to finishing our model, but we need to do one more thing before we create a submission. Like before, we need to transform test_keywords and test_corpus into Numpy arrays so that they can be concatenated for inputting into our classifier. This can be done just like before:

test_corpus = test_corpus.toarray()
test_keywords = test_keywords.toarray()

Once again, we will concatenate these two arrays along the 1st dimension to create x_test.

x_test = np.concatenate((test_keywords, test_corpus), axis=1)

We’re finally at the moment we’ve all been waiting for! We can run our classifier on x_test and create a submission file.

Predicting and Creating a Submission File

Predicting with the Classifier

As with all Scikit-Learn classifiers, we can input x_test and create a predictions array by using the .predict method of our classifier. Since we’ve already included all our independent variables in x_test, we don’t need to do any more reshaping or preprocessing.

Let’s implement this into code as shown below

predictions = classifier.predict(x_test)

Now, we have a Numpy array called predictions!

Creating a Submission File

To submit our predictions to Kaggle, we need to create a .csv file with two columns:

  • The Tweet ID

  • Our prediction (1 or 0)

We actually already have arrays of both the ID and predictions column (we created ids at the beginning of the testing phase), so all we need to do is concatenate these two arrays together.

To do this, however, we’re going to have to reshape the datasets so that they both have the same number of dimensions. We can do this by using the .reshape method as shown below:

ids = ids.reshape(len(ids), 1)
predictions = predictions.reshape(len(predictions), 1)

Now, we can concatenate these two into another Numpy array: submission

submission = np.concatenate((ids, predictions), axis=1)

And finally, we can create submission.csv by using the .savetxt function from Numpy.

np.savetxt('submission.csv', submission, fmt='%d')

Conclusion

Results

After submitting submission.csv to Kaggle, I got an accuracy of 80.324 percent! This put me in 506th place out of 1500 competitors. Although this is a pretty good score, there is always scope of improvement, and we must make sure we try to achieve higher accuracies.

Potential Improvements

If you ran the code that I had on GitHub, you probably noticed how long the program took to completely finish. This is due to the high dimensionality of the dataset, which was caused by OneHotEncoding keywords. Fortunately, we could try and fix this problem by using a different type of categorical encoder. One possibility is TargetEncoding.

Although I picked the most optimal classifier by testing each of the models, there are still other classifiers out there that we can consider. For example, XGBoost is a powerful ensemble classifier that may be able to return an accuracy higher than that of RBF SVM. We could also try and create artificial neural networks using Keras!

And that’s it! We’ve successfully created a powerful natural language processing model to detect whether Tweets reference real disasters.

Please feel free to leave feedback in the comments so that I can fix any errors and improve for my next article.

The next project I will write about is the famous Titanic Survivor Classification Challenge. The goal for this project will be to successfully predict whether a passenger survived on the Titanic given some information about them.

Previous
Previous

Titanic Survival