NLP predict REAL or NOT with Disaster Tweets (Kaggle Competition)

Import required libraries for data manipulation
In [1]:
import pandas as pd
import numpy as np
Load the datasets
In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')
Have a look at the given data
In [3]:
train.head()
Out[3]:
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1
In [4]:
test.head()
Out[4]:
id keyword location text
0 0 NaN NaN Just happened a terrible car crash
1 2 NaN NaN Heard about #earthquake is different cities, s...
2 3 NaN NaN there is a forest fire at spot pond, geese are...
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan
In [5]:
sample_submission.head()
Out[5]:
id target
0 0 0
1 2 0
2 3 0
3 9 0
4 11 0
In [6]:
train.isnull().sum()
Out[6]:
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64
In [7]:
test.isnull().sum()
Out[7]:
id             0
keyword       26
location    1105
text           0
dtype: int64
In [8]:
train['target'].value_counts()
Out[8]:
0    4342
1    3271
Name: target, dtype: int64

The problem requires us to classify the test data into two classes 0 and 1

Let's start by splitting out train data into train and validation data

In [9]:
from sklearn.model_selection import train_test_split
In [10]:
X = train.text.values
y = train.target.values
In [11]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                     stratify = y,
                                                     random_state = 0,
                                                     test_size = 0.2, shuffle=True)
In [12]:
print(X_train.shape)
print(X_valid.shape)
(6090,)
(1523,)

Let's first build a basic TF-IDF model

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
In [14]:
tfv = TfidfVectorizer(min_df=3, max_features=None,
                     strip_accents='unicode', analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1, 3), use_idf=1,
                     smooth_idf=1, sublinear_tf=1,
                     stop_words='english')
In [15]:
# Fitting Tfidfvectorizer to train and test data

tfv.fit(list(X_train) + list(X_valid))
X_train_tfv = tfv.transform(X_train)
X_valid_tfv = tfv.transform(X_valid)
In [16]:
# Build a simple logistic regression model

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1.0)
clf.fit(X_train_tfv, y_train)

predictions = clf.predict(X_valid_tfv)
In [17]:
# Check F1 score 
from sklearn.metrics import f1_score

f1_score(y_valid, predictions)
Out[17]:
0.7596638655462186

Looks good score for first simple model. Let's try same model with different data type

In [18]:
ctv = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',
                     ngram_range=(1, 3), stop_words='english')

ctv.fit(list(X_train) + list(X_valid))
X_train_ctv = ctv.transform(X_train)
X_valid_ctv = ctv.transform(X_valid)
In [19]:
# Fit and predict with Logistic Regression
clf = LogisticRegression(C=1.0)
clf.fit(X_train_ctv, y_train)

predictions = clf.predict(X_valid_ctv)
In [20]:
# Check Score
f1_score(y_valid, predictions)
Out[20]:
0.7299145299145301

Not an improvement from first model

In [21]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train_tfv, y_train)
predictions = clf.predict(X_valid_tfv)

f1_score(y_valid, predictions)
Out[21]:
0.7296819787985865

Let's try Naive Bayes

In [22]:
clf = MultinomialNB()
clf.fit(X_train_ctv, y_train)
predictions = clf.predict(X_valid_ctv)

f1_score(y_valid, predictions)
Out[22]:
0.7465224111282843

Not an improvement either, Let's try XGBClassifier

In [23]:
from xgboost import XGBClassifier

clf = XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(X_train_tfv.tocsc(), y_train)
predictions = clf.predict(X_valid_tfv.tocsc())

f1_score(y_valid, predictions)
Out[23]:
0.7266495287060839

So, Logistic Regression model has worked best for us. Let's create a submission file with that model

In [24]:
clf = LogisticRegression(C=1.0)
clf.fit(X_train_tfv, y_train)

predictions = clf.predict(X_valid_tfv)
In [25]:
f1_score(y_valid, predictions)
Out[25]:
0.7596638655462186
In [26]:
sample_submission.head()
Out[26]:
id target
0 0 0
1 2 0
2 3 0
3 9 0
4 11 0

Preprocess the test data

In [27]:
test.head()
Out[27]:
id keyword location text
0 0 NaN NaN Just happened a terrible car crash
1 2 NaN NaN Heard about #earthquake is different cities, s...
2 3 NaN NaN there is a forest fire at spot pond, geese are...
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan
In [28]:
X_test = test.text.values
In [29]:
X_test_tfv = tfv.transform(X_test)
In [30]:
sample_submission['target'] = clf.predict(X_test_tfv)
In [31]:
sample_submission.head()
Out[31]:
id target
0 0 1
1 2 1
2 3 1
3 9 1
4 11 1
In [32]:
sample_submission.to_csv("submission.csv", index=False)

And we're done with a basic NLP analysis model