NLP predict REAL or NOT with Disaster Tweets (Kaggle Competition)¶

Import required libraries for data manipulation¶

import pandas as pd
import numpy as np

Load the datasets¶

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

Have a look at the given data¶

train.head()

test.head()

sample_submission.head()

train.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

test.isnull().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

train['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

The problem requires us to classify the test data into two classes 0 and 1¶

Let's start by splitting out train data into train and validation data¶

from sklearn.model_selection import train_test_split

X = train.text.values
y = train.target.values

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                     stratify = y,
                                                     random_state = 0,
                                                     test_size = 0.2, shuffle=True)

print(X_train.shape)
print(X_valid.shape)

(6090,)
(1523,)

Let's first build a basic TF-IDF model¶

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfv = TfidfVectorizer(min_df=3, max_features=None,
                     strip_accents='unicode', analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1, 3), use_idf=1,
                     smooth_idf=1, sublinear_tf=1,
                     stop_words='english')

# Fitting Tfidfvectorizer to train and test data

tfv.fit(list(X_train) + list(X_valid))
X_train_tfv = tfv.transform(X_train)
X_valid_tfv = tfv.transform(X_valid)

# Build a simple logistic regression model

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1.0)
clf.fit(X_train_tfv, y_train)

predictions = clf.predict(X_valid_tfv)

# Check F1 score 
from sklearn.metrics import f1_score

f1_score(y_valid, predictions)

0.7596638655462186

Looks good score for first simple model. Let's try same model with different data type¶

ctv = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}',
                     ngram_range=(1, 3), stop_words='english')

ctv.fit(list(X_train) + list(X_valid))
X_train_ctv = ctv.transform(X_train)
X_valid_ctv = ctv.transform(X_valid)

# Fit and predict with Logistic Regression
clf = LogisticRegression(C=1.0)
clf.fit(X_train_ctv, y_train)

predictions = clf.predict(X_valid_ctv)

# Check Score
f1_score(y_valid, predictions)

0.7299145299145301

Not an improvement from first model¶

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train_tfv, y_train)
predictions = clf.predict(X_valid_tfv)

f1_score(y_valid, predictions)

0.7296819787985865

Let's try Naive Bayes¶

clf = MultinomialNB()
clf.fit(X_train_ctv, y_train)
predictions = clf.predict(X_valid_ctv)

f1_score(y_valid, predictions)

0.7465224111282843

Not an improvement either, Let's try XGBClassifier¶

from xgboost import XGBClassifier

clf = XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(X_train_tfv.tocsc(), y_train)
predictions = clf.predict(X_valid_tfv.tocsc())

f1_score(y_valid, predictions)

0.7266495287060839

So, Logistic Regression model has worked best for us. Let's create a submission file with that model¶

clf = LogisticRegression(C=1.0)
clf.fit(X_train_tfv, y_train)

predictions = clf.predict(X_valid_tfv)

f1_score(y_valid, predictions)

0.7596638655462186

sample_submission.head()

Preprocess the test data¶

test.head()

X_test = test.text.values

X_test_tfv = tfv.transform(X_test)

sample_submission['target'] = clf.predict(X_test_tfv)

sample_submission.head()

sample_submission.to_csv("submission.csv", index=False)

	id	keyword	location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, s...
2	3	NaN	NaN	there is a forest fire at spot pond, geese are...
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, s...
2	3	NaN	NaN	there is a forest fire at spot pond, geese are...
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

NLP predict REAL or NOT with Disaster Tweets (Kaggle Competition)¶

Import required libraries for data manipulation¶

Load the datasets¶

Have a look at the given data¶

The problem requires us to classify the test data into two classes 0 and 1¶

Let's start by splitting out train data into train and validation data¶

Let's first build a basic TF-IDF model¶

Looks good score for first simple model. Let's try same model with different data type¶

Not an improvement from first model¶

Let's try Naive Bayes¶

Not an improvement either, Let's try XGBClassifier¶

So, Logistic Regression model has worked best for us. Let's create a submission file with that model¶

Preprocess the test data¶

And we're done with a basic NLP analysis model¶