We will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/wordsforthewise/lending-club
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.[3] It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model thatcan predict wether or nor a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan. Keep in mind classification metrics when evaluating the performance of your model!
The "loan_status" column contains our label.
There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:
LoanStatNew | Description | |
---|---|---|
0 | loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
1 | term | The number of payments on the loan. Values are in months and can be either 36 or 60. |
2 | int_rate | Interest Rate on the loan |
3 | installment | The monthly payment owed by the borrower if the loan originates. |
4 | grade | LC assigned loan grade |
5 | sub_grade | LC assigned loan subgrade |
6 | emp_title | The job title supplied by the Borrower when applying for the loan.* |
7 | emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
8 | home_ownership | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER |
9 | annual_inc | The self-reported annual income provided by the borrower during registration. |
10 | verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified |
11 | issue_d | The month which the loan was funded |
12 | loan_status | Current status of the loan |
13 | purpose | A category provided by the borrower for the loan request. |
14 | title | The loan title provided by the borrower |
15 | zip_code | The first 3 numbers of the zip code provided by the borrower in the loan application. |
16 | addr_state | The state provided by the borrower in the loan application |
17 | dti | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
18 | earliest_cr_line | The month the borrower's earliest reported credit line was opened |
19 | open_acc | The number of open credit lines in the borrower's credit file. |
20 | pub_rec | Number of derogatory public records |
21 | revol_bal | Total credit revolving balance |
22 | revol_util | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
23 | total_acc | The total number of credit lines currently in the borrower's credit file |
24 | initial_list_status | The initial listing status of the loan. Possible values are – W, F |
25 | application_type | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
26 | mort_acc | Number of mortgage accounts. |
27 | pub_rec_bankruptcies | Number of public record bankruptcies |
import pandas as pd
data_info = pd.read_csv('lending_club_info.csv',index_col='LoanStatNew')
print(data_info.loc['revol_util']['Description'])
def feat_info(col_name):
print(data_info.loc[col_name]['Description'])
feat_info('mort_acc')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# might be needed depending on your version of Jupyter
%matplotlib inline
df = pd.read_csv('lending_club_loan_two.csv')
df.info()
TASK: Since we will be attempting to predict loan_status, create a countplot as shown below.
sns.countplot(df['loan_status'])
TASK: Create a histogram of the loan_amnt column.
plt.figure(figsize=(12,6))
sns.distplot(df['loan_amnt'], kde=False, bins=50)
TASK: Let's explore correlation between the continuous feature variables. Calculate the correlation between all continuous numeric variables using .corr() method.
df.corr()
TASK: Visualize this using a heatmap. Depending on your version of matplotlib, you may need to manually adjust the heatmap.
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')
TASK: You should have noticed almost perfect correlation with the "installment" feature. Explore this feature further. Print out their descriptions and perform a scatterplot between them. Does this relationship make sense to you? Do you think there is duplicate information here?
feat_info('installment')
feat_info('loan_amnt')
sns.scatterplot(x='installment', y='loan_amnt', data=df)
TASK: Create a boxplot showing the relationship between the loan_status and the Loan Amount.
sns.boxplot(x='loan_status', y='loan_amnt', data=df)
TASK: Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans. What are the unique possible grades and subgrades?
df.head(1)
df['grade'].unique()
df['sub_grade'].unique()
TASK: Create a countplot per grade. Set the hue to the loan_status label.
sns.countplot(df['grade'].sort_values(), hue=df['loan_status'])
TASK: Display a count plot per subgrade. You may need to resize for this plot and reorder the x axis. Feel free to edit the color palette. Explore both all loans made per subgrade as well being separated based on the loan_status. After creating this plot, go ahead and create a similar plot, but set hue="loan_status"
plt.figure(figsize=(12,6))
sns.countplot(df['sub_grade'].sort_values(), hue=df['loan_status'])
TASK: It looks like F and G subgrades don't get paid back that often. Isloate those and recreate the countplot just for those subgrades.
f_g = df[(df['grade'] == 'G') | (df['grade'] == 'F')]
plt.figure(figsize=(12,6))
subgrade_order = sorted(f_g['sub_grade'].unique())
sns.countplot(x='sub_grade',data=f_g,order = subgrade_order,hue='loan_status')
TASK: Create a new column called 'loan_repaid' which will contain a 1 if the loan status was "Fully Paid" and a 0 if it was "Charged Off".
df['loan_status'].unique()
df['loan_repaid'] = df['loan_status'].map({'Fully Paid': 1, 'Charged Off': 0})
df[['loan_repaid','loan_status']]
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind='bar')
Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.
df.head()
Let's explore this missing data columns. We use a variety of factors to decide whether or not they would be useful, to see if we should keep, discard, or fill in the missing data.
TASK: What is the length of the dataframe?
df.shape[0]
TASK: Create a Series that displays the total count of missing values per column.
df.isnull().sum()
TASK: Convert this Series to be in term of percentage of the total DataFrame
df.isnull().sum() / len(df) *100
TASK: Let's examine emp_title and emp_length to see whether it will be okay to drop them. Print out their feature information using the feat_info() function from the top of this notebook.
feat_info('emp_title')
print('\n')
feat_info('emp_length')
TASK: How many unique employment job titles are there?
df['emp_title'].nunique()
df['emp_title'].value_counts()
TASK: Realistically there are too many unique job titles to try to convert this to a dummy variable feature. Let's remove that emp_title column.
df = df.drop('emp_title',axis=1)
df.columns
TASK: Create a count plot of the emp_length feature column. Challenge: Sort the order of the values.
sorted(df['emp_length'].dropna().unique())
emp_order = ['< 1 year',
'1 year',
'2 years',
'3 years',
'4 years',
'5 years',
'6 years',
'7 years',
'8 years',
'9 years',
'10+ years']
plt.figure(figsize=(12,6))
sns.countplot(x='emp_length', data=df, order=emp_order)
TASK: Plot out the countplot with a hue separating Fully Paid vs Charged Off
plt.figure(figsize=(12,6))
sns.countplot(x='emp_length', data=df, order=emp_order, hue='loan_status')
CHALLENGE TASK: This still doesn't really inform us if there is a strong relationship between employment length and being charged off, what we want is the percentage of charge offs per category. Essentially informing us what percent of people per employment category didn't pay back their loan. There are a multitude of ways to create this Series. Once you've created it, see if visualize it with a bar plot. This may be tricky, refer to solutions if you get stuck on creating this Series.
emp_co = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']
emp_fp = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']
emp_len = emp_co/emp_fp
emp_len
emp_len.plot(kind='bar')
TASK: Charge off rates are extremely similar across all employment lengths. Go ahead and drop the emp_length column.
df = df.drop('emp_length', axis=1)
df.columns
TASK: Revisit the DataFrame to see what feature columns still have missing data.
df.isnull().sum()
TASK: Review the title column vs the purpose column. Is this repeated information?
df['purpose'].head(10)
df['title'].head(10)
TASK: The title column is simply a string subcategory/description of the purpose column. Go ahead and drop the title column.
df = df.drop('title', axis=1)
df.columns
NOTE: This is one of the hardest parts of the project! Refer to the solutions video if you need guidance, feel free to fill or drop the missing values of the mort_acc however you see fit! Here we're going with a very specific approach.
TASK: Find out what the mort_acc feature represents
feat_info('mort_acc')
TASK: Create a value_counts of the mort_acc column.
df['mort_acc'].value_counts()
TASK: There are many ways we could deal with this missing data. We could attempt to build a simple model to fill it in, such as a linear model, we could just fill it in based on the mean of the other columns, or you could even bin the columns into categories and then set NaN as its own category. There is no 100% correct approach! Let's review the other columsn to see which most highly correlates to mort_acc
df.corr()['mort_acc'].sort_values()
TASK: Looks like the total_acc feature correlates with the mort_acc , this makes sense! Let's try this fillna() approach. We will group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc entry. To get the result below:
df.groupby('total_acc').mean()['mort_acc']
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']
def fill_mort_acc(total_acc,mort_acc):
'''
Accepts the total_acc and mort_acc values for the row.
Checks if the mort_acc is NaN, if so, it returns the avg mort_acc value
for the corresponding total_acc value for that row
total_acc_avg here should be a Series or dictionary containing the mapping of the groupby averages
of mort_acc per total_acc values.
'''
if np.isnan(mort_acc):
return total_acc_avg[total_acc]
else:
return mort_acc
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)
df.isnull().sum()
TASK: revol_util and the pub_rec_bankruptcies have missing data points, but they account for less than 0.5% of the total data. Go ahead and remove the rows that are missing those values in those columns with dropna().
df = df.dropna()
df.isnull().sum()
We're done working with the missing data! Now we just need to deal with the string values due to the categorical columns.
**TASK: List all the columns that are currently non-numeric.
df.select_dtypes(['object']).columns
Let's now go through all the string features to see what we should do with them.
TASK: Convert the term feature into either a 36 or 60 integer numeric data type using .apply() or .map().
df['term'].value_counts()
df['term'] = df['term'].apply(lambda term: int(term[:3]))
TASK: We already know grade is part of sub_grade, so just drop the grade feature.
df = df.drop('grade',axis=1)
df.columns
TASK: Convert the subgrade into dummy variables. Then concatenate these new columns to the original dataframe. Remember to drop the original subgrade column and to add drop_first=True to your get_dummies call.
subgrade_dummie = pd.get_dummies(df['sub_grade'], drop_first=True)
df = pd.concat([df.drop('sub_grade', axis=1), subgrade_dummie], axis=1)
df.columns
df.select_dtypes(['object']).columns
TASK: Convert these columns: ['verification_status', 'application_type','initial_list_status','purpose'] into dummy variables and concatenate them with the original dataframe. Remember to set drop_first=True and to drop the original columns.
dummies = pd.get_dummies(df[['verification_status', 'application_type', 'initial_list_status','purpose']],
drop_first=True)
df = df.drop(['verification_status', 'application_type', 'initial_list_status','purpose'], axis=1)
df = pd.concat([df, dummies], axis=1)
df.select_dtypes(['object']).columns
TASK:Review the value_counts for the home_ownership column.
df['home_ownership'].value_counts()
TASK: Convert these to dummy variables, but replace NONE and ANY with OTHER, so that we end up with just 4 categories, MORTGAGE, RENT, OWN, OTHER. Then concatenate them with the original dataframe. Remember to set drop_first=True and to drop the original columns.
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
dummies = pd.get_dummies(df['home_ownership'], drop_first=True)
df = df.drop('home_ownership', axis=1)
df = pd.concat([df,dummies], axis=1)
df.select_dtypes(['object']).columns
TASK: Let's feature engineer a zip code column from the address in the data set. Create a column called 'zip_code' that extracts the zip code from the address column.
df['address'].head()
df['zip_code'] = df['address'].apply(lambda address: address[-5:])
df['zip_code'].head()
TASK: Now make this zip_code column into dummy variables using pandas. Concatenate the result and drop the original zip_code column along with dropping the address column.
dummies = pd.get_dummies(df['zip_code'], drop_first=True)
df = df.drop('address', axis=1)
df = pd.concat([df,dummies], axis=1)
df.select_dtypes(['object']).columns
df = df.drop('zip_code', axis=1)
TASK: This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an issue_date, drop this feature.
df.drop('issue_d', axis=1, inplace=True)
df.select_dtypes(['object']).columns
TASK: This appears to be a historical time stamp feature. Extract the year from this feature using a .apply function, then convert it to a numeric feature. Set this new data to a feature column called 'earliest_cr_year'.Then drop the earliest_cr_line feature.
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda year: int(year[-4:]))
df.drop('earliest_cr_line', axis=1, inplace=True)
df.select_dtypes(['object']).columns
TASK: Import train_test_split from sklearn.
from sklearn.model_selection import train_test_split
TASK: drop the load_status column we created earlier, since its a duplicate of the loan_repaid column. We'll use the loan_repaid column since its already in 0s and 1s.
df.drop('loan_status', axis=1, inplace=True)
TASK: Set X and y variables to the .values of the features and label.
y= df['loan_repaid'].values
X= df.drop('loan_repaid', axis=1).values
TASK: Perform a train/test split with test_size=0.2 and a random_state of 101.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
TASK: Use a MinMaxScaler to normalize the feature data X_train and X_test. Recall we don't want data leakge from the test set so we only fit on the X_train data.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
TASK: Run the cell below to import the necessary Keras functions.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
TASK: Build a sequential model to will be trained on the data. You have unlimited options here, but here is what the solution uses: a model that goes 78 --> 39 --> 19--> 1 output neuron. OPTIONAL: Explore adding Dropout layers 1) 2
model = Sequential()
# input Layer
model.add(Dense(78, activation='relu'))
model.add(Dropout(0.2))
# hidden Layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))
# hidden Layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))
# output Layer
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
TASK: Fit the model to the training data for at least 25 epochs. Also add in the validation data for later plotting. Optional: add in a batch_size of 256.
model.fit(x= X_train,
y= y_train,
epochs=25,
batch_size=256,
validation_data=(X_test,y_test))
TASK: Plot out the validation loss versus the training loss.
losses = pd.DataFrame(model.history.history)
losses[['loss', 'val_loss']].plot()
TASK: Create predictions from the X_test set and display a classification report and confusion matrix for the X_test set.
from sklearn.metrics import classification_report, confusion_matrix
preds = (model.predict(X_test)>0.5).astype('int32')
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))
TASK: Given the customer below, would you offer this person a loan?
import random
random.seed(101)
random_ind = random.randint(0,len(df))
new_customer = df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer
(model.predict(new_customer.values.reshape(1,78))>0.5).astype('int32')
TASK: Now check, did this person actually end up paying back their loan?
df.iloc[random_ind]['loan_repaid']