Step 1: Import Data¶

# import packages
import pandas as pd
import numpy as np

# remove warnings so outputs look better when running cells
import warnings
#warnings.filterwarnings('ignore')

# read file, low_memory is set to False as required by a system message
df = pd.read_csv("LoansTrainingSet.csv", low_memory = False)

i) Rename columns for convenience¶

# rename columns for typing convenience
df.columns = ['loanid', 'custid', 'loan_status', 'loan_amount', 'term', 'credit_score', 'years_in_job', 'home_ownership', 'income', 'purpose', 'monthly_debt', 'years_credhistory', 'months_since_del', 'nr_accounts', 'nr_problems', 'credit_balance', 'max_credit', 'bankr', 'tax_liens']

ii) Eliminate Duplicate Rows¶

We eliminate 16k duplicate rows from our dataset (these are exactly identical rows).For each loan ID, there should be ONLY ONE single row of data that we will use to predict the probability of default for that loan.

#eliminate identical duplicate rows
print df.shape  #256k rows
df.drop_duplicates(inplace = True) #eliminate duplicate rows
print df.shape # 240k rows remain after exact duplicates are dropped

Step 2: Data Cleaning¶

Target Variable: Make loan_status easier to use by converting it to zero's and one's¶

# loan_status is set to 0 and 1 instead of Charged Off and Fully Paid
z = {'Charged Off':0, 'Fully Paid':1, }
df.loan_status = df.loan_status.map(z)

Data Overview and Missing Values¶

There are numerical and non-numerical variables.
16k duplicate rows (exactly identical rows) were eliminated with drop_duplicates above
59k remaining entries are missing the credit_score and income. There are other missing values for months_since_del, bankr, and tax_liens

OBSERVATIONS:¶

WE COULD DROP THE MONTHS-SINCE-DEL COLUMN OR THE ROWS!!!! ?????
DO PARTIAL DEP OR CORRELATIONS!!!!!!!
ALSO PLOT TRAIN ERROR VERSUS TEST ERROR
We could use numbers for years in the job, replacing by ZERO the N/A's
USE GRIDSEARCH AND CV TO CHOOSE PARAMETERS AND DATA COLUMNS===> USING RECALL INSTEAD OF ACCURACY!!!!
Use gridsearch or proba function to tune recall versus the other

# Eliminate Missing Values
df.dropna (subset = ['credit_score', 'income', 'bankr', 'tax_liens' ], inplace = True)

Validation and Feature Engineering¶

I) CATEGORICAL VARIABLES¶

Drop loan id duplicates¶

df = df.drop_duplicates(subset = 'loanid')

Years in Job: Create bins for years-in_job to form the new column 'years_in_job2'¶

w = {'10+ years': 'Over 10 years', '< 1 year': 'Less than 1 Year','1 year':'1-5', '2 years': '1-5', '3 years':'1-5', '4 years': '1-5', '5 years': '1-5','6 years': '6-10', '7 years': '6-10', '8 years': '6-10', '9 years': '6-10','n/a': 'Not Applicable'}
df['years_in_job2'] = df.years_in_job.map(w)

Home Ownership: Replace HaveMortgage with Home Mortgage¶

df.loc[df.home_ownership == 'HaveMortgage', 'home_ownership'] = 'Home Mortgage'

II) NUMERICAL VARIABLES¶

loan_amount: drop 35k invalid entries of 999999¶

df = df[df.loan_amount < 200000]

Credit Scores:¶

Correct credit scores over 800 (divide them by 10)

df.credit_score = df.credit_score.apply (lambda x: x/10 if x > 800 else x)

income¶

Drop 14 values over 1000000

df = df[df.income<1000000]

Transform number of bankruptcies to objects¶

#dummies will be created later for this feature
df.bankr = df.bankr.astype(object)

Monthly Debt¶

Clean up monthly debt
Eliminate values over 6000

df.monthly_debt = df.monthly_debt.str.replace ('$', '')
df.monthly_debt = df.monthly_debt.str.replace (',', '')
df.monthly_debt = df.monthly_debt.apply(lambda x: float(x) )

df = df[df.monthly_debt<=6000]

3) Select Variables for Model and Get Dummies¶

df.shape

(135646, 20)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 135646 entries, 0 to 256983
Data columns (total 20 columns):
loanid               135646 non-null object
custid               135646 non-null object
loan_status          135646 non-null int64
loan_amount          135646 non-null int64
term                 135646 non-null object
credit_score         135646 non-null float64
years_in_job         135646 non-null object
home_ownership       135646 non-null object
income               135646 non-null float64
purpose              135646 non-null object
monthly_debt         135646 non-null float64
years_credhistory    135646 non-null float64
months_since_del     61328 non-null float64
nr_accounts          135646 non-null int64
nr_problems          135646 non-null int64
credit_balance       135646 non-null int64
max_credit           135646 non-null object
bankr                135646 non-null object
tax_liens            135646 non-null float64
years_in_job2        135646 non-null object
dtypes: float64(6), int64(5), object(9)
memory usage: 21.7+ MB

Xy = df.copy()
Xy.drop(['loanid', 'custid','years_credhistory', 'months_since_del', 'nr_accounts', 'nr_problems', 'credit_balance', 'max_credit'], axis = 1, inplace = True)

Xy.columns

Index([u'loan_status', u'loan_amount', u'term', u'credit_score',
       u'years_in_job', u'home_ownership', u'income', u'purpose',
       u'monthly_debt', u'bankr', u'tax_liens', u'years_in_job2'],
      dtype='object')

# ADD CUSTID, LOANID??

## ADD CUSTID AND ENGINEER DEBT TO INCOME!!!!

3) Prepare Model¶

## GREAT CODE FOR DEFININING VARIABLES THAT GO IN THE MODEL!!!!!
#X.columns
#X = combined [['']]

#input_features=df[['Years of Credit History','Monthly Debt']]
#to_predict1=df[['Annual Income']]

X=Xy.copy()
del X['loan_status']
y=Xy['loan_status']

Get Dummies¶

X.columns

Index([u'loan_amount', u'term', u'credit_score', u'years_in_job',
       u'home_ownership', u'income', u'purpose', u'monthly_debt', u'bankr',
       u'tax_liens', u'years_in_job2'],
      dtype='object')

X=pd.get_dummies(X,drop_first=True)

X.head()

OBSERVATIONS¶

###### CALCULATE THE ERROR FOR THE TRAIN DATA AT THE SAME TIME AS THE TEST DATA!!!! PLOT WOULD BE GREAT!!!!!
###### ENGINEER SOME VARIABLES, like debt to income

from sklearn.cross_validation import train_test_split as tts
X_train, X_test, y_train, y_test = tts(X, y, test_size = .2, random_state = 33)

C:\Users\nfolcini\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import RFE

model = GradientBoostingClassifier()
rfe = RFE(model,3)
fitting = rfe.fit(X,y)

print("Num features: %d" % fit.n_features_)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-6cfec0a0c868> in <module>()
      6 fitting = rfe.fit(X,y)
      7 
----> 8 print("Num features: %d" % fit.n_features_)

NameError: name 'fit' is not defined

model2 = GradientBoostingClassifier()
model2.fit(X,y)

print(model2.feature_importances_)

# Fit Models: Logistic Reg, Decision Trees, Random Forest, Gradient Boosting Classifier, AdaBoost Classifier
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import AdaBoostClassifier 

log = LogisticRegression()
tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
gradient = GradientBoostingClassifier()
ada = AdaBoostClassifier()

### Optimize for recall instead of accuracy!!!

l = log.fit(X_train, y_train)
t = tree.fit(X_train, y_train)
f = forest.fit(X_train, y_train)
g = gradient.fit(X_train, y_train)
a = ada.fit(X_train, y_train)

print("The score for Logistic Regression is, ", l.score(X_test, y_test))
print("The score for Decision Trees is ", t.score(X_test, y_test))
print("The score for Random Forest is ", f.score(X_test, y_test))
print("The score for Gradient Boosting is ", g.score(X_test, y_test))
print("The score for AdaBoost is ", a.score(X_test, y_test))

from sklearn.metrics import recall_score,precision_score,f1_score

pred=g.predict(X_test)

print recall_score(y_test, pred)
print precision_score(y_test,pred)
print f1_score(y_test,pred)

from xgboost import XGBClassifier
from xgboost import plot_importance

Apply XGBOOST¶

#instantiate a model, create a param grid
model = XGBClassifier(max_depth = 5, learning_rate = .01, n_estimators = 2000, nthread = -1, min_child_weight = 2, subsample = .6, colsample_bylevel = .5, seed = 0 )
#TRY COLUMN SAMPLE BY TREE!!!

X_train.shape

X_train.columns = [u'loan_amount', u'credit_score', u'income', u'monthly_debt',
       u'tax_liens', u'term_Short Term', u'years_in_job_10+ years',
       u'years_in_job_2 years', u'years_in_job_3 years',
       u'years_in_job_4 years', u'years_in_job_5 years',
       u'years_in_job_6 years', u'years_in_job_7 years',
       u'years_in_job_8 years', u'years_in_job_9 years',
       u'years_in_job_Under 1 year', u'years_in_job_n/a',
       u'home_ownership_Own Home', u'home_ownership_Rent',
       u'purpose_Buy House', u'purpose_Buy a Car',
       u'purpose_Debt Consolidation', u'purpose_Educational Expenses',
       u'purpose_Home Improvements', u'purpose_Medical Bills',
       u'purpose_Other', u'purpose_Take a Trip', u'purpose_other',
       u'bankr_1.0', u'bankr_2.0', u'bankr_3.0', u'bankr_4.0', u'bankr_5.0',
       u'bankr_6.0', u'years_in_job2_6-10', u'years_in_job2_Less than 1 Year',
       u'years_in_job2_Not Applicable', u'years_in_job2_Over 10 years']

model.fit(X_train, y_train)

X_test.columns = [u'loan_amount', u'credit_score', u'income', u'monthly_debt',
       u'tax_liens', u'term_Short Term', u'years_in_job_10+ years',
       u'years_in_job_2 years', u'years_in_job_3 years',
       u'years_in_job_4 years', u'years_in_job_5 years',
       u'years_in_job_6 years', u'years_in_job_7 years',
       u'years_in_job_8 years', u'years_in_job_9 years',
       u'years_in_job_Under 1 year', u'years_in_job_n/a',
       u'home_ownership_Own Home', u'home_ownership_Rent',
       u'purpose_Buy House', u'purpose_Buy a Car',
       u'purpose_Debt Consolidation', u'purpose_Educational Expenses',
       u'purpose_Home Improvements', u'purpose_Medical Bills',
       u'purpose_Other', u'purpose_Take a Trip', u'purpose_other',
       u'bankr_1.0', u'bankr_2.0', u'bankr_3.0', u'bankr_4.0', u'bankr_5.0',
       u'bankr_6.0', u'years_in_job2_6-10', u'years_in_job2_Less than 1 Year',
       u'years_in_job2_Not Applicable', u'years_in_job2_Over 10 years']

print("The score for XGB is ", model.score(X_test, y_test))

Feature Importance w / Zia¶

f = g.feature_importances_

len(f)

feat_imp = pd.DataFrame(data = {'Feature Name': X.columns, 'Feature Importance': f},columns = ['Feature Name', 'Feature Importance'])

feat_imp.sort_values('Feature Importance', ascending = False, inplace = True)

feat_imp

	loan_amount	credit_score	income	monthly_debt	term_Short Term	years_in_job_10+ years	years_in_job_4 years	...	years_in_job2_Over 10 years
0	11520	741.0	33694.0	584.03	1	1	0	...	1
1	3441	734.0	42269.0	1106.04	1	0	1	...	0
2	21029	747.0	90126.0	1321.85	1	1	0	...	1
3	18743	747.0	38072.0	751.92	1	1	0	...	1
4	11731	746.0	50025.0	355.18	1	0	1	...	0