Step 1: Import Data

In [ ]:
# import packages
import pandas as pd
import numpy as np

# remove warnings so outputs look better when running cells
import warnings
#warnings.filterwarnings('ignore')

# read file, low_memory is set to False as required by a system message
df = pd.read_csv("LoansTrainingSet.csv", low_memory = False)

i) Rename columns for convenience

In [ ]:
# rename columns for typing convenience
df.columns = ['loanid', 'custid', 'loan_status', 'loan_amount', 'term', 'credit_score', 'years_in_job', 'home_ownership', 'income', 'purpose', 'monthly_debt', 'years_credhistory', 'months_since_del', 'nr_accounts', 'nr_problems', 'credit_balance', 'max_credit', 'bankr', 'tax_liens']

ii) Eliminate Duplicate Rows

We eliminate 16k duplicate rows from our dataset (these are exactly identical rows).For each loan ID, there should be ONLY ONE single row of data that we will use to predict the probability of default for that loan.

In [ ]:
#eliminate identical duplicate rows
print df.shape  #256k rows
df.drop_duplicates(inplace = True) #eliminate duplicate rows
print df.shape # 240k rows remain after exact duplicates are dropped

Step 2: Data Cleaning

Target Variable: Make loan_status easier to use by converting it to zero's and one's

In [ ]:
# loan_status is set to 0 and 1 instead of Charged Off and Fully Paid
z = {'Charged Off':0, 'Fully Paid':1, }
df.loan_status = df.loan_status.map(z)

Data Overview and Missing Values

  • There are numerical and non-numerical variables.
  • 16k duplicate rows (exactly identical rows) were eliminated with drop_duplicates above
  • 59k remaining entries are missing the credit_score and income. There are other missing values for months_since_del, bankr, and tax_liens
OBSERVATIONS:
  • WE COULD DROP THE MONTHS-SINCE-DEL COLUMN OR THE ROWS!!!! ?????
  • DO PARTIAL DEP OR CORRELATIONS!!!!!!!
  • ALSO PLOT TRAIN ERROR VERSUS TEST ERROR
  • We could use numbers for years in the job, replacing by ZERO the N/A's
  • USE GRIDSEARCH AND CV TO CHOOSE PARAMETERS AND DATA COLUMNS===> USING RECALL INSTEAD OF ACCURACY!!!!
  • Use gridsearch or proba function to tune recall versus the other
In [ ]:
# Eliminate Missing Values
df.dropna (subset = ['credit_score', 'income', 'bankr', 'tax_liens' ], inplace = True)

Validation and Feature Engineering

I) CATEGORICAL VARIABLES

Drop loan id duplicates

In [6]:
df = df.drop_duplicates(subset = 'loanid')

Years in Job: Create bins for years-in_job to form the new column 'years_in_job2'

In [7]:
w = {'10+ years': 'Over 10 years', '< 1 year': 'Less than 1 Year','1 year':'1-5', '2 years': '1-5', '3 years':'1-5', '4 years': '1-5', '5 years': '1-5','6 years': '6-10', '7 years': '6-10', '8 years': '6-10', '9 years': '6-10','n/a': 'Not Applicable'}
df['years_in_job2'] = df.years_in_job.map(w)

Home Ownership: Replace HaveMortgage with Home Mortgage

In [8]:
df.loc[df.home_ownership == 'HaveMortgage', 'home_ownership'] = 'Home Mortgage'

II) NUMERICAL VARIABLES

loan_amount: drop 35k invalid entries of 999999

In [9]:
df = df[df.loan_amount < 200000]

Credit Scores:

Correct credit scores over 800 (divide them by 10)

In [10]:
df.credit_score = df.credit_score.apply (lambda x: x/10 if x > 800 else x)

income

Drop 14 values over 1000000

In [11]:
df = df[df.income<1000000]

Transform number of bankruptcies to objects

In [12]:
#dummies will be created later for this feature
df.bankr = df.bankr.astype(object)

Monthly Debt

  • Clean up monthly debt
  • Eliminate values over 6000
In [13]:
df.monthly_debt = df.monthly_debt.str.replace ('$', '')
df.monthly_debt = df.monthly_debt.str.replace (',', '')
df.monthly_debt = df.monthly_debt.apply(lambda x: float(x) )
In [14]:
df = df[df.monthly_debt<=6000]

3) Select Variables for Model and Get Dummies

In [15]:
df.shape
Out[15]:
(135646, 20)
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 135646 entries, 0 to 256983
Data columns (total 20 columns):
loanid               135646 non-null object
custid               135646 non-null object
loan_status          135646 non-null int64
loan_amount          135646 non-null int64
term                 135646 non-null object
credit_score         135646 non-null float64
years_in_job         135646 non-null object
home_ownership       135646 non-null object
income               135646 non-null float64
purpose              135646 non-null object
monthly_debt         135646 non-null float64
years_credhistory    135646 non-null float64
months_since_del     61328 non-null float64
nr_accounts          135646 non-null int64
nr_problems          135646 non-null int64
credit_balance       135646 non-null int64
max_credit           135646 non-null object
bankr                135646 non-null object
tax_liens            135646 non-null float64
years_in_job2        135646 non-null object
dtypes: float64(6), int64(5), object(9)
memory usage: 21.7+ MB
In [22]:
Xy = df.copy()
Xy.drop(['loanid', 'custid','years_credhistory', 'months_since_del', 'nr_accounts', 'nr_problems', 'credit_balance', 'max_credit'], axis = 1, inplace = True)
In [23]:
Xy.columns
Out[23]:
Index([u'loan_status', u'loan_amount', u'term', u'credit_score',
       u'years_in_job', u'home_ownership', u'income', u'purpose',
       u'monthly_debt', u'bankr', u'tax_liens', u'years_in_job2'],
      dtype='object')
In [ ]:
# ADD CUSTID, LOANID??
In [ ]:
## ADD CUSTID AND ENGINEER DEBT TO INCOME!!!!

3) Prepare Model

In [ ]:
## GREAT CODE FOR DEFININING VARIABLES THAT GO IN THE MODEL!!!!!
#X.columns
#X = combined [['']]
In [ ]:
#input_features=df[['Years of Credit History','Monthly Debt']]
#to_predict1=df[['Annual Income']]
In [24]:
X=Xy.copy()
del X['loan_status']
y=Xy['loan_status']

Get Dummies

In [25]:
X.columns
Out[25]:
Index([u'loan_amount', u'term', u'credit_score', u'years_in_job',
       u'home_ownership', u'income', u'purpose', u'monthly_debt', u'bankr',
       u'tax_liens', u'years_in_job2'],
      dtype='object')
In [26]:
X=pd.get_dummies(X,drop_first=True)
In [27]:
X.head()
Out[27]:
loan_amount credit_score income monthly_debt tax_liens term_Short Term years_in_job_10+ years years_in_job_2 years years_in_job_3 years years_in_job_4 years ... bankr_1.0 bankr_2.0 bankr_3.0 bankr_4.0 bankr_5.0 bankr_6.0 years_in_job2_6-10 years_in_job2_Less than 1 Year years_in_job2_Not Applicable years_in_job2_Over 10 years
0 11520 741.0 33694.0 584.03 0.0 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 3441 734.0 42269.0 1106.04 0.0 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
2 21029 747.0 90126.0 1321.85 0.0 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 18743 747.0 38072.0 751.92 0.0 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 11731 746.0 50025.0 355.18 0.0 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 38 columns

OBSERVATIONS

  • ###### CALCULATE THE ERROR FOR THE TRAIN DATA AT THE SAME TIME AS THE TEST DATA!!!! PLOT WOULD BE GREAT!!!!!
  • ###### ENGINEER SOME VARIABLES, like debt to income
In [28]:
from sklearn.cross_validation import train_test_split as tts
X_train, X_test, y_train, y_test = tts(X, y, test_size = .2, random_state = 33)
C:\Users\nfolcini\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [31]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import RFE

model = GradientBoostingClassifier()
rfe = RFE(model,3)
fitting = rfe.fit(X,y)

print("Num features: %d" % fit.n_features_)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-6cfec0a0c868> in <module>()
      6 fitting = rfe.fit(X,y)
      7 
----> 8 print("Num features: %d" % fit.n_features_)

NameError: name 'fit' is not defined
In [ ]:
model2 = GradientBoostingClassifier()
model2.fit(X,y)
In [ ]:
print(model2.feature_importances_)
In [ ]:
# Fit Models: Logistic Reg, Decision Trees, Random Forest, Gradient Boosting Classifier, AdaBoost Classifier
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import AdaBoostClassifier 

log = LogisticRegression()
tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
gradient = GradientBoostingClassifier()
ada = AdaBoostClassifier()
In [ ]:
### Optimize for recall instead of accuracy!!!
In [ ]:
l = log.fit(X_train, y_train)
t = tree.fit(X_train, y_train)
f = forest.fit(X_train, y_train)
g = gradient.fit(X_train, y_train)
a = ada.fit(X_train, y_train)
In [ ]:
print("The score for Logistic Regression is, ", l.score(X_test, y_test))
print("The score for Decision Trees is ", t.score(X_test, y_test))
print("The score for Random Forest is ", f.score(X_test, y_test))
print("The score for Gradient Boosting is ", g.score(X_test, y_test))
print("The score for AdaBoost is ", a.score(X_test, y_test))
In [ ]:
from sklearn.metrics import recall_score,precision_score,f1_score
In [ ]:
pred=g.predict(X_test)
In [ ]:
print recall_score(y_test, pred)
print precision_score(y_test,pred)
print f1_score(y_test,pred)
In [ ]:
from xgboost import XGBClassifier
from xgboost import plot_importance

Apply XGBOOST

In [ ]:
#instantiate a model, create a param grid
model = XGBClassifier(max_depth = 5, learning_rate = .01, n_estimators = 2000, nthread = -1, min_child_weight = 2, subsample = .6, colsample_bylevel = .5, seed = 0 )
#TRY COLUMN SAMPLE BY TREE!!!
In [ ]:
X_train.shape
In [ ]:
X_train.columns = [u'loan_amount', u'credit_score', u'income', u'monthly_debt',
       u'tax_liens', u'term_Short Term', u'years_in_job_10+ years',
       u'years_in_job_2 years', u'years_in_job_3 years',
       u'years_in_job_4 years', u'years_in_job_5 years',
       u'years_in_job_6 years', u'years_in_job_7 years',
       u'years_in_job_8 years', u'years_in_job_9 years',
       u'years_in_job_Under 1 year', u'years_in_job_n/a',
       u'home_ownership_Own Home', u'home_ownership_Rent',
       u'purpose_Buy House', u'purpose_Buy a Car',
       u'purpose_Debt Consolidation', u'purpose_Educational Expenses',
       u'purpose_Home Improvements', u'purpose_Medical Bills',
       u'purpose_Other', u'purpose_Take a Trip', u'purpose_other',
       u'bankr_1.0', u'bankr_2.0', u'bankr_3.0', u'bankr_4.0', u'bankr_5.0',
       u'bankr_6.0', u'years_in_job2_6-10', u'years_in_job2_Less than 1 Year',
       u'years_in_job2_Not Applicable', u'years_in_job2_Over 10 years']
In [ ]:
model.fit(X_train, y_train)
In [ ]:
X_test.columns = [u'loan_amount', u'credit_score', u'income', u'monthly_debt',
       u'tax_liens', u'term_Short Term', u'years_in_job_10+ years',
       u'years_in_job_2 years', u'years_in_job_3 years',
       u'years_in_job_4 years', u'years_in_job_5 years',
       u'years_in_job_6 years', u'years_in_job_7 years',
       u'years_in_job_8 years', u'years_in_job_9 years',
       u'years_in_job_Under 1 year', u'years_in_job_n/a',
       u'home_ownership_Own Home', u'home_ownership_Rent',
       u'purpose_Buy House', u'purpose_Buy a Car',
       u'purpose_Debt Consolidation', u'purpose_Educational Expenses',
       u'purpose_Home Improvements', u'purpose_Medical Bills',
       u'purpose_Other', u'purpose_Take a Trip', u'purpose_other',
       u'bankr_1.0', u'bankr_2.0', u'bankr_3.0', u'bankr_4.0', u'bankr_5.0',
       u'bankr_6.0', u'years_in_job2_6-10', u'years_in_job2_Less than 1 Year',
       u'years_in_job2_Not Applicable', u'years_in_job2_Over 10 years']
In [ ]:
print("The score for XGB is ", model.score(X_test, y_test))

Feature Importance w / Zia

In [ ]:
f = g.feature_importances_
In [ ]:
len(f)
In [ ]:
feat_imp = pd.DataFrame(data = {'Feature Name': X.columns, 'Feature Importance': f},columns = ['Feature Name', 'Feature Importance'])
In [ ]:
feat_imp.sort_values('Feature Importance', ascending = False, inplace = True)
In [ ]:
feat_imp