Decision tree example on Loan dataset
Step#01
Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
Step#02
Read the dataset frm directory Path
df=pd.read_csv('/content/drive/MyDrive/dataset/loan_data.csv')
df.head() # shows first 5 rows from dataset
Step#03
I have choose Google Colab, so importing dataset from my drive directly
from google.colab import drive
drive.mount('/content/drive')
Step#04
first need to convert string values in variable'purpose' to some dummy variables. As, the purpose column is categorical.
That means we need to transform them using dummy variables so sklearn will be able to understand them. Let's do this in one clean step
using pd.get_dummies.
Let's show you a way of dealing with these columns that can be expanded to multiple categorical features if necessary.
1. Create a list of 1 element containing the string 'purpose'. Call this list cat_feats
cat_feats = ['purpose']
Step#05
2. Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.
final_data = pd.get_dummies(df, columns = cat_feats, drop_first = True)
Step#06
Compare both dataset now
df.head()
final_data.head()
Step#07
from sklearn.model_selection import train_test_split
X = final_data.drop('not.fully.paid', axis = 1) #drop not.fully.paid variable from x-axis
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20 , random_state= 101)
Step#08
Let's start by training a single decision tree.
first Import DecisionTreeClassifier
and then Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train,sample_weight=None, check_input=True, X_idx_sorted=None)
Step#09
Create predictions from the test set and create a classification report.
y_predict = dtree.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))
No comments:
Post a Comment