5.2.2 Notebook

The notebook below is an example by using the models and methods from the suggestion lists. The score tables and the radar charts are the results of using the evaluation system. The source code of this notebook is as below:

GitHub: Evaluation_System.ipynb

import h2o
import psutil
from h2o.automl import H2OAutoML
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators import H2OGradientBoostingEstimator
import pandas as pd
import numpy as np
from h2o.estimators.random_forest import H2ORandomForestEstimator
import matplotlib.pyplot as plt

1. Initialize H2O

pct_memory=4
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)

h2o.init(strict_version_check=False,max_mem_size=min_mem_size)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_152-release"; OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12); OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
  Starting server from /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/25/079nl8ps1d17m_b9mm9cwnhm0000gn/T/tmpp_c51ipx
  JVM stdout: /var/folders/25/079nl8ps1d17m_b9mm9cwnhm0000gn/T/tmpp_c51ipx/h2o_guruisi_started_from_python.out
  JVM stderr: /var/folders/25/079nl8ps1d17m_b9mm9cwnhm0000gn/T/tmpp_c51ipx/h2o_guruisi_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.

H2O cluster uptime:

02 secs

H2O cluster timezone:

America/New_York

H2O data parsing timezone:

UTC

H2O cluster version:

3.24.0.2

H2O cluster version age:

1 year, 3 months and 23 days !!!

H2O cluster name:

H2O_from_python_guruisi_r1jxcs

H2O cluster total nodes:

H2O cluster free memory:

7.111 Gb

H2O cluster total cores:

H2O cluster allowed cores:

H2O cluster status:

accepting new members, healthy

H2O connection url:

http://127.0.0.1:54321

H2O connection proxy:

None

H2O internal security:

False

H2O API Extensions:

Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4

Python version:

3.7.4 final

2. Data Preprocessing

# Insert an index column as 'Id' in order to slice the dataset for ICE plot
data_path = "../data/Churn_Train.csv"
df = pd.read_csv(data_path)
df['Id'] = df.index
df.to_csv("../data/Churn.csv", index = False)

# Read the dataset with 'Id' column in H2O
data_path = "../data/Churn.csv"
df = h2o.import_file(data_path)
df.shape

(6499, 22)

df.head(5)

CustomerID

Gender

Senior Citizen

Partner

Dependents

Tenure

Phone Service

Multiple Lines

Internet Service

Online Security

Online Backup

Device Protection

Tech Support

Streaming TV

Streaming Movies

Contract

Paperless Billing

Payment Method

Monthly Charges

Total Charges

Churn

7590-VHVEG

Female

Yes

No phone service

DSL

Yes

Month-to-month

Yes

Electronic check

29.85

5575-GNVDE

Male

Yes

DSL

Yes

One year

Mailed check

56.95

1889.5

3668-QPYBK

Male

Yes

DSL

Yes

Month-to-month

Yes

Mailed check

53.85

108.15

Yes

7795-CFOCW

Male

No phone service

DSL

Yes

One year

Bank transfer (automatic)

42.3

1840.75

9237-HQITU

Female

Yes

Fiber optic

Month-to-month

Yes

Electronic check

70.7

151.65

Yes

Finding the missing data. The missing data only takes 0.14% of the whole dataset, therefore, dropping the missing data is our solution.

df.isna().sum()

9.0

df = df.na_omit()

df.isna().sum()

0.0

Separating the target column and numerical columns. Target : 'Churn' Non-feature: 'CustomerID' & 'Id'

What Is Churn Rate In Business Area? The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period. It is also the rate at which employees leave their jobs within a certain period. For a company to expand its client, its growth rate (measured by the number of new customers) must exceed its churn rate.

target = 'Churn'
X = list(set(df.columns) - set(['Churn']) - set(['CustomerID']) - set(['Id']))
X

['Online Backup',
 'Phone Service',
 'Partner',
 'Senior Citizen',
 'Contract',
 'Device Protection',
 'Tenure',
 'Gender',
 'Internet Service',
 'Online Security',
 'Monthly Charges',
 'Paperless Billing',
 'Streaming Movies',
 'Multiple Lines',
 'Payment Method',
 'Tech Support',
 'Dependents',
 'Total Charges',
 'Streaming TV']

Running the H2O AutoML to get the models leaderboard. Since the target 'Churn' is a binomial target, the problem we are facing here is a classification problem.

aml = H2OAutoML(max_runtime_secs=600)

aml.train(x=X,y=target,training_frame=df)

AutoML progress: |████████████████████████████████████████████████████████| 100%

aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

model_id

auc

logloss

mean_per_class_error

rmse

mse

StackedEnsemble_BestOfFamily_AutoML_20200809_1...

0.850180

0.417073

0.226777

0.366226

0.134121

GBM_grid_1_AutoML_20200809_184844_model_18

0.850156

0.409682

0.230198

0.364476

0.132843

GBM_grid_1_AutoML_20200809_184844_model_20

0.849776

0.410381

0.234533

0.364628

0.132953

StackedEnsemble_AllModels_AutoML_20200809_184844

0.849452

0.417161

0.228774

0.366490

0.134315

GBM_grid_1_AutoML_20200809_184844_model_16

0.849246

0.411673

0.238235

0.365142

0.133329

GBM_grid_1_AutoML_20200809_184844_model_12

0.848980

0.452791

0.234555

0.381081

0.145222

GBM_grid_1_AutoML_20200809_184844_model_14

0.848974

0.412404

0.225886

0.365270

0.133422

GBM_grid_1_AutoML_20200809_224838_model_8

0.848855

0.431964

0.240782

0.372267

0.138582

GBM_grid_1_AutoML_20200809_184844_model_3

0.848788

0.432952

0.237229

0.372787

0.138970

GBM_grid_1_AutoML_20200809_184844_model_1

0.848785

0.434713

0.228669

0.373369

0.139405

XGBoost_grid_1_AutoML_20200809_224838_model_9

0.848772

0.411119

0.232333

0.365420

0.133531

GBM_grid_1_AutoML_20200809_224838_model_12

0.848654

0.411532

0.234209

0.365369

0.133494

GBM_grid_1_AutoML_20200809_224838_model_7

0.848639

0.414071

0.236514

0.366018

0.133970

GBM_grid_1_AutoML_20200809_224838_model_5

0.848601

0.431544

0.240111

0.371907

0.138315

GBM_grid_1_AutoML_20200809_184844_model_13

0.848597

0.412046

0.233466

0.365329

0.133465

XGBoost_grid_1_AutoML_20200809_224838_model_5

0.848483

0.415327

0.232322

0.365961

0.133928

GBM_5_AutoML_20200809_184844

0.848407

0.411846

0.235760

0.365449

0.133553

XGBoost_grid_1_AutoML_20200809_224838_model_2

0.848375

0.411610

0.229764

0.365684

0.133724

XGBoost_3_AutoML_20200809_224838

0.848356

0.412230

0.229621

0.365567

0.133639

GBM_grid_1_AutoML_20200809_224838_model_11

0.848199

0.413048

0.234907

0.365655

0.133704

GBM_grid_1_AutoML_20200809_224838_model_15

0.848197

0.435540

0.229049

0.373651

0.139615

GBM_grid_1_AutoML_20200809_184844_model_8

0.848169

0.441859

0.234264

0.376513

0.141762

XGBoost_3_AutoML_20200809_184844

0.848065

0.412465

0.237157

0.365700

0.133736

XGBoost_2_AutoML_20200809_184844

0.847559

0.413327

0.234434

0.366205

0.134106

XGBoost_grid_1_AutoML_20200809_184844_model_10

0.847499

0.415598

0.237966

0.366557

0.134364

GBM_grid_1_AutoML_20200809_184844_model_19

0.847337

0.436593

0.235980

0.374232

0.140050

GBM_5_AutoML_20200809_224838

0.847264

0.412953

0.233433

0.366141

0.134059

XGBoost_2_AutoML_20200809_224838

0.847197

0.413653

0.231425

0.366346

0.134210

GBM_grid_1_AutoML_20200809_224838_model_4

0.846974

0.439954

0.240557

0.376069

0.141428

XGBoost_grid_1_AutoML_20200809_224838_model_1

0.846642

0.412812

0.232162

0.366473

0.134303

...

GBM_grid_1_AutoML_20200809_224838_model_14

0.831422

0.432501

0.243522

0.375203

0.140777

GBM_grid_1_AutoML_20200809_224838_model_13

0.830757

0.435942

0.248588

0.376414

0.141687

GBM_4_AutoML_20200809_184844

0.830160

0.438904

0.251052

0.377504

0.142509

DeepLearning_1_AutoML_20200809_224838

0.829825

0.436108

0.254523

0.378159

0.143004

XGBoost_grid_1_AutoML_20200809_184844_model_3

0.829003

0.447995

0.250662

0.381721

0.145711

DeepLearning_grid_1_AutoML_20200809_184844_mod...

0.828073

0.505879

0.251965

0.388789

0.151157

DeepLearning_grid_1_AutoML_20200809_224838_mod...

0.827414

0.485080

0.249715

0.381528

0.145563

DeepLearning_grid_1_AutoML_20200809_224838_mod...

0.826863

0.463806

0.250040

0.382114

0.146011

XGBoost_grid_1_AutoML_20200809_184844_model_2

0.826246

0.447182

0.254606

0.382063

0.145972

XGBoost_grid_1_AutoML_20200809_224838_model_6

0.825827

0.453619

0.252576

0.382699

0.146458

DeepLearning_grid_1_AutoML_20200809_224838_mod...

0.824306

0.460778

0.256933

0.383795

0.147299

XRT_1_AutoML_20200809_184844

0.822605

0.512764

0.259865

0.379563

0.144068

XRT_1_AutoML_20200809_224838

0.821975

0.498749

0.259430

0.379496

0.144018

GBM_grid_1_AutoML_20200809_224838_model_17

0.821147

0.569075

0.251833

0.437177

0.191124

DRF_1_AutoML_20200809_224838

0.820474

0.522540

0.256855

0.380883

0.145072

GBM_grid_1_AutoML_20200809_184844_model_5

0.819377

0.465534

0.260789

0.387906

0.150471

DRF_1_AutoML_20200809_184844

0.819231

0.538513

0.255728

0.380744

0.144966

GBM_grid_1_AutoML_20200809_184844_model_7

0.816504

0.570223

0.255827

0.437693

0.191575

DeepLearning_grid_1_AutoML_20200809_184844_mod...

0.814457

0.464657

0.256316

0.387104

0.149850

XGBoost_grid_1_AutoML_20200809_184844_model_9

0.812922

0.497491

0.262802

0.398574

0.158861

GBM_grid_1_AutoML_20200809_184844_model_11

0.810966

0.568592

0.257857

0.436954

0.190928

GBM_grid_1_AutoML_20200809_184844_model_10

0.810071

0.483470

0.262444

0.391754

0.153472

DeepLearning_grid_1_AutoML_20200809_184844_mod...

0.809415

0.681625

0.257185

0.417600

0.174389

GBM_grid_1_AutoML_20200809_184844_model_17

0.807155

0.671564

0.261080

0.419363

0.175865

DeepLearning_grid_1_AutoML_20200809_184844_mod...

0.806609

0.665037

0.270382

0.397929

0.158348

GBM_grid_1_AutoML_20200809_224838_model_6

0.805700

0.500964

0.264710

0.399209

0.159368

GBM_grid_1_AutoML_20200809_184844_model_9

0.805206

0.568771

0.268710

0.437034

0.190999

DeepLearning_grid_1_AutoML_20200809_184844_mod...

0.803046

0.550576

0.271191

0.405248

0.164226

XGBoost_grid_1_AutoML_20200809_184844_model_7

0.792067

0.559789

0.277797

0.417189

0.174046

GBM_grid_1_AutoML_20200809_224838_model_2

0.789137

0.606154

0.282159

0.420276

0.176632

100 rows × 6 columns

3. Building Models

As the suggestion list generated from the evaluation system of interpretability, there are three models to build. Logistic Regression, Gradient Boosting Machine, and XGBoost Model. By taking advantage of the H2O Auto ML leaderboard, we are able to get the best models with the most suitable hyperparameter.

# Build the models

logistic_model_local = H2OGeneralizedLinearEstimator(family = "binomial")

gbm_model_local = h2o.get_model('GBM_grid_1_AutoML_20200809_184844_model_18')

xgboost_model_local = h2o.get_model('XGBoost_grid_1_AutoML_20200809_224838_model_9')

# Slice the dataframe into three parts: training df, valiadation df and testing df with the ratio 6:3:1

# cast to factor
df[1] = df[1].asfactor()

# Random UNIform numbers, one per row
r = df[0].runif()

# 60% for training data
train = df[ r < 0.6 ]

# 30% for validation
valid = df[ (0.6 <= r) & (r < 0.9) ]

# 10% for testing
test  = df[ 0.9 <= r ]

3.1 Logistic Regression

# Train the Logistic Regression Model

logistic_model_local.train(x=X,
                           y= target,
                           training_frame = train,
                           validation_frame = valid)

# Performance of Logistic Regression Model

logistic_model_local

Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  GLM_model_python_1597013319592_17144


ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.1313121673914865
RMSE: 0.36237020764887184
LogLoss: 0.40633115377777673
Null degrees of freedom: 3863
Residual degrees of freedom: 3836
Null deviance: 4482.797889446194
Residual deviance: 3140.127156394658
AIC: 3196.127156394658
AUC: 0.8558425142502644
pr_auc: 0.6758682461569124
Gini: 0.7116850285005287
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.25859197421042557:

Yes

Error

Rate

2088.0

745.0

0.263

(745.0/2833.0)

Yes

181.0

850.0

0.1756

(181.0/1031.0)

Total

2269.0

1595.0

0.2396

(926.0/3864.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric

threshold

value

idx

max f1

0.2585920

0.6473724

241.0

max f2

0.1563713

0.7615521

288.0

max f0point5

0.5398616

0.6524909

122.0

max accuracy

0.5398616

0.8115942

122.0

max precision

0.8750044

1.0

0.0

max recall

0.0031143

1.0

397.0

max specificity

0.8750044

1.0

0.0

max absolute_mcc

0.2585920

0.5044077

241.0

max min_per_class_accuracy

0.2989662

0.7720660

223.0

max mean_per_class_accuracy

0.2585920

0.7807351

241.0

Gains/Lift Table: Avg response rate: 26.68 %, avg score: 26.69 %

group

cumulative_data_fraction

lower_threshold

lift

cumulative_lift

response_rate

score

cumulative_response_rate

cumulative_score

capture_rate

cumulative_capture_rate

gain

cumulative_gain

0.0100932

0.8072080

3.2673282

0.8717949

0.8318661

0.8717949

0.8318661

0.0329777

226.7328210

0.0201863

0.7867289

3.0751324

3.1712303

0.8205128

0.7973382

0.8461538

0.8146022

0.0310378

0.0640155

207.5132433

217.1230322

0.0300207

0.7708700

3.2546838

3.1985685

0.8684211

0.7784711

0.8534483

0.8027661

0.0320078

0.0960233

225.4683751

219.8568514

0.0401139

0.7558323

3.2673282

3.2158693

0.8717949

0.7653157

0.8580645

0.7933431

0.0329777

0.1290010

226.7328210

221.5869341

0.0502070

0.7380001

2.7868388

3.1296209

0.7435897

0.7467696

0.8350515

0.7839804

0.0281280

0.1571290

178.6838767

212.9620927

0.1001553

0.6729028

2.8351367

2.9827593

0.7564767

0.7046722

0.7958656

0.7444288

0.1416101

0.2987391

183.5136670

198.2759269

0.1501035

0.6071369

2.3885056

2.7850162

0.6373057

0.6387067

0.7431034

0.7092489

0.1193016

0.4180407

138.8505551

178.5016221

0.2000518

0.5381941

2.0778056

2.6084423

0.5544041

0.5722402

0.6959897

0.6750410

0.1037827

0.5218235

107.7805642

160.8442299

0.2999482

0.3962338

1.5437902

2.2538645

0.4119171

0.4683588

0.6013805

0.6062064

0.1542192

0.6760427

54.3790173

125.3864456

0.4001035

0.2677629

1.2976940

2.0145126

0.3462532

0.3286738

0.5375162

0.5367335

0.1299709

0.8060136

29.7693968

101.4512593

0.5

0.1837581

0.7961685

1.7710960

0.2124352

0.2207226

0.4725673

0.4735967

0.0795344

0.8855480

-20.3831483

77.1096023

0.5998965

0.1104158

0.5922717

1.5747948

0.1580311

0.1484992

0.4201898

0.4194606

0.0591659

0.9447139

-40.7728298

57.4794820

0.7000518

0.0617657

0.2808442

1.3896714

0.0749354

0.0847926

0.3707948

0.3715802

0.0281280

0.9728419

-71.9155783

38.9671388

0.7999482

0.0286901

0.1262218

1.2318935

0.0336788

0.0437199

0.3286962

0.3306374

0.0126091

0.9854510

-87.3778162

23.1893476

0.8998447

0.0107940

0.0970937

1.1059134

0.0259067

0.0186504

0.2950820

0.2960021

0.0096993

0.9951503

-90.2906278

10.5913406

1.0

0.0013001

0.0484214

1.0

0.0129199

0.0050078

0.2668219

0.2668575

0.0048497

1.0

-95.1578583

0.0

ModelMetricsBinomialGLM: glm
** Reported on validation data. **

MSE: 0.13988195347217955
RMSE: 0.3740079591027169
LogLoss: 0.42669574270102295
Null degrees of freedom: 1963
Residual degrees of freedom: 1936
Null deviance: 2254.1867641245126
Residual deviance: 1676.060877329618
AIC: 1732.060877329618
AUC: 0.8334476691632231
pr_auc: 0.6085277427369343
Gini: 0.6668953383264462
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3933891185861035:

Yes

Error

Rate

1189.0

263.0

0.1811

(263.0/1452.0)

Yes

163.0

349.0

0.3184

(163.0/512.0)

Total

1352.0

612.0

0.2169

(426.0/1964.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric

threshold

value

idx

max f1

0.3933891

0.6209964

177.0

max f2

0.1467309

0.7441787

296.0

max f0point5

0.5139588

0.6147913

124.0

max accuracy

0.5208544

0.7998982

121.0

max precision

0.8712309

1.0

0.0

max recall

0.0066205

1.0

393.0

max specificity

0.8712309

1.0

0.0

max absolute_mcc

0.3933891

0.4744255

177.0

max min_per_class_accuracy

0.3241675

0.7519531

210.0

max mean_per_class_accuracy

0.2878493

0.7594885

226.0

Gains/Lift Table: Avg response rate: 26.07 %, avg score: 27.45 %

group

cumulative_data_fraction

lower_threshold

lift

cumulative_lift

response_rate

score

cumulative_response_rate

cumulative_score

capture_rate

cumulative_capture_rate

gain

cumulative_gain

0.0101833

0.8083745

3.0687500

0.8

0.8366588

0.8

0.8366588

0.03125

206.8750000

0.0203666

0.7814955

2.4933594

2.7810547

0.65

0.7940115

0.725

0.8153352

0.0253906

0.0566406

149.3359375

178.1054688

0.0300407

0.7625985

2.8264803

2.7956833

0.7368421

0.7728231

0.7288136

0.8016448

0.0273438

0.0839844

182.6480263

179.5683263

0.0402240

0.7433646

2.8769531

2.8162579

0.75

0.7519586

0.7341772

0.7890660

0.0292969

0.1132812

187.6953125

181.6257911

0.0504073

0.7278637

2.8769531

2.8285196

0.75

0.7359333

0.7373737

0.7783321

0.0292969

0.1425781

187.6953125

182.8519571

0.1003055

0.6504495

2.5051020

2.6676317

0.6530612

0.6902004

0.6954315

0.7344900

0.125

0.2675781

150.5102041

166.7631662

0.1502037

0.5958767

2.3485332

2.5616261

0.6122449

0.6207043

0.6677966

0.6966900

0.1171875

0.3847656

134.8533163

156.1626059

0.2001018

0.5365627

2.0745376

2.4401638

0.5408163

0.5687102

0.6361323

0.6647764

0.1035156

0.4882812

107.4537628

144.0163804

0.2998982

0.4021287

1.7222577

2.2012680

0.4489796

0.4687213

0.5738540

0.5995357

0.171875

0.6601562

72.2257653

120.1268039

0.4002037

0.2980178

1.2072494

1.9521310

0.3147208

0.3501028

0.5089059

0.5370188

0.1210938

0.78125

20.7249365

95.2131043

0.5

0.2041451

0.7632733

1.7148438

0.1989796

0.2493834

0.4470468

0.4796089

0.0761719

0.8574219

-23.6726722

71.4843750

0.5997963

0.1297163

0.7241311

1.5500053

0.1887755

0.1641780

0.4040747

0.4271263

0.0722656

0.9296875

-27.5868941

55.0005306

0.7001018

0.0717965

0.3504918

1.3781477

0.0913706

0.1003014

0.3592727

0.3803012

0.0351562

0.9648438

-64.9508249

37.8147727

0.7998982

0.0354904

0.1957111

1.2306254

0.0510204

0.0536183

0.3208148

0.3395438

0.0195312

0.984375

-80.4288903

23.0625398

0.8996945

0.0127866

0.1369978

1.1093175

0.0357143

0.0222299

0.2891907

0.3043466

0.0136719

0.9980469

-86.3002232

10.9317523

1.0

0.0013697

0.0194718

1.0

0.0050761

0.0067909

0.2606925

0.2745001

0.0019531

1.0

-98.0528236

0.0

Scoring History:

timestamp

duration

iterations

negative_log_likelihood

objective

2020-08-10 08:10:16

0.000 sec

2241.3989447

0.5800722

2020-08-10 08:10:16

0.017 sec

1654.2086472

0.4289863

2020-08-10 08:10:16

0.028 sec

1585.7817876

0.4114282

2020-08-10 08:10:16

0.031 sec

1571.8949395

0.4081494

2020-08-10 08:10:16

0.035 sec

1570.1412044

0.4078462

2020-08-10 08:10:16

0.039 sec

1570.0635782

0.4078434

# Performance of Logistic Regression Model on testing data

logistic_model_local.model_performance(test_data=test)

ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.13676664581603143
RMSE: 0.36981974773669324
LogLoss: 0.41905124614571654
Null degrees of freedom: 661
Residual degrees of freedom: 634
Null deviance: 758.6431182843685
Residual deviance: 554.8238498969288
AIC: 610.8238498969288
AUC: 0.8399264356905553
pr_auc: 0.642954585445394
Gini: 0.6798528713811105
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.30243924348259327:

Yes

Error

Rate

369.0

121.0

0.2469

(121.0/490.0)

Yes

37.0

135.0

0.2151

(37.0/172.0)

Total

406.0

256.0

0.2387

(158.0/662.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric

threshold

value

idx

max f1

0.3024392

0.6308411

193.0

max f2

0.1422109

0.7490637

271.0

max f0point5

0.5192027

0.6445313

110.0

max accuracy

0.5192027

0.8141994

110.0

max precision

0.8750186

1.0

0.0

max recall

0.0137343

1.0

379.0

max specificity

0.8750186

1.0

0.0

max absolute_mcc

0.5110350

0.4988469

111.0

max min_per_class_accuracy

0.3130038

0.7591837

187.0

max mean_per_class_accuracy

0.3024392

0.7689725

193.0

Gains/Lift Table: Avg response rate: 25.98 %, avg score: 27.20 %

group

cumulative_data_fraction

lower_threshold

lift

cumulative_lift

response_rate

score

cumulative_response_rate

cumulative_score

capture_rate

cumulative_capture_rate

gain

cumulative_gain

0.0105740

0.8028264

3.8488372

1.0

0.8344546

1.0

0.8344546

0.0406977

284.8837209

0.0211480

0.7855657

3.2990033

3.5739203

0.8571429

0.7949672

0.9285714

0.8147109

0.0348837

0.0755814

229.9003322

257.3920266

0.0302115

0.7630627

2.5658915

3.2715116

0.6666667

0.7721759

0.85

0.8019504

0.0232558

0.0988372

156.5891473

227.1511628

0.0407855

0.7516464

3.2990033

3.2786391

0.8571429

0.7580396

0.8518519

0.7905661

0.0348837

0.1337209

229.9003322

227.8639104

0.0513595

0.7319520

3.2990033

3.2828317

0.8571429

0.7393147

0.8529412

0.7800144

0.0348837

0.1686047

229.9003322

228.2831737

0.1012085

0.6708461

2.2159972

2.7573759

0.5757576

0.6986010

0.7164179

0.7399152

0.1104651

0.2790698

121.5997181

175.7375911

0.1510574

0.6110156

2.0993658

2.5402326

0.5454545

0.6368209

0.66

0.7058941

0.1046512

0.3837209

109.9365751

154.0232558

0.2009063

0.5535949

2.6825229

2.5755377

0.6969697

0.5807483

0.6691729

0.6748429

0.1337209

0.5174419

168.2522903

157.5537681

0.3006042

0.3961013

1.3412615

2.1661797

0.3484848

0.4753281

0.5628141

0.6086721

0.1337209

0.6511628

34.1261452

116.6179736

0.4003021

0.2801609

1.3995772

1.9752523

0.3636364

0.3394054

0.5132075

0.5416095

0.1395349

0.7906977

39.9577167

97.5252304

0.5

0.1838036

0.8164200

1.7441860

0.2121212

0.2310811

0.4531722

0.4796914

0.0813953

0.8720930

-18.3579986

74.4186047

0.5996979

0.1218909

0.5831572

1.5511686

0.1515152

0.1538460

0.4030227

0.4255207

0.0581395

0.9302326

-41.6842847

55.1168649

0.6993958

0.0686841

0.4082100

1.3882415

0.1060606

0.0948750

0.3606911

0.3783876

0.0406977

0.9709302

-59.1789993

38.8241499

0.7990937

0.0313631

0.1166314

1.2295907

0.0303030

0.0485601

0.3194707

0.3372371

0.0116279

0.9825581

-88.3368569

22.9590715

0.8987915

0.0107587

0.1749471

1.1126050

0.0454545

0.0193192

0.2890756

0.3019723

0.0174419

1.0

-82.5052854

11.2605042

1.0

0.0014169

0.0

1.0

0.0

0.0056128

0.2598187

0.2719782

0.0

1.0

-100.0

0.0

3.1.1 Variable Importance

'Tenure', the number of months the customer has stayed with the company, is the most important feature in this dataset. The longer the customer has stayed with the company, the possibility of the customer stop doing business with the company is smaller.

logistic_model_local.varimp_plot()

3.1.2 Partial Dependence Plot

Tenure_PDP = logistic_model_local.partial_plot(data=df, cols=['Tenure'], plot=True, plot_stddev=True)

The above PDP is for the column "Tenure" which turned out to be one of the most important columns under variable importance. The explanation from the graph is as observed, that the target "Churn"(which means if a customer will stay or leave) decreases as the tenure increases. We can infer that, if a Customer is staying for a longer duration, the chances of the Churn reduces, so the curve levels down as it moves with the increase of Tenure.

logistic_model_local.partial_plot(data=train, cols=['Total Charges'], plot=True, plot_stddev=True)

PartialDependence: Partial Dependence Plot of model GLM_model_python_1597013319592_17144 on column 'Total Charges'

truenetotal_charges

mean_response

stddev_response

std_error_mean_response

18.8

0.2137070

0.2401651

0.0038636

474.9052632

0.2284907

0.2505882

0.0040313

931.0105263

0.2436702

0.2606659

0.0041934

1387.1157895

0.2592092

0.2703500

0.0043492

1843.2210526

0.2750709

0.2795973

0.0044979

2299.3263158

0.2912182

0.2883697

0.0046391

2755.4315789

0.3076140

0.2966338

0.0047720

3211.5368421

0.3242221

0.3043611

0.0048963

3667.6421053

0.3410069

0.3115273

0.0050116

4123.7473684

0.3579340

0.3181119

0.0051175

4579.8526316

0.3749705

0.3240982

0.0052138

5035.9578947

0.3920849

0.3294726

0.0053003

5492.0631579

0.4092474

0.3342245

0.0053767

5948.1684211

0.4264302

0.3383460

0.0054431

6404.2736842

0.4436072

0.3418315

0.0054991

6860.3789474

0.4607542

0.3446780

0.0055449

7316.4842105

0.4778489

0.3468842

0.0055804

7772.5894737

0.4948709

0.3484511

0.0056056

8228.6947368

0.5118011

0.3493812

0.0056206

8684.8

0.5286223

0.3496788

0.0056254

[]

Similar to "Tenure", one of the other important features is "total_charges" and the PDP of "total_charges" helps us understand that as the charges of the customer increases the chances of they getting Churned is high, and it makes a complete sense that the reason for the Churn could be because of high charges that make them drop the subscription.

# manually calculate 2-D partial dependence

def par_dep_2d(xs1, xs2, frame, model, resolution=20):

    """ Creates Pandas dataframe containing partial dependence for two variables.

    Args:
        xs1: First variable for which to calculate partial dependence.
        xs2: Second variable for which to calculate partial dependence.
        frame: Data for which to calculate partial dependence.
        model: Model for which to calculate partial dependence.
        resolution: The number of points across the domain of xs for which to calculate partial dependence.

    Returns:
        Pandas dataframe containing partial dependence values.

    """

    # init empty Pandas frame w/ correct col names
    par_dep_frame = pd.DataFrame(columns=[xs1, xs2, 'partial_dependence'])

    # cache original data 
    col_cache1 = frame[xs1]
    col_cache2 = frame[xs2] 

    # determine values at which to calculate partial dependency
    # for xs1
    min1_ = frame[xs1].min()
    max1_ = frame[xs1].max()
    by1 = float((max1_ - min1_)/resolution)
    print("min1:" + str(min1_))
    print("max1_" + str(max1_))
    print("by1" + str(by1))
    range1 = np.arange(min1_, max1_, by1)

    # determine values at which to calculate partial dependency
    # for xs2
    min2_ = frame[xs2].min()
    max2_ = frame[xs2].max()
    by2 = float((max2_ - min2_)/resolution)
    print("min2:" + str(min2_))
    print("max2_" + str(max2_))
    print("by2" + str(by2))
    range2 = np.arange(min2_, max2_, by2)  

    # calculate partial dependency 
    for j in range1:
        for k in range2:
            frame[xs1] = j
            frame[xs2] = k
            par_dep_i = model.predict(frame)
            par_dep_j = par_dep_i.mean()[0]
            std_j = model.predict(frame).sd()[0]
            pos_std, neg_std = par_dep_j + std_j, par_dep_j - std_j
            par_dep_frame = par_dep_frame.append({xs1:j,
                                                  xs2:k,
                                                  'partial_dependence': par_dep_j}, 
                                                  ignore_index=True)

    # return input frame to original cached state    
    frame[xs1] = col_cache1
    frame[xs2] = col_cache2

    return par_dep_frame

# calculate 2-D partial dependence
h2o.no_progress()
resolution = 20
par_dep_Tenure_v_TotalCharges = par_dep_2d('Tenure',
                                           'Total Charges',
                                           df,
                                           logistic_model_local,
                                           resolution=resolution)

print(par_dep_Tenure_v_TotalCharges)

min1:1.0
max1_72.0
by13.55
min2:18.8
max2_8684.8
by2433.3
     Tenure  Total Charges  partial_dependence
0      1.00           18.8            0.501233
1      1.00          452.1            0.528659
2      1.00          885.4            0.557627
3      1.00         1318.7            0.592142
4      1.00         1752.0            0.618952
5      1.00         2185.3            0.649615
6      1.00         2618.6            0.675347
7      1.00         3051.9            0.700000
8      1.00         3485.2            0.722804
9      1.00         3918.5            0.769492
10     1.00         4351.8            0.798613
11     1.00         4785.1            0.824037
12     1.00         5218.4            0.841757
13     1.00         5651.7            0.865177
14     1.00         6085.0            0.893837
15     1.00         6518.3            0.921880
16     1.00         6951.6            0.940524
17     1.00         7384.9            0.955778
18     1.00         7818.2            0.976579
19     1.00         8251.5            0.998921
20     4.55           18.8            0.449769
21     4.55          452.1            0.477812
22     4.55          885.4            0.508783
23     4.55         1318.7            0.537596
24     4.55         1752.0            0.567797
25     4.55         2185.3            0.600616
26     4.55         2618.6            0.625886
27     4.55         3051.9            0.657781
28     4.55         3485.2            0.684592
29     4.55         3918.5            0.709553
..      ...            ...                 ...
370   64.90         4351.8            0.000000
371   64.90         4785.1            0.000000
372   64.90         5218.4            0.003390
373   64.90         5651.7            0.010786
374   64.90         6085.0            0.020185
375   64.90         6518.3            0.032049
376   64.90         6951.6            0.054700
377   64.90         7384.9            0.078737
378   64.90         7818.2            0.106317
379   64.90         8251.5            0.141448
380   68.45           18.8            0.000000
381   68.45          452.1            0.000000
382   68.45          885.4            0.000000
383   68.45         1318.7            0.000000
384   68.45         1752.0            0.000000
385   68.45         2185.3            0.000000
386   68.45         2618.6            0.000000
387   68.45         3051.9            0.000000
388   68.45         3485.2            0.000000
389   68.45         3918.5            0.000000
390   68.45         4351.8            0.000000
391   68.45         4785.1            0.000000
392   68.45         5218.4            0.000000
393   68.45         5651.7            0.000000
394   68.45         6085.0            0.005239
395   68.45         6518.3            0.016025
396   68.45         6951.6            0.024653
397   68.45         7384.9            0.037596
398   68.45         7818.2            0.060092
399   68.45         8251.5            0.087057

[400 rows x 3 columns]

# create 2-D partial dependence plot

# imports
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

# create 3-D grid 
new_shape = (resolution, resolution)
x = np.asarray(par_dep_Tenure_v_TotalCharges['Tenure']).reshape(new_shape)
y = np.asarray(par_dep_Tenure_v_TotalCharges['Total Charges']).reshape(new_shape)
z = np.asarray(par_dep_Tenure_v_TotalCharges['partial_dependence']).reshape(new_shape)


fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')

# set axes labels
ax.set_title('Partial Dependence for Churn')
ax.set_xlabel('Tenure')
ax.set_ylabel('Total Charges')
ax.set_zlabel('\nChurn')

# axis decorators/details
#ax.zaxis.set_major_locator(LinearLocator(10))
#ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))

# surface
surf = ax.plot_surface(x, y, z, 
                       cmap=cm.coolwarm, 
                       linewidth=0.05, 
                       rstride=1, 
                       cstride=1, 
                       antialiased=True)
plt.tight_layout()

_ = plt.sho

3.1.3 Individual Conditional Expectation

# manually calculate 1-D partial dependence
# for educational purposes

def par_dep(xs, frame, model, resolution=20, bins=None):

    """ Creates Pandas dataframe containing partial dependence for a single variable.

    Args:
        xs: Variable for which to calculate partial dependence.
        frame: Data for which to calculate partial dependence.
        model: Model for which to calculate partial dependence.
        resolution: The number of points across the domain of xs for which to calculate partial dependence.

    Returns:
        Pandas dataframe containing partial dependence values.

    """

    # init empty Pandas frame w/ correct col names
    par_dep_frame = pd.DataFrame(columns=[xs, 'partial_dependence'])

    # cache original data 
    col_cache = h2o.deep_copy(frame[xs], xid='col_cache')

    # determine values at which to calculate partial dependency
    if bins == None:
        min_ = frame[xs].min()
        max_ = frame[xs].max()
        by = (max_ - min_)/resolution
        bins = np.arange(min_, max_, by)

    # calculate partial dependency  
    # by setting column of interest to constant 
    for j in bins:
        frame[xs] = j
        par_dep_i = model.predict(frame)
        par_dep_j = par_dep_i.mean()[0]
        par_dep_frame = par_dep_frame.append({xs:j,
                                              'partial_dependence': par_dep_j}, 
                                              ignore_index=True)

    # return input frame to original cached state    
    frame[xs] = h2o.get_frame('col_cache')

    return par_dep_frame

# show some output
par_dep_Tenure = par_dep('Tenure', df, logistic_model_local)
# par_dep_Tenure.plot.line(x='Tenure', y='partial_dependence')
# print(par_dep_Tenure)

def get_quantile_dict(y, id_, frame):

    """ Returns the percentiles of a column y as the indices for another column id_.

    Args:
        y: Column in which to find percentiles.
        id_: Id column that stores indices for percentiles of y.
        frame: H2OFrame containing y and id_. 

    Returns:
        Dictionary of percentile values and index column values.

    """

    quantiles_df = frame.as_data_frame()
    quantiles_df.sort_values(y, inplace=True)
    quantiles_df.reset_index(inplace=True)

    percentiles_dict = {}
    percentiles_dict[0] = quantiles_df.loc[0, id_]
    percentiles_dict[99] = quantiles_df.loc[quantiles_df.shape[0]-1, id_]
    inc = quantiles_df.shape[0]//10

    for i in range(1, 10):
        percentiles_dict[i * 10] = quantiles_df.loc[i * inc,  id_]

    return percentiles_dict

quantile_dict = get_quantile_dict('Churn', 'Id', df)

bins = list(par_dep_Tenure['Tenure'])
for i in sorted(quantile_dict.keys()):
    col_name = 'Percentile_' + str(i)
    par_dep_Tenure[col_name] = par_dep('Tenure',
                                   df[df['Id'] == int(quantile_dict[i])],
                                   logistic_model_local,
                                   bins=bins)['partial_dependence']
par_dep_Tenure

Tenure

partial_dependence

Percentile_0

Percentile_10

Percentile_20

Percentile_30

Percentile_40

Percentile_50

Percentile_60

Percentile_70

Percentile_80

Percentile_90

Percentile_99

1.00

0.655008

1.0

0.0

1.0

0.0

1.0

0.0

1.0

4.55

0.608783

1.0

0.0

1.0

0.0

1.0

0.0

1.0

8.10

0.559630

1.0

0.0

1.0

0.0

1.0

0.0

1.0

11.65

0.513713

1.0

0.0

1.0

0.0

1.0

0.0

1.0

15.20

0.469954

1.0

0.0

1.0

0.0

1.0

0.0

1.0

18.75

0.426502

0.0

1.0

0.0

1.0

0.0

1.0

22.30

0.378582

0.0

1.0

0.0

25.85

0.320493

0.0

1.0

0.0

29.40

0.269337

0.0

1.0

0.0

32.95

0.212481

0.0

36.50

0.158243

0.0

40.05

0.115100

0.0

43.60

0.082589

0.0

47.15

0.054083

0.0

50.70

0.033436

0.0

54.25

0.021109

0.0

57.80

0.011864

0.0

61.35

0.005393

0.0

64.90

0.002311

0.0

68.45

0.000616

0.0

fig, ax = plt.subplots()
par_dep_Tenure.drop('partial_dependence', axis=1).plot(x='Tenure', colormap='gnuplot', ax=ax)

par_dep_Tenure.plot(title='Partial Dependence and ICE for Churn',
                         x='Tenure', 
                         y='partial_dependence',
                         style='r-', 
                         linewidth=3, 
                         ax=ax)

_ = plt.legend(bbox_to_anchor=(1.05, 0),
               loc=3, 
               borderaxespad=0.)

The ICE plot for "Tenure" shows every instance of the data that changes according to the target. And the average of which is the Partial Plot which is shown in the graph.

3.2 Gradient Boosting Machine

# Training BGM model

gbm_model_local.train(x=X, y=target, training_frame=train, validation_frame = valid)

# The Performance of BGM model

gbm_model_local.model_performance()

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.10317340756401237
RMSE: 0.32120617609879853
LogLoss: 0.3305658187390948
Mean Per-Class Error: 0.15369427869427865
AUC: 0.9176863520492865
pr_auc: 0.8046855934562773
Gini: 0.835372704098573
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3760811955332678:

Yes

Error

Rate

2495.0

354.0

0.1243

(354.0/2849.0)

Yes

214.0

822.0

0.2066

(214.0/1036.0)

Total

2709.0

1176.0

0.1462

(568.0/3885.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric

threshold

value

idx

max f1

0.3760812

0.7432188

201.0

max f2

0.2599200

0.8228181

249.0

max f0point5

0.5358469

0.7384654

137.0

max accuracy

0.4226923

0.8550837

183.0

max precision

0.9288880

1.0

0.0

max recall

0.0296397

1.0

382.0

max specificity

0.9288880

1.0

0.0

max absolute_mcc

0.3760812

0.6441133

201.0

max min_per_class_accuracy

0.3300302

0.8378378

220.0

max mean_per_class_accuracy

0.2818741

0.8463057

240.0

Gains/Lift Table: Avg response rate: 26.67 %, avg score: 26.69 %

group

cumulative_data_fraction

lower_threshold

lift

cumulative_lift

response_rate

score

cumulative_response_rate

cumulative_score

capture_rate

cumulative_capture_rate

gain

cumulative_gain

0.0100386

0.8899302

3.75

1.0

0.9065725

1.0

0.9065725

0.0376448

275.0

0.0200772

0.8579185

3.5576923

3.6538462

0.9487179

0.8778309

0.9743590

0.8922017

0.0357143

0.0733591

255.7692308

265.3846154

0.0301158

0.8366076

3.6538462

0.9743590

0.8478174

0.9743590

0.8774069

0.0366795

0.1100386

265.3846154

0.0404118

0.8169945

3.28125

3.5589172

0.875

0.8268917

0.9490446

0.8645368

0.0337838

0.1438224

228.125

255.8917197

0.0501931

0.7901430

3.3552632

3.5192308

0.8947368

0.8036881

0.9384615

0.8526791

0.0328185

0.1766409

235.5263158

251.9230769

0.1001287

0.6838086

3.3827320

3.4511568

0.9020619

0.7367699

0.9203085

0.7948735

0.1689189

0.3455598

238.2731959

245.1156812

0.1500644

0.5917246

2.5322165

3.1453688

0.6752577

0.6386479

0.8387650

0.7428876

0.1264479

0.4720077

153.2216495

214.5368782

0.2

0.5229638

2.3002577

2.9343629

0.6134021

0.5565991

0.7824968

0.6963754

0.1148649

0.5868726

130.0257732

193.4362934

0.3001287

0.3784853

2.0051414

2.6243568

0.5347044

0.4510177

0.6998285

0.6145194

0.2007722

0.7876448

100.5141388

162.4356775

0.4

0.2616369

1.1501289

2.2562741

0.3067010

0.3181251

0.6016731

0.5405162

0.1148649

0.9025097

15.0128866

125.6274131

0.5003861

0.1649588

0.5

1.9039352

0.1333333

0.2125728

0.5077160

0.4747251

0.0501931

0.9527027

-50.0

90.3935185

0.6

0.0988879

0.2422481

1.6280566

0.0645995

0.1315786

0.4341484

0.4177548

0.0241313

0.9768340

-75.7751938

62.8056628

0.6998713

0.0552501

0.1449742

1.4164215

0.0386598

0.0741760

0.3777124

0.3687263

0.0144788

0.9913127

-85.5025773

41.6421478

0.8

0.0339601

0.0674807

1.2475869

0.0179949

0.0430926

0.3326898

0.3279697

0.0067568

0.9980695

-93.2519280

24.7586873

0.8998713

0.0212780

0.0193299

1.1112700

0.0051546

0.0276263

0.2963387

0.2946364

0.0019305

1.0

-98.0670103

11.1270023

1.0

0.0109891

0.0

1.0

0.0

0.0176606

0.2666667

0.2669031

0.0

1.0

-100.0

0.0

3.2.1 Variable Importance

Different from the Variable Importance Plot we generated from the Logistic Regression Model, the most important variable in Gradient Boosting Model is 'Contract'. That could be a reason for changing the model to a tree-based model, and the Contract column can be seen as a very important Leaf Node.

gbm_model_local.varimp_plot()

3.2.2 Partial Dependence Plot

gbm_model_local.partial_plot(data=train, cols=['Tenure'], plot=True, plot_stddev=True)

PartialDependence: Partial Dependence Plot of model GBM_model_python_1597013319592_18792 on column 'Tenure'

tenure

mean_response

stddev_response

std_error_mean_response

1.0

0.4034221

0.2908955

0.0046670

4.7368421

0.3325176

0.2527814

0.0040555

8.4736842

0.3245015

0.2471217

0.0039647

12.2105263

0.2921240

0.2306489

0.0037005

15.9473684

0.2706254

0.2223178

0.0035668

19.6842105

0.2605981

0.2229685

0.0035772

23.4210526

0.2130585

0.1886440

0.0030265

27.1578947

0.2157993

0.1934324

0.0031034

30.8947368

0.2145211

0.1928156

0.0030935

34.6315789

0.2145211

0.1928156

0.0030935

38.3684211

0.2105373

0.1890392

0.0030329

42.1052632

0.2054247

0.1833675

0.0029419

45.8421053

0.2019875

0.1820938

0.0029215

49.5789474

0.2037796

0.1868446

0.0029977

53.3157895

0.2021542

0.1739150

0.0027902

57.0526316

0.1782865

0.1582936

0.0025396

60.7894737

0.1482262

0.1329597

0.0021332

64.5263158

0.1517906

0.1362895

0.0021866

68.2631579

0.1517977

0.1362876

0.0021866

72.0

0.1191810

0.1090938

0.0017503

[]

The PDP of Tenure shown above is from the GBM model, which shows a similar trend as in the Logistic Regression but has an uneven trend due to the non-linearity that GBM possesses. Though the overall interpretation from the graph is the same for the Logistic Regression

gbm_model_local.partial_plot(data=train, cols=['Total Charges'], plot=True, plot_stddev=True)

PartialDependence: Partial Dependence Plot of model GBM_model_python_1597013319592_18792 on column 'Total Charges'

total_charges

mean_response

stddev_response

std_error_mean_response

18.8

0.2999703

0.2774126

0.0044507

467.6605263

0.2593645

0.2500084

0.0040111

916.5210526

0.2756536

0.2298681

0.0036879

1365.3815789

0.2670324

0.2321262

0.0037242

1814.2421053

0.2683494

0.2377486

0.0038144

2263.1026316

0.2771945

0.2405347

0.0038591

2711.9631579

0.2783816

0.2412736

0.0038709

3160.8236842

0.2531762

0.2227409

0.0035736

3609.6842105

0.2422233

0.2135560

0.0034262

4058.5447368

0.2508845

0.2205948

0.0035392

4507.4052632

0.2671597

0.2359473

0.0037855

4956.2657895

0.2468248

0.2171484

0.0034839

5405.1263158

0.2712422

0.2350661

0.0037713

5853.9868421

0.2627801

0.2348798

0.0037683

6302.8473684

0.2701906

0.2288810

0.0036721

6751.7078947

0.2591206

0.2288842

0.0036721

7200.5684211

0.2729574

0.2385147

0.0038267

7649.4289474

0.3157270

0.2462378

0.0039506

8098.2894737

0.3142197

0.2465738

0.0039560

8547.15

0.3142197

0.2465738

0.0039560

[]

# calculate 2-D partial dependence
h2o.no_progress()
resolution = 20
par_dep_Tenure_v_TotalCharges_GBM = par_dep_2d('Tenure',
                                           'Total Charges',
                                           df,
                                           gbm_model_local,
                                           resolution=resolution)

print(par_dep_Tenure_v_TotalCharges_GBM)

min1:1.0
max1_72.0
by13.55
min2:18.8
max2_8684.8
by2433.3
     Tenure  Total Charges  partial_dependence
0      1.00           18.8            0.551926
1      1.00          452.1            0.528659
2      1.00          885.4            0.624499
3      1.00         1318.7            0.622958
4      1.00         1752.0            0.635285
5      1.00         2185.3            0.627735
6      1.00         2618.6            0.631433
7      1.00         3051.9            0.630508
8      1.00         3485.2            0.620339
9      1.00         3918.5            0.605701
10     1.00         4351.8            0.618336
11     1.00         4785.1            0.626656
12     1.00         5218.4            0.642373
13     1.00         5651.7            0.620647
14     1.00         6085.0            0.643914
15     1.00         6518.3            0.621109
16     1.00         6951.6            0.636980
17     1.00         7384.9            0.640062
18     1.00         7818.2            0.683513
19     1.00         8251.5            0.681510
20     4.55           18.8            0.514330
21     4.55          452.1            0.475501
22     4.55          885.4            0.531587
23     4.55         1318.7            0.521880
24     4.55         1752.0            0.534977
25     4.55         2185.3            0.530354
26     4.55         2618.6            0.536672
27     4.55         3051.9            0.535285
28     4.55         3485.2            0.534052
29     4.55         3918.5            0.508166
..      ...            ...                 ...
370   64.90         4351.8            0.132512
371   64.90         4785.1            0.274114
372   64.90         5218.4            0.130817
373   64.90         5651.7            0.134206
374   64.90         6085.0            0.133128
375   64.90         6518.3            0.112943
376   64.90         6951.6            0.200462
377   64.90         7384.9            0.199692
378   64.90         7818.2            0.272573
379   64.90         8251.5            0.268105
380   68.45           18.8            0.290447
381   68.45          452.1            0.203236
382   68.45          885.4            0.213097
383   68.45         1318.7            0.172727
384   68.45         1752.0            0.179815
385   68.45         2185.3            0.240524
386   68.45         2618.6            0.234977
387   68.45         3051.9            0.329122
388   68.45         3485.2            0.130971
389   68.45         3918.5            0.132512
390   68.45         4351.8            0.132512
391   68.45         4785.1            0.274114
392   68.45         5218.4            0.130817
393   68.45         5651.7            0.134206
394   68.45         6085.0            0.133128
395   68.45         6518.3            0.112943
396   68.45         6951.6            0.200462
397   68.45         7384.9            0.199692
398   68.45         7818.2            0.272573
399   68.45         8251.5            0.268105

[400 rows x 3 columns]

# create 3-D grid 
new_shape = (resolution, resolution)
x = np.asarray(par_dep_Tenure_v_TotalCharges_GBM['Tenure']).reshape(new_shape)
y = np.asarray(par_dep_Tenure_v_TotalCharges_GBM['Total Charges']).reshape(new_shape)
z = np.asarray(par_dep_Tenure_v_TotalCharges_GBM['partial_dependence']).reshape(new_shape)


fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')

# set axes labels
ax.set_title('Partial Dependence for Churn')
ax.set_xlabel('Tenure')
ax.set_ylabel('Total Charges')
ax.set_zlabel('\nChurn')

# axis decorators/details
#ax.zaxis.set_major_locator(LinearLocator(10))
#ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))

# surface
surf = ax.plot_surface(x, y, z, 
                       cmap=cm.coolwarm, 
                       linewidth=0.05, 
                       rstride=1, 
                       cstride=1, 
                       antialiased=True)
plt.tight_layout()

_ = plt.show()

As an inference, the PDP's derived from logistic and GBM can clearly show us the difference in the derived graphs, which shows differences in the algorithms working within.

3.2.3 Individual Conditional Expectation

gbm_par_dep_Tenure = par_dep('Tenure', df, gbm_model_local)
# gbm_par_dep_Tenure.plot.line(x='Tenure', y='partial_dependence')
# print(gbm_par_dep_Tenure)

quantile_dict = get_quantile_dict('Churn', 'Id', df)

bins = list(gbm_par_dep_Tenure['Tenure'])
for i in sorted(quantile_dict.keys()):
    col_name = 'Percentile_' + str(i)
    gbm_par_dep_Tenure[col_name] = par_dep('Tenure',
                                   df[df['Id'] == int(quantile_dict[i])],
                                   gbm_model_local,
                                   bins=bins)['partial_dependence']
gbm_par_dep_Tenure

Tenure

partial_dependence

Percentile_0

Percentile_10

Percentile_20

Percentile_30

Percentile_40

Percentile_50

Percentile_60

Percentile_70

Percentile_80

Percentile_90

Percentile_99

1.00

0.576888

1.0

0.0

1.0

0.0

1.0

0.0

1.0

4.55

0.507550

1.0

0.0

1.0

0.0

1.0

8.10

0.500000

1.0

0.0

1.0

0.0

1.0

11.65

0.456086

1.0

0.0

1.0

0.0

1.0

0.0

1.0

15.20

0.443143

1.0

0.0

1.0

0.0

1.0

0.0

1.0

18.75

0.420339

1.0

0.0

1.0

0.0

1.0

0.0

1.0

22.30

0.359322

1.0

0.0

1.0

0.0

1.0

0.0

1.0

25.85

0.330354

1.0

0.0

1.0

0.0

1.0

0.0

1.0

29.40

0.337442

1.0

0.0

1.0

0.0

1.0

0.0

1.0

32.95

0.336826

1.0

0.0

1.0

0.0

1.0

0.0

1.0

36.50

0.336826

1.0

0.0

1.0

0.0

1.0

0.0

1.0

40.05

0.322804

1.0

0.0

1.0

0.0

1.0

0.0

1.0

43.60

0.312481

1.0

0.0

1.0

0.0

1.0

0.0

1.0

47.15

0.318952

1.0

0.0

1.0

0.0

1.0

0.0

1.0

50.70

0.310015

1.0

0.0

1.0

0.0

1.0

0.0

1.0

54.25

0.306163

1.0

0.0

1.0

0.0

1.0

0.0

1.0

57.80

0.273960

1.0

0.0

1.0

0.0

1.0

61.35

0.192758

1.0

0.0

1.0

0.0

1.0

64.90

0.205239

1.0

0.0

1.0

0.0

1.0

68.45

0.205239

1.0

0.0

1.0

0.0

1.0

fig, ax = plt.subplots()
gbm_par_dep_Tenure.drop('partial_dependence', axis=1).plot(x='Tenure', colormap='gnuplot', ax=ax)

gbm_par_dep_Tenure.plot(title='Partial Dependence and ICE for Churn',
                         x='Tenure', 
                         y='partial_dependence',
                         style='r-', 
                         linewidth=3, 
                         ax=ax)

_ = plt.legend(bbox_to_anchor=(1.05, 0),
               loc=3, 
               borderaxespad=0.)

3.3 XGBoost

# Trainig the XGBoost Model and the Performance of the model

xgboost_model_local.train(x=X, y=target, training_frame=train, validation_frame = valid)
xgboost_model_local.model_performance()

ModelMetricsBinomial: xgboost
** Reported on train data. **

MSE: 0.06132453573019107
RMSE: 0.2476379125461024
LogLoss: 0.21775439530372342
Mean Per-Class Error: 0.06897156897156898
AUC: 0.979129370056011
pr_auc: 0.9411672125904095
Gini: 0.958258740112022
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4160831719636917:

Yes

Error

Rate

2682.0

167.0

0.0586

(167.0/2849.0)

Yes

91.0

945.0

0.0878

(91.0/1036.0)

Total

2773.0

1112.0

0.0664

(258.0/3885.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric

threshold

value

idx

max f1

0.4160832

0.8798883

197.0

max f2

0.3003411

0.9168793

234.0

max f0point5

0.5387729

0.8942644

151.0

max accuracy

0.4264682

0.9341055

193.0

max precision

0.9776292

1.0

0.0

max recall

0.0642277

1.0

342.0

max specificity

0.9776292

1.0

0.0

max absolute_mcc

0.4160832

0.8350746

197.0

max min_per_class_accuracy

0.3879922

0.9285714

206.0

max mean_per_class_accuracy

0.3585542

0.9310284

215.0

Gains/Lift Table: Avg response rate: 26.67 %, avg score: 26.70 %

group

cumulative_data_fraction

lower_threshold

lift

cumulative_lift

response_rate

score

cumulative_response_rate

cumulative_score

capture_rate

cumulative_capture_rate

gain

cumulative_gain

0.0100386

0.9424194

3.75

1.0

0.9552457

1.0

0.9552457

0.0376448

275.0

0.0200772

0.9212297

3.6538462

3.7019231

0.9743590

0.9316404

0.9871795

0.9434430

0.0366795

0.0743243

265.3846154

270.1923077

0.0301158

0.9050536

3.75

3.7179487

1.0

0.9124670

0.9914530

0.9331177

0.0376448

0.1119691

275.0

271.7948718

0.0401544

0.8860434

3.75

3.7259615

1.0

0.8956389

0.9935897

0.9237480

0.0376448

0.1496139

275.0

272.5961538

0.0501931

0.8648538

3.75

3.7307692

1.0

0.8763654

0.9948718

0.9142715

0.0376448

0.1872587

275.0

273.0769231

0.1001287

0.7814316

3.5953608

3.6632391

0.9587629

0.8228929

0.9768638

0.8686997

0.1795367

0.3667954

259.5360825

266.3239075

0.1500644

0.6925323

3.5180412

3.6149228

0.9381443

0.7361416

0.9639794

0.8245894

0.1756757

0.5424710

251.8041237

261.4922813

0.2

0.5959323

3.3247423

3.5424710

0.8865979

0.6410087

0.9446589

0.7787533

0.1660232

0.7084942

232.4742268

254.2471042

0.3001287

0.3844050

2.2172237

3.1003431

0.5912596

0.4889227

0.8267581

0.6820602

0.2220077

0.9305019

121.7223650

210.0343053

0.4

0.2166304

0.5605670

2.4662162

0.1494845

0.2954908

0.6576577

0.5855423

0.0559846

0.9864865

-43.9432990

146.6216216

0.5001287

0.1206617

0.0867609

1.9898353

0.0231362

0.1632120

0.5306227

0.5009893

0.0086873

0.9951737

-91.3239075

98.9835306

0.6

0.0609180

0.0483247

1.6666667

0.0128866

0.0882648

0.4444444

0.4322904

0.0048263

1.0

-95.1675258

66.6666667

0.6998713

0.0307850

0.0

1.4288341

0.0

0.0440276

0.3810224

0.3768855

0.0

1.0

-100.0

42.8834130

0.8

0.0131013

0.0

1.25

0.0

0.0209528

0.3333333

0.3323366

0.0

1.0

-100.0

25.0

0.8998713

0.0049039

0.0

1.1112700

0.0

0.0084077

0.2963387

0.2963857

0.0

1.0

-100.0

11.1270023

1.0

0.0003411

0.0

1.0

0.0

0.0024668

0.2666667

0.2669560

0.0

1.0

-100.0

0.0

3.3.1 Variable Importance

xgboost_model_local.varimp_plot()

3.3.2 Partial Dependence Plot

xgboost_model_local.partial_plot(data=df, cols=['Tenure'], plot=True, plot_stddev=True)

PartialDependence: Partial Dependence Plot of model XGBoost_model_python_1597013319592_19135 on column 'Tenure'

tenure

mean_response

stddev_response

std_error_mean_response

1.0

0.3390722

0.3175808

0.0039421

4.7368421

0.3265676

0.3092209

0.0038384

8.4736842

0.3189497

0.3025351

0.0037554

12.2105263

0.2732323

0.2627196

0.0032611

15.9473684

0.2541345

0.2545033

0.0031592

19.6842105

0.2689445

0.2609253

0.0032389

23.4210526

0.2041085

0.2114047

0.0026242

27.1578947

0.2120031

0.2326531

0.0028879

30.8947368

0.2089505

0.2294045

0.0028476

34.6315789

0.2079895

0.2272737

0.0028212

38.3684211

0.2144646

0.2273624

0.0028223

42.1052632

0.2050112

0.2244768

0.0027864

45.8421053

0.1991063

0.2210418

0.0027438

49.5789474

0.2045538

0.2324444

0.0028853

53.3157895

0.2221776

0.2186239

0.0027138

57.0526316

0.1401536

0.1683525

0.0020898

60.7894737

0.1114619

0.1456679

0.0018082

64.5263158

0.1023254

0.1369044

0.0016994

68.2631579

0.1255760

0.1615960

0.0020059

72.0

0.0795766

0.1187945

0.0014746

[]

If observed clearly, the results of PDP for GBM and XGboost are similar and are very closely related. That is because both the algorithm are tree-based models. and that shows their working within the models. This is a clear capture of how the models of similar origin help us interpret the explainable methods.

xgboost_model_local.partial_plot(data=df, cols=['Total Charges'], plot=True, plot_stddev=True)

PartialDependence: Partial Dependence Plot of model XGBoost_model_python_1597013319592_19135 on column 'Total Charges'

total_charges

mean_response

stddev_response

std_error_mean_response

18.8

0.4238615

0.3261315

0.0040483

474.9052632

0.2464518

0.2738153

0.0033989

931.0105263

0.2916773

0.2639648

0.0032766

1387.1157895

0.2316458

0.2615037

0.0032461

1843.2210526

0.2017253

0.2400597

0.0029799

2299.3263158

0.2147928

0.2491192

0.0030923

2755.4315789

0.2225582

0.2517619

0.0031251

3211.5368421

0.2479061

0.2703625

0.0033560

3667.6421053

0.1678940

0.1976482

0.0024534

4123.7473684

0.2350038

0.2451407

0.0030429

4579.8526316

0.1998015

0.2325371

0.0028865

5035.9578947

0.2053192

0.2169033

0.0026924

5492.0631579

0.2459624

0.2454963

0.0030474

5948.1684211

0.2257868

0.2376747

0.0029503

6404.2736842

0.2348899

0.2398839

0.0029777

6860.3789474

0.2202493

0.2335130

0.0028986

7316.4842105

0.2375569

0.2388472

0.0029648

7772.5894737

0.2341314

0.2315600

0.0028744

8228.6947368

0.2336091

0.2314393

0.0028729

8684.8

0.2336091

0.2314393

0.0028729

[]

# calculate 2-D partial dependence
h2o.no_progress()
resolution = 20
par_dep_Tenure_v_TotalCharges_XGB = par_dep_2d('Tenure',
                                           'Total Charges',
                                           df,
                                           xgboost_model_local,
                                           resolution=resolution)

print(par_dep_Tenure_v_TotalCharges_XGB)

min1:1.0
max1_72.0
by13.55
min2:18.8
max2_8684.8
by2433.3
     Tenure  Total Charges  partial_dependence
0      1.00           18.8            0.663790
1      1.00          452.1            0.448844
2      1.00          885.4            0.452388
3      1.00         1318.7            0.440986
4      1.00         1752.0            0.448382
5      1.00         2185.3            0.446995
6      1.00         2618.6            0.409245
7      1.00         3051.9            0.461325
8      1.00         3485.2            0.413713
9      1.00         3918.5            0.321572
10     1.00         4351.8            0.547304
11     1.00         4785.1            0.563174
12     1.00         5218.4            0.387827
13     1.00         5651.7            0.385362
14     1.00         6085.0            0.527273
15     1.00         6518.3            0.445609
16     1.00         6951.6            0.467951
17     1.00         7384.9            0.401695
18     1.00         7818.2            0.451002
19     1.00         8251.5            0.449307
20     4.55           18.8            0.665177
21     4.55          452.1            0.417720
22     4.55          885.4            0.438675
23     4.55         1318.7            0.426656
24     4.55         1752.0            0.440370
25     4.55         2185.3            0.437596
26     4.55         2618.6            0.396918
27     4.55         3051.9            0.449461
28     4.55         3485.2            0.393991
29     4.55         3918.5            0.301387
..      ...            ...                 ...
370   64.90         4351.8            0.095223
371   64.90         4785.1            0.120647
372   64.90         5218.4            0.040370
373   64.90         5651.7            0.082743
374   64.90         6085.0            0.097072
375   64.90         6518.3            0.023112
376   64.90         6951.6            0.077812
377   64.90         7384.9            0.048844
378   64.90         7818.2            0.071803
379   64.90         8251.5            0.071341
380   68.45           18.8            0.352542
381   68.45          452.1            0.151926
382   68.45          885.4            0.126965
383   68.45         1318.7            0.097381
384   68.45         1752.0            0.089060
385   68.45         2185.3            0.107242
386   68.45         2618.6            0.094299
387   68.45         3051.9            0.113559
388   68.45         3485.2            0.065331
389   68.45         3918.5            0.051618
390   68.45         4351.8            0.134823
391   68.45         4785.1            0.167334
392   68.45         5218.4            0.067334
393   68.45         5651.7            0.114330
394   68.45         6085.0            0.122958
395   68.45         6518.3            0.053775
396   68.45         6951.6            0.105393
397   68.45         7384.9            0.067797
398   68.45         7818.2            0.088906
399   68.45         8251.5            0.087211

[400 rows x 3 columns]

# create 3-D grid 
new_shape = (resolution, resolution)
x = np.asarray(par_dep_Tenure_v_TotalCharges_XGB['Tenure']).reshape(new_shape)
y = np.asarray(par_dep_Tenure_v_TotalCharges_XGB['Total Charges']).reshape(new_shape)
z = np.asarray(par_dep_Tenure_v_TotalCharges_XGB['partial_dependence']).reshape(new_shape)


fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')

# set axes labels
ax.set_title('Partial Dependence for Churn')
ax.set_xlabel('Tenure')
ax.set_ylabel('Total Charges')
ax.set_zlabel('\nChurn')

# axis decorators/details
#ax.zaxis.set_major_locator(LinearLocator(10))
#ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))

# surface
surf = ax.plot_surface(x, y, z, 
                       cmap=cm.coolwarm, 
                       linewidth=0.05, 
                       rstride=1, 
                       cstride=1, 
                       antialiased=True)
plt.tight_layout()

_ = plt.show()

3.3.3 Individual Conditional Expectation

xgboost_par_dep_TC = par_dep('Total Charges', df, xgboost_model_local)
# xgboost_par_dep_TC.plot.line(x='Total Charges', y='partial_dependence')
# print(xgboost_par_dep_TC)

# quantile_dict = get_quantile_dict('Churn', 'Id', df)

bins = list(xgboost_par_dep_TC['Total Charges'])
for i in sorted(quantile_dict.keys()):
    col_name = 'Percentile_' + str(i)
    xgboost_par_dep_TC[col_name] = par_dep('Total Charges',
                                   df[df['Id'] == int(quantile_dict[i])],
                                   xgboost_model_local,
                                   bins=bins)['partial_dependence']
xgboost_par_dep_TC

Total Charges

partial_dependence

Percentile_0

Percentile_10

Percentile_20

Percentile_30

Percentile_40

Percentile_50

Percentile_60

Percentile_70

Percentile_80

Percentile_90

Percentile_99

18.8

0.574422

1.0

0.0

1.0

0.0

1.0

452.1

0.333128

1.0

0.0

1.0

0.0

1.0

885.4

0.338829

1.0

0.0

1.0

0.0

1.0

1318.7

0.310015

1.0

0.0

1.0

0.0

1.0

1752.0

0.310786

1.0

0.0

1.0

0.0

1.0

2185.3

0.326965

1.0

0.0

1.0

0.0

1.0

2618.6

0.306626

1.0

0.0

1.0

0.0

1.0

3051.9

0.359168

1.0

0.0

1.0

0.0

1.0

3485.2

0.290755

1.0

0.0

1.0

0.0

1.0

3918.5

0.200462

1.0

0.0

1.0

4351.8

0.452080

1.0

0.0

1.0

0.0

1.0

4785.1

0.441757

1.0

0.0

1.0

0.0

1.0

5218.4

0.252388

1.0

0.0

1.0

0.0

1.0

5651.7

0.271495

1.0

0.0

1.0

0.0

1.0

6085.0

0.417720

1.0

0.0

1.0

0.0

1.0

6518.3

0.323267

1.0

0.0

1.0

0.0

1.0

6951.6

0.354854

1.0

0.0

1.0

0.0

1.0

7384.9

0.274268

1.0

0.0

1.0

0.0

1.0

7818.2

0.329584

1.0

0.0

1.0

0.0

1.0

8251.5

0.328351

1.0

0.0

1.0

0.0

1.0

fig, ax = plt.subplots()
xgboost_par_dep_TC.drop('partial_dependence', axis=1).plot(x='Total Charges', colormap='gnuplot', ax=ax)

xgboost_par_dep_TC.plot(title='Partial Dependence and ICE for Total Charges',
                         x='Total Charges', 
                         y='partial_dependence',
                         style='r-', 
                         linewidth=3, 
                         ax=ax)

_ = plt.legend(bbox_to_anchor=(1.05, 0),
               loc=3, 
               borderaxespad=0.)

Previous5.2.1 Suggestion Lists Next5.3 Results & Explanation

Last updated 3 years ago

Was this helpful?