🎓
An Evaluation System
  • An Evaluation System for Interpretable Machine Learning
  • 1. Abstract
  • 2. Introduction
  • 3. Interpretable Machine Learning
    • 3.1 Definition
    • 3.2 Methods
      • 3.2.1 Machine Learning Model
      • 3.2.2 Model-Agnostic
      • 3.2.3 Sample Theory
  • 4. Evaluation System of Interpretable Machine Learning
    • 4.1 Definition
    • 4.2 The Structure of the Evaluation System
      • 4.2.1 Interpretable Models(Model specific)
      • 4.2.2 Model Agnostic
      • 4.2.3 Human Explanations
    • 4.3 Reference Score Table
  • 5. Customized - Interpretable Machine Learning Model
    • 5.1 Why & How the user customize their model?
    • 5.2 Example
      • 5.2.1 Suggestion Lists
      • 5.2.2 Notebook
    • 5.3 Results & Explanation
  • 6. Discussion
  • 7. Citation and License
Powered by GitBook
On this page

Was this helpful?

  1. 3. Interpretable Machine Learning
  2. 3.2 Methods

3.2.3 Sample Theory

Explains sampling theory and approach for doing it with a library h2o

Previous3.2.2 Model-AgnosticNext4. Evaluation System of Interpretable Machine Learning

Last updated 4 years ago

Was this helpful?

Sampling Distributions

Sampling distributions describe the distribution for a specific statistic. That is, sampling distributions are a subset (sample) of the full data set, with which you can play, explore, and simulate statistics like averages, variance, and skew.

Sampling distributions help us create conclusions using the statistics about a population. A sample population is the statistical representation of the actual population. Before we dive into what and how to do sampling, let's understand a few key terms, that would help us calculate samples.

I. Confidence Interval Confidence intervals are ‘intervals’ that we can create to guess with a certain degree of accuracy where a parameter of interest lies.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval.

Confidence Interval Formula

II. Method & Formula for Sampling

In the real world, working with an entire population's data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

  • Get your data

  • Figure out what you want to estimate(ex. Binary Classification/Regression)

  • Bootstrap that parameter

  • Create Confidence intervals.

It depends on the type of problem you want to address based on the distribution of the target on which the model is being built to predict.

  • Binary:

For the other distributions, we use the Rule of thumb for minimum sample size

  • Multivariate: n >= 100 + 50k

  • Regression: n >= 104 + k Note: Where k is the number of the independent variable

III. Implementation - Python function We created our own function:

Assumptions:

  • Error% = 1%

  • Confidence Interval = 95%

  • z = 1.95

IV. Sampling Theorem Script:

class Sampling:
    
    def __init__(self):
        self.__error__ = 0.01
        self.__zscore__ = 1.96
        self.__binary__ = 2
        self.__multivariate__ = 9
        self.__addmultivariate__ = 200
        self.__factormultivariate__ = 100
        self.__addcontinous__ = 500   

    def sampling(self, df, y):
        if len(df[y].unique()) == self.__binary__:
            print('binary')
            # values = have the binary category list from the target
            values = list(df[y].unique().as_data_frame()['C1'])
            # p0 = number of samples for category 1, p1 = number of samples for category 2
            p0, p1 = len(df[df[y] == values[0]]), len(df[df[y] == values[1]])
            # prob0 = probablity of category 1. prob1 = probability of category 2
            prob0, prob1 = p0/(p0+p1), p1/(p0+p1)
            # n_samples = number of samples to be computed
            n_samples = prob0*prob1*(self.__zscore__/self.__error__)**2
        if self.__binary__ < len(df[y].unique()) < self.__multivariate__:
            print('multivariate')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addmultivariate__ + (self.__factormultivariate__ * features)
        if self.__multivariate__ < len(df[y].unique()):
            print('continuous')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addcontinous__ + features # first try = 104, second = 500
        print('df', df.shape)
        print('n_samples', n_samples)
        # if the total population of the data is less than the sample calculated,
        # we take the whole population as a sample
        if n_samples/df.shape[0] < 1:
            print('returning sampled_df')
            sample_df, _, _ = df.split_frame(ratios=[n_samples/df.shape[0], 0.8 - n_samples/df.shape[0]])
            return sample_df
        else:
            print('returning df')
            return df

Binary: Number of sample calculations

We have used the library to operate and test the sampling to help us with the Interpretability and Evaluations of the Interpretable Models.

H2O