3.2.3 Sample Theory

Explains sampling theory and approach for doing it with a library h2o

Sampling Distributions

Sampling distributions describe the distribution for a specific statistic. That is, sampling distributions are a subset (sample) of the full data set, with which you can play, explore, and simulate statistics like averages, variance, and skew.

Sampling distributions help us create conclusions using the statistics about a population. A sample population is the statistical representation of the actual population. Before we dive into what and how to do sampling, let's understand a few key terms, that would help us calculate samples.

I. Confidence Interval Confidence intervals are ‘intervals’ that we can create to guess with a certain degree of accuracy where a parameter of interest lies.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval.

Confidence Interval Formula

II. Method & Formula for Sampling

In the real world, working with an entire population's data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

  • Get your data

  • Figure out what you want to estimate(ex. Binary Classification/Regression)

  • Bootstrap that parameter

  • Create Confidence intervals.

It depends on the type of problem you want to address based on the distribution of the target on which the model is being built to predict.

  • Binary:

Binary: Number of sample calculations

For the other distributions, we use the Rule of thumb for minimum sample size

  • Multivariate: n >= 100 + 50k

  • Regression: n >= 104 + k Note: Where k is the number of the independent variable

III. Implementation - Python function We created our own function:

Assumptions:

  • Error% = 1%

  • Confidence Interval = 95%

  • z = 1.95

IV. Sampling Theorem Script:

We have used the H2O library to operate and test the sampling to help us with the Interpretability and Evaluations of the Interpretable Models.

class Sampling:
    
    def __init__(self):
        self.__error__ = 0.01
        self.__zscore__ = 1.96
        self.__binary__ = 2
        self.__multivariate__ = 9
        self.__addmultivariate__ = 200
        self.__factormultivariate__ = 100
        self.__addcontinous__ = 500   

    def sampling(self, df, y):
        if len(df[y].unique()) == self.__binary__:
            print('binary')
            # values = have the binary category list from the target
            values = list(df[y].unique().as_data_frame()['C1'])
            # p0 = number of samples for category 1, p1 = number of samples for category 2
            p0, p1 = len(df[df[y] == values[0]]), len(df[df[y] == values[1]])
            # prob0 = probablity of category 1. prob1 = probability of category 2
            prob0, prob1 = p0/(p0+p1), p1/(p0+p1)
            # n_samples = number of samples to be computed
            n_samples = prob0*prob1*(self.__zscore__/self.__error__)**2
        if self.__binary__ < len(df[y].unique()) < self.__multivariate__:
            print('multivariate')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addmultivariate__ + (self.__factormultivariate__ * features)
        if self.__multivariate__ < len(df[y].unique()):
            print('continuous')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addcontinous__ + features # first try = 104, second = 500
        print('df', df.shape)
        print('n_samples', n_samples)
        # if the total population of the data is less than the sample calculated,
        # we take the whole population as a sample
        if n_samples/df.shape[0] < 1:
            print('returning sampled_df')
            sample_df, _, _ = df.split_frame(ratios=[n_samples/df.shape[0], 0.8 - n_samples/df.shape[0]])
            return sample_df
        else:
            print('returning df')
            return df

Last updated

Was this helpful?