# 3.2.3 Sample Theory

**Sampling Distributions**

Sampling distributions describe the distribution for a specific statistic. That is, sampling distributions are a subset (sample) of the full data set, with which you can play, explore, and simulate statistics like averages, variance, and skew.

***Sampling distributions help us create conclusions using the statistics about a population. A sample population is the statistical representation of the actual population.***\
Before we dive into what and how to do sampling, let's understand a few key terms, that would help us calculate samples.

**I. Confidence Interval**\
Confidence intervals are ‘intervals’ that we can create to guess with a certain degree of accuracy where a parameter of interest lies.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval.<br>

![Confidence Interval Formula](https://lh6.googleusercontent.com/ZlH3G5E3Jq_evaEfq0t6vMGigzedTlwVn9Vna7e356zlrmZzRywVFCNa_vo83gCJ5en15Rys859GINskcnjvPjfbDmSX2f0MF2CE7AdLB-hafxd7Xn-9H7g-rrKJs0RYL_rCz5ec-A8)

**II. Method & Formula for Sampling**

In the real world, working with an entire population's data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

* Get your data
* Figure out what you want to estimate(ex. Binary Classification/Regression)
* Bootstrap that parameter
* Create Confidence intervals.

It depends on the type of problem you want to address based on the distribution of the target on which the model is being built to predict.

* **Binary:**&#x20;

![Binary: Number of sample calculations](https://lh4.googleusercontent.com/Effbl1wFOc-7YSp0buvZiooxewElYaQZVWhwAR7B9HddL-EyllkDewOru3vOp6pq5HEVZZqjqAiHCKXN8cnTnYvCaAy_VaCOp7e6XYZ8P8ujJkaRkLLNn2zqoS6aPqQozAwQtpRBiZg)

For the other distributions, we use the Rule of thumb for minimum sample size

* **Multivariate:** n >= 100 + 50k
* **Regression:** n >= 104 + k\
  \&#xNAN;*Note: Where k is the number of the independent variable*

**III. Implementation - Python function**\
We created our own function:&#x20;

**Assumptions:**

* Error% = **1%** &#x20;
* Confidence Interval = **95%**
* z = **1.95**

  &#x20;                                               ![](https://lh6.googleusercontent.com/rNNFFSnog61IaJ1n1iMpXJnCUHta30mMjYzIVa0pOHT5DfeGvOm2o0sIG0lFw9RvWjYfkwJyc2q9K47Fv0rGpnpKGcVLtEikaNbLXXaPoinJygF34UTzUQvTF8lrZTQX9E46RU_C1ZM)

**IV. Sampling Theorem Script:**

We have used the [**H2O** ](https://www.h2o.ai/)library to operate and test the sampling to help us with the Interpretability and Evaluations of the Interpretable Models.

```python
class Sampling:
    
    def __init__(self):
        self.__error__ = 0.01
        self.__zscore__ = 1.96
        self.__binary__ = 2
        self.__multivariate__ = 9
        self.__addmultivariate__ = 200
        self.__factormultivariate__ = 100
        self.__addcontinous__ = 500   

    def sampling(self, df, y):
        if len(df[y].unique()) == self.__binary__:
            print('binary')
            # values = have the binary category list from the target
            values = list(df[y].unique().as_data_frame()['C1'])
            # p0 = number of samples for category 1, p1 = number of samples for category 2
            p0, p1 = len(df[df[y] == values[0]]), len(df[df[y] == values[1]])
            # prob0 = probablity of category 1. prob1 = probability of category 2
            prob0, prob1 = p0/(p0+p1), p1/(p0+p1)
            # n_samples = number of samples to be computed
            n_samples = prob0*prob1*(self.__zscore__/self.__error__)**2
        if self.__binary__ < len(df[y].unique()) < self.__multivariate__:
            print('multivariate')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addmultivariate__ + (self.__factormultivariate__ * features)
        if self.__multivariate__ < len(df[y].unique()):
            print('continuous')
            # features = number of features the given dataset has
            features = len(df.columns)
            print('features', features)
            # n_samples = number of samples to be computed
            n_samples = self.__addcontinous__ + features # first try = 104, second = 500
        print('df', df.shape)
        print('n_samples', n_samples)
        # if the total population of the data is less than the sample calculated,
        # we take the whole population as a sample
        if n_samples/df.shape[0] < 1:
            print('returning sampled_df')
            sample_df, _, _ = df.split_frame(ratios=[n_samples/df.shape[0], 0.8 - n_samples/df.shape[0]])
            return sample_df
        else:
            print('returning df')
            return df
```

<br>
