3.2.3 Sample Theory

Explains sampling theory and approach for doing it with a library h2o

Sampling Distributions

Sampling distributions describe the distribution for a specific statistic. That is, sampling distributions are a subset (sample) of the full data set, with which you can play, explore, and simulate statistics like averages, variance, and skew.

Sampling distributions help us create conclusions using the statistics about a population. A sample population is the statistical representation of the actual population. Before we dive into what and how to do sampling, let's understand a few key terms, that would help us calculate samples.

I. Confidence Interval Confidence intervals are ‘intervals’ that we can create to guess with a certain degree of accuracy where a parameter of interest lies.

Confidence Interval Width: The distance between the upper and lower bounds of the confidence interval.

Confidence Interval Formula

II. Method & Formula for Sampling

In the real world, working with an entire population's data can be slow and heavy, but we can use sampling distributions to estimate what a population parameter most probably is. The general process is as follows:

  • Get your data

  • Figure out what you want to estimate(ex. Binary Classification/Regression)

  • Bootstrap that parameter

  • Create Confidence intervals.

It depends on the type of problem you want to address based on the distribution of the target on which the model is being built to predict.

  • Binary:

Binary: Number of sample calculations

For the other distributions, we use the Rule of thumb for minimum sample size

  • Multivariate: n >= 100 + 50k

  • Regression: n >= 104 + k Note: Where k is the number of the independent variable

III. Implementation - Python function We created our own function:

Assumptions:

  • Error% = 1%

  • Confidence Interval = 95%

  • z = 1.95

IV. Sampling Theorem Script:

We have used the H2O library to operate and test the sampling to help us with the Interpretability and Evaluations of the Interpretable Models.

Last updated

Was this helpful?