There is a featureI really like in Apache Spark. **Spark can process data out of memory in my local machine even without a cluster.** Good news for those who process data sets bigger than the memory size that currently have. From time to time, I have this issue when I work with hypothesis testing.

**For hypothesis testing I usually use statistical bootstrapping techniques. This method does not require any statistical knowledge and is very easy to understand.** Also, this method is very simple to implement. There are no normal distributions and student distributions from your statistical courses, only some basic coding skills. Good news for those who doesn’t like statistics. Spark and bootstrapping is a very powerful combination which can help you check hypotheses in a large scale.

## 1. Bootstrap methods

The most common application with bootstrapping is calculating confidence intervals and you can use these confidence intervals as a part of the hypotheses checking process. There is a very simple idea behind bootstrapping – sample your data set size N for hundreds or even thousands times with the replacement (this is important) and calculate the estimated metrics for each of the hundreds\thousands subset. This process gives you a histogram which is an **actual distribution for your data**. Then, you can use this actual distribution for hypothesis testing.

The beauty of this method is the actual distribution histogram. In a classical statistical approach, you need to approximate a distribution of your data by normal distribution and calculate z-scores or student-scores based on theoretical distributions. With the actual distribution from the first step it is easy to calculate 2.5% percentile and 97.5% percentiles and this would be your actual confidence interval. That’s it! **Confident interval with almost no math.**

## 2. Choosing the right hypothesis

Choosing right hypotheses is only the tricky part in this analytical process. This is a question you ask the data and you cannot automate that. Hypotheses testing is a part of the analytical process and isn’t usual for machine learning experts. **In machine learning you ask an algorithm to build a model\structure which is sometimes called hypothesis and you are looking for the best hypotheses which correlates your data and labels. **

**In the analytics process, knowing the correlation is not enough**, you should know the hypothesis from the get-go and the question is – if the hypothesis is correct and what is your level of confidence.

If you have a correct hypotheses it is easy to check the hypotheses based on the bootstrapping approach. For example let’s try to check the hypothesis in which we take an average for some feature in your dataset that is equal to 30.0. We should start with a null hypothesis H0 which we try to reject and an alternative hypothesis H1:

H0: mean(A) == 30.0

H1: meanA() != 30.0

If we fail to reject H0 we will take this hypothesis as ground truth. That’s what we need. If we don’t – then we should come up with a better hypothesis (mean(A) == 40).

## 3. Checking hypotheses

For the hypotheses checking we can simply calculate the confidence interval for dataset A by sampling and calculating 95% confidence interval. If the interval does not contain 30.0 then your hypotheses H0 was rejected.

Obviously, this confident interval starts with 2.5% and ends 97.5% which gives us 95% of the items between this interval. In the sorted array of our observations we should find 2.5% and 97.5% percentiles: p1 and p2. If p1 <= 30.0 <= p2, then we weren’t able to reject H0. So, we can suppose that H0 is the truth.

## 4. Apache Spark code

Implementation of bootstrapping in this particular case is straight forward.

import scala.util.Sorting.quickSort def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double) : (Double, Double) = { // Simulate by sampling and calculating averages for each of subsamples val hist = Array.fill(N){0.0} for (i <- 0 to N-1) { hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean } // Sort the averages and calculate quantiles quickSort(hist) val left_quantile = hist((N*left).toInt) val right_quantile = hist((N*right).toInt) return (left_quantile, right_quantile) }

Because I did not find any good open datasets for the large scale hypotheses testing problem, let’s use skewdata.csv dataset from the book “Statistics: An Introduction Using R”. You can find this dataset in this archive. It is not perfect but will work in a pinch.

val dataWithHeader = sc.textFile("zipped/skewdata.csv")val header = dataWithHeader.firstval data = dataWithHeader.filter( _ != header ).map( _.toDouble )val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975) val H0_mean = 30 if (left_qt < H0_mean && H0_mean < right_qt) {println("We failed to reject H0. It seems like H0 is correct.")} else {println("We rejected H0")}

**We have to understand the difference between “filed to reject H0” and “proof H0”.** A failing to reject a hypothesis gives you a pretty strong level of evidence that the hypothesis is correct and you can use this information in your decision making process but this is not an actual proof.

## 5. Equal means code example

Another type of hypotheses – check if the means of the two datasets are different. This leads us to the usual design of experiment questions – if you apply some change in your web system (user interface change for example) would your click rate change in a positive direction?

Let’s create a hypothesis:

H0: mean(A) == mean(B)

H1: mean(A) > mean(B)

It is not easy to find H1 for this hypothesis which we can prove. Let’s change this hypothesis around a little bit:

Ho’: mean(A-B) == 0

H1: mean(A-B) > 0

Now we can try to reject H0′.

def getConfIntervalTwoMeans(input1: org.apache.spark.rdd.RDD[Double],input2: org.apache.spark.rdd.RDD[Double],N: Int, left: Double, right:Double): (Double, Double) = {// Simulate average of differencesval hist = Array.fill(N){0.0}for (i <- 0 to N-1) {val mean1 = input1.sample(withReplacement = true, fraction = 1.0).meanval mean2 = input2.sample(withReplacement = true, fraction = 1.0).meanhist(i) = mean2 - mean1}// Sort the averages and calculate quantilesquickSort(hist)val left_quantile = hist((N*left).toInt)val right_quantile = hist((N*right).toInt)return (left_quantile, right_quantile)}

**
**We should change 2.5% and 97.5% percentiles in the interval to 5% percentile in the left side only because of one-side (one-tailed) hypothesis testing. And an actual code as an example:

// Let's try to check the same dataset with itself. Ha-ha.val (left_qt, right_qt) = getConfIntervalTwoMeans(data, data, 1000, 0.05, 0.95) // A condition was changed because of one-tailed test.if (left_qt > 0) {println("We failed to reject H0. It seems like H0 is correct."} else {println("We rejected H0")}

## Conclusion

Bootstrapping methods are very simple for understanding and implementation. They are intuitively simple and you don’t need any deep knowledge of statistics. Apache Spark can help you implement these methods in a large scale.

As I mentioned previously it is not easy to find a good open large dataset for hypotheses testing. **Please share with our community if you have one or come across one.**

My code is shared in this Scala file.

[…] Guest blog post by Dmitry Petrov. Originally posted here. […]

LikeLike

[…] Guest blog post by Dmitry Petrov. Originally posted here. […]

LikeLike

[…] Guest blog post by Dmitry Petrov. Originally posted here. […]

LikeLike

Regarding “With the actual distribution from the first step it is easy to calculate 2.5% percentile and 97.5% percentiles and this would be your actual confidence interval.” – does it mean you don’t need to know anything about the distribution of your measurement? In addition, can you provide a reference/paper for that statement?

LikeLike

Hi Nancy,

Yes. This histogram (“hist” variable in the code) will contain your actual distribution, not theoretical. The distribution should be close but not necessary equal to a normal distribution.

References:

1) Davison, A. C.; Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

2) Book: “Statistics: An Introduction Using R” 2nd Edition by Michael J. Crawley

3) wiki: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Deriving_confidence_intervals_from_the_bootstrap_distribution

LikeLike

Dmitry, thank you for your quick, clear and detailed reply.

Some questions I would like to hear your thoughts:

1. Can repeated k-fold (stratified) cross validation be used here instead of bootstrap sampling? If not, why? (variance?)

2. I using this procedure to properly evaluate the **prediction** algorithm we’re currently using in production (let’s call it – algorithm A). I would like to be able to compare algorithm A to some other prediction algorithm B – to be able to say with some confidence that one algorithm is superior to another for a particular dataset. What is the proper test the use here?

LikeLiked by 1 person

Yes, you can use some ideas from this approach for machine learning model evaluation. There is a terminology issue: usually we use term “confident interval” for parameter and “prediction intervals” for prediction (your data). See http://robjhyndman.com/hyndsight/intervals/ and http://stats.stackexchange.com/questions/154677/confidence-versus-prediction-intervals-using-quantile-regression-quantile-loss

The evaluation is very depend on your metrics. If you are looking for a new model which is “with some confidence” better then your prediction intervals should not have intersection. I don’t think this level of confidence is (usually) needed in ML.

LikeLike

[…] How to check hypotheses with bootstrap and Apache Spark? There is a featureI really like in Apache Spark. Spark can process data out of memory in my local machine even without a cluster. Good news for those who process data sets bigger than the memory size that currently have. From time to time, I have this issue when I work with hypothesis testing. For hypothesis testing I usually use statistical bootstrapping techniques. This method does not require any statistical knowledge and is very easy to understand. Also, this method is very simple to implement. There are no normal distributions and student distributions from your statistical courses, only some basic coding skills. Good news for those who doesn’t like statistics. Spark and bootstrapping is a very powerful combination which can help you check hypotheses in a large scale. […]

LikeLike

Can that work also for more complicated metrics say k-means clustering models,

eg: creating a k-means cluster from each bootstrap sample and then how can I aggregate the cluster results of all the samples in a histogram?

LikeLike

Why have you used fraction=1.0 ? are you taking the whole dataset? cant we take smaller samples say 20% of the input ? How to calculate an Ideal sample size for this?

LikeLike

Ali,

in the statistical bootstrapping method we use 100% random sample with replacement (parameter withReplacement = true). It gives you around 63% unique values plus duplications and even triplications.

Yes, the method could be used for k-means clustering if you evaluate a single metrics. For example, for a given subset of items, you can evaluate how many different clusters they belong. A hypothesis example to evaluate – the items are distributed by less than 5 clusters (with 95% confidence level).

LikeLike