Recently, I discovered an interesting blog post Big RAM is eating big data – Size of datasets used for analytics from Szilard Pafka. He says that “Big RAM is eating big data”. This phrase means that the growth of the memory size is much faster than the growth of the data sets that typical data scientist process. So, data scientist do not need as much data as the industry offers to them. Would you agree?
I do not agree. This result does not match my intuition. During my research I found an infrastructure bias in the data from this blogpost. I’ll show that the growth of the datasets is approximately the same as the memory growth in Amazon AWS rented machines and the Apple MacBook Pro laptops during the last 10 years.
1. The blog post results
According to “Big RAM is eating big data” blog post, the amount of memory in the Amazon AWS machines grow faster (50% per year) than the median datasets (20% per year) that people use for analytics. This result is based on KDNuggets survey about data sizes: Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range. You might find the most recent survey dataset here in Github.
Let’s take a look at the data and results more closely. The cumulative distribution of dataset sizes for a few select years is below:
I did not find a code from the post. So, I reproduced this research in R.
Below is my R code to create this graph from the dataset file.
library(ggplot2) library(dplyr) file <- "dataset-sizes.cv" data <- read.csv(file, sep="\t") data.slice <- data %>% filter(year == 2006 | year == 2009 | year == 2012 | year == 2015) data.slice.cum_freq <- data.slice %>% group_by(year, sizeGB) %>% summarise(value = sum(freq)) %>% mutate(user_prop = value/sum(value), cum_freq = cumsum(value)/sum(value)) ggplot(data.slice.cum_freq, aes(x=log10(sizeGB), y=cum_freq, color=factor(year))) + geom_line(aes(group = factor(year)))
He mentioned that cumulative distribution function looks like linear in the 0.1-0.9 range (10 megabytes to 10 petabytes). By fitting the linear model for this range you might calculate the difference between these years.
My R code:
data.slice.reg <- data.slice.cum_freq %>% filter(log10(sizeGB) >= -2) %>% filter(log10(sizeGB) <= 4) ggplot(data.slice.reg, aes(x=log10(sizeGB), y=cum_freq, color=factor(year))) + geom_line(aes(group = factor(year))) attach(data.slice.reg) model <- lm(log10(sizeGB) ~ cum_freq + year, na.action=na.exclude) summary(model)
From the model summary you might find the coefficient corresponding to the year variable which is equal to 0.08821 from my code (0.075 from the blogpost). This coefficient corresponds to log10(sizeGB). After the conversion from log10(GB) back to GB we will get 10^0.088 = 1.22 which give us 22%, or roughly 20%, growth in datasets.
This 20% growth is what he compares to the AWS maximum memory instance size for the same year ranges:
A change from 15GB in 2007 to 244GB in 2014 give us approximately 50% AWS memory growth which is much higher than the datasets growth and shows that data scientists do not need as much memory according to the blog post.
3. An intuition about memory size
So, we got the same result as in the blog post. However, I can’t say that I agree with this study result. My intuition tells me that more memory gives me more luxury in data processing and analytics. The ability to work with a large amount of data could simplify the analytics process. Due to the memory constraints, I feel this squeeze constantly.
Another aspect of the memory issue is the data preparation step. Today you need two set of skills – preparing “big data” (usually in-disk processing using Unix grep, awk, Python, Apache Spark in standalone mode etc..) and in-memory analytics (R, Python scipy). These two things are very different. Good data scientists will have both skills. However, if you have a large amount of memory you don’t need the first skill because you can prepare data in R or Python directly. This is especially important for text analytics where the amount of input data is huge by default. So, data processing becomes simplified with the large amount of memory in your machine.
I can’t imagine saying “Okay, I don’t need any more memory and more CPU cores”. Additianally, I can add “…and please stop parallelizing my nice sequential code!”.
3. AWS memory growth
It looks like the maximum amount of memory in a rented AWS instance is not the best proxy for estimating the amount of memory that data scientists use. There are three reasons for that:
- High performance computing (HPC) machines are a relatively new products which have been introduced in around 2010 and AWS HPC product creates a strong bias in the analytics memory v.s. the AWS memory correlation. The research jumps from regular machines in 2006 to 2010 to HPC ones from 2010 to 2015. Thereby, giving us an improvement in 50%. However, in my humble opinion, I believe that the improvement is less (perhaps closer to 20% as in the median data size).
- The price of AWS HPC machines is much higher than many companies can afford ($2-$3K/month). A couple of months of using this kind of machine is more expensive than a brand new shiny MacBook Pro with 16Gb of RAM memory and 1Tb SSD disk.
- It is not easy and efficient to use remote AWS machines. Not a big deal. However, I believe that many sata scientists would prefer to use their local machines, especially Apple fans :).
In my mind, HPC machines create a bias in this research and we should estimate memory usage only by regular AWS machines not including HPC and memory optimized machines. Here is the AWS history for regular machines:
From this table I’d exclude 2006 and m1.small because it was a limited beta and obviously m1.small is the m1.xlarge machine “sliced” by 8 parts. The blogger did the same – he started from 2007.
Side note: As luck would have it, my AWS experience started in that same 2007 year. For 2007, it was an amazing experience to rent a machine in one minute as apposite to days or even weeks in hosting companies previously. During this time frame, my experience was mostly in working with regular AWS machines. HPC machines were specialized and overpriced for my purposes.
So, let’s start the AWS regular machine history from 2007 with m1.xlarge then the AWS memory growth would be 35% annually during these years: 15GB*1.35^8year ~ 160GB.
Based on this result, it is closer to the growth of the datasets. As you can see the difference is 20% vs. 35%. Consequently, this cannot be agreed as strong evidence for the unimportance of RAM memory.
Let’s have more fun…
4. Apple MacBook Pro memory growth
I think that many people analyse data in their local machines and laptops. I think that most people are not ready to switch from their shiny laptops with a cozy local environment to a remote AWS machine for analytics. At least it is not easy for me and I’ll find a way to process a relatively large amount of data in my laptop (I suppose that a cluster is not needed).
Let’s try to use Apple MacBook Pro as a proxy for estimating memory growth. In the table (data is based on wikipedia) below is the MacBook Pro memory history:
|2007||1st generation. Late 2006 release.||2|
|2012||2nd generation. Mid 2012 release.||8|
|2015||3rd generation. Retina.||16|
Surprisingly, this MacBook Pro data gives us the same result as the AWS regular machine results – 35% growth: 1GB*1.36^9 ~ 16GB. It appears as if we removed (or at least dramatically reduced) the infrastructure bias.
This blog post shows that maximum memory in MacBook Pro laptops and regular AWS machines are unbiased proxies for estimating the amount of memory people and data scientists use.
Memory is huge. It gives us the ability to analyze data more efficiently. We are limited only by the growth of analytical methods and memory size. Given the opportunity, we can consume all the “affordable” memory and then some as data scientists are memory hogs, in my humble opinion (and biased as well 🙂 ).
An update: Szilard Pafka pointed me to his code in his Github.