How Much Memory Does A Data Scientist Need?

Recently, I discovered an interesting blog post Big RAM is eating big data – Size of datasets used for analytics from Szilard Pafka. He says that “Big RAM is eating big data”. This phrase means that the growth of the memory size is much faster than the growth of the data sets that typical data scientist process. So, data scientist do not need as much data as the industry offers to them. Would you agree?

I do not agree. This result does not match my intuition. During my research I found an infrastructure bias in the data from this blogpost. I’ll show that the growth of the datasets is approximately the same as the memory growth in Amazon AWS rented machines and the Apple MacBook Pro laptops during the last 10 years.

1. The blog post results

According to “Big RAM is eating big data” blog post, the amount of memory in the Amazon AWS machines grow faster (50% per year) than the median datasets (20% per year) that people use for analytics. This result is based on KDNuggets survey about data sizes: Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB rangeYou might find the most recent survey dataset here in Github.

Let’s take a look at the data and results more closely. The cumulative distribution of dataset sizes for a few select years is below:memory-size

I did not find a code from the post. So, I reproduced this research in R.

Below is my R code to create this graph from the dataset file.

library(ggplot2)
library(dplyr)

file <- "dataset-sizes.cv"
data <- read.csv(file, sep="\t")
data.slice <- data %>%
        filter(year == 2006 | year == 2009 | year == 2012 | year == 2015)
data.slice.cum_freq <- data.slice %>%
        group_by(year, sizeGB) %>%
        summarise(value = sum(freq)) %>%
        mutate(user_prop = value/sum(value), cum_freq = cumsum(value)/sum(value)) 

ggplot(data.slice.cum_freq, aes(x=log10(sizeGB), y=cum_freq, color=factor(year))) + 
        geom_line(aes(group = factor(year)))

He mentioned that cumulative distribution function looks like linear in the 0.1-0.9 range (10 megabytes to 10 petabytes). By fitting the linear model for this range you might calculate the difference between these years.

My R code:

data.slice.reg <- data.slice.cum_freq %>%
        filter(log10(sizeGB) >= -2) %>%
        filter(log10(sizeGB) <= 4)

ggplot(data.slice.reg, aes(x=log10(sizeGB), y=cum_freq, color=factor(year))) + 
        geom_line(aes(group = factor(year)))

attach(data.slice.reg)

model <- lm(log10(sizeGB) ~ cum_freq + year, na.action=na.exclude)
summary(model)

From the model summary you might find the coefficient corresponding to the year variable which is equal to 0.08821 from my code (0.075 from the blogpost). This coefficient corresponds to log10(sizeGB). After the conversion from log10(GB) back to GB we will get 10^0.088 = 1.22 which give us 22%, or roughly 20%, growth in datasets.

This 20% growth is what he compares to the AWS maximum memory instance size for the same year ranges:

year type RAM (GB)
2007 m1.xlarge 15
2009 m2.4xlarge 68
2012 hs1.8xlarge 117
2014 r3.8xlarge 244
2016* x1 2 TB

A change from 15GB in 2007 to 244GB in 2014 give us approximately 50% AWS memory growth which is much higher than the datasets growth and shows that data scientists do not need as much memory according to the blog post.

3. An intuition about memory size

So, we got the same result as in the blog post. However, I can’t say that I agree with this study result. My intuition tells me that more memory gives me more luxury in data processing and analytics. The ability to work with a large amount of data could simplify the analytics process. Due to the memory constraints, I feel this squeeze constantly.

Another aspect of the memory issue is the data preparation step. Today you need two set of skills – preparing “big data” (usually in-disk processing using Unix grep, awk, Python, Apache Spark in standalone mode etc..) and in-memory analytics (R, Python scipy). These two things are very different. Good data scientists will have both skills. However, if you have a large amount of memory you don’t need the first skill because you can prepare data in R or Python directly. This is especially important for text analytics where the amount of input data is huge by default. So, data processing becomes simplified with the large amount of memory in your machine.

I can’t imagine saying “Okay, I don’t need any more memory and more CPU cores”. Additianally, I can add “…and please stop parallelizing my nice sequential code!”.

3. AWS memory growth

It looks like the maximum amount of memory in a rented AWS instance is not the best proxy for estimating the amount of memory that data scientists use. There are three reasons for that:

  1. High performance computing (HPC) machines are a relatively new products which have been introduced in around 2010 and AWS HPC product creates a strong bias in the analytics memory v.s. the AWS memory correlation. The research jumps from regular machines in 2006 to 2010 to HPC ones from 2010 to 2015. Thereby, giving us an improvement in 50%. However, in my humble opinion, I believe that the improvement is less (perhaps closer to 20% as in the median data size).
  2. The price of AWS HPC machines is much higher than many companies can afford ($2-$3K/month). A couple of months of using this kind of machine is more expensive than a brand new shiny MacBook Pro with 16Gb of RAM memory and 1Tb SSD disk.
  3. It is not easy and efficient to use remote AWS machines. Not a big deal. However, I believe that many sata scientists would prefer to use their local machines, especially Apple fans :).

In my mind, HPC machines create a bias in this research and we should estimate memory usage only by regular AWS machines not including HPC and memory optimized machines. Here is the AWS history for regular machines:

year type RAM (GB)
2006 m1.small 1.75
2007 m1.xlarge 15
2009 m2.4xlarge 68
2015 m4.10xlarge 160

From this table I’d exclude 2006 and m1.small because it was a limited beta and obviously m1.small is the m1.xlarge machine “sliced” by 8 parts. The blogger did the same – he started from 2007.

Side note: As luck would have it, my AWS experience started in that same 2007 year. For 2007, it was an amazing experience to rent a machine in one minute as apposite to days or even weeks in hosting companies previously. During this time frame, my experience was mostly in working with regular AWS machines. HPC machines were specialized and overpriced for my purposes.

So, let’s start the AWS regular machine history from 2007 with m1.xlarge then the AWS memory growth would be 35% annually during these years: 15GB*1.35^8year ~ 160GB.

Based on this result, it is closer to the growth of the datasets. As you can see the difference is 20% vs. 35%. Consequently, this cannot be agreed as strong evidence for the unimportance of RAM memory.

Let’s have more fun…

4. Apple MacBook Pro memory growth

I think that many people analyse data in their local machines and laptops. I think that most people are not ready to switch from their shiny laptops with a cozy local environment to a remote AWS machine for analytics. At least it is not easy for me and I’ll find a way to process a relatively large amount of data in my laptop (I suppose that a cluster is not needed).

Let’s try to use Apple MacBook Pro as a proxy for estimating memory growth. In the table (data is based on wikipedia) below is the MacBook Pro memory history:

year type RAM (GB)
2006 1st generation 1
2007 1st generation. Late 2006 release. 2
2008 2nd generation 4
2012 2nd generation. Mid 2012 release. 8
2015 3rd generation. Retina. 16

Surprisingly, this MacBook Pro data gives us the same result as the AWS regular machine results – 35% growth: 1GB*1.36^9 ~ 16GB. It appears as if  we removed (or at least dramatically reduced) the infrastructure bias.

Conclusion

This blog post shows that maximum memory in  MacBook Pro laptops and regular AWS machines are unbiased proxies for estimating the amount of memory people and data scientists use.

Memory is huge. It gives us the ability to analyze data more efficiently. We are limited only by the growth of analytical methods and memory size. Given the opportunity, we can consume all the “affordable” memory and then some as data scientists are memory hogs, in my humble opinion (and biased as well 🙂 ).

 

An update: Szilard Pafka pointed me to his code in his Github.

6 thoughts on “How Much Memory Does A Data Scientist Need?

  1. Hi:

    This is Szilard, author of the post you are referencing.

    First of all, thanks for looking at my post and scrutinizing the results. I think we actually agree on more things than your post suggests, because there are things we agree, there are things I never said/you seem to misunderstand/misinterpret, and there are only a few things we might disagree.

    Concerning the growth rate of datasets, thanks for reproducing my results (20% growth/year). My code is actually on github as R markdown files https://github.com/szilard/dataset-sizes-kdnuggets/tree/master/analysis but it’s really nice to see an independent verification and that you put in time to do that.

    While it seems we agree on the growth rates of datasets (20%), I actually have concerns/I’m more cautious to say that, and you can read some of that in the original blog post (e.g. the last paragraph).

    For the growth rate of memory I tried to find a proxy for “the largest RAM on one box that’s easily available for a data scientist”. I can see your point against switching types (standard vs HPC), but someone who needs 250GB of RAM today will not stop at the m4.10xlarge 160GB box. It would be interesting (and I was thinking at the time of the post about this, but seemed too hard to find reliable data) to look at the largest RAM one can easily get in a high-end laptop, desktop and server, respectively (by year). Arguably a lot of people like working on their laptops. High-end desktops typically stuff more RAM than laptops and data scientists can still use the same analytics tools and environments, many of them based on GUIs. However, there are nowadays excellent options to work on a server with similar experience thanks to tools like RStudio (server) or the Jupyter (formerly IPython) notebooks. Your analysis of Macbook Pro RAM sizes is a nice step in this direction (for laptops).

    Seems like you are suggesting 35% yearly growth rate for RAM. That’s still significantly higher than the 20% growth rate of datasets. My main point is that it’s not data (“big data”) that’s growing faster, but RAM – at least for analytical tasks. So, it’s not “more and more data scientists need big data analytical tools” (that are often clunky), but rather more and more people are able to use single machine analytical tools for their analytical tasks (the “raw” datasets that companies have to deal with that is in data pipelines/ETL etc. still often requires distributed computing). Another reason for the increased ability to deal with analytics on a single machine is that tools such as Python and R in particular got more efficient in dealing with larger datasets (e.g. making less low-level copies and thus needing less memory and being actually faster too, or getting new packages for fast reading of CSV files, fast data manipulation etc. such as R’s data.table).

    There are a couple of misunderstandings I think:

    1. “So, data scientist do not need as much data as the industry offers to them. Would you agree? – I do not agree. This result does not match my intuition. […]”

    I did not imply this. Most often more data is better and having bigger RAM on one machine makes it easier to deal with bigger data.

    2. “I can’t imagine saying Okay, I don’t need any more memory and more CPU cores.”

    I did not imply that either. Quite the opposite, RAM is cheap, so get as much as possible, and more cores the better too.There are many analytical tools that can deal efficiently with large datasets if you have many cores and lots of RAM, there are several machine learning tools as such.

    I think the only “disagreement” is our different estimate for the growth rate of RAM. I was trying to find a proxy for “how much RAM someone can easily get in one machine for doing analytics” and came up with a growth rate of 50% (or actually possibly even more, see original post). Your best estimate is 35%. Let’s see if others have any other estimates say based on high-end desktops or physical servers. Either way, it seems we agree that RAM is growing faster than the size of datasets used for analytics, no?

    Cheers,
    Szilard

    Like

    • Hi Szilard,

      Thank you so much for your response. I really like your blog post and just wanted add my two cents to the investigation and add some people-friendly interpretation of the research result.

      Based on what I read and understand, your question (your interpretation of the results) appears to me about “big-data” analytical tools. It seems to me that you are using the numbers to to explain why we “don’t need” the analytical tools for predictive analytics. From my experience, “big data” tool primarily focus on preparing data and ETL. They tend to have basic analytical features like regressions methods but we cannot effectively use them for analytics because they have limited amount of analytical methods and lack of visualization features.

      Today, I don’t consider these “big data” tools for predictive analytics. They are useful only in limited scenarios. I investigate these scenarios a lot, see my Apaches Spark blog posts like this http://fullstackml.com/2015/10/29/beginners-guide-apache-spark-machine-learning-scenario-with-a-large-input-dataset/. For me, the only question is how much memory we use for regular analytic tools such as R and Python in a single machine. This is my (biased) interpretation of the numbers – “So, data scientist do not need as much data as the industry offers to them”.

      Based on your research, yes,, I see that “RAM is growing faster than the size of datasets used for analytics” (35% or 50%). However, it’s my belief that we shouldn’t have these difference. I guess that we have even more biases in our datasets in user’s side and you had mentioned this possibility as well: “maybe there is some strong bias and non-representativeness in the KDnuggets survey etc”.

      So, it looks like we agree that the single-machine-tools hold the central role in analytical process but we tried to show that from different points of view.

      Liked by 1 person

  2. Yeah, I absolutely agree with if the main conclusion is “single-machine-tools hold the central role in analytical process.”

    I also agree with the details:

    – “big data tool primarily focus on preparing data and ETL”
    – “They tend to have basic analytical features like regressions methods but we cannot effectively use them for analytics”
    – “They are useful only in limited scenarios” (and nice example in your separate blog post you are referencing)

    Finally, here is (shameless plug) another area I’ve been looking into in case you did not see it:
    – Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification: https://github.com/szilard/benchm-ml
    http://www.slideshare.net/0xdata/h2o-world-benchmarking-open-source-ml-platforms-szilard-pafka
    http://library.fora.tv/2015/11/11/benchmarking_open_source_ml_platforms

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s