How To Find Simple And Interesting Multi-Gigabytes Data Set

Many folks are very exited about big data. They like play, explore, work and study this frontier. Most likely these folks either work with or would like to play with large amount of data (hundreds of gigabytes or even terabytes). But here’s the thing, it’s not easy to find a multi-gigabytes dataset. Usually, these kinds of datasets are needed for experimentating with new data processing framework such as Apache Spark or data streaming tools like Apache Kafka. In this blog post I will describe and provide a link to simple and a powerful multi-gigabytes stackoverflow data set.

1. Datasets for machine learning

Lots of sources exist for machine learning problems. Kaggle is the best source for these problems and they offer lots of datasets presented with examples of code. Most of these data sets are clean and ready to use in your machine learning experiments.

In a real data scientist’s life most likely you do not have the luxury of clean data and the size of the input data creates an additional big problem. University courses as well as online courses offer a limited viewpoint on data science and machine learning due to the fact they teach student to apply statistical and machine learning methods to a small amount of clean data. In reality, a data scientist spends the majority part of time by getting data and cleaning up that data. According to Hal Varian (Google’s chief economist) “the sexiest job of the 21st century” belongs to Statisticians (and I assume to Data Scientists). However, they perform “clean up” work most of the time.

In order to experiment with new data processing or data streaming tools, you need a large (larger than your computer can hold in memory) and an uncleaned datasets.

Large and uncleanrf datasets will allow you to get actual data processing or learn analytical skills. It turns out that this is not that easy to find.

2. Datasets for processing

Kdnuggets and Quora have pretty good lists of open repositories:

  1. http://www.kdnuggets.com/datasets/index.html
  2. https://www.quora.com/What-kinds-of-large-datasets-open-to-the-public-do-you-analyze-the-mostly

Most of these datasets from these lists are very small in size and for the most part, you need specific knowledge from a dataset specific business domain such as physics or healthcare. However, for learning and experimentation purposes, it would be nice to have a dataset from a well known business domain that all people are familiar with.

Social network data is the best because people understand these datasets and they have intuition about the data which is important in the analytic process. You might use a social network API to extract your data sets. Unfortunately, your data set is not the best for sharing your analytical results with other people. It would be great to find a common social network dataset with an open license. And I’ve found one!

3. Stackoverflow open dataset

Stackoverflow data set is the only social open dataset that I was able to find. Stackoverflow.com is a question and answers web site about programming. This web site is especially useful when you have to write a code in a language you are not familiar with. This well known approach is called – stackoverflow driven development or SDD. I believe all people from the high-tech industry are familiar with stackoverflow and many of them have an account for this web site.

Stack Exchange Company (owner of stackoverflow.com) publishes stackexchange dataset under an open creative common license. You might find the freshest dataset on this page:

https://archive.org/details/stackexchange

The dataset contains all stackexchange data including stackoverflow and the overall size of the archive is 27 gigabytes. The size of the uncompressed data is more than 1 terabyte.

4. How to download and extract the dataset?

However, this dataset is not easy to get. First, you need to upload the archive of the entire dataset. Please note that the downloading speed is very slow. They recommend using a bittorrent client to download the archive but often it has some issues. Without the bittorent, I made 3 attempts and spent 2 days to download this archive. Next, you need to unzip the large archive. Finally, you need to unzip the subset of data that you need (like stackoverflow-Posts or travel.stackexchange) using the 7z compressor. If you don’t have the 7z compressor, you need to find and install it to your machine.

After you download the archive from https://archive.org/details/stackexchange extract all stackoverflow related archives and uncompress each of them (all archives which starts with stackovervlow.com):

  • stackovervlow.com-Posts.7z
  • stackovervlow.com-PostsHistory.7z
  • stackovervlow.com-Comments.7z
  • stackovervlow.com-Badges.7z
  • stackovervlow.com-PostLinks.7z
  • stackovervlow.com-Tags.7z
  • stackovervlow.com-Users.7z
  • stackovervlow.com-Votes.7z

As a result you will see a set of xml files with the same names.

5. How to use the dataset?

Let’s experiment with the dataset. The most interesting file is Posts.xml. This file contains 34Gb of uncompressed data, approximately 70% is Body text which is a text of questions from the web site. This amount of data, most likely, does not fit your memory. We might use an in-disk data manipulation or machine learning technology. This is a good chance to use Apache Spark and MLLib or your custom solution.

Let’s take a look how this stackoverflow question will look like in the file.

Stackowerflow example
Stackowerflow example

In the file this post is presented by one single row. Note that because the text is HTML – the opening and closing p tags (<p> and </p>) are written as &lt;p&gt; and &lt;/p&gt; respectively.

<row
Id=“4”
PostTypeId=“1”
AcceptedAnswerId=“7”
CreationDate=“2008-07-31T21:42:52.667”
Score=“322”
ViewCount=“21888”
Body=“&lt;p&gt;I want to use a track-bar to change a form’s opacity.&lt;/p&gt; &lt;p&gt;This is my code:&lt;/p&gt; &lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000; this.Opacity = trans; &lt;/code&gt;&lt;/pre&gt; &lt;p&gt;When I try to build it, I get this error:&lt;/p&gt; &lt;blockquote&gt; &lt;p&gt;Cannot implicitly convert type ‘decimal’ to ‘double’.&lt;/p&gt; &lt;/blockquote&gt; &lt;p&gt;I tried making &lt;code&gt;trans&lt;/code&gt; a &lt;code&gt;double&lt;/code&gt;, but then the control doesn’t work. This code has worked fine for me in VB.NET in the past. &lt;/p&gt; ”
OwnerUserId=“8”
LastEditorUserId=“451518”
LastEditorDisplayName=“Rich B”
LastEditDate=“2014-07-28T10:02:50.557”
LastActivityDate=“2014-12-20T17:18:47.807”
Title=“When setting a form’s opacity should I use a decimal or double?”
Tags=“&lt;c#&gt;&lt;winforms&gt;&lt;type-conversion&gt;&lt;opacity&gt;”
AnswerCount=“13”
CommentCount=“1”
FavoriteCount=“27”
CommunityOwnedDate=“2012-10-31T16:42:47.213”
/>

I’ll provide Apache Spark code examples with this data set in the next blog post. My scenario will include two parts: preparing data or data manipulation and machine learning part. Both of these part I’ll use multi-gigabytes dataset as an input.

Conclusion

Stackoverflow dataset (https://archive.org/details/stackexchange) is probably the simplest and most interesting open multi-gigabytes dataset you can find which fits machine learning, data processing scenarios and data streaming. Please share if you have any information about other simple open big dataset resources. This should help the community a lot.

10 thoughts on “How To Find Simple And Interesting Multi-Gigabytes Data Set

    • Thank you for the link. I took a look at big table data. Most of the datasets are relatively small. I found only wikipedia with multi-gigabytes data size. Could you please point me to some other large dataset in BigQuery?

      Like

      • I updated the page noting the size of the big ones: GDELT (340GB and growing every 15 minutes), Wikipedia (380GB per month), GitHubArchive (87.2 GB per year, and growing every day), Genomics (3.4 TB + 9.8 TB + …), HttpArchive (42 GB per run), Freebase (142 GB), New York City taxi rides (130 GB+), Reddit (546 GB of comments, and growing).

        BigQuery is pretty good for machine learning too: Host the dataset in BigQuery, quickly explore dimensions and choose your features within BigQuery. Then feed that to your ML algs.

        Btw, any other dataset that you have in an easy to ingest CSV or JSON, I’ll happily load them into BigQuery for you.

        Liked by 1 person

      • Felipe, thanks for the sizes. With the sizes it looks much more interesting 🙂
        I know how to query data in big table. Is it possible to download entire dataset of a large subset?

        Let say I need all GET queries from HttpArchive. Expected size – 35Gb (from 42Gb). How can I download this large subset to my local machine?

        Like

      • If you want to download the full dataset from BigQuery, you’ll need to run an extract operation that exports a csv, json, or avro to Google Cloud Storage – industry standard cloud storage costs apply here (storage, egress, …).

        On the other hand, if you use BigQuery for feature extraction, you get a free terabyte of querying per month, and results can be extracted straight into your app for free using the REST API.

        This solves a traditional problem for people who want to share data: Wherever you leave data, the person has to bear the costs of hosting and for each person that downloads it. BigQuery is better than that: If you share a dataset in BigQuery, the person sharing doesn’t have any additional cost when other people want to use it.

        Like

Leave a comment