Where to find terabyte-size dataset for machine learning

In the previous blog posts we played with a large multi-gigabyte dataset. This 34 GB dataset is based on stackoverflow.com data. A couple days ago I found another great large dataset. This is a two terabyte snapshot from Reddit website. This dataset is perfect for text mining and NLP experimentation.

1. Two terabytes data set

The full dataset contains two terabytes of data in JSON format. Thank you for Stuck_In_the_Matrix who created this dataset! The compressed version is 250 GB. You can find this dataset here in Reddit. You should use torrent to download this compressed data.

Additionally, you might find a 32 gb subset of this data in Kaggle website in SQLite format here. Also, you can play with the data online through R or Python in the Kaggle competition.

2. Easy to use 16 gigabytes subset

To simplify the process of working with this data, I created a subset of this data in plain text TSV format (tab separated values) here in my dropbox folder (updated, old Mac OS compatable only archive is here). The file contains the copy of the Kaggle subset. File size is 16GB uncompressed (yes, it is 2 times smaller than the Kaggel file because of plain text format without indexes) and 6.6GB in archive.

SQLite code for converting the Kaggle file to a plain text:

 sqlite> .open database.sqlite
 sqlite> .headers off
 sqlite> .mode tabs
 sqlite> .out reddit-May2015.tsv
 sqlite> SELECT created_utc,ups,subreddit_id,link_id,name,score_hidden,replace(replace(author_flair_css_class, X'09', ' '), X'0A', ' ') AS author_flair_css_class,replace(replace(author_flair_text, X'09', ' '), X'0A', ' ') AS author_flair_text,subreddit,id,removal_reason,gilded,downs,archived,author,score,retrieved_on, replace(replace(body,X'09',' '), X'0A', ' ') AS body, distinguished,edited,controversiality,parent_id FROM May2015;
 sqlite> .exit

Note that I replace all tabs (X’09’) and newlines (X’0A’) to spaces for all text columns. Please let me know if you know how to combine two character replacement to one operations.

3. Read data in Spark

import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val fileName = "reddit-May2015.tsv"
val textFile = sc.textFile(fileName)
val rdd = textFile.map(_.split("\t")).filter( _.length == 22 ).map { p =>
            Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9),
                p(10), p(11), p(12), p(13), p(14), p(15), p(16), p(17), p(18), p(19),
                p(20), p(21))
        }
val schemaString = "created_utc,ups,subreddit_id,link_id,name,score_hidden,author_flair_css_class,author_flair_text,subreddit,id,removal_reason,gilded,downs,archived,author,score,retrieved_on,body,distinguished,edited,controversiality,parent_id"
val schema = StructType(
      schemaString.split(",").map(fieldName => StructField(fildName, StringType, true)))
val df = sqlContext.createDataFrame(rdd, schema)
df.show

Conclusion

Today is not easy to find great and interesting dataset for testing, training and research. So,  let’s collect some interesting datasets. Please share with the community your newly found information.

P.S.

I looked into the licensing of this dataset. The dataset publisher Stuck_In_the_Matrix just published the dataset and provided description and links to the torrent directly in the Reddit website. Please note that Reddit sponsors the Kaggle competition with this dataset. It appears that we may play with the dataset for non-business related purposes.

Enter your email address to follow this blog and receive notifications of new posts by email.

3 thoughts on “Where to find terabyte-size dataset for machine learning

  1. Hi Dmitr,
    Thanks for sharing the data set. I downloaded the sample data set from your Dropbox however I could not unzip it because it looks to be corrupted. Is there any chance that you check it out and upload a new version if it is corrupted? That is very much appreciated.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s