How to export data-frame from Apache Spark

Apache Spark is a great tool for working with a large amount of data like terabytes and petabytes in a cluster. It’s also very useful in local machine when gigabytes of data do not fit your memory. Normally we use Spark for preparing data and very basic analytic tasks. However, it is not advanced analytical features or even visualization. So, therefore, you have to reduce the amount of data to fit your computer memory capacity. It turns out that Apache Spark still lack the ability to export data in a simple format like CSV.

1. spark-csv library

I was really surprised when I realized that Spark does not have a CSV exporting features from the box. It turns out that CSV library is an external project. This is must-have library for Spark and I find it funny that this appears to be a marketing plug for Databricks than an Apache Spark project.

Another surprise is this library does not create one single file. It creates several files based on the data frame partitioning. This means that for one single data-frame it creates several CSV files. I understand that this is good for optimization in a distributed environment but you don’t need this to extract data to R or Python scripts.

2. Export from data-frame to CSV

Let’s take a closer look to see how this library works and export CSV from data-frame.

df.write.
    format("com.databricks.spark.csv").
    option("header", "true").
    save("myfile.csv")

You should include this library in your Spark environment. From spark-shell just add –packages parameter:

1) for scala 2.10:
        bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
2) for scala 2.11:
        bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0

This code creates a directory myfile.csv with several CSV files and metadata files. If you need single CSV file, you have to implicitly create one single partition.

df.write.
    repartition(1).
    format("com.databricks.spark.csv").
    option("header", "true").
    save("myfile.csv")

We should export data the directory with Parquet data, more CSV to the correct place and remove the directory with all the files. Let’s automate this process:

def saveDfToCsv(df: DataFrame, tsvOutput: String,
                sep: String = ",", header: Boolean = false): Unit = {
    val tmpParquetDir = "Posts.tmp.parquet"
    df.repartition(1).write.
        format("com.databricks.spark.csv").
        option("header", header.toString).
        option("delimiter", sep).
        save(tmpParquetDir)
    val dir = new File(tmpParquetDir)
    val tmpTsvFile = tmpParquetDir + File.separatorChar + "part-00000"
    (new File(tmpTsvFile)).renameTo(new File(tsvOutput))
    dir.listFiles.foreach( f => f.delete )
    dir.delete
}

Conclusion

Apache Spark has many great aspects about it. At this time it cannot be the be-all answer. Usually, you have to pair Spark with your analytical tools like R or Python. However, improvement are constantly being made.

5 thoughts on “How to export data-frame from Apache Spark

  1. Thanks for the infos. That was a nice solution for older version of spark. Now with version 2.0.x the following solution would work as well

    nameofMyDataFrame.write
    .mode(“overwrite”)
    .option(“header”,true)
    .option(“sep”, “;”)
    .csv(path=”/mypath”)

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s