Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Scale

06/12/2018 - 16:30 to 17:10

Kesselhaus

long talk (40 min)

Intermediate

Session abstract:

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.

The use cases that we’ve examined are:

reading all of the columns
reading a few of the columns
filtering using a filter predicate
writing the data

While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.

Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.

Video:

#bbuzz 2018: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Slide:

sparkfileformatbenchmark.pdf

Berlin Buzzwords

Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Session abstract:

Video:

#bbuzz 2018: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Slide:

sparkfileformatbenchmark.pdf

Newsletter

Partners

Gold Partner

Past conferences