Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Scale
06/12/2018 - 16:30 to 17:10
Kesselhaus
long talk (40 min)
Intermediate

Session abstract: 

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.

The use cases that we’ve examined are:

  • reading all of the columns
  • reading a few of the columns
  • filtering using a filter predicate
  • writing the data

While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.

Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.  

Video: 

Slide: