Table of Contents
File types
Spark + Parquet In Depth File Format Benchmark Avro JSON ORC and Parquet Berlin buzzwords18: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Parquet
- Better Column selecting
- Columnar format
- Binary format
- Encoded & Compressed
- Support schema evolution - Format supports
Limitation:
- Pushdown filters dont works on String / Binary (source)
- Write speed tradeoff
Walkaround(s):
- Immutability
- Write using partitioning
- Combine with a database (i.e. Cassandara) - after a while spilt out parquets
- Write mode append, that added embedded schema
vs ORC
- indexed
- dont handles nested data
ORC
- Nested Data
- Columnar format
- Predicate pushdown (Min max + bloomfilters)
- ACID support / cannot add
- suggested to streaming (source)