This is an old revision of the document!
Table of Contents
File types
Parquet
- Better Column selecting
- Columnar format
- Binary format
- Encoded & Compressed
Limitation:
- Pushdown filters dont works on String / Binary (source)
- Write speed tradeoff
Walkaround(s):
- Immutability
- Write using partitioning
- Combine with a database (i.e. Cassandara) - after a while spilt out parquets
- Write mode append, that added embedded schema
vs ORC
- indexed
- dont handles nested data
ORC
- Nested Data
- Columnar format
- Predicate pushdown (Min max + bloomfilters)
- ACID support / cannot add
- suggested to streaming (source)