Trace: • file_types

Table of Contents

File types
- Parquet
- ORC
- Avro

File types

Spark + Parquet In Depth File Format Benchmark Avro JSON ORC and Parquet Berlin buzzwords18: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Parquet

Better Column selecting
Columnar format
Binary format
Encoded & Compressed
Support schema evolution - Format supports

Limitation:

Pushdown filters dont works on String / Binary (source)
Write speed tradeoff

Walkaround(s):

Immutability
- Write using partitioning
- Combine with a database (i.e. Cassandara) - after a while spilt out parquets
- Write mode append, that added embedded schema

vs ORC

indexed
dont handles nested data

ORC

Nested Data
Columnar format
Predicate pushdown (Min max + bloomfilters)
ACID support / cannot add
suggested to streaming (source)

Avro

kb/bigdata/file_types.txt · Last modified: 2022/01/03 16:03 by 127.0.0.1

Back to top