====== File types ======

[[https://www.youtube.com/watch?v=_0Wpwj_gvzg|Spark + Parquet In Depth]]
[[https://www.youtube.com/watch?v=2vOfh064uUM|File Format Benchmark Avro JSON ORC and Parquet]]
[[https://www.youtube.com/watch?v=aIcxFIyL6xo|Berlin buzzwords18: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet]]

===== Parquet =====


  * Better Column selecting
  * Columnar format
  * Binary format
  * Encoded & Compressed
  * Support schema evolution - Format supports
Limitation:
  * Pushdown filters dont works on String / Binary ([[https://www.youtube.com/watch?v=_0Wpwj_gvzg|source]])
  * Write speed tradeoff

Walkaround(s):
  * Immutability
    * Write using partitioning
    * Combine with a database (i.e. Cassandara) - after a while spilt out parquets
    * Write mode append, that added embedded schema

=== vs ORC ===
  * indexed
  * dont handles nested data
  * 

===== ORC =====

  * Nested Data 
  * Columnar format
  * Predicate pushdown (Min max + bloomfilters)
  * ACID support / cannot add 
  * suggested to streaming ([[https://www.youtube.com/watch?v=NZLrJmjoXw8|source]])


===== Avro =====