====== File types ====== [[https://www.youtube.com/watch?v=_0Wpwj_gvzg|Spark + Parquet In Depth]] [[https://www.youtube.com/watch?v=2vOfh064uUM|File Format Benchmark Avro JSON ORC and Parquet]] [[https://www.youtube.com/watch?v=aIcxFIyL6xo|Berlin buzzwords18: Owen O'Malley – Fast Access To Your Complex Data - Avro, JSON, ORC, and Parquet]] ===== Parquet ===== * Better Column selecting * Columnar format * Binary format * Encoded & Compressed * Support schema evolution - Format supports Limitation: * Pushdown filters dont works on String / Binary ([[https://www.youtube.com/watch?v=_0Wpwj_gvzg|source]]) * Write speed tradeoff Walkaround(s): * Immutability * Write using partitioning * Combine with a database (i.e. Cassandara) - after a while spilt out parquets * Write mode append, that added embedded schema === vs ORC === * indexed * dont handles nested data * ===== ORC ===== * Nested Data * Columnar format * Predicate pushdown (Min max + bloomfilters) * ACID support / cannot add * suggested to streaming ([[https://www.youtube.com/watch?v=NZLrJmjoXw8|source]]) ===== Avro =====