====== BigData ====== | [[.:BigData:YARN|YARN]] | [[.:BigData:Sqoop]] | [[.BigData:Spark]] | [[.:BigData:Knox]] | [[.:BigData:Hortonworks|Hortonworks]] | | [[.:BigData:File types]] | [[.:BigData:NiFi]] | [[.:BigData:Eco-system]] | [[https://www.slideshare.net/mattlieber/parquet-and-impala-overview-external| parquet-and-impala-overview-external presentation]] [[https://www.dremio.com/|Dermio Israeli startup ]] [[https://delta.io/|Delat Lake]] [[https://www.youtube.com/watch?v=zx9rFKnk4hU|Delta lake youtube]] ===== Performance ===== * [[http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-1/|Hadoop]] * [[https://www.cloudera.com/documentation/enterprise/5-9-x/topics/admin_hos_tuning.html|Cloudera tune]] ==== Tools ==== * [[https://unraveldata.com/|Unravel]] * [[https://github.com/linkedin/dr-elephant|Dr. Elephant]] * [[https://www.pepperdata.com/| Dr. Elephant Enterprise]] ===== Ingestion ===== * [[https://gobblin.apache.org/|Apache Gobblin]] * [[https://www.youtube.com/watch?v=BQ7aONetKl4|Youtube:Stream and Batch Data Integration at LinkedIn scale using Apache Gobblin]] * [[https://engineering.linkedin.com/blog/2021/data-integration-library|Linkedin blog: data-integration-library]] * [[https://gobblin.readthedocs.io/en/latest/miscellaneous/Exactly-Once-Support/#achieving-exactly-once-delivery-with-commitstepstore|Gobblin Exactly-Once-Support readthedocs.io]] * [[https://www.youtube.com/watch?v=fHFNZlWCpKA|Youtube:Gobblin как ETL-фреймворк / Иван Ахлестин (Rambler&Co)]] * [[https://cwiki.apache.org/confluence/display/GOBBLIN/Gobblin+as+a+Service|Gobblin as a Service]] * [[https://gobblin.apache.org/docs/user-guide/Gobblin-CLI/|user-guide Gobblin-CLI]] ===== Workflow ===== * [[https://azkaban.github.io|Azkaban]] ===== MDM ===== * DataHub: A Generalized Metadata Search & Discovery Tool (ex WhereHows) * [[https://github.com/linkedin/datahub| Linkedin Datahub (ex WhereHows)]] * [[https://engineering.linkedin.com/wherehows | Linkedin wherehows]] ===== OLAP & OLTP ===== * Druid * Kylin * [[https://www.slideshare.net/argonauts007/kylin-and-druid-presentation|Kylin and Druid presentation]] ===== Fast Databases ===== See also: [[https://en.wikipedia.org/wiki/List_of_in-memory_databases| in mem dbs]] * [[https://www.memsql.com/|MemSQL]] * [[http://kylin.apache.org/|Apache Kylin™ Extream OLAP engine for big data]] * https://www.citusdata.com/ * [[http://druid.io/|Druid]] is a high-performance, column-oriented, distributed data store. * [[https://www.rethinkdb.com/|The open-source database for the realtime web]] * [[https://prestodb.io/overview.html|Presto]] - is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. * [[https://www.voltdb.com|VoltDB]] in-memory operational database with real-time analytics and real-time decisioning is available in several different editions: Enterprise, Pro, AWS, and Community. * [[https://clickhouse.yandex/|clickhouse]] ===== Primume Databases ===== * [[https://www.vertica.com/overview/|Vertica]] * [[https://www.teradata.com/|teradata]] * [[https://www.kinetica.com/|kinetica]] ===== Tec ===== * [[http://kylin.apache.org/|kylin]] * [[http://druid.io/|Druid]] * Redix - http://redux.js.org/ * Clustrix - http://www.clustrix.com/ * Aerospike - http://www.aerospike.com/ * [[https://prestodb.io|presto]] - Distributed SQL query engine for big data ([[https://github.com/prestodb/presto|github]]) -Apache License 2.0 * [[https://kudu.apache.org/|Apache Kudu]] - A addition to Apache Hadoop ecosystem. Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data. Kudu is specifically designed for use cases that require fast analytics on fast (rapidly changing) data. Engineered to take advantage of next-generation hardware and in-memory processing, Kudu lowers query latency significantly for Apache Impala (incubating) and Apache Spark (initially, with other execution engines to come). * [[http://www.alluxio.org/|Alluxio (formerly Tachyon)]] - enables any application to interact with any data from any storage system at memory speed. * [[http://vespa.ai/|vespa]] - Big data. Real time. The open big data serving engine: Store, search, rank and organize big data at user serving time. ===== Logs & Visual ===== * logstash - https://www.elastic.co/products/logstash * graph - https://www.elastic.co/products/x-pack/graph * kibana - https://www.elastic.co/products/kibana * graylog - https://www.graylog.org/ ===== D Other ===== * [[http://dbs.uni-leipzig.de/dedoop|DeDup with hadoop]] ===== Graph Databases ===== ==== Terminology ==== * **RDF** - Resource Description Framework [[https://en.wikipedia.org/wiki/Resource_Description_Framework|Source]] * **OLTP** (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). [[http://datawarehouse4u.info/OLTP-vs-OLAP.html|source]] * **OLAP** (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). [[http://datawarehouse4u.info/OLTP-vs-OLAP.html|source]] * **OLTP vs OLAP** - We can divide IT systems into transactional (OLTP) and analytical (OLAP). ==== Technology ==== * [[https://github.com/rayokota/hgraphdb|HGraphDB]] - HBase as a TinkerPop Graph Database * [[http://tinkerpop.apache.org/|TinkerPop]] - Graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). * [[http://giraph.apache.org/|Giraph]] - Iterative graph processing system built for high scalability * [[http://s2graph.incubator.apache.org/|s2graph]] - graph database designed to handle transactional graph processing at scale. Its REST API allows you to store, manage and query relational information using edge and vertex representations in a fully asynchronous and non-blocking manner * [[https://graphframes.github.io|GraphFrames]] - package for Apache Spark which provides DataFrame-based Graphs ([[https://databricks.com/blog/2016/03/03/introducing-graphframes.html|Tutorial]]) * [[http://spark.apache.org/docs/latest/graphx-programming-guide.html|Spark GraphX]] - component in Spark for graphs and graph-parallel computation ([[http://note.yuhc.me/2015/03/data-loading-in-graphx/|Tutorial]]) ===== Other ===== Hadoop, Spark, Storm, Samza, Spark Streaming, Kafka, Flume, MapReduce, Scalding, Hbase, MongoDB, Cassandra, Elasticsearch, Solr, Spark Mlib, Algebird, Spark Graphx NiFi, Apex http://sigmajs.org/ - Vizual grpah js lib * [[https://databricks.com/blog/2016/03/03/introducing-graphframes.html|Spark and HBase - Example]] * [[https://github.com/amplab/graphx/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala|HBase Spark example ]] ===== Tech KB ===== * [[.:bigdata:Apache phoenix|Apache phoenix]] * [[https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_connecting_to_a_mainframe|Sqoop connect mainframe]] ===== Other url ===== * [[https://streever.atlassian.net/wiki/spaces/HADOOP/pages/9961474/Hive+JDBC+Extended+Connection+URL+Examples| Hadoop]] * [[https://cdap.io/|CDAP]]