BigData

YARN	Sqoop	Spark	Knox	Hortonworks
File types	NiFi	Eco-system

Performance

Tools

Ingestion

Apache Gobblin

Workflow

Azkaban

MDM

DataHub: A Generalized Metadata Search & Discovery Tool (ex WhereHows)
- Linkedin Datahub (ex WhereHows)
- Linkedin wherehows

OLAP & OLTP

Fast Databases

Primume Databases

Tec

kylin
Druid
Redix - http://redux.js.org/
Clustrix - http://www.clustrix.com/
Aerospike - http://www.aerospike.com/
presto - Distributed SQL query engine for big data (github) -Apache License 2.0
Apache Kudu - A addition to Apache Hadoop ecosystem. Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data. Kudu is specifically designed for use cases that require fast analytics on fast (rapidly changing) data. Engineered to take advantage of next-generation hardware and in-memory processing, Kudu lowers query latency significantly for Apache Impala (incubating) and Apache Spark (initially, with other execution engines to come).
Alluxio (formerly Tachyon) - enables any application to interact with any data from any storage system at memory speed.
vespa - Big data. Real time. The open big data serving engine: Store, search, rank and organize big data at user serving time.

Logs & Visual

logstash - https://www.elastic.co/products/logstash
graph - https://www.elastic.co/products/x-pack/graph
kibana - https://www.elastic.co/products/kibana
graylog - https://www.graylog.org/

D Other

DeDup with hadoop

Graph Databases

Terminology

RDF - Resource Description Framework Source
OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). source
OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). source
OLTP vs OLAP - We can divide IT systems into transactional (OLTP) and analytical (OLAP).

Technology

HGraphDB - HBase as a TinkerPop Graph Database
TinkerPop - Graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP).
Giraph - Iterative graph processing system built for high scalability
s2graph - graph database designed to handle transactional graph processing at scale. Its REST API allows you to store, manage and query relational information using edge and vertex representations in a fully asynchronous and non-blocking manner
GraphFrames - package for Apache Spark which provides DataFrame-based Graphs (Tutorial)
Spark GraphX - component in Spark for graphs and graph-parallel computation (Tutorial)

Other

Hadoop, Spark, Storm, Samza, Spark Streaming, Kafka, Flume, MapReduce, Scalding, Hbase, MongoDB, Cassandra, Elasticsearch, Solr, Spark Mlib, Algebird, Spark Graphx

NiFi, Apex

http://sigmajs.org/ - Vizual grpah js lib

Tech KB

Other url

Hadoop
CDAP

kb/bigdata.txt · Last modified: 2022/01/03 16:03 by 127.0.0.1

Back to top

Table of Contents