CCC review W8
Elasticsearch (I hate this database)
“Big data” challenges and architectures
Four V in big data, Volume, Velocity(the frequency of new data brought in system), Variety(complexity of data schema), Veracity(more data, more diverse source, more unstructured data)
no relational and no SQL, SQL did a great job on consistency, but not work for big data
Data model for distributed system,
key-value store(fast, structured), BigTable DBMS, Document-oriented DBMS(store data as structured files, eg.Elasticsearch)
Database cluster: why we need DB as cluster also
Distribute computing load over multiple computer affording availability
Store multiple copy of data
shard is a partition of your database(take as a part, also the same shards can be stored in different nodes, and we can choose how many replicates we want to store)
Federated architecture DB: many nodes with different tables, but only one entry node. Different DBs with different tables run on many nodes
ElasticSearch Cluster architecture:
two nodes type, Master and data node; one node can have more than one node type.
More than one node can have master role, but only one node in cluster can be a master at the same time(the other nodes with master role will become master-eligible)
every node can act as a coordinate node(coordinating query execution)
indexes can have shards and replicates, also elasticsearch has status green&yellow(indicates that there is no sufficient nodes for distributing shards and replicates)
consistency(client receive same answer from all node of the cluster, availability(receive response from at least one node, partition-tolerance(keep operating when one or more nodes offline)
Brewer's Cap Theorem, you can only achieve two of them
to achieve two of them:
Consistency and availability: two phase commit
when the cluster is co-located, it works great, but not good for distributed
Availability and partition-tolerance: Multi-Version Concurrency Control (MVCC)
Consistency and partition-tolerance: Paxos, in this algorithm, every node either a proposer or an accepter
Why Document-Oriented DB for big data?
sharding: shards in the partitioning of a DB horizontally, database rows or documents are partitioned into subsets and are stored on different nodes
replication: the action of the same row or documents on different nodes to achieve fault tolerance
Finally ElasticSearch now
pros: full-text search, retrieval of time-based data, storing unstructured data
cons: bad at lined data,
Concepts:
Index: an index is comparable to a DB in relational DBMS, Documents: data item of an index, Data steam: a set of indexes follow the same naming pattern, Shard: horizontal partition of an index, Replicates, Node: an instance of ES, Cluster: multiple nodes that cooperate to manage the same index
Components:
ES, FileBeat: the component that listens for updates in files, and loads the updates into indexes, MetricBeat: monitor the status of system, Logstash: transfer data by data sources,Kibana: frontend user interface
scaling in ES either be horizontal(adding more nodes to a cluster) or vertical(provisioning a more powerful node)