CCC review W8

Posted on 2024-05-30

Elasticsearch (I hate this database)

“Big data” challenges and architectures

Four V in big data, Volume, Velocity(the frequency of new data brought in system), Variety(complexity of data schema), Veracity(more data, more diverse source, more unstructured data)

no relational and no SQL, SQL did a great job on consistency, but not work for big data

Data model for distributed system,

key-value store(fast, structured), BigTable DBMS, Document-oriented DBMS(store data as structured files, eg.Elasticsearch)

Database cluster: why we need DB as cluster also

Distribute computing load over multiple computer affording availability

Store multiple copy of data

shard is a partition of your database(take as a part, also the same shards can be stored in different nodes, and we can choose how many replicates we want to store)

Federated architecture DB: many nodes with different tables, but only one entry node. Different DBs with different tables run on many nodes

ElasticSearch Cluster architecture:

two nodes type, Master and data node; one node can have more than one node type.

More than one node can have master role, but only one node in cluster can be a master at the same time(the other nodes with master role will become master-eligible)

every node can act as a coordinate node(coordinating query execution)

indexes can have shards and replicates, also elasticsearch has status green&yellow(indicates that there is no sufficient nodes for distributing shards and replicates)

consistency(client receive same answer from all node of the cluster, availability(receive response from at least one node, partition-tolerance(keep operating when one or more nodes offline)

Brewer's Cap Theorem, you can only achieve two of them

to achieve two of them:

Consistency and availability: two phase commit

when the cluster is co-located, it works great, but not good for distributed

Availability and partition-tolerance: Multi-Version Concurrency Control (MVCC)

Consistency and partition-tolerance: Paxos, in this algorithm, every node either a proposer or an accepter

Why Document-Oriented DB for big data?

sharding: shards in the partitioning of a DB horizontally, database rows or documents are partitioned into subsets and are stored on different nodes

replication: the action of the same row or documents on different nodes to achieve fault tolerance

Finally ElasticSearch now

pros: full-text search, retrieval of time-based data, storing unstructured data

cons: bad at lined data,

Concepts:

Index: an index is comparable to a DB in relational DBMS, Documents: data item of an index, Data steam: a set of indexes follow the same naming pattern, Shard: horizontal partition of an index, Replicates, Node: an instance of ES, Cluster: multiple nodes that cooperate to manage the same index

Components:

ES, FileBeat: the component that listens for updates in files, and loads the updates into indexes, MetricBeat: monitor the status of system, Logstash: transfer data by data sources,Kibana: frontend user interface

scaling in ES either be horizontal(adding more nodes to a cluster) or vertical(provisioning a more powerful node)