Notes of Designing Data-Intensive Application(1)

2019-05-01

1. Data-intensive Application

Scalability

Percentiles
1. tail latency amplification
2. Dashboard: forward decay, t-digest, HdrHistogram
3. SLOs, SLAs, it is up if less than 200ms and a 99th percentile under 1s, and the service may be required to be up at least 99.9% of the time.
4. Queueing delay, head-of-line blocking.
Coping with Load
1. scaling up, vertical moving to a more powerful machine
2. scaling out, horizontal distributing the load across multiple smaller machines
3. elastic, automatically add computing resources.
4. easy for stateless services to be distributed but complex for stateful one.

Maintainability

fixing bugs
keeping its systems operational investigating failures
adapting it to new platform
modifying it for new use cases
repaying technical debt
adding new features

Operability - easy to keep the system running smoothly
Simplicity - easy to understand the system
Avoid:
- explosion of the state space
- tight coupling of modules
- tangled dependencies
- inconsistent naming and terminology
Evolvability - easy to change the system, adapt it for new use cases.

2. Data Models and Query Languages

Examples

Model the real world in terms of objects or data structures, and APIs that manipulate those data structures.
Express them in terms of a general-purpose data model, such as JSON, XML documents, tables in relational databases, or a graph model.
Decide a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. Allow the data to be queried, searched, manipulated and processed in various ways.
Lower layer, hardware engineers represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more.

Relational Model Versus Document Model

SQL - transaction processing, batch processing
Relational model dominates other competitors - network model, hierarchical model, Object data bases, XML databases.
NoSQL: driving forces behind the adoption of NoSQL are:
1. greater scalability can be achieved, very large datasets or very high write throughput.
2. preference for free and open source software over commercial database products.
3. Specialized query operations
4. Frustration with restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model.

Object-Relational Mismatch

Needs a translation layer between the objects in the application code and the databases model of tables, rows and columns.

ORM(Object-relational mapping) framework like ActiveRecord and Hibernate
one-to-many relationship from the user to these items, which can be represented in various ways:
1. put them in separate tables with a foreign key reference to the users table.
2. support for structured datatypes and XML data;
3. encode them as a JSON or XML document, store it on a text column in the database, and let the application interpret its structure and content.

Many-to-One and Many-to-Many Relationships

There are advantages to having standardized lists of geographic regions and industries, and letting users choose from a drop-down list or auto-completer:

Consistent style and spelling across profiles
Avoid ambiguity
Ease of updating
Localization support
Better search

many-to-one relationship

Store an ID instead of a text string is a question of duplication

Because an ID has no meaning to humans, it never needs to change. If the information is duplicated, all the redundant copies need to be updated, that incurs write overheads, and risks inconsistencies.
Removing such duplication is the key idea behind normalization in databases.

Many-to-Many relationship

Organizations and schools as entities

each organization, school and university could have its own web page, each resume could link to it, includes their logos and other information.
Recommendations

Compare

Document data model:

schema flexibility, better performance due to locality, closer to the data structures used by the application
Relational model:

providing better support for joins, and many-to-one and many-to-many relationships.
polyglot

Query Languages for Data

SQL: declarative - just specify the pattern of the data you want. And databases system’s query optimizer to decide how to achieve the goal.

IMS, CODASYL and others programming languages: imperative - step through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.

Declarative query language is attractive because:

Possible for the database system to introduce performance improvements without requiring any changes to queries.
Have a better chance of getting faster in parallel execution they specify only the pattern of the results, not the algorithm that is used to determine the results.

Declarative Queries on the Web

Specify the selected item “Sharks” as blue background:
CSS example:

1
2
3

li.selected > p{
	background-color: blue;
}

XSL example:

<xsl:template match="li[@class='selected']/p">
	<fo:block background-color="bule">
		<xsl:apply-templates/>
	</fo:block>
</xsl:template>

They are both declarative languages for specifying the styling of a document. If we use a imperative approach, like Java-script, using the core Document Object Model(DOM) API, the result might look like:

var liElements = document.getElementsByTagName("li");
for(var i = 0; i<liElements.length;i++){
	if(liEelements[i].className=="selected"){
		var children = liElements[i].childNodes;
		for(var j=0; j<children.length;j++){
			var child = children[j];
			if(child.nodeType == Node.ELEMENT_NODE && child.tagName === "p"){
				child.setAttribute("style", "background-color:blue");
			}
		}
	}
}

Not only longer and harder to read, but also:

If the selected class is removed(select other page), the background color won’t be removed, until the entire page is reloaded. With CSS, the browser automatically detects when the li.selected > p rule no longer applies.
If you want to take advantage of a new API, say to use document.getElementsByClassName(“selected”) or document.evaluate()
, you have to rewrite the code. But CSS can do it without breaking compatibility.
MapReduce Querying
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between. It is based on the map(collect) and reduce(fold, inject) functions that exist in many functional programming languages.

An example to report how many sharks you have sighted per month:
In PostgreSQL:

SELECT data_trunc('month', observation_timestamp) AS observatin_month, sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;

The same can be expressed with MongoDB’s MapReduce features as follows:

db.observations.mapReduce(
	function map(){
		var yer = this.observationTimestamp.getFullYear();
		var month = this.observationTimestamp.getMonth()+1;
		emit(year + "-" +month, this.numAnimals;
	}
	function reduce(key, values){
		return Array.sum(values);
	}
	{
		query: { family: "Sharks"},
		out: "monthlySharkReport"
	}
};

Result:
map: emit(“1995-12”, 3) and emit(“1995-12”,4)
reduce: reduce(“1995-12”, [3,4]) -> 7

Restricted in: They must be pure function, cannot perform additional databases queries, and they must not have any side effects.

MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations, but there are also many distributed implementations of SQL that don’t use MapReduce.

MongoDB 2.2 added support for a declarative query language called the aggregation pipeline:

db.observations.aggregate([
	{ $match:{family:"Sharks"}},
	{ $group: {
		_id: {
			year:{$year: "$observationTimestamp"},
			month:{$month:"$ovservationTimestamp"}
		},
		totalAnimals:{$sum:"$numAnimals"}
	}}
]};

Graph-like data models

Many-to-many relationships are an important distinguishing feature between different data models.
A graph consists of two kinds of objects: vertices, edges
Social graphs, The web graph, Road or rail networks.

Property Graphs

Each vertex consists of:

A unique identifier
A set of outgoing edges
A set of incoming edges
A collection of properties(key-value pairs)

Each edge consists of:

A unique identifier
The vertex at which the edge starts(the tail vertex)
The vertex at which the edge ends(the head vertex)
A label to describe the kind of relationship between the two vertices
A collection of properties(key-value pairs)

The Cypher Query Language

Cypher is a declarative query language for property graphs, create for the Neo4j graph database.

Recursive common table expression

find all nodes that are within “us”

Triple-Stores and SPARQL

simple three-part statements: subject, predicate, object -> Jim likes bananas.

Semantic web
Resource Description Framework(RDF) was intended as a mechanism for different websites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data.
- RDF data model
  The subject, predicate and object of a triple are often URIs, http://my-company.com/namespace#within

SPARQL query language
query a person born in somewhere:
Cypher:

(person) -[:BORN_IN]-> () -[:WITHIN*0..]->(location)

SPARQL:

?person :bornIn / : within* ?location.

Datalog

is similar to triple-store model, writing as predicate(subject, object).

Summary

Besides from relational databases.
NoSQL datastores have diverged in two main directions:

Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
Graph databases go in the opposite direction, targeting use cases where anything is potentially to related to everything.

3. Storage and Retrieval

Data Structures

Log - an append-only data file.
Index - efficiently find the value for a particular key in database

Hash Indexes

Bitcask(the default storage engine in Riak)
Keep the hash map completely in memory, the key points to the byte offset where can find the value. High-performance reads and writes.

This is feasible when you have a large number of writes per key, and there are not many distinct keys.

What if “Out of disk space”
Break log into segments of a certain size, then perform compaction on these segments

Issues in an implementation

File format - CSV, it is faster simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string.
Deleting records - append a deletion record to the data file(tombstone), will be discarded previous while merging.
Crash recovery - speed up the restart by storing a snapshot of each segment’s map on disk.
Partially written records - checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control - only one write thread.

Why append-only

appending and segment merging are sequential write operations, generally much faster than random writes.
CC and crash recovery are much simpler.

Hash table index limitations:

The hash table must fit in memory.
Range query are not efficient.

SSTables and LSM-Trees

Require the sequence of key-value pairs is sorted by key.

Merging segments is simple and efficient, use mergesort.
Find particular key in the file you don’t need an index of all keys in memory, but directly jump to the offset of some known key and scan according to the order. Only needs a sparse index to tell you the offsets for some of the keys.
Possible to group those records in a block and compress it before writing it to disk.

Long-Structured Merge-Tree (LMS-Tree) is based on the principle of merging and compacting files. It can be slow when looking up keys that do not exist.

B-Trees

Crash recovery - WAL(write ahead log) or redo log, every B-tree modification must be written before it can be applied to the pages of the tree itself. The log is used to restore the B-tree back to a consistent state.
Concurrency control with latches(lightweight locks).

Optimizations

use a copy-on-write scheme, instead of overwriting pages and maintaining a WAL for each recovery.
Abbreviating the sorted key names.
lay out the trees so that leaf pages appear in sequential order on disk. (LMS-trees is easier on this)
Additional pointers have been added to the tree, sibling pages to the left and right, for going back to parent pages.
Fractal trees with log-structured ideas to reduce disk seeks.

Compare B-Trees and LSM-Trees

LSM-trees are faster for writes.
B-trees are faster for reads.

Secondary index

there might be many rows with the same key:

make each value in the index a list of matching row identifiers.
make each key unique by appending a row identifier to it.

Storing values within the index

Clustered index - storing all row data within the index
Non-clustered index - storing only references to the data within the index

Multi-column indexes

Concatenated index, simply combines several fields into on key by appending one column to another.

Spatial Indexes

translate a two-dimensional location into a single number using space-filling curve, and then to use a regular B-tree index.
More commonly, R-trees are used.

Full-text search and fuzzy indexes

fuzzy querying

It requires different techniques. It uses a finite state automaton over the characters in the keys, similar to a trie. Search for words within a given edit distance.

In memory databases

Keep the entire database in memory, potentially distributed across several machines.

Performance, faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
Providing data models that are difficult to implement with disk-based indexes.
Anti-caching approach - LRU
Non-volatile memory(NVM)

Transaction Processing or Analytics?

Online transaction processing(OLTP)
OLTA - different access pattern

needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics(count, sum, average).

Data Warehousing - run analytics on this separate database

They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database.
Expensive, scanning large parts of the dataset, harm the performance of concurrently executing transactions.

Extract-Transform-Load(ETL)

Data is extracted from OLTP databases(using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.
The divergence between OLTP databases and data warehouses
On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface, but the internals of the systems can look quite different. Focus on either OLTP or OLTA, but not both.
A plethora of open source SQL-on-Hadoop projects

Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo and Apache Drill.

Starts and Snowflakes: Schemas for Analytics

There is a fact table in the middle, surrounded by its dimension tables. These dimension tables with the foreign key as the columns in the fact table.

Snowflakes
If the dimension tables are further broken into sub-dimensions.

Column-Oriented Storage

In OLTP databases, storage is laid out in a row-oriented fashion, all the value from a row are stored one by one.

Instead, store all the values from each column together, save a lot of work and space. They are in the same order of the row number, so it is easy to reassemble an entire row.

Column Compression

Bitmap encoding
74 = 000010000000000000 -> 4,1(4 zeros, 1 one, rest zeros)
1. WHERE product_sk IN(30,68,69)
  Use bitwise OR can be done efficiently
2. WHERE product_sk=31 AND store_sk=3
  Use bitwise AND.

Memory bandwidth and vectorized processing

Reducing the volume of data that needs to be loaded from disk.
Making efficient use of CPU cycle. Process a chunk of compressed column data that fit in the same amount of L1 cache directly one chunk by one chunk, CPU can execute a loop much faster. This is vectorized processing.

Sort Order in Column Storage

Several different sort orders

C-Store, multiple sort orders, replicate to multiple machines anyway for backup.

Writing to Column-Oriented Storage

With a B-Tree, write a row will be difficult. Rewrite all column files.
But with LMS-Tree, they are all written in a sorted structure and prepared for writing to disk.

Aggregation: Data Cubes and Materialized Views

Materialized aggregates, Cache some of the counts or sums that queries use most often.
1. Materialized view, automatically make a copy of the query results. More expensive for writing, only for improving read performance.
2. Data cube, OLAP cube, use a n-dimension table to get the sum, count attribute for each other dimension.
  Certain queries are very fast with this fashion, but it doesn’t have the same flexibility as querying the raw data.

Encoding and Evolution

Evolvability: Make it easy to change.

Server-side applications, perform a rolling upgrade, deploying the new version to a few nodes at a time, gradually working your way through all nodes. Allow new versions to be deployed without service downtime.
Client-side, at the mercy of the user, who may not install the update for some time.

Old and new versions of the code, old and new data formats potentially coexist in the system.

Compatibility
1. Backward compatibility - newer code can read data that was written by older code.
2. Forward compatibility - Older code can read data that was written by newer code.

Formats for Encoding Data

Two representations of data:

In memory, data is kept in objects, structs, lists, arrays, hash tables, trees and so on. Optimized for efficient access, and manipulation by the CPU.
When write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes(JSON).

So we need to translate these two representation, from in-memory to a byte sequence is called encoding(serialization or marshalling), other way is called decoding(deserialization, unmarshalling, parsing).

Language-Specific Formats

Many programming languages come with built-in support for encoding in-memory object into byte sequence.

Java - java.io.Serializable
Ruby - Marshal
Python - pickle
Kryo for Java(thrid-party libraries)

Problems are:

It is very difficult to read the data in another language.
To restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. Security problem, attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code.
Versioning data is often an afterthought in these libraries.
Efficiency, Java’s built-in serialization is notorious for its bad performance and bloated encoding.

JSON, XML, and Binary Variants

Problems:

There is a lot of ambiguity around the encoding of numbers.
Dealing with large number, integers greater than %2^53% cannot be exactly represented in an IEEE 754 double-precision floating-point number.
JSON and XML have good support for Unicode character strings, but they don’t support binary strings.
Applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.
CSV doesn’t have any schema, if an application change adds a new row or column, you have to handle the change manually. CSV is also a quite vague format(what happens if a value contains a comma or a newline character?)

Binary encoding

JSON is less verbose than CML, but both use a lot of space compared to binary format.
Example record:

{
	"userName":"Martin",
	"favoriteNumber":1337,
	"interest":["daydreaming","hacking"]
}

Example of MessagePack, a binary encoding for JSON:

object(3 entries)	string(length 8)	u	s	e	r	N	a	m	e	string(length6)	M	a	r
83	a8	75	73	65	72	4e	61	6d	65	a6	4d	61	72

The first byte, 0x83, indicates that what follows is an object(top four bits=0x80) with three fields(bottom four bits = 0x03).
The second byte, 0xa8, indicates that what follows is a string, that is eight bytes long.
The next eight bytes are the field name userName in ASCII.
The next seven bytes encode the six-letter string value Martin with a prefix 0xa6.

Not very much reduction in the size.

Thrift and Protocol Buffers

Both Thrift(Facebook) and Protocol Buffers(Google) require a schema for any data that is encoded. In Thrift, you would describe the schema in the Thrift interface definition language(IDL) like this:

strcut Person{
	1: required string userName,
	2: optional i64 favoriteNumber,
	3: optional list<string> interests
}

In Protocol Buffers looks like similar:

message Person {
	required string userName = 1;
	optional int64 favoriteNumber = 2;
	repeated string interests = 3;
}

BinaryProtocal

Using field tag(number 1, 2, 3) to replace the field name.

CompactProtocol
It packs the field type and field tag number into a single byte.

The required and optional don’t interfere how the field is encoded, only use for a runtime check if the field is not set when it is required, catching the bug.

Field tags and Schema Evolution

Adding new schema(fields)
For forward compatbility, simply ignore the new field.
For backward compatbility, as long as each field has a unique tag number, new code can always read old data. The only concern is that the new field cannot be a required field, instead it should be optional or has a default value.
Removing schema(fields)
Reversed, not removing a required field, and you can never use the same tag number again.

Datatypes and schema Evolution

It is possible for Protocol buffers when make the single field value as repeated (multi-value)field. New code reading old data sees a list with zero or one elements. old code reading new data sees only the last element of the list.
Thrift has a dedicated list datatype, but it has advantage of supporting nested lists.

Avro

For Hadoop
Use a schema to specify the structure of the data being encoded. Two schema languages:

Avro IDL, for human editing
based on JSON, for machines-readable
There is noting to identify fields or their datatypes. Simply consists of values Concatenated together.
To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field.

Writer’s schema - the schema to be compiled into the application
Reader’s schema - decode some data, read it from a file or database, recieve it from the network, etc.
Writer’s schema and reader’s don’t need to be same, just need to be compatible.

Schema resolution matches up the fields by field name.

If the code reading the data encounters a field that appears in the writer’s schema, but not in the reader’s schema, ignore.
If the code reading expects some fields, but the writer’s schema does not contain a field of that name, it is filled with a default value declared in the reader’s schema.

compatibility
forward compatibility: ignore the field added, or fill with defualt value with the removed one.
backward compatibility: only add or remove a field that has a default value.
null : to make it nullable, must use a union type. union {null, long, string} field.
Changing datatype is possible.
Changing the name of a field, can use an alias to refer the old name, so it is backward compatible not forward.
Adding a branch to a union type is also backward compatbible but not forward.
Dynamic generated Schemas
Avro is friendly to dynamically generated schemas, because it doesn’t contain any tag numbers. Generate new Avro schema from the database schema and export data in the new Avro schema.
Code generation and dynamically typed Languages
Thrift and Protocol Buffers rely on code generation, after a schema has been defined, you an generate code that implements this schema in programming languages you chose. Statically typed languages Java, C++ or C# allows efficient in-memory structures to be used for decoded data, allows type checking and autocomplition in IDEs.
For dynamically typed programming languages such as JavaScript, Ruby, or Python, there is not much point in generating code, since there is no compile-time type checker to satisfy.

Avro provides optional code generation for statically typed programming languages, but it can be used just as well without any code generation. The file is self-describing since it includes all the neccessary metadata when open Arvo’s objsct container file(which embeds the writer’s schema).

The Merits of schemas

Schema languages:

ASN, a schema definition language in 1984. To define various network protocols, its binary encoding (DER) is still used to encode SSL certificates.
It supports schema evolution using tag numbers. Still complex, not a good choice for now.
Some protocols are generally specific to a particular database, and the database vendor provides a driver(ODBC or JDBC APIs).

Modes of Dataflow

Who encodes the data, decodes it?

Via databases
Via service calculates
Via asynchronous message passing
Dataflow through databases
In database the process that writes to the database encodes the data, the process reads from the database decodes it.
There could be severall different applications or sevices, and some processes acessing database run newer code, some will be running older code.

History data can stay in the database, and update with null value in the newly added fields, thus whole database can be updated withouth rewriting.
To backup the database, you can take a snapshot from time to time, or load into a data warehouse.

Datafolw Through Services: REST and RPC

Servers expose an API over the network, the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service.
Web browsers make requests to web servers, making GET requests to download HTML, CSS, JavaScript, imgas, etc. And making POST requests to submit data to the server. The API consists of a standardized set of protocols and data formats(HTTP, URLs, SSL/TLS, HTML, etc)
A client-side JavaScript application running inside a web browser can use XMLHttpRequest to become a HTTP client(Ajax).

SOA, Microservices
A server can itself be a client to another service. This approach is often used to decompose a large application into smaller services by area of funcationality. This way of building applications is called service oriented architecture(SOA) or refined as microservices architecture.
Service can impose fine-grained restrictions on what clients can and cannot do.
Goal of microservices
Make the application easier to change and maintain by making services independently deployable and evolvable.
So the data encoding used by servers and clients must be compatible across versions of the service API.

Web Services

A client application running on a user’s device. (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. There requests typically go over the public internet.
One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware.)
One service making requests to a service owned by a different organization, usually via the internet. This category includes public APIs provided by online services, such as credit card processing systems, or OAuth for shared access to user data.

Two popular appraches to web service: REST, SOAP

REST is not a protocol, but rather a design philosiphy that build upon the principles of HTTP. It emphasized simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. An API designed according to the principles of REST is called RESTfull.
SOAP is an XML-based protocol by contrast, for making network API requests. Web Services Description Language(WSDL). Useful in statically typed programming Languages.

RESTful APIs tend to favor simpler approaches, typically involving less code generation and automated tooling. A definition format such as OpenAPI, also known as Swagger, can be used to describe RESTful APIs and produce documentation.

The problems with remote procedure calls(RPCS)

Make a request to a remote network service look the same as calling a function or method in your programming language, whin the same process.(This abstraction is called location tranparancy). But the approach is fundamentally flawed.

A local function is predictable, however, a network request is not, it could be lost, the remote machine may be slow or unavailable.
A local function call either returns a result or throw an exception, or never returns. A network request might return but without results, due to timeout, you don’t know what happended.
If you retried a failed network request, it could happen that the requests are actually getting through, and only the rewponses are getting lost.
Local function takes same time to execute, but a network request takes longger, and its latency is widely variable. When the network is congested or the remote service is overloaded it took long time to complete.
When you call a local function, you can efficiently pass its references(pointers) to objects in local memory, but a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network.
The client and service may be implemented in different programming languages, so the RPCs framework must translate datatypes from one language into another.

REST appeals because it doesn’t hide the fact that it’s a network protocol, although this doesn’t seem to stop people from building RPC libraries on top of REST.

Current directions for RPC

Various RPC frameworks have been built on top of all encodings(Thrift, Avro). gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift and Rest.li uses JSON over HTTP.

Finagle and Rest.li use futures(promises) to encapsulate asynchronous actions that may fail. Futures also simplify situations where you need to make requests to mutiple services in paralell, and combine their results.

Streams - gRPC supports streams, where a call consists of not just one request and one response, but a series of reqeusts and response over time.

Custom RPC protocols with a binary encoding format can achieve better performance than something generic like JSON over REST.
But a RESTful API has other advantages:

good for experimentation and debugging(simply make requests to it using a web browser or $curl)
it is supported by all main-stream programming languages and platforms.
there is a vast ecosystem of tools available(servers, caches, load balancers, proxies, firewalls, monitoering, debugging tools, testing tools, ect.)
For these reasons, REST seems to be the predominant style for public APIs, the main focus of RPC frameworks is on requests between servies owned by the same organization, typically within the same datacenter.

Data encoding and evolution for RPC

It is important that RPC clinets and servers can be changed and deployed independently.
Update servers firstly, and client secondly. Handle backward compatibility for requests, and forward for response.
The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses.

Service compatibility is made harder by the fact that RPC is often used for communication across organizational boundaries, so the provider of a service often has no control across organizational boundaries.
There is no aggrement on how API versioning should work. (how a client can indicate which version of the API it want to use.)
1. For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP Accepct header.
2. For services that use API keys to identify a particular client, another option is to store a client’s requested API version on the server, version selection can be updated through a separate administrative interface.

Message-Passing Dataflow

asynchronous message-passing systems: somewhere between RPC and databases

As for RPC, a message is delivered to another process with low latency.
As for databases, the message is not sent via a direct net-work connection, but goes via an intermediary called a message broker.(message queue, message-oriented middleware), which stores the message temporarily.

Using a message broker has several advantages compared to direct RPC:

It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
Sender don’t need to know the IP address and port number of the recipient.
It allows one message to be sent to several recipients.
It logically decouples the sender from the recipient(the sender just published messages and doesn’t care who consumes them).

Message broker: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subcribers to that queue or topic. There can be many producers and many consumuers on the same topic.
Message brokers examples: IBM WebSphere, RabbitMQ, ActiveMQ, HornetQ, NATS and Apache Kafka.
A topic provides only on-way dataflow. However, a consumer may itself publish messages to another topic, or to a reply queue that is consumed by the sender of the original message.
Message brokers typically don’t enforce any particular data model. A message is just a sequence of bytes with some metadata, any encoding format is okay.
If the encoding is forward and backward compatible, flexible to change publishers and consumers independently and deploy them in any order.

Distributed actor frameworks

actor model: is a programming model for concurrency in a single process.
actor: encapsulate logic in actors not threads, representing a client or entity
1. may have some local state, not shared with owther actor.
2. communicates with other actors by sending and receiving asynchronous message.
3. Message delivery is not guarantted.
4. Each actor processes only one message at a time, it doesn’t need to worry about thread.
5. each actor can be scheduled independently by the framework.
Distributed actor framework, the programming model is used to scale an application across multiple nodes.
1. Use the same message-passing mechanism.
2. Message between different nodes is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.
Location transparency works better in the actor model than in RPC, because the actor model already assumes that the message may be lost.
Less of a fundamental mismatch between local and remote communication when using actor model(latency still higher than within the same process.)
A distributed actor framework is a single framework = a message broker + actor programming model
- Backward and forward compability is still need to be worried about, a message may be sent from a node running the new version to a code running the old version, and vice versa.

Some popular distributed actor frameworks handle message encoding as follow:

Akka, Java’s built-in serialization by default, Can replace it with something like Protocaol Buffers, thus gain the ablity to do rolling upgrades.
Orleans, by default uses a custom data encoding format that does not support rolling upgrade deployments. To deploy a new version of your application, you need to set up a new cluster, move traffic from the cluster to the new one, and shut down the old one. Custom serialization plug-ins can be used.
Erlang OTP, hard to make changes to record shcemas, the system haves many features designed for high availability. Rolling upgrades are possible, but need to plan carefully. map - a JSON-like structure may make this easier in the future.

Summary

Rolling upgrades

Allow new versions of a service to be released without downtime. More frequent small releases over rare big release.
make deployments less risky. Allow to be rolled back before they affect a large number of users.
Beneficial for evolvability.

Data encoding formats:

Programming languages specific - restircted to single programming langues, fail to provide compatibility.
Textual formats like JSON, XML, CSV, widespread, compatibility depends on how to use.
Binary schema-driven formats like Thrift, Protocol Buffers, and Avro, allow compat, efficient encoding with caerfully defined sematics. (need to decode for human-readble)

Model of Dataflow

Databases, the process writing to the database encode, and the process reading to do encode.
RPC and REST APIs, the client encodes a request, and the server decodes the request and encode the response, and the client finally decodes the response.
Asynchronous message passing(using message brokers or actors).

1. Data-intensive Application

Scalability

Maintainability

2. Data Models and Query Languages

Relational Model Versus Document Model

Object-Relational Mismatch

Many-to-One and Many-to-Many Relationships

many-to-one relationship

Many-to-Many relationship

Compare

Query Languages for Data

Declarative Queries on the Web

MapReduce Querying

Graph-like data models

Property Graphs

The Cypher Query Language

Recursive common table expression

Triple-Stores and SPARQL

Datalog

Summary

3. Storage and Retrieval

Data Structures

Hash Indexes

Issues in an implementation

Why append-only

Hash table index limitations:

SSTables and LSM-Trees

B-Trees

Optimizations

Compare B-Trees and LSM-Trees

Secondary index

Storing values within the index

Multi-column indexes

Spatial Indexes

Full-text search and fuzzy indexes

In memory databases

Transaction Processing or Analytics?

Data Warehousing - run analytics on this separate database

Starts and Snowflakes: Schemas for Analytics

Column-Oriented Storage

Column Compression

Memory bandwidth and vectorized processing

Sort Order in Column Storage

Several different sort orders

Writing to Column-Oriented Storage

Aggregation: Data Cubes and Materialized Views

Encoding and Evolution

Formats for Encoding Data

Language-Specific Formats

JSON, XML, and Binary Variants

Binary encoding

Thrift and Protocol Buffers

Field tags and Schema Evolution

Datatypes and schema Evolution

Avro

Dynamic generated Schemas

Code generation and dynamically typed Languages

The Merits of schemas

Modes of Dataflow

Dataflow through databases

Datafolw Through Services: REST and RPC

Web Services

The problems with remote procedure calls(RPCS)

Current directions for RPC

Data encoding and evolution for RPC

Message-Passing Dataflow

Distributed actor frameworks

Summary

Rolling upgrades

Data encoding formats:

Model of Dataflow