Summary of "The Log" || hariswb

“The Log” is an article about the importance of log for software engineer interested in distributed systems. It is written back in 2013 by Jay Kreps, formerly lead architect for online data infrastructure at linkedin and co-creator of Apache Kafka. In the article, he explores the concept of log as a record of events to tackle the problem of data consistency and data integration in distributed systems.

Distributed systems deals with interconnected services and databases. The challenge of conducting such system is that they need to share states, either globally or locally within subset of the system. Some parts can fail and how should the system in general reconcile the state of the program which might be missing in that case? Here comes the log.

The author defines the log in a specific way. It is an array of records and its ordered nature denotes the notion of time. In contrast to another type of log which saves the final result of an operation to the system, this log records the operation itself which tells us human or other computers about what happen and when.

This is of importance because with that concept of log, we can have a deterministic system. If we have two processes and the same input log, then they will give the exact same output. Consider replicating a master database problem. When replicating a database from a log containing the final state after an operation, two replicas can have inconsistent state in case of a failure, like network failure which makes a replica miss some events. Reconciling them is not trivial matter. But, with a log recording the operations itself, we don’t need a master database and can have replicas always available by subscribing to the log. A failing replica can immediately catch up to the actual state via recorder operations and all replicas are consistent with each other.

This particular way of conducting the data flow among parts of distributed system applies to the core problem of data integration accross organization. Each part of a system have a specific way to consume data and this tend to the teams involved to create a custom pipeline for each data consumption. This kind of system is particularly hard to manage when we count in the need of data consistency accross organization. To resolve this, the author uses the log to unify and synchronize the various ways of providing and consuming the data.

The unified log gets rid of specificity of how the data is used accross organization. It only records that something happens at some time while the content details are hidden from the log interface, like stored in a capsule. Serving the data in this way enables the different consumers to subscribe to the events and process them as they like. Any failing part of the system can easily resurrect itself by referring the events on the log as if nothing happens, it does not need to reconcile with any particular data provider or consumer.

Hiding details is important for data integration because the author found that the problem of scalable ETL (Extract, Transform, Load) pipeline lies in the different data ownership in the data flow stage from sourcing, pipelining, to consumption. In pipelining stage, the team has to transform and provide clean data, meanwhile data provider and consumer need things to be done in specific way. With unified log, those teams do not need to tell each other about the data treatment details. Teams just need to agree on a clear defined way to transfer data with an API(Application Programming Interface). It will save headache and the development can go faster.

The log is a powerful concept to have data consistency in the face of failures in distributed system and is very helpful to ease data management burden accross teams in scalable ETL pipeline. In distributed system, the log records events in chronological order so any service or server can subscribe to it deterministically. Using this concept in scalable ETL pipeline ensures smooth data integration accross organization. This article proves to be valuable for software engineer as it shifts the way to manage a fragile entangled bridges of services and servers into its fast and reliable interconnection version. Even more so today with the advent of distributed system in medium to big organizations.