Shared Logs for Distributed Computing: the case for CORFU

Today I am posting on a more technical subject here for Digital Distributed Asset. This blog has a more business and FinTech related content. But from now and then I will post some important technical or scientific developments about distributed computing in general.

Following that note I decided to repost here for everyone about a new development in distributed computing that I found in the by now familiar Blog The Morning Paper, familiar at leat for the readers of my other Blog, now called The Intelligence of Information. The post in question is about CORFU, yes named after the Greek island, but representing actually a new development in shared logs for distributed computing with consensus. Indeed the name of the post and its paper review is: Corfu: A distributed shared log:

Why a shared log?

If we can construct a (suitably performant) linearizable shared log, then this primitive can be used as a key building block to solve some hard distributed systems problems: an agreed upon global ordering becomes simply the order of events in the log, and through state machine replication, this can be parlayed into consistent views of system state. Shared logs have been used for failure atomicity and node recovery, for consistent remote mirroring, and for transactional systems…

(…)

Corfu and Paxos are neighbouring Greek islands, which gives a clue as to another use of shared logs: as a consensus engine:

Used in this manner, CORFU provides a fast, fault-tolerant service for imposing and durably storing a total order on events in a distributed system. From this perspective, CORFU can be used as a drop-in replacement for existing Paxos implementations, with far better performance than previous solutions.

corfu-fig-1

This development shows some capabilities in database development worth to pay attention to. Our modern-day databases are increasingly of a distributed and shared nature, shared within multiple jurisdictions and differentiated workloads. This calls the need for database systems with flexibility, immutability and ease of access, all without compromising data integrity and identity. CORFU is a State Machine Replication library that provides some of those functionalities:

The authors build two applications on top of CORFU to demonstrate the possibilities: CORFU-Store is a key-value store with atomic multikey puts and gets and low-latency geo-distribution, where CORFU acts as a log of data updates, durably storing data versions without overwriting them in place; CORFU-SMR is a State Machine Replication library where replicas propose commands by appending them to the log and execute commands by playing the log.

But obviously this isn’t coming without challenges or difficulties. That was rightly pointed out by The Morning Paper:

Challenges in implementing an efficient shared log

A shared log certainly sounds desirable, but of course it needs to present a single strongly consistent view of the log state, offer high performance, durability and fault tolerance. If it was easy, everyone would be doing it!

The performance requirements eliminate a class of designs in which everything must be serialized through an elected leader. Replication helps with durability and fault tolerance, but Corfu still needs to cope with both processes and storage units that may fail at any point in time, allowing reconfiguration of the system with no loss or disruption. And once we have replication, we need to start worrying about consistency again…

From here we pass to a more detailed description of CORFU. I briefly sketch it in this blog. Below is also the USENIX video link to the formal presentation of CORFU from the Microsoft Research Silicon Valley series:

Introducing CORFU

CORFU provides a very simple API to applications, consisting of append, read, trim, and fill operations. Append adds to the end of the log, read returns the log entry at a given log position. We’ll return to the use of trim and fill shortly

Our design places most CORFU functionality at the clients, which reduces the complexity, cost, latency, and power consumption of the storage units. In fact, CORFU can operate over SSDs that are attached directly to the network, eliminating general purpose storage servers from the critical path.

(…)

The mapping is maintained at the clients, and Corfu therefore needs a mechanism consistently update it. To read from the log, a client looks up the required storage page in the map, and then issues a read directly to the storage unit containing it. Appends are also written directly to the storage page for the next available log position. A sequencer node (strictly an optimisation to avoid contention with other appending clients) assigns tokens for log positions.

In this way, the log in its entirety is managed without a leader, and CORFU circumvents the throughput cap of any single storage node. Instead, we can append data to the log at the aggregate bandwidth of the cluster, limited only by the speed at which the sequencer can assign them 64-bit tokens, that is, new positions in the log.

A user-space sequencer is capable of serving 500K tokens/s.

corfu-fig-3

The projection map

The projection map maintained at clients divides the log into disjoint ranges, each of which is projected to a list of extents within the address spaces of individual storage units.

Within each log range, log positions are mapped to storage pages in the corresponding list of extents via any deterministic function (e.g., round robin). While the map above shows each log position mapped to a single storage page, for replication purposes each extent is actually associated with a replica set of storage units rather than just one unit.

(…)

When we change the projection, we invoke a seal request on all relevant storage units, so that clients with obsolete copies of a projection will be prevented from continuing to access them. All messages from clients to storage units are tagged with the epoch number, so messages from sealed epochs can be aborted. In this sense, a projection serves as a view of the current configuration.

Projection changes may link in new ranges, keeping the old ones intact, or may affect the configuration of some past ranges but not others. Over time therefore the log evolves in disjoint ranges, each using its own projection over a set of storage extents.

I recommend the interested to read the Blog’s post to its end. For now here it is below that USENIX link with the Video presentation of CORFU in Silicon Valley:

CORFU: A Shared Log Design for Flash Clusters

 

Abstract: 

CORFU organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services. CORFU can be viewed as a distributed SSD, providing advantages over conventional SSDs such as distributed wear-leveling, network locality, fault tolerance, incremental scalability and geodistribution. A single CORFU instance can support up to 200K appends/sec, while reads scale linearly with cluster size. Importantly, CORFU is designed to work directly over network-attached flash devices, slashing cost, power consumption and latency by eliminating storage servers.

featured image: VMworld US 2016: The Day 2 Buzz

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s