@hackage legion0.10.0.0

Distributed, stateful, homogeneous microservice framework.

legion

Legion is a framework for writing horizontally scalable stateful applications, particularly microservices.

Motivation

Writing stateful microservices is hard. Typically, the way stateful services are written to make them easy is they are written as stateless services that offload state to a database, making the database the stateful service. This approach has several disadvantages, the most important of which is that it is not always possible in principal to accomplish what you need.

Why is it hard to write stateful microservices without resorting to the DB? Well, for the same reason it is hard to write a distributed database in the first place. If you are storing state, you have to worry about scaling that state by distributing it across a cluster, ensuring the durability by replicating the state, and routing requests to the location where the state is stored. You have to worry about nodes entering an exiting the cluster, and how the state is repaired and rebalanced when the cluster topology changes.

Wouldn't it be nice if you could get all that for free and just focus on logic of your microservice application?

Disadvantages of Offloading State to the DB

  • Data Transfer.

    Transfer costs are only trivial if the size of your state is trivial, and probably not even then if you are dealing with frequently accessed objects, or hot spots. It is difficult to offload state to the DB in this way if the size of your state objects is large.

  • Consistency Is Still a Problem.

    Distributed databases have gotten good at providing eventual consistency for the semantics of database operations, but not for the semantics of your application. Counters are a common example of this. Say a field in your DB object represents some kind of counter keeping track of the instances of some event or other. Two instances of the event happen simultaneously on two different nodes. Node A reads the current value, which is 10. Node B reads the current value, which is 10. Node A adds 1, and stores the new value as 11. Node B adds 1, and stores the new value as 11. Two events happened, but the countered only moved from 10 to 11. Your data is now inconsistent in relation to your application semantics.

    It is true that some databases are starting to provide tools to handle this specific case, and others which are similar, but those tools are not typically generalizable, or else require locking which may lead to substandard performance, or break A or P in the CAP Theorem.

    Another approach some people take to solve this problem is to store CRDTs in the database layer (in fact, Legion relies heavily on CRDTs internally). This approach is limited by the support of your database, and in any case using CRDTs this way is problematic because the growth of most CRDTs is unbounded over time, causing the size of the CRDT to become prohibitively large. It is very difficult to do garbage collection on such CRDTs in a hybrid system. One of the most important things Legion does internally is implement asynchronous CRDT garbage collection.

Solutions

The general philosophy that Legion takes to solving the problems of the application/DB hybrid approach is not new. Instead of moving data to where a request is being handled, we move the request handling to where the data lives. What is interesting is the implementation, which has the following characteristics:

  • Request Routing.

    User's of the Legion framework supply a request handler which is used to service application requests. Requests are routed by the Legion runtime to a node in cluster where the data actually resides and the request is executed by the user-provided request handler.

  • CAP Theorem.

    Legion chooses A and P. In other words, Legion focuses on eventual consistency while maintaining availability and fault tolerance.

    This is a little bit trickier than it seems at first glance. You are probably used to this option being chosen by distributed databases; in fact choosing A and P is basically the whole point of why many distributed databases exist in the first place. However, distributed DBs don't offer eventual consistency over arbitrary user-defined operations. See "Consistency Is Still a Problem" above. Being eventually consistent with arbitrary semantics is a lot harder than with "last write wins".

  • Meet Semilattices.

    Legion stores incoming requests as a set of (user-defined) events* organized into a meet-semilattice, with monotonically increasing event ids, and a monotonically increasing set of peer acknowledgements for each event. This is important for two reasons. The first is because it allows us to rewrite the order of events in the case of conflict while maintaining the user-defined event semantics, giving us Strong Eventual Consistency. The second is because, unlike similar schemes layered on top of an external database, it allows us to compute a Greatest Lower Bound (or infimum) for the user-defined partition value (or "object value", or "state value", as some people think of it) encapsulated implicitly in the event semilattice. This is the same as saying that it allows us to do garbage collections, because it is not possible for a new events to arrive that fall below the infimum. In other words, while it is possible for new events to arrive at a given peer out of their natural order, it is guaranteed that all events arriving in the future must be above the infimum, and that there are no possible events that fall below the infimum which the peer has not already seen. Therefore, we are free to collapse and discard all events below the infimum.

    * "Events" are user-defined pieces of code that accept the current partition value as input, and produce some kind of response along with a new partition value as output.

  • Pure Haskell Interface.

    Comming Soon.

  • Automatic Rebalancing.

    Comming Soon.

  • Replication.

    Comming Soon.

Development Status

The Legion framework is still experimental.

Examples

Check out the legion-discovery project for an example of a stateful web services that takes advantage of Legion's ability to define your own operations on your data. Take a look at Network.Legion.Discovery.App to see where the magic of defining a Legion application happens. The rest of the code is mostly just standard HTTP-interface-written-in-Haskell, and requests sent to the Legion runtime.

FAQ

How do a "partition" in my Legion application and a "partition" as a subset of records in a distributed database relate to one another?

Some people find the term "partition" confusing because of the way it is typically used to describe subsets of a table in distributed relational databases. That's ok. The term "partition" as used here has a more general meaning, primarily because of the more generalized nature of Legion as compared to a distributed database.

In Legion, a partition is an abstract unit of state upon which user requests operate. It is called a "partition" because it "is separate from every other partition", meaning that an individual request can only operate upon a single partition, and can never span multiple partitions. Furthermore, Legion can only guarantee consistency within the partition boundaries.

Another characteristic of a partition is that Legion cannot subdivide it. All of the data on one partition is guaranteed to be located on the same physical node. Legion treats partitions as the smallest unit of data that can be rebalanced across the cluster.

In a relational database partition, it is sometimes the case that the table can be "repartitioned", where rows from one partition move to the other. This has no analog in Legion. In Legion, a partition is an atomic unit of data which cannot be subdivided.

Why Haskell?

Developing correct distributed systems is hard. One reason it is hard is because it comes with a large number of very subtle rules and constraints that are not part of the average development process and require highly specialized knowledge. Typically this knowledge is entirely unrelated to the business problem you are trying to solve. Violating any of those constraints can lead to a nightmare of data corruption, scalability, or availability problems.

Most languages are unable to enforce distributed constraints in the type system, forcing the developer to very carefully tread through a proverbial mine field. Making an error in even one step can have an associated cost that is wildly disproportionate to the subtlety of the error.

Haskell on the other hand, has a type system that can be used to express these constraints. In addition to implementing the distributed runtime, providing a distribution-safe API is a major part of what makes Legion awesome. It fences off the mines so you can run through the mine field full tilt. If you hit one, the cost to your organization is a compile time error, instead of a fundamentally broken and failing project.