Company

Visit jetbrains.com

Articles Big Data Tools

Big Data World, Part 6: PACELC

Pasha Finkelshteyn

This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens.

After reading Big Data World, Part 5: CAP Theorem, you might think that this theorem hardly helps in actual development. If the whole storage layer is, say, CP, the best thing you can achieve in the entire system is CP. Also, network partitioning is not something an engineer can actually change. But what if I told you that nothing is as it seems when positioning a system between C, A, and P?

PACELC theorem

The PACELC theorem is an extension of the CAP theorem, stating that if there is partitioning (P) in the network, you should choose between availability (A) and consistency (C), else (E), you should select between latency (L) and consistency (C).

Do you still remember the example with DB replication from our previous article?

Let’s reiterate shortly: traditional relational DBs are linearizable (i.e. CAP-consistent) only if they are single-node or are being replicated synchronously. What does synchronous replication imply when there is no partitioning? It means that any commit will take more time, because every node should acknowledge this commit. We can even say that the response time depends linearly on the number of nodes to which we’re replicating synchronously. The response time could even become unacceptably high for a client if there are too many nodes. And this is the price of consistency: higher consistency requires higher latency.

Of course, dependency is not always linear. Popular consensus protocols, such as Raft and Paxos, do not require replication to all nodes for consistency to be achieved. If you’re interested in understanding how the Raft protocol works, there is a beautiful visual description in The Secret Lives of Data. Also, there is a whole family of Wikipedia articles on consensus.

Why is it important? It looks intuitive when you think about it – there is always some trade-off between consistency and latency. But sometimes, it’s easy to forget that even a reliable and consistent system won’t respond momentarily. And the bigger our system is, the more time it will require to write data there, and the harder it is to fulfill our SLO (service level objectives).

Considering this, it’s hard to imagine the extreme difficulty in building reliable and consistent storage like Amazon S3. Remember, it not only stores objects that are more or less stable, but also stores metadata, which is volatile. This metadata should be consistent, too, or it could be a massive security breach for user data.

Conclusion

When we create the architecture for our data storage, it is essential to remember that it probably won’t be CAP-consistent, but any given layer probably will be. It is important to understand our goals and do performance tests to understand what we can guarantee to our users and how it will change when the amount of data inevitably grows.

As usual, at JetBrains we’re committed to building tools that will work with any data source, regardless of its latency. So far these include:

DataGrip
Big Data Tools plugin for our IDEs

Big Data Tools Update Big Data Tools Compatible with 2020.3 Is Now Available

Discover more

This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: DefinitionsBig Data World, Part 2: Role…

This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.

This is the third part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the first responsibility of the data engineer: building pipelines.

In this part, we’ll talk about the roles of people working with Big Data. All these roles are data-centric, but they’re very different. Let’s describe them in broad brushstrokes to understand better who are those people we target.

Company

Big Data World, Part 6: PACELC

PACELC theorem

Conclusion

Discover more

Big Data World, Part 5: CAP Theorem

Big Data World, Part 4: Architecture

Big Data World, Part 3: Building Data Pipelines

Big Data World, Part 2: Roles

Company

Big Data World, Part 6: PACELC

PACELC theorem

Conclusion

Subscribe to JetBrains Blog updates

Discover more

Big Data World, Part 5: CAP Theorem

Big Data World, Part 4: Architecture

Big Data World, Part 3: Building Data Pipelines

Big Data World, Part 2: Roles