How to Choose a Shard Key: The Card Game

来源:互联网 发布:c语言求质数 编辑:程序博客网 时间:2024/05/07 22:53

 

教你如何选择一个好的shard key, 作者Kristina, 是MongoDB: The Definitive Guide 和 Scaling MongoDB 的作者, 文章来源:http://www.snailinaturtleneck.com/blog/2011/01/04/how-to-choose-a-shard-key-the-card-game/

 

Choosing the right shard key for a MongoDB cluster is critical: a bad choice will make you and your application miserable. Shard Carks is a cooperative strategy game to help you choose a shard key. You can try out a few pre-baked strategies I set up online (I recommend reading this post first, though). Also, this won’t make much sense unless you understand the basics of sharding.

Mapping from MongoDB to Shard Carks

This game maps the pieces of a MongoDB cluster to the game “pieces:”

  • Shard – a player.
  • Some data – a playing card. In this example, one card is ~12 hours worth of data.
  • Application server – the dealer: passes out cards (data) to the players (shards).
  • Chunk – 0-4 cards defined by a range of cards it can contain, “owned” by a single player. Each player can have multiple chunks and pass chunks to other players.

Instructions

Before play begins, the dealer orders the deck to mimic the application being modeled. For this example, we’ll pretend we’re programming a news application, where users are mostly concerned with the latest few weeks of information. Since the data is “ascending” through time, it can be modeled by sorting the cards in ascending order: two through ace for spades, then two through ace of hearts, then diamonds, then clubs for the first deck. Multiple decks can be used to model longer periods of time.

Once the decks are prepared, the players decide on a shard key: the criteria used for chunk ranges. The shard key can be any deterministic strategy that an independent observer could calculate. Some examples: order dealt, suit, or deck number.

Gameplay

The game begins with Player 1 having a single chunk (chunk1). chunk1 has 0 cards and the shard key range [-∞, ∞).

Each turn, the dealer flips over a card, computes the value for the shard key, figures out which player has a chunk range containing that key, and hands the card to that player. Because the first card’s shard key value will obviously fall in the range [-∞, ∞), it will go to Player 1, who will add it to chunk1. The second and the third cards go to chunk1, too. When the fourth card goes to chunk1, the chunk is full (chunks can only hold up to four cards) so the player splits it into two chunks: one with a range of [-∞,midchunk1), the other with a range of [midchunk1, ∞), where midchunk1 is the midpoint shard key value for the cards in chunk1, such that two cards will end up in one chunk and two cards will end up in the other.

The dealer flips over the next card and computes the shard key’s value. If it’s in the [-∞, midchunk1) range, it will be added to that chunk. If it’s in the [midchunk1, ∞) range, it will be added to that chunk.

Splitting

Whenever a chunk gets four cards, the player splits the chunk into two 2-card chunks. If a chunk has the range [xz), it can be split into two chunks with ranges [xy), [yz), where x < y < z.

Balancing

All of the players should have roughly the same number of chunks at all times. If, after splitting, Player A ends up with more chunks than Player B, Player A should pass one of their chunks to Player B.

Strategy

The goals of the game are for no player to be overwhelmed and for the gameplay to remain easy indefinitely, even if more players and dealers are added. For this to be possible, the players have to choose a good shard key. There are a few different strategies people usually try:

Sample Strategy 1: Let George Do It

The players huddle together and come up with a plan: they’ll choose “order dealt” as the shard key.

The dealer starts dealing out cards: 2 of spades, 3 of spades, 4 of spades, etc. This works for a bit, but all of the cards are going to one player (he has the [x, ∞) chunk, and each card’s shard key value is closer to ∞ than the preceding card’s). He’s filling up chunks and passing them to his friends like mad, but all of the incoming cards are added to this single chunk. Add a few more dealers and the situation becomes completely unmaintainable.

Ascending shard keys are equivalent to this strategy: ObjectIds, dates, timestamps, auto-incrementing primary keys.

Sample Strategy 2: More Scatter, Less Gather

When George falls over dead from exhaustion, the players regroup and realize they have to come up with a different strategy. “What if we go the other direction?” suggests one player. “Let’s have the shard key be the MD5 hash of the order dealt, so it’ll be totally random.” The players agree to give it a try.

The dealer begins calculating MD5 hashes with his calculator watch. This works great at first, at least compared to the last method. The cards are dealt at a pretty much even rate to all of the players. Unfortunately, once each player has a few dozen chunks in front of them, things start to get difficult. The dealer is handing out cards at a swift pace and the players are scrambling to find the right chunk every time the dealer hands them a card. The players realize that this strategy is just going to get more unmanageable as the number of chunks grows.

Sharding keys equivalent to this strategy: MD5 hashes, UUIDs. If you shard on a random key, you lose data locality benefits.

Sample Strategy 3: Combo Plate

What the players really want is something where they can take advantage of the order (like the first strategy) and distribute load across all of the players (like the second strategy). They figure out a trick: couple a coarsely-grained order with the random element. “Let’s say everything in a given deck is ‘equal,’” one player suggests. “If all of the cards in a deck are equivalent, we’ll need a way of splitting chunks, so we’ll also use the MD5 hash as a secondary criteria.”

The dealer passes the first four cards to Player 1. Player 1 splits his chunk and passes the new chunk to Player 2. Now the cards are being evenly distributed between Player 1 and Player 2. When one of them gets a full chunk again, they split it and hand a chunk to Player 3. After a few more cards, the dealer will be evenly distributing cards among all of the players because within a given deck, the order the players are getting the cards is random. Because the cards are being split in roughly ascending order, once a deck has finished, the players can put aside those cards and know that they’ll never have to pick them up again.

This strategy manages to both distribute load evenly and take advantage of the natural order of the data.

Applying Strategy 3 to MongoDB

For many applications where the data is roughly chronological, a good shard key is:

{<coarse timestamp> : 1, <search criteria> : 1}

The coarseness of the timestamp depends on your data: you want a bunch of chunks (a chunk is 200MB) to fit into one “unit” of timestamp. So, if 1 month’s worth of writes is 30GB, 1 month is a good granularity and your shard key could start with {"month" : 1}. If one month’s worth of data is 1 GB you might want to use the year as your timestamp. If you’re getting 500GB/month, a day would work better. If you’re inserting 5000GB/sec, sub-second timestamps would qualify as “coarse.”

If you only use a coarse granularity, you’ll end up with giant unsplittable chunks. For example, say you chose the shard key {"year" : 1}. You’d get one chunk per year, because MongoDB wouldn’t be able to split chunks based on any other criteria. So you need another field to target queries and prevent chunks from getting too big. This field shouldn’t really be random, as in Strategy 3, though. It’s good to group data by the criteria you’ll be looking for it by, so a good choice might be username, log level, or email, depending on your application.

Warning: this pattern is not the only way to choose a shard key and it won’t work well for all applications. Spend some time monitoring, figuring out what your application’s write and read patterns are. Setting up a distributed system is hard and should not be taken lightly.

How to Use Shard Carks

If you’re going to be sharding and you’re not sure what shard key to choose, try running through a few Shark Carks scenarios with coworkers. If a certain player start getting grouchy because they’re having to do twice the work or everyone is flailing around trying to find the right cards, take note and rethink your strategy. Your servers will be just as grumpy and flailing, only at 3am.

If you don’t have anyone easily persuadable around, I made a little web application for people to try out the strategies mentioned above. The source is written in PHP and available on Github, so feel free to modify. (Contribute back if you come up with something cool!)

And that’s how you choose a shard key.