Wednesday, May 5, 2010
Partitioning Data for Fun and Scalability
A while back, TripAdvisor released a Facebook app, Cities I've Visited, that lets people stick pins in a map for places they've been and where they're going. While our CIV app is not quite Farmville yet, the viral nature of Facebook does help things take off. Without any particular effort on our part, we've ended up with almost 1 billion pins, with new pins coming in at almost five million a day.
The pin table is easily the fastest-growing table in our database and one of the largest. With new functionality in the works that will take better advantage of this data, and associated plans for marketing pushes to promote our app, it could easily start growing significantly faster. Currently, we store all the pin data in a table in a PostgreSQL database. This has served us adequately so far, but the size of the table is putting stress on our replication system and makes compiling aggregate statistics and database management difficult. Now is a good time to rethink our design and come up with something that scale better, so we would be ready for any surges down the road.
The obvious way to increase scalability is to partition your data between databases on multiple machines. Thus, you can increase your data capacity by simply buying new hardware. There are some downsides, of course: queries that need to talk to more than one database are on average slightly slower, and enforcing transactional semantics becomes more challenging. But it also has some side bonuses, such as more parallelism for large queries. And there's really only so much you can do if you insist on keeping all your data on every database server. (Database replication can help if read load is your worry, but that just makes things worse when you're worried about write load like we are, and most of the time a proper caching setup works better there anyways.)
Partitioning the data ended up being a great fit here. Our data setup makes this easy: all operations (except for some offline aggregate reporting) specify either a single member ID or a set of member IDs (e.g., for when you're seeing everywhere your friends have gone). Thus, partitioning on member ID makes routing queries easy. Moreover, all modification operations only operate on a single member ID, except in the less-frequent case of merging two users. Finally, we don't need strong guarantees about which pins are visible if you look at someone's map while they're in the middle of modifying it, so we don't need to handle multi-server transactions.
From here, the main question is how to partition the data. Based on our query load, we know we want the partitioning to be based on member ID. The obvious way to do this would be to take the member ID, mod by the number of servers, and put the data on the server with the resulting index. This is quick and easy, and since memberid is effectively random this should result in an even distribution of data. Why wouldn't you go with the simple solution?
Well, one situation that this doesn't handle so well is when you need to repartition your data. Say that you initially partitioned your data onto 5 servers, and now you want to add 3 more. With optimal partitioning, you could move 1/40th of the data from each of the 5 servers to the 3 new ones, and you'd end up 1/8 of the data on each server, without needing to move any data between the existing servers. However, with the simple mod hash function, almost all the data changes which server it's assigned to. This results in unnecessary copying and thus overly long transitions.
For some of you, this problem will look familiar, and you'll want to reach into your toolbox for everyone's favorite way to divide data among cache nodes: consistent hashing. I won't go into the hairy details here, but the basic idea of consistent hashing is that you arrange your data evenly on a circle (say, by taking some rightmost bits of the member ID), and then you assign points on the circle at random to each server. Each piece of data is stored at the server that comes next in the circle. With enough points per server, your data ends up well-distributed, though not perfectly so. Consistent hashing has the useful property that whenever you add or remove a server, only the data for the affected server has to move, and that data is likely to be well-distributed among the other servers. Thus, more servers can be added at any time with minimal disruption.
What's the downside? One issue is that the data's not distributed as evenly as with mod-hashing, which can lead to you not getting the most out of your hardware. You can increase the evenness of your data distribution by using more circle points per server. But the more points you add, the more complex your hash function becomes. One great thing with mod-hashing is that you can easily express your hashing in SQL, whereas with consistent hashing you either have huge and less-efficient SQL expressions or you have to do your hashing in client code. So, while less data needs to be moved under consistent hashing, figuring out what data needs to move can take considerably longer. Finally, consistent hashing is complex and random enough that it's no longer possible to ensure you're hashing properly by inspection, making it more difficult to notice errors.
While consistent hashing seems like a useful idea here, it's possible to get most of its advantages while keeping the simplicity of mod-hashing. First, ideally you've already got a hot backup of your database servers, so you don't need to worry about needing to remove database machines on short notice. Then, you can start by dividing your data between more databases than machines you need to use. For example, start with 12 databases divided between two database machines. If your load or data size becomes large enough that two machines aren't enough, just move 2 databases from each machine onto a new one. From the point of view of your partitioning, nothing's changed, but you're now using more machines. You can easily transition to 4, then 6, then 12 machines with even machine usage.
But what if your app becomes the next Twitter, and 12 machines is no longer enough? We saw before that in the general case adding new databases causes most of your data to need to move. But there's a special case that works better: multiplying the number of databases by an integer. Going from 12 to 24 databases, half of each database's data moves to a new database, with no movement of data between the existing databases at all. Of course, you're still moving half your data, so it's not going to be a quick operation. But it's much better than shuffling your data between all of your databases.
Scalability is a complex area, and obviously what works for us won't necessarily be the right answer in your situation. The mod-hashing system with more databases than machines looks to be working well with us, and it'll be interesting to see how it holds up with the new developments of the next few months. While too much load may be a good problem to have, I think we can all agree that it's a better problem to handle seamlessly.
Labels:
facebook,
federation
Subscribe to:
Post Comments (Atom)

0 comments:
Post a Comment