DISQUS

Bret Taylor's blog: How FriendFeed uses MySQL to store schema-less data - Bret Taylor's blog

  • nawroth · 9 months ago
    Thx a lot for sharing this!

    To me this looks like part of the tendency to move away from RDBMS for some or most of the storage needs to be able to scale well. The next natural step would be a shared nothing architecture using specialized storage engines for different needs. For instance there are different key/value stores with varying characteristics and the document-based ones should get more stable over time. One option I think will get more and more interesting in the future is using a graph database engine like http://neo4j.org/ (which BTW is the reason I joined that project) - it handles interconnected and semistructured data in a very efficient way.
  • malditogeek · 9 months ago
    Bret, did you heard about Tokio Cabinet [1] ?
    Looks like a great alternative for schema-less storage, and seems to be in production at mixi.co.jp [2]

    Do you evaluated this one? or have any opinion?

    Excelent read! Thx!


    [1] http://tokyocabinet.sourceforge.net/index.html
    [2] http://www.igvita.com/2009/02/13/tokyo-cabinet-...
  • Bret Taylor · 9 months ago
    Yes, we are definitely going to check that out. That was the one project that seemed most promising in our evaluation, and we still plan on doing a full evaluation.
  • bobdvirident · 5 months ago
    Bret,

    We're working with the folks at Mixi and Tokyo cabinet is intersting. bobd@virident.com if you'd like to connect. We're deliveirng unparralled MySQL performance there with our Virident appliance. Cheers,
  • Bret Taylor · 9 months ago
    We do shard our indexes. We query all the relevant index shards in parallel and over-fetch. The indexes are stored in sort order, so sorting is not an issue. To paginate, we fetch start + num and truncate in Python.
  • joshua schachter · 9 months ago
    This is in a bunch of ways what I wanted to build for Delicious. Oh well.
  • davedash · 9 months ago
    Now you can update several cells, some xml schemas, and some sapi calls, and some sapi providers and run an indexing process over the weekend and blammo.
  • Buck Nuggets · 9 months ago
    This is like container-managed persistence in java. Which can be wonderful if you don't mind always having to go through a java layer to get to your data. Which is fine for operational (transactional) activities but is extremely slow & expensive for reporting. Especially if you want to go beyond 1980s-style reporting products (crystal reports, birt, etc) and use the more powerful OLAP products (Microstrategy, Cognos, etc) that automatically generate SQL from metadata - generating new reports and views in minutes rather than days.

    Additionally, now you're locked into your development language. Note - SQL tends to outlast development languages. So long after you may decide that you want to replace your app with a new language or connect another app to your database that's written in another language - you'll still have SQL around. Odds are. But with this solution you won't be able to do that.

    So, why make this unnecessary sacrifice? To speed up database maintenance? Ok, here's the thing - most other databases do not have the same maintenance limitations that MySQL has that are the root cause of all of this. So, address the root cause and you no longer need to further limit your options with this solution. For example, most database products I work with allow you to add columns, add/remove indexes, etc - *ON THE FLY*. In other words the product won't rewrite the table and essentially hold your application offline & hostage for hours. Furthermore, some products will even let you completely rearrange your storage (add/remove new disks, raw logical volumes, file systems, etc) on the fly - automatically rebalancing your data for you. While your app is fully online and your users are blissfully unaware.

    Nice design, but I suggest that by addressing the root rather than the symptom you'll end up with a better solution at the end of the day.
  • Too Many Hackers · 9 months ago
    Agreed. This is a horrible hack on top of a toy RDBMS.

    Get a real RDBMS next time.
  • Not enough hackers · 9 months ago
    Make something yourself to the scale of FriendFeed. Then you can talk.
  • Buck Nuggets · 9 months ago
    Note the third paragraph of the story above. Scalability isn't the issue here - manageability of the database product is.

    Also, while this design dramatically reduces the need for DDL changes - it doesn't not completely eliminate them. Inevitably, you'll eventually need to add a column, change the storage around a bit, etc. And when that happens you'll still be down for hours. And wait until you've got 200 million rows - and are down for 24 hours. Ugh.
  • Not enough hackers · 9 months ago
    I wonder if it's a deeper issue -- that an RDBMS is simply the wrong tool for the job -- its tables and schemas and set operations are simply the wrong abstraction layer when what you really need are big, fast, persistent lists and hashtables with some basic atomicity.

    I don't disagree with any of your observations, either. MySQL is a fine database but not the ideal tool for every job. One of the nice things about MySQL, though, is that with its lack of advanced features (like being able to modify storage and tables on the fly, optimistic locking, etc) comes a certain simplicity that makes it a very malleable tool for someone who needs a fast and reliable storage engine.

    One thing that I have found, though, is that many sacred cows need to be slaughtered in the name of performance, stability and scale.
  • banane · 9 months ago
    You can't really compare this situation to OLAP systems- that's my major experience so that's where I'm coming from, and it's vastly different. They're not dealing with high levels of transactional, live data, they're meant for massive reporting 10X the scale of FF, and it's really just so functionally different, it's beyond comparison.
    And OLAP was alive and well side-by-side with OLTP systems, and I also disagree iwth your buckets of reporting systems. Crystal / Business Obj have their uses, and aren't on some technical evolution scale. The big reporting systems are good for just that- big sets of data.
  • Buck Nuggets · 9 months ago
    Well, I'd say that enabling OLAP isn't really the point (and BTW, works fine on live little databases as well).

    What is the point is that there is a considerable amount of functionality dependent upon a data model that is lost with this kind of approach. And if the objective is to address database manageability there are far better approaches that won't cost the loss of this functionality.

    Not to say that this isn't creative and fun. Just seems to be a solution for some other problem.
  • banane · 9 months ago
    Buck you were making the point to " use the more powerful OLAP products " and avoid the thick Java layer. I'm saying that FF is purely transactional, very minorly analytical, so OLAP-style design and products aren't of much use. I think they are somewhat useful in consideration since they have similar scale that Bret's dealing with.

    I agree that there are more robust, bug-free apps out there that mirror what he's doing, but they're not open source (to my knowledge), and they're willing to rebuild them, so it's all good.
  • Buck Nuggets · 9 months ago
    Banane,

    Support for OLAP is only one of many reasons to avoid abandoning the relational or dimensional model. Even if an organization were to decide that their core business activity did not warrant the kind of analysis made possible by OLAP then there is still a basket full of reasons not to go this route, including but not limited to:
    1. the fact that this design does not completely eliminate the need for table alterations and outages will still be required
    2. no reporting products of any kind are supported
    3. no tools that leverage database metadata are supported
    4. any reports, queries or views not anticipated by the application will have to be built leveraging the application layer rather than SQL. A vast number of potential requests or requirements will face additional costs or may not be feasible to implement with this kind of design.
    5. although I'm a huge fan of Python I have to admit that languages come & go and this team may find it desirable to replace or augment this application with another language. That won't talk easily to their python APIs. Sure, web services, etc are options - but are also far more costly than native sql in terms of performance.
    6. no database constraints are possible. I realize that some people don't use them anyway - but the simple declarative constraint language common to all database products more reliably protects against data corruption that application validation - which is more subject to coding defects and changes over time.

    So, take away the argument that OLAP capability is lost and you are still left with 6 huge reasons not to go this route. And that is for a product which is not a 100% solution and is only required because the database is being hosted by a product which fails to support common dbms functionality.
  • banane · 9 months ago
    #5 is the issue- and they're proving that SQL will be more of a time suck than what they're doing. By removing Joins, they are saving quite a bit of time. They're taking the connecting tissue out of the Database and putting it in the code. It's funny, I used to be a Java/Perl developer, and moved to databases. Listening/reading this thread is reminding me of the split between DB and programming, and this is such a programmer's solution. But I can't quibble with the fact, that, as a user of FF, it totally works.

    One big aspect that makes this make sense for FF (and not other companies, software, or apps) is that FF is really, really simple schematically. They simply have two basic entities, the user, and the post.

    What's the motivation for them to do something that works with other systems? There's no *reason* to be generic. Except, perhaps, that he can never fire his devlopers!
  • BuggyFunBunny · 9 months ago
    >> By removing Joins, they are saving quite a bit of time. They're taking the connecting tissue out of the Database and putting it in the code.

    Only if the engine sucks.
  • Kamil Szot · 6 months ago
    >> 1. the fact that this design does not completely eliminate the need for table alterations and outages will still be required
    Only if they want to extend their datastore. When they will want to add maybe some kind of reporting or aggregation of data. Most likely they will manage this by adding tables to their schema not by modifying existing ones. Business changes don't require any changes to db schema. It was one of the major pains they wanted to alleviate and they did.

    >> 5. although I'm a huge fan of Python I have to admit that languages come & go and this team may find it desirable to replace or augment this application with another language. That won't talk easily to their python APIs. Sure, web services, etc are options - but are also far more costly than native sql in terms of performance.

    Their db schema is so simple that you could port their data store interface to any language effortlessly (maybe they will just have to convert python pickled objects in db to something more generally readable like JSON).
  • daveg · 9 months ago
    Silly goose... the data entries are self-describing and do not have a fixed schema (just like xml).

    So a "schema change" to add fields takes no time at all.

    But yes, a schema change that transforms existing data either takes a long time, or is done on-the fly as records of a prior version are loaded and then later persisted.
  • Buh? · 9 months ago
    Funny... Google uses MySQL and FriendFeed has done something similar to what Google has done (without patching the engine). I would say it's not a toy. http://google-code-updates.blogspot.com/2007/04...

    In addition, DAOC (Dark Ages of Camelot - MMORPG for you non gamers), started out using MySQL... And they still use it! They find that this little "toy" handles huge amounts of traffic (think about it).

    So what RDBMS do you consider better? Oracle, which requires a team of people frantically digesting a novel sized book to apply a simple patch? One that still lives in the stone age of Java and is the most expensive of all? (I think not) Or how about the M$ version, which in order to run it, you must be running at least one seat of M$ Server? (I think not) All others aren't even worth mentioning here, unless you are speaking of PostgreSQL (which is ok).

    Kudos Bret for finding a workaround to your issue. Have you considered the JCR? With JackRabbit, I'm not sure about connectivity from Python, but it certainly works with Ruby. Sounds a little like you all have implemented a version of the JCR w/o a tree structure and with a MySQL backend.

    Don't listen to these negative yay-hoos... the bottom line is that your system is running faster and cheaper.
  • Otto · 9 months ago
    I'm forced to agree with Buck on this one. While I see the benefits of a schemaless system for certain applications, it seems to me that this was extreme overkill for the problem.

    The main problem is the use of MySQL in the first place. If schema changes require hours of downtime, then something else is seriously wrong there. I've used a lot of database products, and changed a lot of schemas on the fly, and I've never experienced any downtime. I've done ALTER TABLE on live production systems that were actually currently in use (as part of live upgrades) on tables with multiple billion rows, and I've never considered for a second that it might require taking the system down. That's just not done in modern RDBMS systems, nor should it ever really be necessary.
  • Kamil Szot · 6 months ago
    Replacing one RDBMS with another in live, large successful project is very hard. You have db servers optimised for given rdbms, optimised rdbms software, schema optimised for given rdbms, tools for monitoring performance, admins, ORM in your app. It's easier to change ORM and schema for better performing, easier to extend, than to change rdbms and all the optimised hardware, software, maintenance procedures and people.

    So actually 'just switching to postgress' could be overkill in their case.
  • Otto · 5 months ago
    If you have to optimize your software to fit your RDBMS, then you're doing it wrong. You optimize the RDBMS and the database to fit the kinds of things you're doing, not the other way around.
  • Andrey Shchekin · 9 months ago
    So, why make this unnecessary sacrifice? To speed up database maintenance? Ok, here's the thing - most other databases do not have the same maintenance limitations that MySQL has that are the root cause of all of this.

    Yes! The MySQL is the single worst performing DB I have ever worked with, especially considering index changes. Finding the workarounds for this can not be considered a breakthrough -- it would be much better if it was just fixed.

    I do not yet see a value in having a schema-less DB2 or Sql Server (but it might be I just had no real use case). But with 20 million index changes are a breeze, especially compared to MySQL.
  • null · 9 months ago
    what "free" products can do all of this?
    i liked the design myself, and i dont consider the language lock an issue - python in this case. actually, it's better to maintain your app in a single language, as well as maintaining a team that speaks and code that same language. no matter if it's python, php, etc. sql cannot be that language, because it's not its goal.
    i don't really believe that it's possible to have that much of optimization and stability using other "products" without spending "a lot" of money for licenses and support. furthermore, now they seem to have total control of their system.
  • Ben O'Steen · 9 months ago
    One thing that might be of interest, is that the new simplejson with its shiny C _speedup outperformed both Pickle and cPickle for me when storing JSON-like data. So, you might be able to squeeze a little more performance out of it yet ;) I really appreciate the description of how you are handling this!
  • Bret Taylor · 9 months ago
    For our last re-shard, we basically set up a parallel instance of our DB and wrote to both in parallel while we copied data over, then switched off the old system. Not optimal, certainly, but it worked for us.
  • Evan · 9 months ago
    Minor code correction:

    entity = user_id_index.get_all(datastore, user_id=binascii.a2b_hex("f48b0440ca0c4f66991c4d5f6a078eaf"))

    should probably be:

    entities = user_id_index.get_all(datastore, user_id=binascii.a2b_hex("f48b0440ca0c4f66991c4d5f6a078eaf"))

    Since user_id_index.get_all appears to be returning a list of entities rather than a single entity.
  • zellyn · 9 months ago
    Just curious — did you consider any of the BigTable equivalents?
  • Mark Renouf · 9 months ago
    This is a reamarkably similar in function to Amazon SimpleDB service. I wouldn't be surprised if it was implemented in much the same way. This is sure to make RDBMS purist scream in horror, but it works better and that's what really matters. Way to go!
  • Charlie Wood · 9 months ago
    Very interesting and educational. Thanks for writing this. More like this, please. :-)

    Regards,
    Charlie
  • Eric Kerr · 9 months ago
    Extremely interesting article. Requires more writing and updating to make sure data is consistent everywhere, but this makes a lot of sense when horizontal partitioning is needed and normalized table JOINs are abandoned.

    The main problem with modulo based sharding is adding more shards and then moving data around accordingly. What do you recommend here for adding nodes to the datastore and keeping the service available - especially if you're trying to keep the setup cheap with minimal nodes? (obviously every setup has trade-offs, but i guess tricks where you swap slaves as temporary masters, etc..)
  • micahwittman · 9 months ago
    A great read, Bret! Thanks.
  • banane · 9 months ago
    I've worked on terabyte-record size databases, both Oracle & MS SQL (more recently MySql, but not at this scale). Granted, they're not as operationally alive as Friendfeed. The solution you have here seems like something that frequently occurs in OLAP systems, that is the manual management of skinny/thin index tables instead of relying on the database to manage it. Interesting, it's a really programmatic solution, in a way, where the pressure is on the Python app to write and manage the data. But, I can see that the optimal feature here is keeping everything keyed on the UID and alive 24x7. BTW big fan of FriendFeed!
  • Jesse Stay · 9 months ago
    I can see cloud DB services like Amazon's SimpleDB working well with this type of infrastructure. Great read, Brett!
  • Edwin Khodabakchian · 9 months ago
    This is very interesting. Thanks for sharing. Do you use memcached at all in the new system?
  • Bret Taylor · 9 months ago
    Yes, we use memcache for a few of the primary indexes and to cache lookups for some types of entities.
  • Ryan Christensen · 9 months ago
    Nice write up. Also I have been a big fan of the UUID approach and getting away from RDBMS. As we have moved larger and larger projects with lots of endpoints and syncs both UUIDs and document based databases or databases like BigTable and CouchDB have been helpful. UUIDs allow your database to be spread across anything and even if using a relational database, architect for scalability like this is the future. Think about in 10 years when we have 1000 times the data, JOINs might be over.
  • capobecchino · 9 months ago
    great! now I optimize engine of meemi.com
  • Prune · 9 months ago
    Ever thought of LDAP (openDS) ?
  • acdha · 9 months ago
    LDAP is a neat idea but it'd be a big step back: the tools are far less mature and the servers aren't optimized for frequent updates (or storing custom data). They also tend to be opaque and hazardous: e.g. OpenLDAP has very hard to diagnose hangs, server performance frequently falls off of a with seemingly minor differences in query syntax, etc.

    Basically you'd be adding complexity, and switching from a known platform with a wide range of tools, operational experience, etc. to something which is narrowly used and not particularly reliable anywhere. It'd be better to consider something like CouchDB, the persistent memcache services, etc. which at least have the advantage of being considerably more modern and designed to do exactly this - something like Tokyo Cabinet is going to be a lot easier to work with than a hairy codebase like any of the extant LDAP servers.
  • Jauder Ho · 9 months ago
    It would be interesting to see LDAP used as the user and shard lookup service.

    Does anyone know if any large sites are using LDAP in their architecture?
  • Josh in California · 9 months ago
    LDAP servers are optimized primarily for fast reads. Even more extensive sharding would be required to keep up with a busy site's write load. You would also end up with the same index problem, where adding or modifying an index will take anywhere from hours to days. (Depends on how much you shard, of course.)
  • Jauder Ho · 9 months ago
    I was looking in the context of user lookups and not as a db replacement but then you run into the problem of having 2 sources to go to. Tradeoffs tradeoffs...
  • gianouts · 9 months ago
    Very interesting. Thanks for writing this up.
  • kmeyer · 9 months ago
    very interesting to read! thanks.
  • Dr. Azrael Tod · 9 months ago
    nice... at work we discussed a similar problem, to be solved pretty much the same way... but since our project is much smaller we thought the efford to keep als this workarounds to mysql stable and performant much bigger then the effort to glue all possible functions into tables.

    at the moment it will work and it should work for some hundreds or even thousands of users more... but if we ever run into scaling-issues (i dont think so) then this will probably be a topic again.
  • Sergey Shepelev · 9 months ago
    > However, as we grew, scaling our existing features to accomodate more traffic turned out to be much less of an issue than adding new features
  • dwight_10gen · 9 months ago
    Interesting. For very similar reasons we began the mongodb project -- which is a schemaless, json-style database. You are right these datastores (including couch) are quite new, but over time, should offer a lot of flexibility, such as the ability for the database to "reach into" these blobs to perform various operations.
  • Geir Magnusson Jr · 9 months ago
    After thinking about it, I'm puzzled by :

    "Since our databases are all heavily sharded,
    the relational features of MySQL like JOIN have
    never been useful to us"

    Doesn't this solution depend upon JOIN to find anything? (Maybe to be fair, "... based on more than a single property"?) Does the sharding preclude ALL multi-property operations?

    geir

    (F.D. - I'm from 10gen and intensely fascinated by document-ish persistence problems... we have a solution called MongoDB...)
  • Bret Taylor · 9 months ago
    We don't do any JOINs in SQL. Looking up entry IDs from the indexes happens in one step, and then we do a second set of queries to look up the bodies of the entities on all the shards. It is a conceptual "join", but not a SQL join. You can't do JOINs across databases in MySQL, and since we have multiple shards, we haven't used JOIN since we launched.
  • Mo · 9 months ago
    Your statement may well be ambiguous rather than incorrect, (I'm assuming the former), but for the benefit of anybody else: you can do joins across databases in MySQL, provided they're accessible from the same connection (i.e., same user/pass/hostname, and privileges allow it).
  • Hingo · 9 months ago
    Mo: Bret is using "databases" in the sense: Separate MySQL process running on separate physical server.

    ...which is the point of sharding: You have more than one server to cater for the high performance need. (Technically speaking, you can use the Federated engine to do JOINs even across 2 separate MySQL engines, but then you would completely defeat the point of getting better performance, and probably it wouldn't even be applicable for most real-world sharded schemas anyway.)
  • Antony · 9 months ago
    Also, the Federated engine which ships "as standard" is broken, which is why on recent binary releases, it is not enabled.
    There is an open-source 'fork' of Federated which fix a couple of the major issues called FederatedX and it is also transaction aware. In addition to that, it may be fairly trivial to patch MySQL to enable combining Federated with Partition which would make the task of sharding almost transparent.
  • Timandes · 5 months ago
    I'm puzzled by this part.....how to get newest records?
    For example, i can get them from table `entities` by:
    Select `body` From `entities` Order By `added_id` Desc Limit 10
    or
    Select `body` From `entities` Order By `updated` Desc Limit 10

    But how to use `index_user_id` to do the same thing?
    Select `entity_id` From `index_user_id` Where `user_id`='asdfasdfasdf' Order By ?????? Desc Limit 10

    I'm bad in English....sorry about that.....
  • scottwilleke · 9 months ago
    Sounds like more features will be coming to friendfeed then. Great!
  • @JoeHobot · 9 months ago
    I like's :) (click)
  • Philipp Lenssen · 9 months ago
    How does this compare to Google's datastore system?
  • Toby DiPasquale · 8 months ago
    Its very similar, save that Google uses BigTable's SSTable for storage and FriendFeed is using MySQL InnoDB tables for storage. This is probably not an accident, since some of the FF folks worked on Google App Engine before leaving to do FriendFeed ;-)
  • Jud Valeski · 9 months ago
    Thanks for sharing. Gnip's storage volume characteristics are vastly different as we don't store long term data, just a window of "real-time" streams, but we've spent a ton of energy around RDBMS avoidance (e.g. we don't have one at all). Our "in transit" data volume requirements have grown to the point that we're having to adjust our thinking around persisting data on disk. Our thinking maps coincidentally well to how you're using MySQL. Basically, managing the "relationships" and model outside the "DB" (in your case across tables), and instead using the DB as a dumb storage facility that knows how to be acidic, and transactional around data access, and that's about it. Those are great qualities, but we're in the same camp WRT needing to manage the schema/model in a more efficient manner. Always great to see under someone else's covers.
  • Jauder Ho · 9 months ago
    Jud, how about something like Prevayler for your needs?
  • Jud Valeski · 9 months ago
    Interesting pointer. I hadn't heard of Prevayler before. At first blush, I
    wonder how it plays in clustered environments. We get the serialization
    aspect of it via TerraCotta snapshots, but they're not transactional in the
    journal rollback sense of the word.

    Thanks for the suggestion.
  • Jauder Ho · 9 months ago
    I would be interested in seeing what your results are if you do test it out. But it sounded like a possible fit to me as soon as you mentioned your requirements.

    Have you spoken to Ari Zilka re TerraCotta? It might be possible to build something thin that sits on top of TC if you are already using it.
  • Jud Valeski · 9 months ago
    we're in good conversations w/ Ari at TC. exploring lots of angles; all
    good.
  • Jauder Ho · 9 months ago
    Bret, what are you using for the graphs?
  • Bret Taylor · 9 months ago
    We use the Google Chart API: http://code.google.com/apis/chart/
  • Jauder Ho · 9 months ago
    Cool. Thanks!
  • Demetri · 9 months ago
    How does this relate to actual user page loads? We run a bit of java that measures the page load times that actual users are experiencing.

    I borrowed the code from: http://www.die.net/musings/page_load_time/
  • chanezon · 9 months ago
    Way to go:-)
    Cool post.

    P@
  • Bitsaver · 9 months ago
    Seems like a waste to be rendering that image dynamically every time...
  • Foobio · 9 months ago
    This type of thing is not a problem for MS SQL Server or Oracle. I guess users of mysql are finally figuring out that even though mysql is good for small php based sites it just doesn't have the power nor the features which are needed in large scale systems. Of course users of mysql don't want to admit that so they keep on using it and inventing ways to get around its limitations.

    In corporate environment that is not possible. It must be possible to access the DB using standard query operations, tables must be normalized etc. Also normally there aren't that many db admins around so you can't use hacks and non standard ways because the db must be easy to maintain. So when you are deadling with db with tens or hundreds of millions of rows ms sql server and oracle are the way to go (there are few others but those are used by most).

    I have seen some attempts to use mysql/postgre in large scale systems and they all have failed. Sure they can get it up and running but there are too many problems. You can get around runtime problems by writing scripts and making hacks into the db itself. That would be actually pretty easy but once you have actually maintain the db (constant changes because of business requirements) it becomes nightmare.

    Eventually those systems (running mostly on linux and unix) decided to use oracle (some even ms sql server even though they didn't have any windows related software). Mysql is still in use in test/dev environments but in production it is always oracle or ms sql server. Some unix based solutions actually use only ms sql server because the dev tools and admin tools are so good. The performance is not an issue. In some cases oracle is faster and in some cases ms sql server is faster. The difference is not huge. I personally think that ms sql server (especially 2008 and 2005) are best products from MS.
  • banane · 9 months ago
    I concur- my experience in corp. DB environments has been the same
  • jj · 9 months ago
    "it just doesn't have the power nor the features which are needed in large scale systems" - and yet google uses it to power some of it's systems ?

    I think the issue is how the system and used and if you have someone that actually understands databases... Maybe that is why "normally there aren't that many db admins around so you can't use hacks and non standard ways" in the corp environments to ensure that the databases are used correctly....
  • eas · 9 months ago
    "This type of thing is not a problem for MS SQL Server or Oracle. I guess users of mysql are finally figuring out that even though mysql is good for small php based sites it just doesn't have the power nor the features which are needed in large scale systems. Of course users of mysql don't want to admit that so they keep on using it and inventing ways to get around its limitations."

    You know, right, that many or all of the page-views in 3 of the top 10 most visited sites in the US involve PHP, and that most of those pageviews also depend on mySQL in some way? You could add a fourth, Google, the #1 site, YouTube, one of their biggest non-search properties, uses mySQL as its database, and so does AdWords which means that most of their revenue depends on mySQL.

    I've seen lots of people in discussions like this assert that everything would be so much better with a real/commercial RDBMS, but I pretty much never see anyone give real examples of real/commercial RDBMS's that have scaled to the levels of concurrent reads and writes that big websites need.
  • Jauder Ho · 9 months ago
    Well, what it really boils down to is the classic argument of build vs. buy. A lot of organizations do not have the programming skillset nor the time to implement non-commercial solutions (mostly because it is viewed as an operational expense rather than a revenue generation source).

    So when there is say a hypothetical product that cost $25k that nets millions in ROI and can be rolled out in a month or two, that's what would tend to be used.

    That is not to say there is no role or use for MySQL and friends. Free goes a very long way when you are bootstrapped. It all depends on the use case and organization.
  • Marty Nickel · 9 months ago
    exactly! the only thing comparable would be MySpace (on MS SQL Server), and you've probably seen the presentations by Dan Farino and others on the challenges they've had.

    it's annoying the noise generated by people who've seem to have not worked with anything of this scope.

    kudos to you Bret! I appreciate your spirit of openness and sharing.
  • joeldg · 9 months ago
    or, you could have used memcachedb like Digg does
    See: http://highscalability.com/scaling-digg-and-oth...
    I think 64,000 reads/sec is worth a look, and if you are already using memcached you can use this out of the box.
  • eonarts · 9 months ago
    Brett -
    Interesting article. You should check out the SF mysql meetup this Monday (3/2/09) -- Brian Aker from Drizzle is speaking. Many call Drizzle MySQL Lite or Fat Free MySQL. It's a style of implementing MySQL for FAST usages. The reason folks use MySQL is not just cause it's opensource (hey there's postgres & it's not used in the same volume as MySQL). Oracle, while being quite $$ is also too slow. All those triggers, stored procs, views, & FKs are expensive.

    What concerns me, as a MySQL DBA is your datatypes. You have a PK that starts with a varchar (725) which is over 244 which is not good. Yet you say you use auto-incrememnt to get the smallest PK you can (in some tables). I find too many developers use bad datatypes and datatypes are your building blocks.

    I think your original schema may not have been that great if yr using large varchars as PKs.

    Also -- you might wanna check into using FNVs to see if they may be faster than binaries. Just google FNV.

    Interesting article & perhaps you should come to the SF meetup & present??

    erin
  • jorgetown · 9 months ago
    Hi Erin,

    Why is a 'PK varchar over 244' not good?

    Thanks
    Jorge
  • qu1j0t3 · 3 months ago
    Long keys are inefficient to store and handle. A primary key over 8 bytes is usually hard to justify; a 725-character key (under normal circumstances) defies common sense. Its use here indicates a denormalised schema.

    Furthermore with InnoDB, a long PK is a liability where secondary keys are used, as in such keys the PK value is used to reference the row.
  • brendano · 9 months ago
    You should make the last two graphs have the same y-axis scale. It'll be easier to visually compare, plus make the new one look way better than the old!
  • Tom Dean · 9 months ago
    Do you shard your indexes as well as your data (shard_on="user_id")?

    If so, how are you dealing with those situations when you need data from multiple users on different shards, but that data needs to be sorted & paginated?

    If not sharding indexes, how are you dealing with indexes getting too large? Replication? Or has there not been a situation where this has yet to be a problem?

    Thanks for a great post!
  • Joseph Smarr · 9 months ago
    Bret-great article. We use a lot of the same tricks at Plaxo these days--most of our tables these days have a binary blob where we store compressed JSON for schema-less data, for exactly the same reasons you mention (some of our tables literally have billions of rows, and we faced all the same paralysis or ops-slave-swap-craziness to touch indexes, so we've tried as much as possible to move away from that). One insight that led us here, which you didn't mention explicitly in your post: SQL is designed for flexibility and arbitrary selects and joins across any column. But in practice, most apps like ours won't ever need to query or join on most columns. So there aren't really that many advantages to using *columns* for those attributes in the first place, at least not in DB terms. While we haven't gone as far as 100% separate tables for indexes, we do try to only use separate columns for fields we expect to index and query on, and shove as much as possible into these JSON blobs. The only downside is sometimes for internal stats purposes or fixing old/bad data, it would be nice if SQL could natively work with JSON-data in columns (seems like a reasonable thing to add in this day and age), but a little clever regexing in our queries handles most of our needs there, and it's wonderful to just be able to easily read and write arbitrary attributes into the JSON from PHP code, since the native data structures match up so perfectly.

    Thanks again for sharing--we should all be doing more of this! js
  • mikelerch · 9 months ago
    Joe Celko would have a heart attack if he caught a whiff of this. ;)
  • joeldg · 9 months ago
    I am having a heat attack even thinking about someone thinking this is a good idea, I really hope people do not follow what they are doing here.. it is like using a database to solve the problem where the database is the problem.. this is plenty solved.. couchdb and those are all fine, memcachedb right now is the best of the bunch in my tests.
  • Sergey Shepelev · 9 months ago
    Do you really use memcachedb as your datastore in real project?

    I was wondering how to get over 1MB max value size for mecachedb user_index key. Splitting?
  • Max · 9 months ago
    Hi Bret,

    very enlightening post which confirms my assumption that more and more people are moving away from RDBMSs with increasing data to store and manage.

    Just curious: Have you every considered using a content repository such as JCR?

    Cheers!
    Max
  • paul_b · 9 months ago
    Very interesting solution to the indexing RDBMS for high volume - and impressive performance gains! Would have been interesting to see how this compares to the Digg solution (memcache) with same data types :)
  • Adam · 9 months ago
    See EventSourcing and some SoC stuff through the eyes of Domain Driven Design. One of Martin Fowler's latest blog posts talks about it. You touched a bit on it when you were considering the transaction log. If you embrace that, you can mitigate some of the concerns you site.
  • Dathan Pattishall · 9 months ago
    Your going to chew through disk space using this method. Innodb adds significant overhead to filespace when using blobs.
  • David Medinets · 9 months ago
    Thanks for writing this article. Nicely Done. Did you consider MonetDb?
  • chetan · 9 months ago
    Hi Bret,

    Great little article and smart idea.

    Just curious as to how you folks deal with post-creation of new indicies on existing data.
    (attempt to create an index on over a million entities)

    Would definitely assume that the process is parallel. Is hadoop being put to use for this task? :)
  • chetan · 9 months ago
    Also, one more question,

    What are you thoughts on read-through/write-through cache for a scaling strategy.
    Have you checked these guys out.
    http://tokyocabinet.sourceforge.net/spex-en.html
    http://tokyocabinet.sourceforge.net/tyrantdoc/
    Seeded by the engineers from mixi (japanese facebook)

    Spent a few hours playing around and it is definitely interesting.
  • Bret Taylor · 9 months ago
    We create new indexes in parallel by running one "cleaner" per shard. We don't want to parallelize much more than that because we don't want to disrupt other processes that are already using the DB. We don't use hadoop. It turns out it is so easy to parallelize, we haven't needed any specialized infrastructure to do so - we just parallelize by splitting into entry ID groups.
  • Alex Popescu · 6 months ago
    Bret, firstly thanks a lot for your post: it offered me THE argument to move forward with some of my old proposals.

    It looks to me that the "cleaner" is a very smart piece in your solution. As far as you explained it, it is an external process that populates/updates the indexes. And that means it has to read sequentially through your main tables to unpack the BLOBs extracting the field values to be placed in the index_tables and I'm wondering how have you implemented this without impacting the performance (in this case sequential access being almost the same issue as adding new indexes). I assume the sharded table is small enough (probably somewhere around 1mil records) in which case adding the index would probably perform decently enough. Or is there some secret sauce in the cleaner that I'm missing?

    I really hope you'll find the time to give a bit more details about it. Thanks in advance.
  • kellan · 9 months ago
    Mo, I think you've misunderstood what Bret means by different databases. Distinct database servers (real or virtual) with distinct IPs, and interfaces, etc. If you can join across those I'd be fascinated to see it. (though I doubt it would be much use in practice)
  • Gavin Towey · 9 months ago
    When you say 250 million entries, at the top, do you mean 250 million entities? On average how many entities per user are there and how many users per shard are you getting?
  • Bret Taylor · 9 months ago
    250 million entries. We store many more entities (the biggest other chunk of entities are comments). We aren't giving out our user numbers at this point, sorry.
  • Sergey Shepelev · 9 months ago
    But can you say how many shards involved?
  • adir1 · 9 months ago
    Interesting indeed, but why not use Hadoop HDB or Hypertable?
  • walterheck · 9 months ago
    The biggest concern I have about this whole thing is that it has been running for just a few days. I would hardly expect the weaknesses of a new system to show up in days. I would love to see you blog in about 3-4 months to see what your experiences are by then.
    Don't get me wrong: it's an interesting approach and I enjoyed reading the article (and especially the comments :) ). Just please do a follow up to this in a couple of months and tell us you experiences then.
  • wuman · 9 months ago
    Hey Bret, thanks for writing this post and sharing your experience with other developers. It's very generous of you and I believe many, including myself, benefit from it.

    What's the size of your typical shard (e.g. number of servers per shard)?

    I'm also wondering what you're doing to offload simultaneous writes (while maintaining consistency) other than sharding. Are you using a MySQL Cluster?
  • Lite · 9 months ago
    Thanks for interesting information!

    I think that distributes key-value storages must be very interesting for you as some people already said. Especially ones that combine persistent storage with distributed in-memory cache.
  • jomagam · 9 months ago
    Hello Bret,

    Thanks for this super interesting write-up. Few questions came to my mind; some of them admittedly low level and boring ;-)

    o Your technique of re-reading the entities after the index lookup gets rid of false positives, but how about false negatives ? What percent of the rows in the index table would you say is out of sync on average ?

    o How do you assign the "id" key in the "entities" table ? Why not just use "added_id" (after making it BIGINT UNSIGNED) ?

    o Why limit "body" to MEDIUMBLOB ? That essentially means that no object can be bigger than 16 megs compressed.

    o Do you really run two instances of MySQL on the same host ? Your sharding example makes it look so.

    /Buzz
  • Bret Taylor · 9 months ago
    1. We are actually ok with false negatives. As I said, the "Cleaner" runs continuously, so that false negatives are fixed within a couple of seconds. From a UI perspective, the user wouldn't see an entry in their feed for a couple of seconds if something crashes until the cleaner runs. For our system, that experience is acceptable (and the loosened constraints mean we can do a bunch of write optimizations that we wouldn't otherwise be able to do).

    2. We have multiple databases, so the auto_increment can't work as a global ID. Furthermore, for practical reasons, we want to know the ID before we insert into the database. We use standard UUIDs for this purpose.

    3. The choice of MEDIUMBLOB was simply based on our current needs.

    4. We run many more than two instances of MySQL on a variety of different hosts. The example above is extremely simplified for the purposes of this blog post. We have run multiple instances on separate disks on the same machine. As long as they have enough memory so you are not going to disk for reads, this method works to a point. In particular, it helps if you want to, e.g., make more shards than you actually need right now so you can accommodate future growth. Put a couple of shards on each machine at first, then expand to 2X the number of machines when growth warrants it.
  • Timandes · 5 months ago
    About q2,i use this to assign it:
    Insert Into `entities`(`id`) Values(UNHEX(REPLACE(UUID(), '-', '')));
  • Alen · 9 months ago
    You only cover the case in which the out of date index matches and is rechecked, but not where index is out of date and is skipped but the real entity matches the criteria. You don't really have ACID semantics.
  • Bret Taylor · 9 months ago
    Yes, we explicitly were not trying to achieve ACID semantics. See http://bret.appspot.com/entry/how-friendfeed-us... for a more detailed explanation.
  • Blaine Cook · 9 months ago
    This is really fantastic, and very similar to a layer I built for Twitter (never got deployed), though much more nicely abstracted. Do you have any plans to open source the management/client layer that you use to access the DB?

    Also, any thoughts on extending the sharding approach to avoid the need to rebuild the indexes when you add (or remove) a shard? I did a lot of thinking in the Twitter context about how to allow for seamless allocation and de-allocation of the index shards, mostly centered around providing user (i.e., shard_on) + shard lookup tables, which is really just one level of indirection, but on the assumption that user data grows much faster than users and will need to be re-sharded much more frequently.
  • Bret Taylor · 9 months ago
    We are definitely open to open sourcing it. It is not quite abstracted enough, and it is a bit intertwined with our code's FriendFeed-isms, but Ben Golub and I talked about it a bit recently, and I am committed to make it happen at some point.
  • Kinlan · 8 months ago
    I am definatly keen on seeing the db abstraction open sourced, because your methods for achieving schema-less db is something I really like and probably need for a future project I am planning.

    Thanks,
    Paul Kinlan
  • David Strauss · 9 months ago
    I've developed a module for Drupal 6 that provides such external, indexed tables as materialized views:

    http://fourkitchens.com/tags/materialized-views

    We're running it on Drupal.org right now.
  • dan · 9 months ago
    You're smarter than me, that's for sure
  • zafarali · 9 months ago
    this is really cool.
    Great work!
    i use mysql and now its become really slow. but im going to try this soon.
  • Vic · 9 months ago
    Thanks for sharing this, it would really help in developing large scale Mysql DBs
  • Andi · 9 months ago
    Thank you for sharing!

    I like the idea to put all (unstructured) data into a BLOB and to use an own table for indexing. And espcially I like that you use MySQL not in the "normal" way but to create an own system on top of it.

    To make it OS would be great for all websites that want to set up their own system instead of being forced to use e.g. the Google App Engine stack.
  • LaptopHeaven · 9 months ago
    We implemented the same concept for storing and indexing XML documents. It works quite well.

    Have you looked into an XML Database, like exist, http://exist.sourceforge.net/
  • Peter Eisentraut · 9 months ago
    PostgreSQL allows adding and removing indexes on the fly without significant blocking. So it is still possible to keep your original schema and optimize at run time.
  • Matthias · 9 months ago
    How do you get a sensible communication pattern using blobs? At least with JDBC I always frowned upon small to moderately sized blobs because you cannot load multiple of them in batch. You only get back blob locators in your entity fetch query and then you need to open, read, close each one of them. 3n+1 roundtrips. Likewise trying to write n at a time.
  • Don MacAskill · 9 months ago
    Great minds...

    We've been slowly moving this way for the last few years ourselves, storing JSON and/or serialized PHP instead of columns. And likewise for the same reasons you chose - MySQL just works and we understand it.

    We haven't yet begun transitioning our biggest tables, though - but now that's food for thought. We should compare notes. :)
  • François Schiettecatte · 9 months ago
    Very interesting, one comment I had, if you don't create a primary key for an innodb table, innodb assigns one for you automatically so you could drop the added_id column from the entities table with no ill effects, see:

    http://books.google.com/books?id=9c-pkLaNmqoC&p...
  • coder · 9 months ago
    I started a Java project based on ideas found here. The project name is schemafree.

    http://code.google.com/p/schemafree/
  • Raul · 9 months ago
    Thanks Bret, well done. To all people who prefer leaving the indexing to mysql and storing everything else on the BLOB field, notice that one of the main purpose of Bret was to avoid downtimes everytime you need to add a new index to a big table. So you either do what was explained in this article, or you need to start changing masters and slaves which is very error prone. I'm the developer of a facebook application with 300K daily active users and I'm already feeling the pain. (although it can be minimized if you spend some time developing the tools to automate the task).

    Bret, one question, you say you store the data compressed using zlib, do you do the compression on the client before sending it to mysql, or do you leave that task to mysql using the gzcompress/decompress mysql functions?
  • Bret Taylor · 9 months ago
    We do compression in the client as a part of the serialization process.
  • Kris Jordan · 1 month ago
    What level of gz compression did you all settle on as the appropriate size/speed trade-off?
  • coder · 9 months ago
    Bret, do you ever use more than one index column on the same index table? If you index on first, last, email for instance, your architecture forces the developer to have three seperate tables? If yes, than I guess your lookups containing all three columns join all three index tables.
  • Bret Taylor · 9 months ago
    We do include multiple columns in the same table. We treat them just like we would treat keys on other tables. Here is one of our standard many-to-one indexes, which indexes two properties (stream_ids[] and sort_time):

    CREATE TABLE index_stream_id_sort_time (
    stream_id BINARY(16) NOT NULL,
    sort_time INT NOT NULL,
    entry_id BINARY(16) NOT NULL,
    PRIMARY KEY (stream_id, sort_time, entry_id),
    UNIQUE KEY (entry_id, stream_id)
    ) ENGINE=InnoDB;

    We don't use any SQL joins in our current system.
  • coder · 9 months ago
    Thanks for the info.
  • zennon · 7 months ago
    Hi Bret,

    Thanks for sharing this. I have a question and I hope that you have a time to answer me.
    Is the sort_time value here reversed (f.e future date minus actual time) so the PK index is ordered exactly as the user stream visually, the newest activity on the top?

    Thanks in advance for your reply
  • Bret Taylor · 7 months ago
    Indexes are stored as btrees on disk, so you can iterate over them both backwards and forwards equally efficiently.
  • Amit C · 9 months ago
    Hi Bret,

    Nifty stuff.

    Talking of latency, the load time for this page is pretty high. Probably want to make the disqus stuff expandable on demand.

    Amit
  • Dilip Krishnan · 9 months ago
    Very useful, Thank you!!
  • lurker86 · 9 months ago
    Really interesting article!
    The topic discussed here is mostly common in large scale and fast changing web 2.0 companies,
    but there is not that cool open-source recipe yet and your way looks promising way!
    and I concerned about this issue using XML support of postgreSQL[1], did you look on this so far?

    [1] http://developer.postgresql.org/index.php/XML_S...
  • Charlie Cheever · 9 months ago
    Is there a reason that you use pickle instead of marshal?

    In a previous project that I worked on, I found that marshaling was faster in both directions (marshaling to string, unmarshaling to dict) and also takes less space than using pickle.

    Ex. For one of the example entities above, the zlb-compressed pickled dict is about 20% larger than the marshaled version.
    In [21]: len(zlib.compress(marshal.dumps(new_entity)))
    Out[21]: 228
    In [22]: len(zlib.compress(pickle.dumps(new_entity)))
    Out[22]: 273

    Do you store anything fancy in your entities that can't be marshaled?
  • Jack Stahl · 8 months ago
    from pydocs:

    "Python has a more primitive serialization module called marshal, but in general pickle should always be the preferred way to serialize Python objects. marshal exists primarily to support Python’s .pyc files."

    That's not to say it's always better, but unless it's a bottlneck of some type you're better off going with the community standard.
  • filesystemman · 8 months ago
    When I was an Engineer at Yahoo! in the late 90's, many of the senior engineers were from RDBMS companies - Oracle, Sybase, Informix.
    How did they deal with (never before seen) scaling? By not using database servers (well, of course we had them - some big ones - but read on).

    I walked into the VP of Engineering's office (a former principal at Oracle), and told him I needed to provision a production SQL database and whats-our-procedure-for-that. Without even looking up, he replied 'you don't need one'.

    That was the day that I learned about the power of filesystem databases. Ever since that day, whenever I approach a data model, I think 'how can I NOT use an SQL server for this?' - and go on from there.

    Over the past 10 years, that fundamental change in mindset has saved me more money, time, maintenance, and frustration than I can possibly ever quantify, and allowed me to survive traffic events that would have required 5x the hardware in a traditional SQL setup (and still fallen down).

    It amazes me that there are no popular books specifically addressing the subject, yet a bazillion SQL books.
  • Pedram · 8 months ago
    Finally some ancient wisdom ;)
  • qu1j0t3 · 3 months ago
    Mind defining "filesystem databases"? Do you mean the traditional key-value stores such as BDB that might have been contemporary with your anecdote? Running a web site on BDB would have been radical, not to mention blog-worthy, in 1997.

    It should be remembered that end of 90's/early 00's was the period that saw MySQL gain ground in this very niche, and retain hold of it to the point that it now powers all major web properties (including, according to this article, FriendFeed).
  • Peter · 8 months ago
    Hi,

    Great stuff.

    Regards,

    Peter
  • sasamat · 8 months ago
    It's really fascinating how variable-record length architecture (schema-less or property bag or what have you) has come back into vogue. Ray Ozzie chose this for Lotus Notes in the late 80's--not for performance, but for extreme flexibility and it's coming back in a big way with CouchDB which should eventually als give excellent performance through parallelization.

    Curiously the prime motivator for doing it here, though, is primarily for performance/scale issues which is counterintuitive given the expense of reading and writing variable-record length data. But what's truly bizarre is to see it implemented ON TOP of a relational DB! Why not a flat file system iike Berkely DB etc. MySQL is fast, but every I/O wastes time in the query optimizer/processor that doesn't exist in a flat file system. Ted Codd is rolling in his grave or ROFLing... not sure which.
  • qu1j0t3 · 3 months ago
    As others have pointed out, MySQL stands up pretty well as a key value store; not only in raw performance, but in maturity and supporting infrastructure. Makes more sense the more one thinks about it.
  • Jon · 8 months ago
    Wow. It's really funny to see how you try to argue for using something "stable". You're argument for not using couchdb is "It's not battle tested." and We like MySQL for storage, just not RDBMS usage patterns." What you are really saying is "Our engineers were bored and wanted to create something cool. So, after doing research on one schema-less DB they decided they could do it better and wrote this Rube Goldberg type mechanism on top of an ordinary RDBMS." Congrats.

    Essentially you've got a gigantic hash table. You might as well use a clustering file system and map/reduce at this point. Gee, is there anything out there that could solve that problem? http://hadoop.apache.org/core/

    Well, I'm sure that isn't "battle tested" enough for your engineers.
  • wow · 7 months ago
    God, can we get anymore arrogant?
  • Matthew Lefevre · 8 months ago
    "The datastore code does the "join" between the index_user_id table and the entities table in Python, by first querying the index_user_id tables on all database shards to get a list of entity IDs and then fetching those entity IDs from the entities table."

    What kind of SQL syntax is this behind the scenes? What if you fetched a 1000 entity IDs, does that then produce a really large WHERE clause that looks like:

    SELECT body FROM entities
    WHERE id IN ('fb46c4ab-9537-4bd9-a7f1-5d7655ab1483',
    '87dd9c65-2b3c-4956-8b1e-6754f647ac41',
    '99bf8a44-6acd-4119-8bc3-5208652f2b8f',
    'e85703ea-b074-4f0f-9eee-e410523b53c1', and so on... )

    ...with 1000 items in the list?
  • Josef · 8 months ago
    Interesting. Years ago the RDF/Semantic Web proponents where talking about triple-stores, graph databases. Now I see that slowly but surely these are being rediscovered in someways. Ultimately, are we just aiming at quad attribute tables, named graphs, (or whatever they are called now)? Or do they just seem similar?

    -- Josef
  • Erik Allik · 7 months ago
    Any plans on open sourcing the code?
  • qu1j0t3 · 3 months ago
    Answered above ^^
  • Zumbala · 6 months ago
    Guys,
    You have money!
    Why don't you go and buy vertica? Check it out.
  • Mark Essel · 4 months ago
    This solution sounds storage intensive but gains so much in agility it's scary. Thanks much forthright architectural share, is any of this knowledge proprietary, or can it be replicated without concern?
  • Bret Taylor · 4 months ago
    Replicate away - that's why we published it. FYI, the entire site is licensed under Creative Commons (http://creativecommons.org/licenses/by/2.5/), and all the source code under Apache 2. I am not sure which type of license is appropriate for a description of a scheme like this, so if you want an official license, you can consider it licensed under both and choose whichever seems most appropriate.
  • Mark Essel · 4 months ago
    Rockin!
    I'll try and run this by database folks far smarter than myself and see if we can use it.
    Thanks again Bret.
  • Dan T · 4 months ago
    Do you wish you'd started out like this or did trying to do it the old fashioned way help you better define the problem that you eventually fixed?
  • supersan · 4 months ago
    Hi Bret,

    it's been about 5 months you posted this article..

    i'm interested to know how it is performing for you since and what kind of tweaks, and problems (if any) you've faced since then..

    thanks in advance for taking the time to reply because this was a really interesting read.

    kind regards,
    san
  • asdonair · 3 months ago
    Bret, have you any plans to open friendfeed.datastore (maybe as a part of tornado)? That's all about writing a code with the same functions by myself. :)
  • twiggy79 · 1 month ago
    I think using a something like a double buffer(like in games) is a better idea.

    If you got tableA then create tableANew with the new columns and the new index will get populated.

    Copy tableA in to tableB...lock tableA..copy the rest of the data in...rename tableANew to tableA. Down for a few seconds without having to give up all the years and years and years of data integrity that are built into RDMS systems.
  • Travell · 1 month ago
    Tornado is open source. Any chance that this gem is going open source? In particular I'm thinking about porting this an implementation of this sharding strategy to PHP. More at http://stonedb.com
  • Ray Wang · 1 month ago
    And here comes my 2 cents on how you can better.

    UUID as index of entity table is a waste. Entity table may want to use more meaningful keys to as index. Clustering and space efficiency are a few benefits to do it. For example, feed entity may have a clustered/primary key of $publisher_id+$publish_time. Having a cluster key enables database store all feeds belong to same publisher as part of the index, making certain queries much more efficient. On top of that it saves at least on index table per entity types.

    Blog post: http://ray2cents.blogspot.com/2009/11/cheating-...

    Given everything else the same, the simpler the better. But what about compound keys? There're situations where compound keys are necessary e.g. feeds of all fans published last month. They can either break the query into queries, each using a single index, and then merge them in code; or it can implement composite keys by adding multiple attributes to index tables. None of the two solutions seems satisfactory.
  • Wojciech Kruszewski · 5 days ago
    Interesting implementation of this idea at:
    http://jamesgolick.com/2009/12/16/introducing-f...
  • shopyop · 2 days ago
    Very interesting. Thanks for writing this up.
  • CyberSkull · 9 months ago
    Interesting, I think I get the gist of it for now. I will have to really take a look at this tomorrow.
  • Joe Dawson · 9 months ago
    Very interesting stuff, I am just getting into MySQL tables at the moment. Great read :)
  • John Rawelins · 9 months ago
    Wow, very enlightening indeed.

    www.be-anonymous.us.tc
  • Jesse McNelis · 9 months ago
    has it occurred to you that you just built a file system on top of Mysql?
    Wouldn't it be easier just to save the data to the file system?
    File systems are very mature, and have great replication, snapshoting and backup facilities, they are also very flexible.
  • Neil Kandalgaonkar · 9 months ago
    Ditto. In their world, as far as I can tell, MySQL's role has been reduced to:

    - sorting items for display,
    - auto incrementing ids,
    - providing a way to create, retrieve, update and delete "files" over the network

    The first two problems seem trivial, but the last is hard to solve with great scale and lots of concurrent users. MySQL works, so why not use it.

    Still, I wonder if there really is no other way to do this, other than MySQL.
  • joshua schachter · 9 months ago
    Filesystems do key/value storing (filename -> file contents.) They don't do the indexing part at all, though.

    I've noticed that when people use the word "just" they're usually about to be condescending.
  • Marcus Breese · 9 months ago
    Using a DB to mimic could be faster than just using the filesystem in some instances. You don't have to open() each file for each access, assuming an appropriate level of caching. Also, retrieving multiple entries could in theory be pulled from disk in only one read as opposed to multiple.

    Also as mentioned below, you don't get indexing if you just use the filesystem.

    The biggest problem is that this isn't strictly a relational / non-relational problem. It's a mix, and so this could be an appropriate solution.
    I might have gone for a different solution, but it seems to work for them.
  • Sleep deprived. · 9 months ago
    I'm a bit sleep deprived right now, but Jesse's comment seems spot on. How is the scheme described in this post any different? (W/the exception that it probably has quirks a mature FS like ZFS doesn't.)
  • claymation · 7 months ago
    ZFS is mature? Hardly :)
  • BuggyFunBunny · 9 months ago
    MySql, minus innoDB (which is to say from the beginning) has never been anything more than a SQL parser fronting the native filesystem; it provides none of the services of a RDBMS. KoderKiddies (whether Mr. Taylor is, I don't know) are ignorant of what RDBMS are good for, and like to write Kode rather than let the database do the heavy lifting which is the only reason to bother using one, so MySql fits quite nicely. That doesn't mean that they've implemented a database application. This is yet another case where the developers should have just written bespoke I/O and been done with it.

    Writing a 'transaction manager' on top of MySql in application code???

    A schema-less RDBMS isn't a RDBMS. Using MySql in order to put 'database developer' on one's CV isn't a good reason to use MySql.

    I considered giving up and not posting, but then I saw that others had been even snarkier, so here it is. The point is: learn what RDBMS is *for* before blindly going off on a Brute Force Development rampage. Then figure out how much strength you'll need in your RDBMS, and get one that meets the criteria. MySql likely won't qualify. Tell your VC the truth. Buy DB2/Oracle/Postgres/SQLServer and the hardware you need. You'll write lots less code, have a spiffier system, and more time for your Significant Other.
  • Clay · 9 months ago
    Speaking from experience, I can safely say that buying a big, expensive, enterprise-class database does not mean you won't have to worry about scalability or will have any more time for your significant other. You will still have problems, they will just be different problems.
  • eas · 9 months ago
    I don't think entrepreneurs choose any technology so they can put it on their resume, they choose it because they think it will allow them to successfully execute their vision for a new business. Giving up more of their company to a VC so they can buy a "real RDBMS" can be incompatible with that goal, since every dollar they take means giving up a little control, though in this case, it might not be an issue since a significant part of their funding has been provided by the founders.

    Perhaps it was a mistake for them to start with MySQL, but perhaps it wasn't. It sounds like they knew its limitations and had an idea of how to get around them. You say they should "learn what a RDBMS is for" before going off on a "brute force development rampage," but remember, again, they are entrepreneurs. The goal is almost certainly not to be 67th-level DB2-nijas, the goal is more likely, "deliver an awesome service that people lost of people dig and make a nice chunk of money along the way, or have fun trying" Diving in with mySQL, the devil they knew, meant they could get down to work. A real DBMS would have required negotiating licensing and then learning all of its quirks because, lets face it, the documentation for commercial databases is never completely accurate. The docs may tell you what is supposed to happen, it doesn't always tell you what actually happens, or adequately convey when performance will go to hell. All that specialist knowledge takes time that might be better spent on the end product

    You seem to be suggesting that if they'd chosen right in the beginning, they'd have done less work on their data layer now and have cleaner and more maintainable code. You might be right, at least in hindsight, but they were starting something new, the requirements were no-doubt less clear in the beginning than they are now. But maybe DB2, Oracle, Postgres or MS-SQL could have carried them to their current traffic levels and database size without looking like such a kludge. The thing is, if Friendfeed keeps growing at a healthy clip none of the real RDBMS are likely to see Friendfeed all the way through without starting loose more and more of their flawless 3rd-normal relational beauty.

    Real RDBMS's just don't work at the scale of a highly successful community or even ecommerce website without resorting to all sorts of perversions to the relational model. People start partitioning them up the wazzo, joins and transactions end up floating into the application layer. Consistency comes eventually. I'd be genuinely interested if you can name one compelling counterexample. It won't be Amazon. It won't be EBay, who have relegated their relational DBs to not much more than dumb data stores, at least for transaction processing. It won't be MySpace, who have partitioned the hell out of MSSQL. It won't be Facebook, which has a highly partitioned bunch of mySQL instances and various custom services to provide eventual consistency between them. It won't be Linked-in, who've built a special service to hold their social graph, and another to serve as a robust key-value store. The only thing I've heard of recently that might fit is Salesforce.com, but they have much higher ARPUs to work with, and the ability to cleanly partition by customer. I'd certainly sleep better knowing that our dev-team has a clear path forward if our application needs to scale to meet our business goals.

    Sure, Friendfeed may not be big enough for that yet, and they may never get there, but I'm pretty sure that's what they aspire to, and if they aspire to that, the right thing to do with their data layer is probably to start iterating on an architecture that will be able to scale all the way up.

    I'm not defending mySQL here at all -- I don't want to touch the thing if I don't have to -- but I find it pretty difficult to accept the premise of your criticism.
  • banane · 9 months ago
    I believe Bret really clearly states that he didn't want to buy into a software package like MS/SQL/Oracle, and also didn't have the skills on staff to manage those beasts. There is a vein here of rebuilding the wheel, but my response is, hey, it works. They got it to work, and it scales. It may be re-inventing a database app, but whatever. I'm interested in how web2.0 systems like FF work with what they're given, not necessarily what is "best" by convention.
    But I am interested in the differences between the homemade stop-gap measure here and how enterprise databases handle it. I've seen some very fast processing done on Oracle,in live environments, I'm just not sure how it's done.
    I work with rather large static databases, and it's all query optimization. We flip the DB to read/write depending on the batch manner. The tables are highly indexed, but I know in general building a "running index" like he suggests is more practical than managing an index on a 250M record table.
    One comment- someone here suggested LDAP, which is a great lightweight (ha) method of storing simple user-identifying information. I've also worked in large messaging systems that used the file server instead of the DB.
  • BuggyFunBunny · 9 months ago
    >> didn't have the skills on staff to manage those beasts

    That sort of like my neighbor. He knows how to cut up a chicken for dinner. He also thinks that qualifies him to be a neurosurgeon.

    >> But I am interested in the differences between the homemade stop-gap measure here and how enterprise databases handle it

    In the case of DB2, anyway, they have almost everything but the code itself available on their website explaining how the engine works. It helps if you've Elmasri or the like first.
  • Kamil Szot · 6 months ago
    >> >> didn't have the skills on staff to manage those beasts

    >> That sort of like my neighbor. He knows how to cut up a chicken for dinner. He also thinks that qualifies him to be a neurosurgeon.

    You'd still be better of alive with your neighbour drilling hole in you skull to alleviate the pressure and not dead because there wasn't neurosurgeon in vicinity to do the same thing more professionally.

    Also you've missed the point. There's actually no skill or reason in throwing you organisation upside down on the quest of spending loads of cash for replacing what you mostly successfully use with REAL rdbms and hire rare and valued professionals that can cope better or worse with these enterprise behemoths.

    You present opinions that can probably be associated with archetypical 'IT worker with no business skills'. Sorry for being nasty, it's just what came to my mind.

    Commercial rdbms are only for stagnant organisations that are made of cash and don't need to be nearly as cost effective as web 2.0 companies usually are.
  • John · 9 months ago
    They used a hammer to drive screws into wood but found that it didn't work well. Then they invented a machine to strip the threads from the screws so that they could continue to use the hammer.
  • seemsArtless · 9 months ago
    I work for one of those companies ( hint, we developed SQL back in the day ) and wish we could use "You'll write lots less code, have a spiffier system, and more time for your Significant Other" as the tag-line for our product!
  • BuggyFunBunny · 9 months ago
    If it looks like HAL, then I can understand that. SQL was created by Chamberlin, not Codd, and it shows. Moreover, HAL has habitually catered to the COBOL/VSAM folk from the mainframe product, which is to say, having a SQL parser in front of the filesystem! I just spent 8 years with one of HAL's clients, who took on DB666 (to spiff up a 30 year old COBOL application, need to look modern, you know) by writing a boatload of COBOL to intercept all read/writes to DB666 and convert both the data to/from VSAM as well as implement ACID in their own COBOL. All of this extra runtime overhead (not to mention creating and maintaining the COBOL) in addition to what the DB666 engine did. But they could then claim that they'd "implemented an RDBMS system for the future". A lot of that goes on in the HAL world of databases. And, to a lesser degree, with FF; as this thread has demonstrated.

    Do it Once, in One Place. Kind of like DRY.

    On the other hand, until they went crazy with xml in v9, the Linux/Windows versions were spiffy. If you take the time to understand what that engine will do, you will understand that the closer to the data you take your ACID, the more efficient you'll be. DB666, for example, is highly optimized C (on Linux/windows). So your choice: layer on your own version of highly optimized C (or something nicely glacial) while somehow short-circuiting the engine's routines, or just letting the engine do the work. The advantage to vanilla MySql is that it makes no pretence, if you look closely, of being anything more than a SQL parser. So you can create an application which is a modern day monolithic COBOL/VSAM application, but with OO trappings. You've still wasted time and money.
  • Mohit Soni · 9 months ago
    Nice read Bret!
    Do you use memcached in combination with Zend or do you use XCache, APC, eAccelerator ?
    I was wondering which combo among the above will give optimum performance.
  • Jauder Ho · 9 months ago
    I believe FF uses Python not PHP.
  • Mohit Soni · 9 months ago
    Hi Jauder Ho,
    I didn't know FF runs on Python. Still can you suggest a combination between memcached + (Zend or XCache or eAccelerator or APC) for optimum performance ?
  • David Cramer · 9 months ago
    APC is a very common choice. They all more or less do the same thing.
  • Mohit Soni · 9 months ago
    Hi David,
    I agree APC is a very common choice. But does APC gives the similar performance benefits like Zend or eAccelerator or XCache) ? Actually, I would like to know which solution is best.

    Thanks
    Cheers
  • David Cramer · 9 months ago
    Well this probably isn't the place to find that opinion. They're all close enough that it's not going to matter that much unless you're scaling on the level of Facebook.
  • eas · 9 months ago
    Its a huge win just to avoid the cost of reparsing the PHP files on every request. I did some testing with Wordpress. Without APC or eAccelerator, startup times on each page view were something like 200-300ms, which was at least half of the total request procesing time. With an opcode cache it dropped ~10x making it a minor part of request processing and ~halving overall page generation time.

    But really, if its really important to get the very best, you should be doing your own tests with your own application. It shouldn't take too long.
  • Jauder Ho · 9 months ago
    The answer is YMMV. The best way is to test all 3 with your app and pick whatever works the best.

    I've been working on Magento (www.magentocommerce.com) over the last couple of days and it appears that XCache and APC return about the same level of performance. Surprisingly, memcache does not improve performance in this case but I am sure this is more a reflection on how the app works and not memcache itself. Unfortunately, it looks like this is just a really slow app for now (so slow it makes Rails look like a rocket).

    PS. Magento uses Zend under the hood so that may be a factor.
  • Mohit Soni · 9 months ago
    I think you guys are right, I should test my application using all the solution and then choose the one that suits me best. Anyways the application my team is working on will change the way the world see the web. It will be the next big thing in the Web Computing Platform.
    Our application is shortlisted in Google Product Prodigy finals. For more info check :
    www.google.co.in/intl/en/productprodigy/finalis...

    Our product name is Flare and our team name is 2 Smart 2 Curious.

    Thanks for your views
    Cheers
  • MC.Spring · 9 months ago
    Great article!

    Thanks for your shared.
  • David Cramer · 9 months ago
    As some have stated this really takes the use of an RDBMS out of the picture, but until there are things like Big Table which are actually on the open market, and are reliable, this seems to be a very common approach to handling large scale database applications. It's what we do, and will continue to do for any application which expects to see a large amount of traffic. However, rather than creating tables for each index (although this is a decent approach to handling parts of the denormalization set), we try to keep each attribute which needs indexed outside of the JSON BLOB, and simply leave it up to MySQL to index.
  • Somebody · 9 months ago
    The similar idea was implemented in this project: http://rubernate.rubyforge.org/index.html
  • Term Paper · 6 months ago
    but until there are things like Big Table which are actually on the open market, and are reliable, this seems to be a very common approach to handling large scale database applications. It's what we do, and will continue to do for any application which expects to see a large amount of traffic. However, rather than creating tables for each index (although this is a decent approach to handling parts of the denormalization set), we try to keep each attribute which needs indexed outside of the JSON BLOB, and simply leave it up to MySQL to index.
  • Paco · 9 months ago
    When you already have performance problems with only 250M records, your probably doing something else wrong. The solution is a stupid idea an will definitly cause performance problems when your database is getting really big. You should have asked yourself why the performance was bad and optimized that.
  • Kinlan · 8 months ago
    you sir, are a fool!
  • Otto · 5 months ago
    Sorry, but he's correct. I mean, if you are scaling badly at a million records, what do you do when you get a lot more?

    I deal regularly with a single table that has (last I checked), 6 billion rows in it. That's a B there, as in 6,000,000,000 rows. I've had to modify it several times. And I never have had to take down the thing to do it. Queries for any given record run more or less instantly (under a tenth of a second for our most common query, about a quarter of a second for a least common one). One time I did have a long running query on it. I had to JOIN it to a couple other tables to get the data I needed. That run took about 3 minutes, I think.

    Databases are designed to hold large amounts of data. At least, good database systems are. If you're running a major enterprise, buck up for a commercial RDBMS, because continuing to program around limitations in your free one is a waste of your valuable time. Don't get me wrong, mySQL is fantastic, but it's not truly large-scale enterprise level software quite yet. It's perfect for smaller databases doing work in a distributed fashion. But if you need to run a multi-terabyte DB, then mySQL will simply get crushed under the load.

    Plus, you end up having to write all your own code in order to scale across multiple database servers. Look at WordPress.com. Those guys are literally running PHP code on all their systems that talks to different database servers depending on what data they're looking for. I shit you not: http://codex.wordpress.org/HyperDB . It's as if Load Balancing and Database Replication were not already solved problems... which they are not, if you're using mySQL. You end up having to use hacks like that. Yes, the hacks clearly work, but would you rather spend your time writing and supporting all that hacked up code, or just paying a few thousand bucks and plugging in something that simply works right off the bat? What is your development time worth anyway?
  • Kinlan · 5 months ago
    That made me laugh... These guys are developing applications that don't sit that well with traditional databases, enterprise or not. The whole article is describing a situation where they say that they can't stop access to the system to build a new index, try not locking tables in oracle or sql server whilst adding an index.

    A single table with 6 billion rows, that is pretty cool, but what happens if you need to index a column that you previously didn't have indexed. Do you have to stop access to updates and selects from the table? If you get any interruption then the service is down, friendfeed are saying that they can't have the down time that is associated with potentially hours creating an index on a table.

    The idea is quite innovative and looks likes that it performs well for the needs that they have - have you seen how fast the service runs - searches are near instant. Like you said a join on another table in your DB took 3 minutes, Friendfeed can't have that on their system - all queries have to be near instant and I presume not intensive on the system; imagine 2000 people running the same 3 minute query.

    The principles used in this article work well in Enterprise Database's too. Trust me I am using this method now. The cool thing about it is that each table(pseudo-index) is really skinny in that there is not too much data in each row.

    I am not saying that there is no place for Enterprise Databases nor am I saying that this method is suited for every problem. But it does solve a set of problems that Friendfeed are facing, that traditional ways of deal ways of doing RDBS don't scale to what they need.

    I have been around the circuit long enough to know that this is a good idea and works well, but is not a solution for every problem. I don't see anyone cussing out Google for their BigTable system, which when you think about what Friendfeed and Google do they aren't too dissimilar, indexing massive amounts of data and allowing you to query it.
  • Otto · 5 months ago
    I'm sorry, but you are mistaken. Oracle and Informix can both build new indexes without locking tables. I don't know about SQL Server, but then I don't use Microsoft products a lot. Admittedly, the tables become read-only, unless you use some form of transactional locking (meaning that selects into the table will take longer, but updates will still happen temporarily, and then permanently when the index finishes building). Depending on the particular index. updates might still work as long as you're not updating a column being indexed.

    As for the three minute query, that was a one time special case. If I had to run it a lot, I'd add the columns I joined into the table themselves, or I'd make a continually updated view or something similar.
  • Kinlan · 5 months ago
    I am not that mistaken you agreed with me "the tables become read-only". Friendfeeds context is that they are probably inserting 10's of thousands of items per minute, and updating 1000's of times a minute and are updating columns that are being indexed at the same time

    The other thing as well is that the couple droping a couple of "thousand" bucks on a DB you mention is normally for 1 CPU for enterprise DB's, scaling these systems using RAQ or what-ever other systems cost and they cost a lot per node. Then you add things like full text search, which when you look at Oracles offering IS just a hack on the database and has a whole host of its own problems.

    All I am saying is that this method works and it works well regardless of the Enterprise DB or not. It suits Friendfeed, it is cheap, it scales well and it easy to create new indexable elements drop other indexes without affecting the live system.
  • Pedram · 8 months ago
    This solution works for this site because it doesn't do much, that solution is something sites like ours will have to do, which are 10x more robust.