Biz & IT

Power tools: Sorting through the crowded specialized database toolbox

With so many choices today, matching database to need isn’t getting any easier.

Lisa Vaas – Apr 27, 2016 8:00 am | 94

Choosing a database is pretty similar—it's all about the right fit. Credit: Flickr user: Sven Slootweg

When you think of game development, the first thing that comes to mind probably isn’t a database. But in the world of Jamaa, the setting for WildWorks’ massively multiplayer online kids’ game Animal Jam, a database keeps millions of cartoon animal characters frolicking and the cartoon trees from crashing down. The database chosen for this job was a specialized, non-relational database from Basho called Riak—one among the herd of new databases that have risen to handle Web-generated gluts of non-structured data.

The database landscape is increasingly complicated. As of April, Solid IT’s DB-Engines initiative was tracking 303 separate relational and non-relational databases. In the golden years of relational databases, benchmarks such as TPC could theoretically give you some sort of way to compare databases directly. But today, it’s difficult to assign a one-size-fits-all measurement to the world of non-relational databases such as Riak and Apache Cassandra (the distributed database project originally developed at Facebook). WildWorks ran its benchmarks and decided on Riak for Animal Jam, and Uber did the same for its dispatch platform. IoT car tech company VCARO decided the exact opposite: Cassandra beat Riak at handling vehicle-generated sensor data. Software company Nuance Communications opted for something else entirely, choosing Couchbase for handling speech and imaging apps.

The “why” of decisions like these are as complicated as the database technologies themselves. It may hinge on which two of three CAP theorem guarantees—consistency, availability, and partitionability—a business values most. The tipping point could alternatively be which database handles software containers or what skills you already have on hand. This list of factors is seemingly infinite.

Why WildWorks runs on Riak

The name creation page for WildWorks’ Animal Jam, an MMO targeted at kids to teach them about wildlife.

WildWorks (originally called Smart Bomb Interactive) launched Animal Jam in 2010 through a partnership with the National Geographic Society. Since then, it has become the fastest growing gaming site in the US. Beau Brewer, Web director and software architect, says they’re supporting 50,000 concurrent players at any given time.

WildWorks’ tale is like many online businesses: it started out on the desktop, but it got to the point where the company had to have a mobile presence. “That’s a duh,” Brewer says. “Everybody was going mobile.”

WildWorks developed a version of Animal Jam called Play Wild for Android and iOS mobile operating systems. While developing, the company realized it had a captive audience of millions. WildWorks didn’t want to lose any of the herd by forcing it to cross bridges while migrating to mobile. It would be better to see players who already had accounts just pick up a device and seamlessly start playing.

In other words, WildWorks needed a single login for both the online and mobile games. This setup also needed to scale, big time. “We’d never played in the mobile space,” Brewer says. “I knew it could start a wildfire and just go off.”

Besides single sign-on and the potential to scale like mad, Brewer says that the company chose Riak because it’s…

Fault tolerant.
Written in Erlang, a general-purpose language that supports concurrency, distribution, and fault tolerance. Arguably, it requires more discipline to learn and hence requires more experienced developers, Brewer says. That’s a plus in his book. “This helps preserve the quality of code and stability of Riak, in my opinion,” he says.
HTTP API and native Erlang interface. Creating an API-centric app helps build functionality that can be used in any device, be it browser, mobile, tablet, or even desktop.
The option of using Solr to search/index data. Solr, pronounced “solar,” is an open source enterprise search platform written in Java that is said to be “blazing-fast.”

By contrast, this is why WildWorks passed on DataStax Enterprise’s Apache Cassandra:

Cassandra splits a hash ring across a cluster, but it’s still a single ring, Brewer says. With Riak, you have independent clusters that act as separate individuals connected by a kind of pipeline. With Riak, WildWorks could slam one replication with searches and it wouldn’t affect the actual operating cluster. Imagine a player trying to log into Animal Jam, hitting on the primary cluster, while a business analyst queries how many active users there are. She could run on a separate cluster with Riak and not affect the live player cluster. That’s opposed to Cassandra, which in WildWorks’ testing…
Allowed you to write if it experiences failure but potentially causes reads to fail. Riak always performs reads and writes, Brewer says. Those writes might not be the most up to date, but WildWorks can live with that.

Why VCARO opted for Cassandra

VCARO is on the car side of the Internet of Things (IoT). The company deals with tons of sensor data, such as engine diagnostics, teens’ driving behavior, and real-time vehicle location.

It never had to migrate from a traditional database; the company started out on DataStax Enterprise Cassandra. VCARO was lucky to already have people with Cassandra skills in place, which helped push the decision. It’s important, after all—you have to have the resources to implement Cassandra correctly, and most don’t know how to properly set up a data schema, according to VCARO CIO Zach Altneu.

He says that VCARO went through an extensive selection process; Riak was the closest competitor. The company also looked at the more “obvious” solutions (i.e., relational databases) including Oracle, VoltDB, and Amazon Redshift. VCARO even evaluated specialized data management platforms like Google Cloud Bigtable and Amazon DynamoDB.

Ultimately, much of the decision centered on cost. Cassandra is open source, so it costs very little. VCARO uses Digital Ocean, a cloud hosting company for lightweight Linux boxes. “You can set up a Cassandra cluster for virtually nothing if you wanted to,” Altneu says. The company pays less than $1,000/month. A similar system through Amazon would have cost $10,000/month, he said, and “we have faster performance.”

Of course, your mileage may vary. VCARO benchmarked performance with one specific system: a straight Cassandra cluster. In this particular use case, the company saw performance that was five to six times faster.

The database also had to handle big data volume and VCARO’s time-sensitive, unstructured, voluminous data flowing from sensors. That pointed to a nontraditional database that could scale linearly and could house billions of rows of sensor data.

One of the things VCARO particularly liked about Cassandra was integration of tools like the Apache Spark open source cluster computing framework and Apache Solr search. DataStax even started adding support for Docker: a software containers system that creates a virtualized Linux OS that runs across APIs and allows you to create sandboxed servers. You can run as many as you want on bare metal, Altneu says, and you can ship the Docker containers anywhere that supports Docker.

VCARO didn’t want a database-as-a-service (DBaaS) model, such as Google’s or Amazon’s. DBaaS vendors set up the database, configure the software, and maintain it, including all operational activities. “For us, that was a negative,” Altneu says. “We wanted to have full control over the entire operation.”

Other options didn’t hold up under examination for other reasons:

VoltDB is a distributed, in-memory, massively parallel “NewSQL” relational database that makes SQL scream. But while it’s fast, it has memory limitations, Altneu said. And VCARO found it didn’t have a lot of options when it came to getting data to persist on the disk. He said you can do it, but that’s not what it’s meant for. That can be a problem when you really do want to keep data around.
VCARO liked what Basho Riak had to offer: a “pure implementation of Amazon’s Dynamo design,” Altneu says. But the company preferred DataStax’s Cassandra implementation. It had all you got with Riak, plus community support.
VCARO does use MongoDB to store things like user profiles and things that don’t have high volume. That boils down to convenience and MongoDB’s famous ease of use. VCARO could store such things in Cassandra, but some of their contract developers just don’t have the skills to work with the database.

Why Nuance Communications runs Couchbase

Nuance Communications makes speech and imaging technology, including server and embedded speech recognition, telephone call steering systems, and medical transcription systems.

Tom Coates, senior principal architect, started to look at the new databases while at his former gig at a company called Varolii (which was acquired by Nuance in 2013). Varolii was in a similar line of work as Nuance. It made products that enabled businesses to automate notifications via text messaging, fax, or e-mail.

Prior to the acquisition, Varolii was considering jumping out of the all-Oracle environment it used for 10 years. The impetus came from a few things, Coates said. For one, the company had developed large, server-side, “monolithic” platforms on Oracle. It was looking at moving into smaller, more easily deployed and managed areas, “where a developer could look at one piece without having to understand the entirety of a big system,” he says.

On top of that, “Oracle is very hard to deal with” from an engineering perspective, Coates says. It’s hard to install and use. It requires a lot of work and know-how.

Without the due diligence “that we should have done,” Varolii picked up MongoDB for a small project. People were using it and seemed to like it, Coates noticed. That meshes with the database’s reputation, as MongoDB is known for being easy to use. This solution worked fine, at least for this small project. Varoli needed to design a way to represent what the company called “events”—essentially, aggregate histories of communications with end consumers.

It’s not that Varoli didn’t already save those events in larger legacy databases, but it wanted to do something different. The company wanted to look back at all customer interactions and figure out how each consumer liked to communicate. Did they press 2 for Spanish, for example?

The sticking point about MongoDB was that Varolii wanted to expand to a second data center. That’s where it ran into trouble with MongoDB, which didn’t offer bidirectional replication. The database had some unidirectional functionality—it would let you read from multiple sites but not write to them. Soon, Varolii stepped back and thought about the situation. The company “liked the NoSQL scene,” Coates says, but it hadn’t picked the right product. Varolii drew up requirements, a list of the must-haves and the nice-to-haves.

Must-haves included:

A multi-master database replication setup that would enable data to flow both ways, enabling data storage on a group of computers with any member of the group able to update it.
Something that supported horizontal scaling (which most nontraditional database providers support) where you can scale out a cluster by adding more machines to it instead of having to buy yet more pricey hardware.
Varoli liked open source, but it needed support given that this was a live production system in play.
The company intended to, but eventually did not, run the system on virtual machines. Hence, VM support was initially a must-have.

Varolii did consider MongoDB, but lack of multi-master replication immediately ruled it out. It also considered Cassandra, Couchbase, HBase, Riak, and Reddis. Couchbase and Cassandra survived the must-haves gauntlet. That left nice-to-haves, including a given database candidate’s:

Ease to set up.
Ability to be easily managed by developers on their own.
Good documentation.
Community support: preferably from the open source community.
Something with a query language. A pure key-value database would have been OK, Coates says, but Varoli needed something to query into the value end of key value.

Cassandra and Couchbase again came out neck and neck—each offered all those things. So not knowing enough about either database at a deep level, Varolii took the project initially written in MongoDB and told the developers to go write it in both databases to see what they could find out.

That’s where the decision was made. Couchbase was much easier to set up and get running, Coates says. The ease of use was “leaps and bounds better.”

Another nice-to-have, Couchbase let the developers write MapReduce jobs. MapReduce, as IBM describes it, is “the heart of Hadoop.” In fact, relational database stalwarts Microsoft, Oracle, and Teradata are among those selling some form of Hadoop integration to connect SQL-based analysis to the world of unstructured big data.

Other issues Varolii had with Cassandra included the fact that Cassandra isn’t really a document database. It proved “more difficult than that,” Coates says, given that it was a columnar database. Storing data in columns instead of rows means the database can more precisely access the data it needs to answer a query. Scanning and discarding unwanted data in rows could increase query performance, particularly in big data sets. But Varolii found that you had to pre-create indexes to query in Cassandra. That made it tough to do certain things. For example, the database didn’t support scanning ranges of data—you couldn’t query for all interactions between the first and last of a month.

What happens when someone cancels a ride? Credit: Uber

Why Riak drives Uber

Uber circles us back to the same rationale that drove WildWorks to Riak for its e-hailing dispatch—crazy rate of growth coupled with need for scalability and availability. Without availability, riders are left standing in the rain while drivers are stuck in empty cars.

Senior Staff Engineer Matt Ranney had worked with Riak before Uber. When Uber was rebuilding its dispatch platform, his team wanted to build a scalable, highly available architecture that was also easy to maintain by people who are all relatively new to the company.

As it is, Uber now has five times the engineering staff as when Ranney first started 1.5 years ago. There’s not a lot of institutional knowledge to be had. The company had to build something that anybody could figure out and jump right into. That meant a dispatch system with no special nodes, something you didn’t have to treat with kid gloves. Uber needed a setup it could just throw computers at to get the necessary scalability.

“We can’t migrate, and we can’t redesign as we handle the growth,” Ranney says. “We just want to buy more computers.”

Uber also wanted something that would let it take nodes down any time in production to tweak the system or to fix it when it breaks. That led to Riak, and it also ruled out HBase, a wide-column store based on Apache Hadoop and on BigTable concepts. That’s because:

HBase wasn’t suitable for running regular failure systems against. Some people will notice when a node goes down.
Since HBase is part of the Hadoop intersystem, you need to run name nodes, which store metadata.
Hadoop isn’t a homogenous, masterless architecture like Riak or Cassandra. Specifically, there’s an availability penalty in terms of the CAP tradeoff.

Cassandra was a no-go since Uber needed causal consistency. Imagine dispatch: both rider and driver have a state that Uber needs to match up, and this entails constant modify reads and modify writes. If you’re a driver looking to pick up a fare and that rider cancels, two states are competing. You want to pick up the fare to change the state to “accepted,” but the rider wants to move the state to “canceled.” Who wins? With Riak, Uber can detect such conflicts and piece the states back together. Cassandra doesn’t have causal consistency to the extent that Uber needed, Ranney says.

And the oversimplified conclusion is…

The tale of Uber getting to Riak is just one example. Scores of organizations are making the same journey to figure out how to handle all the unstructured data pouring out of the Web. It’s clear there’s no one map.

Ask another dozen organizations and you’ll get convincing use cases for the databases rejected by companies like WildWorks and Uber, et al. There are compelling reasons why HBase, Amazon Dynamo, or MongoDB may be the best thing since sliced bread.

So who do you believe? The only sure-fire strategy is to start by figuring out who looks back when your business peeks in the mirror.

Lisa Vaas is a freelance technology journalist and blogger based in Boston who writes about database technology, information security, careers, and all things pancreatic.

Listing image: Flickr user: Sven Slootweg

94 Comments