Light a FUSE under your Riak cluster
As an experiment, we hacked together a FUSE driver in Ruby that lets you mount your Riak cluster as a file system, and browse around. Not sure how really useful it is, but was fun to do nonetheless.
If your Riak keys look like /foo/bar/logo.png, you will be able to ls and cd around the (simulated) directory structure, and cat files (keys).
Things get even more interesting if you hook this up with jsawk, then you can do things like this if your keys contain JSON values:
cat /mnt/riak/users/* | jsawk 'if (this.city != "Paris") return null'
The current implementation does a list keys function when you access a bucket, which of course is slow. We put all the keys into a tree structure and cache that for performance. Another fun project would be to rewrite it using Riak's Links to simulate the directory structure, and then run interactive map/reduce jobs to navigate the hierarchy. That will have to wait for another time...
Anyway, check it out yourself at Github, just keep in mind it is a toy and should not be pointed at a cluster you care about.
Hydra attacks Riak cluster
The great thing about querying a Riak cluster is that it's all just HTTP, so the huge universe of tools that are available to work with HTTP resources are at your disposal. For example, lets say we want to quickly request 20 keys from a Riak cluster. We could use curl or curb and request them one at a time, or we could use the multi-headed hydra that is Typhoeus to get them concurrently! Let's see this monster in action with a little Ruby code:
require 'rubygems' require 'typhoeus' require 'yajl' HOST = "riak.cluster.com" PORT = "8098" # bucket is the name of a Riak bucket # keys is an array of keys to get # returns array of response hashes with the keys, :key :code :headers :body def multi_get(bucket, keys) hydra = Typhoeus::Hydra.new @responses = [] keys.each do |key| url = "http://#{HOST}:#{PORT}/riak/#{bucket}/#{key}" request = Typhoeus::Request.new(url) # When the request completes, this block gets run request.on_complete do |response| result = {} result[:key] = key result[:code] = response.code result[:headers] = response.headers result[:body] = response.body @responses << result end # queue up the request to run later hydra.queue request end # This is a blocking call that executes all queued requests concurrently, # and returns when all requests have completed hydra.run @responses end # OK, lets try this baby out now, # first, set up the list of keys we want to get keys = YAML::load(<<EOT) --- - 1244idS1NaricUO2RtXJrjcfzr8 - 12ktMhh8KZOCYzMIRTLlqf5JeGA - 129Izkjd6Fh2i1zqxCE2acT6iju - 129AomedyZIa3gjudjCxpke5kU9 - 12BEfyiKWPqwZqiqcKGmizVN34i - 12EZagKQHnakkIChE3ruLUu4TrA - 12RBVtZV0EwQyTixXFLwHHqwLuK EOT # Now let the Hydra out! rs = multi_get("my_bucket", keys) # Or, if you are storing JSON objects, get the results as JSON using the speedy Yajl gem... rs = multi_get("my_bucket", keys).map{|r| Yajl::Parser.parse(r[:body])}
As you can see, it's pretty darn easy to leverage all the great work by the Typhoeus folks to speed up your queries.
Map/Reduce job to select specific keys
Riak will politely tell you about all the keys in a specific bucket, all you need to do is ask, like this:
curl http://localhost:8098/riak/my_bucket
The problem is what if you have a million keys? You can tell Riak to stream you the keys, but what if you only want certain keys, like all the keys that start with foo, for example. In that case, MapReduce is your friend. In Ruby, it looks like this:
results = Riak::MapReduce.new(client) .add("my_bucket") .map("function(value,keyData,arg) { var re = new RegExp(arg); return value.key.match(re) ? [value.key] : []; }", :keep => true, :arg => "^foo").run
You can pass in any regular expression in the :arg parameter. Since keys in Riak have to be unique, you will never get duplicates and don't need a reduce phase.
Update: Note that this code is pretty slow to execute on a bucket with many keys, so is best used in background jobs, not for interactive queries. For example, on a single node, small EC2 instance, with 10,000 JSON objects (3K each in size) in a bucket, running the above map reduce code takes 60 seconds.
To see how much of that time is spent marshaling the JSON objects, we removed the JSON body of each object and left only the key, and then ran the code again, which took 30 seconds, still not even in the right ballpark for interactive use. Of course, YMMV.
Riak is for Ops, but Ops don’t build Apps
You can't go anywhere on the interwebs without bumping into a NoSQL post somewhere. At least in the Rails community, MongoDB seems to be gaining the most mindshare. Mongo has the whole NoSQL thing going for it, as well as the two most important things from a developer's perspective -- a full-featured ORM or two (MongoMapper and dm-adapter-mongo) and the ability to (easily) index and query your data. So even though there is no SQL, you can still do things like this, which feels very comforting and familiar to a Rails developer:
Person.find_by_email("starbuck@galactica.mil")
However, MongoDB was designed as a single-node database, and achieves scalability in the same way a MySQL db would, by using things like masters, slaves, and shards. So while Mongo buys you the flexibility of a schema-less data store, you are still stuck with the same old scaling problems of the SQL databases. Why not just use a SQL db as a schema-less store in the first place? (a la Friendly).
Riak is a data store built from the ground up for scaling. The scaling story can be summed up in three words: "Add a node." That's it. No "shards", "masters", "slaves", etc etc. It's an incredibly compelling story. The problem is that as much as the ops folks love it, ops folks don't build apps. Developers do. And in its current form, Riak makes app developers work harder to build their app, in exchange for a much easier time scaling and maintaining the app down the road. But developer's don't care, because they don't usually have to worry about the scaling issue, its someone else's job. So it's easy for them to ignore the long-term benefits and go with something familiar and easier to get into, like Mongo.
Basho seems like a great company with a refreshing attitude towards making money from open source software, and I really want to see them succeed. But they need to move fast to gain mindshare in the developer community, and that means investing in the tools that make developer's lives easier. It seems that they have taken the first step and hired Sean Cribbs, the developer of the awesome Ripple gem, which is a Rails-compatible Object Mapper for Riak. I hope they continue to focus on developer tools and make Riak the first choice when building a new Rails app.
(We are doing our small part for the community by hacking on Briak, which is a data browser for Riak clusters based on Sinatra.)
Ring around the Ring file
If Riak hangs on you, and you have been mucking about with the name of your node in the vm.args file, make sure you delete all the files in the ring directory, which in the standard Debian/Ubuntu distribution is at /var/lib/riak/ring.
Simulating directories in Riak
In the course of trying to store a directory structure of files in Riak using the Ripple gem, we ran into some problems with the way Ripple handled the escaping of slashes. Basically, Ripple calls URI.escape on the URL it submits to the Riak server, and the URI lib escapes everything except the "/" character, since it is a valid character in URLs. So we end up sending a request to Riak that looks like this:
GET http://host.com/riak/bucket/key/with/slashes%20and%20spaces
The Riak server doesn't like slashes in keys, they need to be escaped to %2f. So the above request will fail. To work around this, and make sure that the slashes (and "+" characters) get escaped, we came up with this monkey patch:
It is definitely a quick hack, as it requires the HTTPBackend class to know about the structure of the riak URLs, but I am sure the author of Ripple Sean Cribbs will come up with a much better long term solution to the problem. Until then, monkey patch away!
Update: The latest release on Github has this baked in already. Those guys move fast.
Vagrant for your VMs
Getting new developers up to speed quickly on a project is critical, and thanks to the magic of virtual machines this is now much easier for us to do. We have a base machine image for each project, which is configured with the exact OS and software packages as the production environment. Now, 99% of the time, if it works on a dev box, it works on the production box.
But, managing all these different VMs is now getting to be somewhat of an issue, so we have started using Vagrant, which is really good at making VM use as seamless as possible. The whole point of what we do is to write code, and all the time spent on configuration to get there is wasted. Vagrant and VMs help us spend less time shaving yaks.
SDRuby Talk
If anyone is in San Diego on March 4th, head down to the SDRuby meetup at UCSD and catch a talk we are giving on Riak, the newest kid on the NoSQL block.
Using Nginx as a front-end for Riak
One of the very cool things about Riak is that you dont need any type of connection pooling, and that any Riak node can answer any request. This config sets up Nginx to act as a load balancer between all of your Riak nodes, so you send your queries to Nginx, and it will distribute the requests out evenly to all of your nodes. It will also only allow simple GET requests, so no one can change or add data, or use the map/reduce functionality (to prevent DOS attacks on the Riak cluster).
Introducing Briak
In the course of evaluating Basho's cool new distributed database Riak, we whipped up a web-based tool called Briak to allow browsing of data stored in the Riak cluster. We used Sinatra, a lightweight Ruby framework to build it. We love Sinatra for tools like these, it has just enough structure to get small jobs done, but not enough to get in the way of a quick hack session.
Anyway, head on over to Github to check it out. Also check out riak-admin, another cool tool that is pure javascript, and runs directly out of the Riak cluster.