subnormal numbers: December 2010

I just got home from attending Cloudstock and thought I would write up a brief review. Since the event was free and sponsored by a large number of companies, I was a little concerned that it would be a bunch of companies hocking their crap. There was a fair amount of selling and overall the conference was light on technical information, but a few of the presentations were interesting nonetheless. A full list of sessions is available on the cloudstock page though I don't see anyway to get access to the slides that were used. I think they were recording the sessions so maybe they'll be made available at some point.

Building a Scalable Geospatial Database on top of Apache Cassandra - Mike Malone (SimpleGeo)

This talk will explore the real world technical challenges we overcame at SimpleGeo while building a spatial database on top of Apache Cassandra. Cassandra offers simple decentralized operations, no single point of failure, and near-linear horizontal scalability. But Cassandra fell far short of providing the sort of sophisticated spatial queries we need. Our challenge was to bridge that gap.

This was the most interesting talk I attended. The main part of the talk was on how to use a distributed hash table, in particular Cassandra, as a spatial database. The key problem is how to support the needed types of queries including:

Exact match: find a particular key
Range: find all keys in some interval
Proximity: find the nearest neighbors to a key
Misc others: reasonable expectation of being able to adapt to new use cases

A typical distributed hash table works well for exact match on a given key, however, it is not particularly well suited to the other use cases. Cassandra uses a partitioning scheme similar to Amazon's Dynamo with the keys positioned along a ring. Furthermore, the partition function can be customized so that the keys will be ordered. For SimpleGeo they used a partition function that provided a Z-order curve. This approach allowed for simple range queries and preserves the locality for some points. They experienced two big problems though:

Poor locality for some points. This is a general problem of the space-filling curves that some points in the n-D space will be close, but when following the curve will be much further apart. In practice, this means that some searches will be much more expensive than they should be.
Non-random distribution of data. The default partition function will randomly spread out the data which avoids hotspots where many keys fall in the same bucket. By customizing the partition function to provide order it also led to a problem that the skew inherent in the dataset became a problem. In the presentation he showed a photo showing the distribution of lights and the clusters around cities. A similar photo is shown below of Egypt with obvious clustering around the Nile river.

To solve this problem they moved away from space-filling curves to something that looked like a kd-tree stored in the distributed hash. If I understood correctly each node in the tree is stored as an entry in Cassandra using the standard partitioning scheme. Exact match and range queries can be performed by standard tree searching and traversal with some caching to avoid problems with hot spots, in particular, the root node of the tree. Data skew can be accommodated by splitting a node when it gets to full. The nodes are stored using standard Cassandra so it avoids the customization that caused tricky problems with the ordering. Proximity queries are handled by first searching for the exact match and then checking the node to restrict the bound further. If the nearest neighbor is found within the same node and the radius is such that neighboring spaces could not have a closer neighbor, then we are done. If it is possible that a neighboring space has points that could match the query, then we go to the parent node and then to siblings until satisfied.

Overall, nice presentation and good progression though their various attempts and explaining the issues they encountered.

Teach a Dog to REST - Brian Mulloy (apigee)

It's been 10 years since Fielding first defined REST. So, where are all the elegant REST APIs? While many claim REST has arrived, many APIs in the wild exhibit arbitrary, productivity-killing deviations from true REST. We'll start with a typical poorly-designed API and iterate it into a well-behaved RESTful API.

Nothing spectacular, but he did have some reasonable advice for constructing APIs and some of the common problems they have seen. This presentation also had more of the sales element with the speaker frequently mentioning the apigee console for learning and playing around with APIs for popular services such as Twitter and LinkedIn. I personally found the speaker to be annoying, e.g. he had a schtick about not knowing how to pronounce idempotent methods that I'm pretty sure was an attempt at self-deprecation to help make the talk more appealing to a non-technical audience. The brief summary is:

Be RESTful. The speaker seems to prefer RESTful interfaces over traditional RPC interfaces such as SOAP or JSON-RPC. The primary reasoning is that it leads to greater simplicity and fewer endpoints for the developer. His preferred interface is two URLs per resource: one for a collection, such as /dogs; and one for a specific element, such as /dogs/cujo. I liked his focus on APIs that are easy for developers to understand and to push for conventions that make it easier to reason about how APIs should work. If done right you can guess what the API will be without ever having to look at the documentation.
Verbs are bad. Nouns are good. At first you might think he is a subject of Evil King Java, but it is not quite the same. The RESTful model is about managing resources and the argument is that the verbs are already provided as part of the HTTP Protocol. So really it is verbs as part of the URL are bad. URLs should refer to a noun.
Plurals are better. Here he is referring to the name for collections and clearly stated that this point was just his opinion. I don't really have a strong preference, but I do agree with him that if a widely used convention was present, it would be much easier to guess what the URL should be for a given API. Plurals also do seem to make it clearer that the response would be a collection instead of a single item.
Move complexity after the question mark. The basic idea here was that the messy parts of the API should be made query parameters to the URL. The justification is that there will be some mess and that other locations, such as HTTP headers, are more obscure and difficult to quickly hack together in a browser. Another good point I think he had is that you should try to make the API trivial to start using. The easier it is to play around with an API the more likely it is to get used.
Borrow from leading APIs. This goes back to his theme about convention. By following other popular APIs it is more likely your API will be familiar to new developers looking at your system. He also mentioned that in his opinion LinkedIn was currently doing the best at designing clean easy to use APIs for their offerings.

One shortcoming that was brought out and emphasized during the questions at the end was he made no mention of error handling. Overall, ok but a 45 minute session was too long for this talk.

Your API Sucks - Marsh Gardiner (apigee)

We've learned the hard way that websites need great user experiences to survive. So why aren't we being this aggressive with API design? What are the deeper reasons behind why REST killed SOAP? And why aren't all API providers thinking about the truly important issues, making APIs that will be used by people? Come for the hall of shame and stay for the wake-up call.

Boring series of "don't do this" examples. At least the previous speaker bothered to explain why he was pushing for APIs to be a certain way. The speaker reminded me of John Hodgman, but without the humor. Waste of time.

Lunch

They had some pre-made sandwiches for the lunch. I don't make it into San Francisco that often so I decided to eat out instead.

Scaling Your Web App - Sebastian Stadil (Scalr)

Got app? Learn to scale it, with tricks for creating and managing scalable infrastructure on EC2 or elsewhere.

I came in late to this talk. The part I saw was him showing off their UI. Complete waste of time, I might as well have flipped through the tour on their website.

Inside MongoDB - Alvin Richards (mongoDB)

In this talk we'll describe and discuss MongoDB's data format (BSON), the insert path, the query optimizer, auto-sharding, replication, and more. The talk will be of interest to developers interested in MongoDB and looking to learn more about what's going on "under the hood", as well as anyone interested in distributed systems and the design decisions that go into creating a system like MongoDB.

Not a bad introductory overview. You could probably get the same information by spending an hour reading through the mongoDB documentation, but you wouldn't have easy access to someone for questions.

AWS Feedback Session - Jeffrey Barr (Amazon Web Services)

If you are an AWS user and want to ask questions or provide feedback, here's your chance. Senior AWS Evangelist Jeff Barr will be conducting an interactive feedback session on EC2, S3, RDS, and the other services. All of the feedback will be routed directly to the product teams.

This session was only really useful as a more direct way to communicate issues to Amazon. The speaker was quite knowledgable about the Amazon stack and its good to see they are eager to get customer feedback. One aspect that came up several times was the poor support for Windows. The two issues I remember were the long delay until new versions of Windows are available as to use and, one I found quite amusing, that if you create a VM snapshot of a Windows VM then apparently the admin password is changed in the original VM.

Hackathon

I skipped the hackathon.

Summary

Not bad for a free event. I heard from others that some of the sessions were worse about just being sales pitches than the ones I attended. Very little technical depth in most of the presentations.

subnormal numbers

Blog Archive

Labels

Wednesday, December 22, 2010

Winter Solstice Lunar Eclipse

Monday, December 6, 2010

Cloudstock Review