Showing posts with label musing. Show all posts
Showing posts with label musing. Show all posts

Tuesday, January 31, 2012

Hadoop 1.0

There is finally a 1.0 version of hadoop. One of my biggest complaints using hadoop since version 0.10 is that something breaks with every release and the usual retort is that compatibility will come with 1.0, and it seemed like 1.0 was just around the corner for years. I haven't been using hadoop as much for the last six months, so I didn't notice that there was finally a 1.0 release until today even though it was announced towards the end of last year. As a user this news inspires some hope that there might be a hadoop upgrade that just works. We'll have to see as new versions come out.

That said, it is a little dissapointing to see some of the cruft that is still around in the APIs. In particular org.apache.hadoop.mapred
and org.apache.hadoop.mapreduce packages are both still around. So if I want to write a map reduce job which API should I use? Even worse the javadocs still don't provide much clarity on which of these APIs should be preferred and neither appears to be deprecated. Maybe they have good reasons, but this adds a lot of confusion for users and, in my selfish opinion, should have been cleaned up before a 1.0 release.

The other big problem I've had with hadoop is trying to keep a stack of hadoop, pig, oozie, hbase, etc all working together flawlessly and being able to update individual components without too much worry on whether the rest of the stack will play nice. This is much easier to do if hadoop provides clean, simple, and well documented APIs that these other tools can build on. At first glance, the 1.0 apidocs look like they just slapped a 1.0 label on the current pile of crud and did not remove any of the old garbage and deprecated APIs.

If they actually maintain compatibility between 1.x releases it will still be a win for users, but hopefully for 2.0 they focus on a clean simple API for users and get rid of a bunch of old cruft. It would also be nice if we don't have to wait 6 years for 2.0.

Tuesday, November 29, 2011

Damn auto-completion

I really wish people would be more careful when using the auto-completion features of IDEs. Though auto-completion does provide some time savings it is also a frequent contributor to sloppy coding. After updating some packages our system broke because someone had incorrectly imported com.google.inject.internal.Sets instead of com.google.common.collect.Sets. Is it too much to ask for people to at least look at the code that is getting generated?

Wednesday, September 21, 2011

Inexcusable laziness

Below is the text of a warning email I received from an internal intrusion detection system:
Subject: IMPORTANT: security violations found for cluster

Security violations found for instances of cluster: foobar

To see the full report go to:
http://ids.mycompany.com/reports/cluster/<clustername>
There are many things I think could be improved with this email, but the primary thing that annoyed me was the link to the full report. Why not insert the actual cluster name so I can just click on the link? The way it is I have to copy the prefix of the url and then type or copy in the cluster it is complaining about. Are functional links too much to ask for?

Saturday, September 3, 2011

Happy New Year!

Well, not really. A week ago I posted about a bug in the dates shown on the overall stats view of Blogger. The graph shows the transition from 2009 to 2010 as happening on September first. I was curious if that meant that 2012 would show up as starting in September as well. The answer is no:


It looks like the they are always considering the current year to consist of the current month plus the previous 11 months. Notice in the figure that after September started the transition from 2009 to 2010 now occurs on October first. As a side note this is not fixed in the new Blogger interface:


Saturday, August 27, 2011

Google can't tell time

I was looking at the stats for this blog and noticed something strange, the time scale started with 2009 May, but the blog didn't exist until 2010 March. It seems Google, and more specifically Blogger, uses a really odd calendar for their stats page. Looking at the all time overview the graph starts with 2009 May and then transitions as expected to 2009 August. At this point it shows the next month is 2010 September. See the screen capture below to see what I'm talking about:
So everyone get ready, New Year's Day is this coming Thursday. I'm curious to see if the mistake is consistent and it shows next month as 2012 September. The 2012 apocalypse could be sooner than we thought.

Sunday, August 14, 2011

Basic coding question

When interviewing candidates I always like to include a basic coding question. The goal is to have something simple that can be done in just about any language to see if the candidate can actually write code. When I used to interview C programmers the question I used was to implement the strtok function. This function useful because you can then follow up with questions about memory management, modifying the input parameters, thread safety etc.

These days, I mostly interview Java programmers, but I still use string tokenization as the basic question and follow up with questions about such as regular expressions and unicode. One of the things that has always surprised me is the amount of variety in the answers. It seems everyone can find different ways of tokenizing strings. So when a colleague said that just about all candidates fail trying the same approach to his basic coding question, I couldn't help but wonder why he was seeing such consistency. The question is to write a function that will print a diamond to the console. The function should take a single integer parameter that will be the width of the diamond. The width must be odd or else the function should print an error.

I thought about it for a bit and had a general sketch of a program within about five minutes. Within another five minutes I had a working implementation in python:
#!/usr/bin/env python

import sys

def printDiamond(width):
    '''
    Print a diamond to the console. The diamond must have an odd width, if the
    width is even and error will be printed to stderr and the program will exit.
    '''
    if width % 2 == 0:
        sys.stderr.write("ERROR: width must be odd\n")
        sys.exit(1)

    # Print top half of diamond
    numSpaces = width / 2
    numAsterisks = 1
    while numAsterisks <= width:
        sys.stdout.write(" " * numSpaces)
        sys.stdout.write("*" * numAsterisks)
        sys.stdout.write("\n")
        numSpaces -= 1
        numAsterisks += 2

    # Print bottom half of dimaond
    numSpaces = 1
    numAsterisks = width - 2
    while numAsterisks >= 1:
        sys.stdout.write(" " * numSpaces)
        sys.stdout.write("*" * numAsterisks)
        sys.stdout.write("\n")
        numSpaces += 1
        numAsterisks -= 2

if len(sys.argv) < 2:
    print "Usage: %s <width>" % sys.argv[0]
    sys.exit(1)
else:
    printDiamond(int(sys.argv[1]))
So what was the common mistake he was seeing? Apparently most people start out with nested for loops and try to figure out some equations to indicate whether or not there should be an asterisk at position (i, j). At this point a lot of candidates just get bogged down trying to figure out the math and never step back to think about whether there is an easier way. I suppose the consistency is just that the basic trap is so easy to fall into for this question.

Sunday, May 8, 2011

Secure delete: why is more than one pass needed?

My mom's new computer came with some software to perform a secure delete, and after reading the advertising she asked me why it was necessary. The advertising brags about 7-pass and 35-pass options to make sure your data does not fall into the wrong hands. However, my mom just didn't get it, she thought that the data should be gone if you delete the file. I was able to answer the first question, explaining that delete just removes the index entry that refers to a given file. The data will still be there until it gets overwritten and with the right software it can be recovered.

I wasn't as prepared to answer the follow up question, why is more than one pass needed? Ok, so we overwrite the file once, how can it then be recovered? I didn't have a good answer, but bumbled through a guess that it was probably like a notepad where writing on the top sheet leaves traces on the pad even after the sheet is removed. To get a better idea of how people recover data, a friend pointed me to an excellent article by Peter Gutmann called Secure Deletion of Data from Magnetic and Solid-State Memory. He gives a nice summary of the basic idea:
In conventional terms, when a one is written to disk the media records a one, and when a zero is written the media records a zero. However the actual effect is closer to obtaining a 0.95 when a zero is overwritten with a one, and a 1.05 when a one is overwritten with a one. Normal disk circuitry is set up so that both these values are read as ones, but using specialised circuitry it is possible to work out what previous "layers" contained. The recovery of at least one or two layers of overwritten data isn't too hard to perform by reading the signal from the analog head electronics with a high-quality digital sampling oscilloscope, downloading the sampled waveform to a PC, and analysing it in software to recover the previously recorded signal.
Sometimes an oscilloscope may not be enough and you might need to use magnetic force microscopy or other techniques that require very expensive equipment. It should also be pointed out that the article was written 15 years ago, and hard drive densities have increased a lot in that time period. Microscopy techniques have no doubt improved as well, but it is still going to be much more difficult to recover data from modern drives. In the 2006 NIST Guidelines for Media Sanitization, they suggest that a single pass is enough to clear data:
For some media, clearing media would not suffice for purging. However, for ATA disk drives manufactured after 2001 (over 15 GB) the terms clearing and purging have converged. Studies have shown that most of today’s media can be effectively cleared and purged by one overwrite using current available sanitization technologies.
In short, it appears to be cost prohibitive to recover data that has been wiped with a single pass. Lets face it, for most of the data on your computer it would probably cost more to recover than the attacker could ever get back by stealing that information, and most likely there are much faster and easier ways to steal your data.

Saturday, May 7, 2011

God's body count in perspective

I recently finished reading Drunk with Blood: God's killings in the Bible, and I was curious how God would stack up with some of the more recent mass murderers. In particular, I chose some of the names that come up frequently including Adolf Hitler, Joseph Stalin, Mao Zedong, and Pol Pot. I also included two that are more contemporary and recently in the news: Osama bin Laden and Saddam Hussein. Satan's number was too small to make the cut.

Estimating the number of people killed by these individuals is difficult and it is impossible to get a precise number that is agreed on by all historians. Instead of trying I just looked around quickly and included a low and high estimate. This approach is similar to the book that includes a count where the Bible provides actual numbers and another count that estimates the number killed when actual numbers are not provided. One difference however, the estimate from the book of God's killings is probably on the low side where the high estimate I'm using is probably higher than most would fairly assign to these individuals. For more nuanced estimates try Who was the Bloodiest Tyrant of the 20th Century? and 1900-2000: A century of genocides. So lets get on with it, here are the numbers:

God

Lets start with God, Steve Wells helpfully has a post with an overview of all God's killings in the Bible. The count where the Bible provides the number is 2,476,636. The estimated count for other killings where the Bible is vague is 24,634,205. Read the blog or the book if you want more information. It should be pointed out, this only includes killings mentioned in the Bible. Some may think God deserves credit for later killings as well, but they are not included in this tally.

Adolf Hitler

For the low count I used the estimated number of people killed in the Holocaust. There are various numbers that get mentioned, but 14 million seems like a reasonable estimate. The high estimate blames Hitler for all of the deaths associated with World War II, and the extreme seems to be around 78 million.

Joseph Stalin

According to wikipedia, Stalin's death count falls somewhere between 3 million and 60 million. Other sources place the actual number between 20 and 25 million. I used the estimates from wikipedia.

Mao Zedong

Mao Zedong killed somewhere between 10 million and 70 million people. The discrepancy is in part whether you include deaths due to famine from policies such as the Great Leap Forward. Basically are we counting democide or genocide.

Pol Pot

The high end estimate for Pol Pot was only around 2.5 million. Given his competition, I didn't bother with a low estimate.

Osama bin Laden

Osama bin Laden was included because he was recently killed and has been in the news a lot lately. If you look at killings he planned or ordered the number is probably around 3,500 (from 1900-2000: A century of genocides). Looking at the wikipedia article it estimated the deaths from the global war on terror at 80,000 to 1.2 million. For my purposes, Osama represents the deaths from the war on terror with an estimate of 1.2 million.

Saddam Hussein

The estimate for Saddam Hussein seems to be around 600,000.

Global Deaths per Year

In addition to various tyrants, I wanted to have some kind of baseline for the comparison. I chose to use the estimated number of people that died in 2010. This number is calculated using the crude death rate of 8.37 deaths per 1000 people over a 1 year period. If the estimated population size is 6.92 billion, then the estimate for the number of people to die per year is 57.9 million.

Infographic

So with those estimates, here is a quick graphic to try and put the number of deaths attributed to God into context with the others:

Wednesday, May 4, 2011

Comcast Live Chat

I hate talking to sales people. These days I expect that for most activities I should be able to accomplish everything via a website and having no interaction with an actual person. Unlike some people, I prefer this lack of interaction and fill with dread when some step mentions snail mail or having to call the company. Comcast has found a new annoyance, the Live Chat. I went to the Comcast website, filled out a form, and then the only option was to enter a live chat with a Comcast representative. The first part was an infuriating series of questions asking me to given them the information I had already entered on the form. This was followed by the representative trying to sell me a bunch of crap inform me of exciting deals. At the end of this chat a survey was provided to rate the experience. Unfortunately I didn't save the survey page, but to the best of my recollection the four questions were:
Was your problem solved?
  • yes
  • no
I had to answer yes, however, my problem could have easily been solved if they would have just processed the web form in a reasonable way. I gave them all of the information they needed on the form, there was no reason to do the live chat.
Would you use this service again?
  • yes
  • no
Well, for nice high speed internet in my area Comcast is really the only choice and there was no way to avoid the live chat on the website. So yes I would use it again.
How helpful was the Comcast representative?
  • not helpful
  • helpful
The representative was as helpful as she could be given the whole service, especially for my issue, was a complete waste. The sales pitches were annoying, but I'm sure the representatives are required to nag the customers with that garbage.
Was this service more or less work than you expected?
  • less
  • about what I expected
  • a little more than I expected
  • way more than I expected
I said way more than expected. In reality when I first saw the text on the form saying I would have to do the live chat to finish I was expecting a complete pain in the ass and waste of my time. So I suppose it was about what I expected. However, since there was no place on the survey for free text and the questions are not designed to get useful feedback, this last question seemed like the best option for ranking them poorly.

Saturday, March 5, 2011

What is Life?

ASU Origins Project ASU Gammage Auditorium Feb. 12, 2011

Richard Dawkins, J. Craig Venter, Nobel laureates Sidney Altman and Leland Hartwell, Chris McKay, Paul Davies, Lawrence Krauss, and The Science Network’s Roger Bingham discuss the origins of life, the possibility of finding life elsewhere, and the latest development in synthetic biology.

Sunday, February 27, 2011

SIGMOD Experimental Repeatability Requirements

I was recently pleased to hear about the SIGMOD Experimental Repeatability Requirements. The stated goal is to:
The goal of the repeatability/workability effort is to ensure that SIGMOD papers stand as reliable, referenceable works for future research. The premise is that experimental papers will be most useful when their results have been tested and generalized by objective third parties.
Apparently it was first done for SIGMOD 2008 and has been getting refined since then. This is something that I think has been lacking in computer science research for quite a while. The obvious benefit is that it makes it much harder to fake results or to report something in a paper that is not really what was implemented and tested. The other benefit is to researchers trying to build on or provide a better alternative to the method presented in a given paper. Anyone who has gone through a graduate computer science program will know the pain of trying to figure out exactly what was done in some paper using only the written description. If you are lucky there is clear pseudocode that can be translated into a working program. But then you still have to worry about things such as magic numbers used to tune the system, or whether the differences you are seeing in the results could be due to other factors such as the machine architecture. Being able to grab the code used for the original paper provides a much faster and more accurate basis for comparing it to a new method.

Unfortunately, though it seemed to be a promising idea, I think their implementation is a crock. The first problem is that participation in the repeatability program is optional:
The repeatability & workability process tests that the experiments published in SIGMOD 2011 can be reproduced (repeatability) and possibly extended by modifying some aspects of the experiment design (workability). Authors participate on a voluntary basis, but authors benefit as well:
  • Mention on the repeatability website
  • The ability to run their software on other sites
  • Often, far better documentation for new members of research teams
The second problem is that it does not mean the code will be made available to everyone. I didn't see any mention of archiving on the 2011 description or how I would be able to get the code for a given paper. The SIGMOD 2010 requirements say the following about code archiving:
Participating in the repeatability/workability evaluation does not imply that the code will be archived for everyone to use subsequently. Pending authors' agreement, the code could be uploaded in the SIGMOD PubZone.
If I understand correctly, it means that if an author chooses to participate, then a committee for the conference will attempt to reproduce the experiments for that paper. After that is done, then the code will only be archived and made available if the author chooses. What the hell were they thinking? I would much rather get rid of the committee that tries to reproduce the results and make it a requirement that the full source code and data sets be made available as supplemental material for all papers. The code needs to be out in the open so it can be scrutinized along with the paper by the broader community of researchers. Furthermore, this supplemental material should be available to those doing the initial reviews of the paper to decide whether or not to accept it.

Monday, February 7, 2011

The United Kingdom Explained

C. G. P. Grey created this helpful video and blog post explaining the differences between the terms United Kingdom, Great Britain, and England.



It's too bad he didn't cover the ISO codes, I tend to think of England, Northern Ireland, Scotland, and Wales as being Great Britain because I usually see the ISO code we use for that market which is en-gb. The first part is the ISO 639-2 code for the English language. The second part is the ISO 3166 code for the United Kingdom. I'm not sure why gb was chosen instead of uk. The current set of codes does not use uk for any other country, though when I'm trying to guess the code for the Ukraine I almost always think it should be uk until I look it up and find that it is ua.

Saturday, February 5, 2011

Quicksort

I recently came across C.A.R. Hoare's 1962 paper on Quicksort. The paper is divided into two parts: theory and implementation. Most of the theory portion will be familiar to anyone that has taken an algorithms course. There are a few differences in terminology, the most noticeable to me was that Hoare uses the term bound for what is usually called the pivot in modern presentations. I also learned a new phrase, mutatis mutandis, that according to wikipedia means:
Mutatis mutandis is a Latin phrase meaning "by changing those things which need to be changed" or more simply "the necessary changes having been made".
What is really interesting is the implementation details and what they reveal about the state of computers at the time. The first example that comes up is the use of an alternative sort when a partition gets small enough. This is still a common quicksort optimization, but what is interesting is the size at which he recommends this optimization:
Furthermore, if a segment consists of less than three or four items (depending on the characteristics of the computer), then it will be advantageous to sort it by the use of a program specially written for sorting a particular small number of items.
I remember having to compare insertion sort with quick sort in college to find an appropriate cutoff point and as I recall it was around 50 items. Of course, this will vary with the computer, nuances of the sort implementations, and the data set being tested. I just would have expected this number to be at least 10 or so. Another example, is the data provided in table 1:

Number of ItemsMerge SortQuicksort
5002 min 8 sec1 min 21 sec
1,0004 min 48 sec3 min 8 sec
1,5008 min 15 sec*5 min 6 sec
2,00011 min 0 sec*6 min 47 sec

The asterisk by the results for the last two rows of merge sort indicate that those figures had to be estimated because they could not be computed. The reason provided is the limited store size available on the computer:
These figures were computed by formula, since they cannot be achieved on the 405 owing to limited store size.
The description of the machine and its capabilities:
The National-Elliott 405 computer has a delay-line working store of 512 locations, and a magnetic-disc backing store of 16,384 words. The average access time for the working store is 0.8 msec and the average access time for a block of 64 words in the backing store is 32 msec. There are 19 words of immediate-access storage, which are used to contain instructions and working space of the inner loops: the time taken by such loops is about 0.15 msec per instruction.
To get a better idea of the kind of computer he is talking about, see this video from the Yorkshire film archive showing the installation of a 405 computer at Dansom Lane. It won't help for understanding the memory limitations, but it is still interesting to get an idea of what the physical machine they were testing on looked like. The paper is not too clear on exactly what the data set looked like other than the single sentence:
The figures relate to six-word items with a single-word key.
Based on this description it is clear that the problem with merge sort is that it does not sort in-place. Duplicating the data set would mean that merge sort on the 405 could support a maximum data set size of 16,384 words / 2 copies / 6 words per item = 1,365 items. Unfortunately, there is no description of the formula used to calculate the last two times for merge sort. I was curious what the times would be for something like insertion sort. Insertion sort is in-place so it should be able to handle the larger data sets, but what would the expected running time be? An extremely rough approximation would be to assume exactly N2 instructions. If we use the number from the description in the paper of 0.15 msec per instruction and assume that for each instruction we would need to have at least one read from the working store at a cost of 0.8 msec, then we can use a cost of 0.95 msec per operation. This is almost certainly an underestimate of the actual time it would take. If nothing else I'm completely ignoring access to the magnetic disc. Crunch the numbers and look what we wind up with:

Number of ItemsN lg(N)N2
5000 min 4 sec3 min 57 sec
1,0000 min 9 sec15 min 50 sec
1,5000 min 15 sec35 min 37 sec
2,0000 min 20 sec1 hour 3 min 20 sec

Over an hour to sort 2000 items compared to less than 7 minutes for quicksort.

Wednesday, January 19, 2011

Yahoo! is a moving company from Mumbai

I was a bit surprised when looking at Alexa's list of top sites and the site in fourth place was anshikapackersmovers. It seems Alexa has some confusion between Yahoo! and the Mumbai company Anshika Packers & Movers. Look at the site info for yahoo.com and it says:
About anshikapackersmovers (yahoo.com): providing you the best possible solutions for all kind of relocation such as home relocation, car carrier service, office or corporate relocation mumbai providing you the best possible solutions for all kind of relocation such as home relocation, car carrier service, office or corporate relocation .

Yahoo.com is ranked #4 in the world according to the three-month Alexa traffic rankings, and roughly 29% of visits to the site consist of only one pageview (i.e., are bounces). While we estimate that 32% of visitors to the site come from the US, where it is ranked #3, it is also popular in Taiwan, where it is ranked #1. Compared with internet averages, the site's audience tends to be aged under 25 and over 45; they are also disproportionately women browsing from school and home. The fraction of visits to Yahoo.com referred by search engines is approximately 7%.
I'm assuming this will get fixed relatively soon, so I included a screen capture of the Alexa site info page below.

Friday, January 14, 2011

Two spaces after a period

Do you still use two spaces after a period when typing documents? I remember learning this rule when I first started working with computers in the early 90s, but somewhere along the way I stopped doing it and never gave it much thought. I think a large part of the reason is that I typically use some form of markup, such as LaTeX, for doing any sort of moderately complex typography. Who really wants to learn and manually try to adhere to all of the silly rules for how to prepare a manuscript? Just use some markup with a typesetting engine to take care of all the mundane stuff like conforming to whitespace conventions. Nothing to get all worked up about.

So I was a bit surprised to see how worked up Farhad Manjoo gets over people using two spaces after a period in his article Space Invaders: Why you should never, ever use two spaces after a period. What really struck me was that he mentions the antiquated typewriters that gave rise to the double space convention, but he doesn't seem to take issue with users having to carry the burden of adhering to this style. One of the reasons I hate WYSIWYG editors is that they force the user to worry about the style as well as the content. Try to write an article having to worry about rivers, widows, and orphans that will change and need to be corrected every time you edit the text. Add to that all of the consistency issues of sentence spacing, fonts, page numbering, references to figures or sources, etc. WYSIWYG was first available, at least on a computer, in 1974 with the Bravo editor and by 1978 and it was obsolete for any serious writing tasks with the release of TeX.

Thomas Thwaites: How I built a toaster -- from scratch

Wednesday, January 5, 2011

Integer representation of dates

For a system at work, I needed to represent dates and times as integers. An integer representation is required because we need to import the data into a search index so we can efficiently perform point and range queries. The search system only supports range queries for integer types and one of our main use cases is querying based on a date range. Further, the system does not have a specific time data type, it just supports signed integers of 1, 2, 4, and 8 bytes. So the problem is how should we represent a date or time as an integer?

First, lets make the requirements more explicit. The input data comes is a JSON feed with ISO 8601 formatted strings for date and timestamp fields. Whatever representation we choose should be able to represent the full range of dates from the input including %Y, %Y-%m, %Y-%m-%d, and %Y-%m-%dT%H:%M:%S (formats specified using the notation for strftime). In addition, it is preferable if the representation is easy to use for a person, i.e., I would like to be able to encode a known date without needing a program.

Problems with Unix Time

The obvious first question is can we just use Unix time? It is well known and widely supported, but there are a few problems trying to use it for our application.

The first problem is that it has a limited range. If we use a 32-bit integer we run into the year 2038 problem. Right now we don't really care about dates that far in the future, but we do care about historical dates so the year 1901 problem is an immediate concern. This problem is easily solved by using a 64-bit integer giving a range of 1970 plus or minus around 290 billion years. We can afford to be that short-sighted.

The second problem is for some fields we do not know the precise timestamp. We only know a particular year, month, or day. Somehow we need to represent these dates as being distinct from a precise date. For example if all we know is that something occurred in 1970 we don't want to encode it as 1970-01-01, because now we can't distinguish it from something that we knew happened on January 1, 1970. Unix time uses all of the bits already so there is no room to represent these vague dates.

A third problem is that it is quite cumbersome for a person to figure out the unix timestamps for a given range. Say I want to find all entries within the year 1980, I would need to compute the unix timestamp values for the beginning (1980-01-01T00:00:00 is 315532800) and end of the year (1980-12-31T23:59:59 is 347155199). This is trivial to code, but for a person doing it manually it is a pain.

The second problem is the killer, Unix time is not an option. The other issues are annoying, but we could live with them, especially considering the benefits of using a representation that is common and widely supported like unix time. So, what is the alternative?

Concatenated Fields Representation

This is a representation I've seen numerous times, but I wasn't able to find an official name or spec for it. The idea is pretty simple, we have a date represented as %Y-%m-%dT%H:%M:%S, we just remove the punctuation symbols and use the values as digits so we get the numeric value %Y%m%d%H%M%S. The first thing that becomes apparent is that we need an integer type that can support at least 14 decimal digits (4 year + 2 month of year + 2 day of month + 2 hour of day + 2 minute of hour + 2 second of minute = 14 digits). If we only care about the date we need at least 8 digits and for just the time we need 6 digits. A 16-bit signed integer as a range of -32768 to 32767 so we need at least a 32-bit integer. The table below shows the max value for 32 and 64 bit signed integers using this representation for a time, date, and datetime.

32-bit max64-bit max
Time23:59:5923:59:59
Date214748-12-31922337203685477-12-31
DateTimetoo small922337203-12-31T23:59:59

For a time or date a 32-bit signed integer is more than enough. Time in particular is limited so the max value will be 23:59:59 and using 64-bits provides no value. For date and datetime types the additional room can be used to expand the range of years supported. However, even with 32-bit integers the date type can works through the year 214748 which is more than enough for our use cases, but if needed 64-bits will get you past the year 922 trillion. For datetime a 32-bit integer is too small for 14 decimal digits. A 64-bit value will provide support past the year 922 million, and more importantly it covers the full range of the 32-bit date.

How does this fit our requirements? It covers the full range of time values that we expect to get in the input. Vague dates that are missing the day and/or month can be encoded as %Y0000. The zeros are not valid for the month or the day of month so we can clearly distinguish from more specific dates. This format is also pretty easy for a human to use. So it fixes the problems with unix time, but have we added any new problems?

Yes. Probably the most annoying problem is that not all values are valid. For example the date 19701332 is nonsense. There is no 13th month or 32nd day of a month. This representation makes it really easy to have values that are complete garbage. In fact, most of the distinct values for the integer representation will not be valid dates. The table below shows the percentage of the distinct values that are valid:

32-bit max64-bit max
Time0.002% (86,400 of 4,294,967,296)0.000000000000468% (86,400 of 18,446,744,073,709,551,616)
Date2.08% (89,335,584 of 4,294,967,296)2.08% (383,692,276,733,158,848 of 18,446,744,073,709,551,616)
DateTimetoo small12.00% (2,213,609,288,860,213,248 of 18,446,744,073,709,551,616)

The second problem is closely related to the first. This representation wastes a lot of space and doesn't make good use of all the bits. For my purposes this is not a big problem, but if you are looking for a compact representation then this is a bad choice.

Finally, we cannot make use of the negative values. In the ISO 8601 representation the year can be negative to specify BCE dates. This approach cannot be used with this representation because the sign would apply to the entire value, not just the year, and that breaks the ordering. For example -0004-01-04 is earlier than -0004-12-30, but when represented as an integer -40104 is greater than -41230.

Summary

The concatenated fields representation is a user friendly way to represent common era dates as an integer. It is particularly useful if you need to support dates where only partial information is present, e.g., just the year.

Wednesday, December 22, 2010

Winter Solstice Lunar Eclipse

It was too cloudy where I was at to get a decent view of the recent lunar eclipse. Luckily William Castleman put up a nice time lapse video of the event:

Winter Solstice Lunar Eclipse from William Castleman on Vimeo.

Monday, December 6, 2010

Cloudstock Review

I just got home from attending Cloudstock and thought I would write up a brief review. Since the event was free and sponsored by a large number of companies, I was a little concerned that it would be a bunch of companies hocking their crap. There was a fair amount of selling and overall the conference was light on technical information, but a few of the presentations were interesting nonetheless. A full list of sessions is available on the cloudstock page though I don't see anyway to get access to the slides that were used. I think they were recording the sessions so maybe they'll be made available at some point.

Building a Scalable Geospatial Database on top of Apache Cassandra - Mike Malone (SimpleGeo)
This talk will explore the real world technical challenges we overcame at SimpleGeo while building a spatial database on top of Apache Cassandra. Cassandra offers simple decentralized operations, no single point of failure, and near-linear horizontal scalability. But Cassandra fell far short of providing the sort of sophisticated spatial queries we need. Our challenge was to bridge that gap.
This was the most interesting talk I attended. The main part of the talk was on how to use a distributed hash table, in particular Cassandra, as a spatial database. The key problem is how to support the needed types of queries including:
  • Exact match: find a particular key
  • Range: find all keys in some interval
  • Proximity: find the nearest neighbors to a key
  • Misc others: reasonable expectation of being able to adapt to new use cases
A typical distributed hash table works well for exact match on a given key, however, it is not particularly well suited to the other use cases. Cassandra uses a partitioning scheme similar to Amazon's Dynamo with the keys positioned along a ring. Furthermore, the partition function can be customized so that the keys will be ordered. For SimpleGeo they used a partition function that provided a Z-order curve. This approach allowed for simple range queries and preserves the locality for some points. They experienced two big problems though:
  • Poor locality for some points. This is a general problem of the space-filling curves that some points in the n-D space will be close, but when following the curve will be much further apart. In practice, this means that some searches will be much more expensive than they should be.
  • Non-random distribution of data. The default partition function will randomly spread out the data which avoids hotspots where many keys fall in the same bucket. By customizing the partition function to provide order it also led to a problem that the skew inherent in the dataset became a problem. In the presentation he showed a photo showing the distribution of lights and the clusters around cities. A similar photo is shown below of Egypt with obvious clustering around the Nile river.
To solve this problem they moved away from space-filling curves to something that looked like a kd-tree stored in the distributed hash. If I understood correctly each node in the tree is stored as an entry in Cassandra using the standard partitioning scheme. Exact match and range queries can be performed by standard tree searching and traversal with some caching to avoid problems with hot spots, in particular, the root node of the tree. Data skew can be accommodated by splitting a node when it gets to full. The nodes are stored using standard Cassandra so it avoids the customization that caused tricky problems with the ordering. Proximity queries are handled by first searching for the exact match and then checking the node to restrict the bound further. If the nearest neighbor is found within the same node and the radius is such that neighboring spaces could not have a closer neighbor, then we are done. If it is possible that a neighboring space has points that could match the query, then we go to the parent node and then to siblings until satisfied.

Overall, nice presentation and good progression though their various attempts and explaining the issues they encountered.

Teach a Dog to REST - Brian Mulloy (apigee)
It's been 10 years since Fielding first defined REST. So, where are all the elegant REST APIs? While many claim REST has arrived, many APIs in the wild exhibit arbitrary, productivity-killing deviations from true REST. We'll start with a typical poorly-designed API and iterate it into a well-behaved RESTful API.
Nothing spectacular, but he did have some reasonable advice for constructing APIs and some of the common problems they have seen. This presentation also had more of the sales element with the speaker frequently mentioning the apigee console for learning and playing around with APIs for popular services such as Twitter and LinkedIn. I personally found the speaker to be annoying, e.g. he had a schtick about not knowing how to pronounce idempotent methods that I'm pretty sure was an attempt at self-deprecation to help make the talk more appealing to a non-technical audience. The brief summary is:
  • Be RESTful. The speaker seems to prefer RESTful interfaces over traditional RPC interfaces such as SOAP or JSON-RPC. The primary reasoning is that it leads to greater simplicity and fewer endpoints for the developer. His preferred interface is two URLs per resource: one for a collection, such as /dogs; and one for a specific element, such as /dogs/cujo. I liked his focus on APIs that are easy for developers to understand and to push for conventions that make it easier to reason about how APIs should work. If done right you can guess what the API will be without ever having to look at the documentation.
  • Verbs are bad. Nouns are good. At first you might think he is a subject of Evil King Java, but it is not quite the same. The RESTful model is about managing resources and the argument is that the verbs are already provided as part of the HTTP Protocol. So really it is verbs as part of the URL are bad. URLs should refer to a noun.
  • Plurals are better. Here he is referring to the name for collections and clearly stated that this point was just his opinion. I don't really have a strong preference, but I do agree with him that if a widely used convention was present, it would be much easier to guess what the URL should be for a given API. Plurals also do seem to make it clearer that the response would be a collection instead of a single item.
  • Move complexity after the question mark. The basic idea here was that the messy parts of the API should be made query parameters to the URL. The justification is that there will be some mess and that other locations, such as HTTP headers, are more obscure and difficult to quickly hack together in a browser. Another good point I think he had is that you should try to make the API trivial to start using. The easier it is to play around with an API the more likely it is to get used.
  • Borrow from leading APIs. This goes back to his theme about convention. By following other popular APIs it is more likely your API will be familiar to new developers looking at your system. He also mentioned that in his opinion LinkedIn was currently doing the best at designing clean easy to use APIs for their offerings.
One shortcoming that was brought out and emphasized during the questions at the end was he made no mention of error handling. Overall, ok but a 45 minute session was too long for this talk.

Your API Sucks - Marsh Gardiner (apigee)
We've learned the hard way that websites need great user experiences to survive. So why aren't we being this aggressive with API design? What are the deeper reasons behind why REST killed SOAP? And why aren't all API providers thinking about the truly important issues, making APIs that will be used by people? Come for the hall of shame and stay for the wake-up call.
Boring series of "don't do this" examples. At least the previous speaker bothered to explain why he was pushing for APIs to be a certain way. The speaker reminded me of John Hodgman, but without the humor. Waste of time.

Lunch

They had some pre-made sandwiches for the lunch. I don't make it into San Francisco that often so I decided to eat out instead.

Scaling Your Web App - Sebastian Stadil (Scalr)
Got app? Learn to scale it, with tricks for creating and managing scalable infrastructure on EC2 or elsewhere.
I came in late to this talk. The part I saw was him showing off their UI. Complete waste of time, I might as well have flipped through the tour on their website.

Inside MongoDB - Alvin Richards (mongoDB)
In this talk we'll describe and discuss MongoDB's data format (BSON), the insert path, the query optimizer, auto-sharding, replication, and more. The talk will be of interest to developers interested in MongoDB and looking to learn more about what's going on "under the hood", as well as anyone interested in distributed systems and the design decisions that go into creating a system like MongoDB.
Not a bad introductory overview. You could probably get the same information by spending an hour reading through the mongoDB documentation, but you wouldn't have easy access to someone for questions.

AWS Feedback Session - Jeffrey Barr (Amazon Web Services)
If you are an AWS user and want to ask questions or provide feedback, here's your chance. Senior AWS Evangelist Jeff Barr will be conducting an interactive feedback session on EC2, S3, RDS, and the other services. All of the feedback will be routed directly to the product teams.
This session was only really useful as a more direct way to communicate issues to Amazon. The speaker was quite knowledgable about the Amazon stack and its good to see they are eager to get customer feedback. One aspect that came up several times was the poor support for Windows. The two issues I remember were the long delay until new versions of Windows are available as to use and, one I found quite amusing, that if you create a VM snapshot of a Windows VM then apparently the admin password is changed in the original VM.

Hackathon

I skipped the hackathon.

Summary

Not bad for a free event. I heard from others that some of the sessions were worse about just being sales pitches than the ones I attended. Very little technical depth in most of the presentations.

Sunday, November 28, 2010