Tuesday, November 29, 2011

Damn auto-completion

I really wish people would be more careful when using the auto-completion features of IDEs. Though auto-completion does provide some time savings it is also a frequent contributor to sloppy coding. After updating some packages our system broke because someone had incorrectly imported com.google.inject.internal.Sets instead of com.google.common.collect.Sets. Is it too much to ask for people to at least look at the code that is getting generated?

Saturday, November 12, 2011

Top-K Selection

Ok, so I need to select the top-K values from a list of size N where K is much smaller than N. Two approaches that immediately come to mind are:

  1. Sort the list and select the first K elements. Running time O(N*lg(N)).
  2. Insert the elements of the list into a heap and pop the top K elements. Running time O(N + K*lg(N)).

As a variant on option 2 a colleague proposed using a small heap that would have at most K elements at any given time. When the heap is full an element would be popped off before a new element could be added. So if I wanted the top-K smallest values I would use a max-heap and if the next value from the input is smaller than the largest value on the heap I would pop off the largest value and insert the smaller value. The running time for this approach is O(N*lg(K)).

For my use case both N and K are fairly small. The size of N will be approximately 10,000 elements and K will typically be 10, but can be set anywhere from 1 to 100. A C++ test program that implements all three approaches and tests them for various sizes of K is provided at the bottom of this post. The table below shows the results for K equal to 10, 50, and 100. You can see that all three approaches have roughly the same running time and increasing the size of K has little impact on the actual running time.

K=10K=50K=100
sort0.4689770.4669180.467177
fullheap0.0993250.1028390.106735
smallheap0.0180630.0289480.040435

Here is the chart for all values of K:


So clearly the third approach with the small heap is the winner. With a small K it is essentially linear time and an order of magnitude faster than the naive sort. The graph shows both the average time and the 99th-percentile so you can see the variation in times is fairly small. This first test covers the sizes for my use case, but out of curiosity, I also tested the more interesting case with a fixed size K and varying the size of N. The graph for N from 100,000 to 1,000,000 in increments of 100,000 tells the whole story:


Source code:

#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <numeric>
#include <vector>

using namespace std;

typedef void (*topk_func)(vector<int> &, vector<int> &, int);

bool greater_than(const int &v1, const int &v2) {
    return v1 > v2;
}

void topk_sort(vector<int> &data, vector<int> &top, int k) {
    sort(data.begin(), data.end());
    vector<int>::iterator i = data.begin();
    int j = 0;
    for (; j < k; ++i, ++j) {
        top.push_back(*i);
    }
}

void topk_fullheap(vector<int> &data, vector<int> &top, int k) {
    make_heap(data.begin(), data.end(), greater_than);
    for (int j = 0; j < k; ++j) {
        top.push_back(*data.begin());
        pop_heap(data.begin(), data.end(), greater_than);
        data.pop_back();
    }
}

void topk_smallheap(vector<int> &data, vector<int> &top, int k) {
    for (vector<int>::iterator i = data.begin(); i != data.end(); ++i) {
        if (top.size() < k) {
            top.push_back(*i);
            push_heap(top.begin(), top.end());
        } else if (*i < *top.begin()) {
            pop_heap(top.begin(), top.end());
            top.pop_back();
            top.push_back(*i);
            push_heap(top.begin(), top.end());
        }
    }
    sort_heap(top.begin(), top.end());
}

void run_test(const string &name, topk_func f, int trials, int n, int k) {
    vector<double> times;

    for (int t = 0; t < trials; ++t) {
        vector<int> data;
        for (int i = 0; i < n; ++i) {
            data.push_back(rand());
        }

        clock_t start = clock();
        vector<int> top;
        f(data, top, k);
        clock_t end = clock();
        times.push_back(1000.0 * (end - start) / CLOCKS_PER_SEC);
    }

    sort(times.begin(), times.end());
    double sum = accumulate(times.begin(), times.end(), 0.0);
    double avg = sum / times.size();
    double pct90 = times[static_cast<int>(trials * 0.90)];
    double pct99 = times[static_cast<int>(trials * 0.99)];
    cout << name << " "
         << k << " "
         << avg << " "
         << pct90 << " "
         << pct99 << endl;
}

int main(int argc, char **argv) {

    cout << "method k avg 90-%tile 99-%tile" << endl;

    int trials = 1000;
    int n = 10000;
    for (int k = 1; k <= 100; ++k) {
        run_test("sort", topk_sort, trials, n, k);
        run_test("fullheap", topk_fullheap, trials, n, k);
        run_test("smallheap", topk_smallheap, trials, n, k);
    }

    return 0;
}

Friday, November 11, 2011

Broken pipe

If you are using python to write a script, please properly handle the broken pipe exception. We have set of command line tools at work that are extremely annoying to use because if you pipe the output through other standard tools, e.g., head, it spits out a worthless exception about a broken pipe. Consider the following test script:

#!/usr/bin/env python
import sys
buffer = ""
for i in range(100000):
    buffer += "%d\n" % i
sys.stdout.write(buffer)

Pipe it into head and look at the output:

$ ./test.py | head -n1
0
ERROR:root:damn
Traceback (most recent call last):
  File "./test.py", line 6, in <module>
    sys.stdout.write(buffer)
IOError: [Errno 32] Broken pipe

I know there was a broken pipe and I don't care. Just swallow the worthless exception so I can see the meaningful output. This is probably the number one reason I often mutter "damn python crap" when using some of these tools. So if you are writing scripts in python, please be considerate and handle the broken pipe exception. Here is an example for quick reference:

#!/usr/bin/env python
import errno
import sys
try:
    buffer = ""
    for i in range(100000):
        buffer += "%d\n" % i
    sys.stdout.write(buffer)
except IOError, e:
    if e.errno != errno.EPIPE:
        raise e

Scala, regex, and null

One of the perks of using scala is that I hardly ever see a NullPointerException unless I'm working with java libraries. The primary reason is because most scala libraries tend to use Option rather than null. However, while using a regex with pattern matching I was surprised by a NullPointerException when trying to look at the result of an optional capture group. Consider the following example:

scala> val Pattern = "(a)(b)?".r
Pattern: scala.util.matching.Regex = (a)(b)?

scala> "a" match { case Pattern(a, b) => printf("[%s][%s]%n", a, b) }
[a][null]

scala> "ab" match { case Pattern(a, b) => printf("[%s][%s]%n", a, b) }
[a][b]

I just assumed that b would be of type Option[String]. There is probably a good reason for this travesty, my guess would be something about making it work with the type system, but after using scala for a while it just seems wrong to be getting a null value.

Monday, October 31, 2011

Crumbly Cable

It is Halloween so I figured I should post some scary pictures. For some reason the plastic coating on the mini-usb cable that came with my Kindle-DX started crumbling as shown in the pictures below. On the bright side it is a generic cable and I have plenty of spares, but you would think Amazon would provide better cables.





Monday, September 26, 2011

Man in the middle

You typically hear the expression man in the middle in the context of an attack where someone is actively eavesdropping on communications that are intended to be private. However, it is also an invaluable tool for debugging network programs. One of my favorite tools is netcat and this tool makes it trivial to implement a simple eavesdropping script. This can also be done with tools such as tcpdump, but I find that netcat is a bit simpler for most tasks and it is more likely to be available on the machine.

The script I typically use is shown below. Essentially it has an endless loop that listens on a port, tees the data that comes in to a request log file, sends the input to a remote server, tees the response from the remote server to a response log, and then writes the data back to a named pipe that is connected to stdin on the netcat process that was listening.

#!/bin/bash

function logFile {
   echo "$(date +%Y-%m-%d-%H-%M-%S).${1}.log"
}

function serveRequests {
   port=$1
   remoteHost=$2
   remotePort=$3
   while true; do
       rm -f backpipe
       mkfifo backpipe
       cat backpipe |
           nc -l $port |
           tee -a $(logFile request) |
           nc $remoteHost $remotePort |
           tee -a $(logFile response) >backpipe
   done
}

port=${1:-12345}
remoteHost=${2:-localhost}
remotePort=${3:-80}
serveRequests $port $remoteHost $remotePort

Wednesday, September 21, 2011

Inexcusable laziness

Below is the text of a warning email I received from an internal intrusion detection system:
Subject: IMPORTANT: security violations found for cluster

Security violations found for instances of cluster: foobar

To see the full report go to:
http://ids.mycompany.com/reports/cluster/<clustername>
There are many things I think could be improved with this email, but the primary thing that annoyed me was the link to the full report. Why not insert the actual cluster name so I can just click on the link? The way it is I have to copy the prefix of the url and then type or copy in the cluster it is complaining about. Are functional links too much to ask for?

Sunday, September 4, 2011

U.S. Drought Monitor

The University of Nebraska-Lincoln has a nice image that sums up the current drought quite nicely:


You can browser their Drought Monitor site for more information.

Saturday, September 3, 2011

Happy New Year!

Well, not really. A week ago I posted about a bug in the dates shown on the overall stats view of Blogger. The graph shows the transition from 2009 to 2010 as happening on September first. I was curious if that meant that 2012 would show up as starting in September as well. The answer is no:


It looks like the they are always considering the current year to consist of the current month plus the previous 11 months. Notice in the figure that after September started the transition from 2009 to 2010 now occurs on October first. As a side note this is not fixed in the new Blogger interface:


Saturday, August 27, 2011

Google can't tell time

I was looking at the stats for this blog and noticed something strange, the time scale started with 2009 May, but the blog didn't exist until 2010 March. It seems Google, and more specifically Blogger, uses a really odd calendar for their stats page. Looking at the all time overview the graph starts with 2009 May and then transitions as expected to 2009 August. At this point it shows the next month is 2010 September. See the screen capture below to see what I'm talking about:
So everyone get ready, New Year's Day is this coming Thursday. I'm curious to see if the mistake is consistent and it shows next month as 2012 September. The 2012 apocalypse could be sooner than we thought.

Changes in sea level

One of the downsides to having a PhD is that my relatives have no idea what it is I do. For some it seems that having a PhD means that I am a "scientist" with expertise in anything that could be lumped under the term science. My degree is actually in computer science and then I specialized in a narrow subset of that field. From there I went into industry to work on projects that are only vaguely related to my dissertation. My general point though is that I don't have any particular expertise in other scientific fields. I do have an interest and read some books intended for laymen, but that doesn't mean I'm an expert on those topics.

A number of my relatives have recently taken an interest in global warming and wanted my opinion on the matter because I was the "scientist" in the room. My general answer is to just point to the scientific consensus and state that I provisionally accept it without being aware of all of the details. The EPA provides a summary of evidence for climate change and in particular a page describing the current state of knowledge. This put me at odds with them suggesting that global warming is some sort of conspiracy and that there is no real evidence. I couldn't really tell whether they disagree that the earth is warming or whether they just disagree that human activity is causing or at least contributing to it. The only thing that really seemed consistent in their arguments was the certainty that they were right and that nothing needed to be done to rectify the situation. Of course, I lost the debate because my typical retort was "I don't know." I haven't spent much time looking into global warming so I'm not that well versed on the evidence to support it.

Anyway, there were a number of questions that came up and for my own curiosity I wanted to know the answer. For this particular post, the question I want to look at is: what do we know about changes in the sea level? This question comes up in the context of global warming because melting ice sheets lead to an increase in sea level. My relatives were interested in the doom and gloom reports, but the numbers that caught my eye were the rate of increase since the 1960s:
Global average sea level rose at an average rate of around 1.8 mm per year over 1961 to 2003 and at an average rate of about 3.1 mm per year from 1993 to 2003.
That is a pretty accurate measurement. Go to the ocean and look at the waves and tides. How would you accurately measure the average sea level? Now consider less obvious sources of problems such as evaporation and how much water gets stuck on land vs returned to the oceans from year to year. These days we use measurements from satellites to help improve the accuracy, but what is the tried and true technique for measuring the sea level?

The answer is the tide gauge. Tide gauges are cool because of the simplicity of the basic mechanism. It is essentially a big pipe with a hole below the sea level to allow water in. The pipe protects the water inside from all of the normal disturbances on the surface such as waves. As the name suggests it will still vary with the tides, but it allows fairly accurate measurement of the high and low tides. If you record this for a long enough time you could work out the average sea level for the location of the gauge as well as how this has changed over time. Place enough of these devices around the world and keep track of the measurements and you can figure out the average global sea level and if there are any discrepancies across the world.

The next obvious question is: with such a simple device how much tide history do we have? Humans have had the technology to make such a device for a long time, especially if you allow for manual measurement instead of various automated schemes that seem to have been developed in the mid to late 1800s with Kelvin's tide gauge. The Permanent Service for Mean Sea Level (PSMSL) has a page listing various long term records with the oldest being from Amsterdam starting in the year 1700. They also provide the data with sea level values relative to the NAP (a fixed reference level frequently used in Europe):

From this data set there is a clear rise in mean sea level starting in the 1800s. The PSMSL site also lists data from Stockholm over the period 1774 to 1984. Does it show the same trend? At first glance the answer is no:

What is going on? How could this record be so different? The explanation I found was from a paper Swedish Sea Level Series - A Climate Indicator and the reason is because the land is rising. Specifically this is known as post-glacial rebound (although glacial isostatic adjustment seems to be the preferred term now) and the land is slowly rising after being depressed by the weight of huge ice sheets that covered the region in a previous ice age. To figure out what the sea level has done we need to remove the trend caused by the rising land. Post-glacial rebound can vary from place to place, e.g., see the paper Measuring Postglacial Rebound with GPS and Absolute Gravity that looks at four sites in North America. That doesn't help me for determining the rate for Stockholm Sweden though. I couldn't find a good data set for showing what the rate of the land rise is, and the sources I read suggest it varies over time. The Swedish Sea Level Series paper mentioned earlier shows a trend, but it isn't clear to me what source they used for the rate. Various sources I have seen suggest the rate is less than 1cm per year now, one example is the description of the lake Mälaren. If I assume a rate of 5mm per year and generate a trend the output looks close to what the Swedish report is showing:


There is also a data provided for Liverpool. For this data set there are two versions: an annual MHW and adjusted MHW. For those that are not aware, MHW is the mean high water measurement. The adjustment is supposed to be described in the associated paper, but I could not find a version of the paper for free online so I don't know exactly what was done. I included both in the graph:



Both the annual and adjusted MHW show that the MHW level is rising. The adjusted value at a slower rate than the raw annual reading, in particular for older measurements. A 2008 paper uses these sources to show a trend in mean sea level from 1700 to 2000. The general trend is that mean sea level has been increasing for the last 200 years.

So what does this tell us? The three data sets being checked here are all from Europe and it is pretty clear that the sea level in that region has been rising for the last 200 years. It would be interesting to see what the longer term sea level has looked like. In particular, are there any long term cyclical trends that take place over thousands of years. Such data may exist, I looked at a handful of the top sources that came up when searching for information and as stated in the opening I am not an expert in this field. There are some interesting complications such as the post-glacial rebounding that can make it difficult to discern what is really happening. I should also point out that I did not look at the evidence for what was causing the rise in sea level. Of course, the general explanation is melting ice sheets as well as just the expansion of water due to warmer oceans. At some point I'll have to look at the evidence for the actual temperature changes, but most importantly (to me at least), it looks like climate change will be an interesting topic to explore.

Saturday, August 20, 2011

Calendar woes

The first bit of Calendar nonsense I encountered today was some spam that a family member felt compelled to forward. The message was:
Money bags

This year, July has 5 Fridays, 5 Saturdays and 5 Sundays. This happens once every 823 years. This is called money bags. So, forward this to your friends and money will arrive within 4 days. Based on Chinese Feng Shui. The one who does not forward.....will be without money.

Kind of interesting - read on!!!

This year we're going to experience four unusual dates.

1/1/11, 1/11/11, 11/1/11, 11/11/11 and that's not all...

Take the last two digits of the year in which you were born - now add the age you will be this year,

The results will be 111 for everyone in whole world. This is the year of the Money!!!

The proverb goes that if you send this to eight good friends, money will appear in next four days as it is explained in Chinese Feng Shui.

Those who don't continue the chain won't receive.

Its a mystery, but it’s worth a try. Good luck
Okay, so we are now well into August so why am I getting this crap touting how special July was? Well I have given up trying to explain to certain relatives that any mail with the phrase "send to everyone you know" is worthless garbage, but then again I'm enough of a loser to actually read through some of this tripe so I guess I can't complain too much. The first claim is about how special the so called "Money Bags" month is and that it only occurs every 823 years. This immediately strikes me as being wrong. A year is usually 365 days and 365 mod 7 is 1. So if we ignored leap years, then the first of July would return to the same day of week every 7 years. I was too lazy to do the math to see when this would occur and factor in the leap years, that is what computers are good for:
$ gseq 2011 2025 | xargs -I'{}' cal 7 '{}' | grep -B2 -A5 '1  2$' 
     July 2011
Su Mo Tu We Th Fr Sa
                1  2
 3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
--
     July 2016
Su Mo Tu We Th Fr Sa
                1  2
 3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
--
     July 2022
Su Mo Tu We Th Fr Sa
                1  2
 3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
So clearly the July "Money Bags" month is nothing special. Hopefully posting it on the internet and mocking the message will also bring money my way in 4 days via the awesome power of Chinese scented bullshit. Their next claim is a rigged math test that is supposed to be 111 for everyone in the world. Unfortunately, they seem to have forgotten that a few people have been born after the year 1999, and for those individuals the result will be significantly less than 111. Also, according to wikipedia there are a few people still alive that were born in the 1890s and they will get a number considerably higher than 111.

Well that's enough criticism of the spam email. The second bit calendar fun was from the reference site I used for the end of the world post. It has a number of errors in the description of common mistakes made by those predicting the end of the world. The description provided is:
An untold number of people have tried to predict the Lord's return by using elaborate time tables. Most date setters do not realize mankind has not kept an unwavering record of time. Anyone wanting to chart for example 100 BC to 2000 AD would have contend with the fact 46 BC was 445 days long, there was no year 0 BC, and in 1582 we switched from Julian Years (360 days) to Gregorian (365 days). Because most prognosticators are not aware of all these errors, from the get go their math is already off by several years.
The basic idea is correct, that is there have been many calendars over time and the nuances of those calendars makes it very difficult if not impossible to determine exactly when recorded events happened. So it is true that 46 BC was 445 days long because of accumulated errors in the Roman calendar. However, it fails to mention that 45 BC was the first year that the Julian calendar started getting used. It is also true that there is no 0 year in either the Julian or Gregorian calendars and that some Catholic countries adopted the Gregorian calendar in 1582, but adoption was a long process that took hundreds of years. For example, Greece did not adopt the Gregorian calendar until 1923.

This brings us to the most glaring mistake, that is the claim that the Julian year was 360 days. The Julian calendar has 365 days and calls for a leap year every 4 years making the average length 365.25 days. In fact, the primary change with the Gregorian calendar is to fix some of the long term drift that occurs because the actual number of days in a solar year is about 365.25 days - 11 minutes. Do the math, 24 hours / 11 minutes is approximately 131. That means that after 131 years the Julian calendar will be off by a full day. After 393 years the calendar would be off by 3 days. In the Gregorian calendar the rules for leap years were changed to be years divisible by 4 unless the year is divisible by 100, but if the year is divisible by 400 it is still a leap year. This correction is still not exact, but it does a better job than the Julian calendar. I'm just guessing, but maybe they were thinking of the Egyptian calendar that did have 360 days.

It should be pointed out that these errors don't change their premise. If anything their position is reenforced as the there are many more complexities than they indicated. History is messy, and this includes the history of how we measure and record the time.

webkit2png

If you use Mac OS and ever need to create an image of an html page, then take a look at webkit2png. It is a simple python script that uses webkit to generate a png image of the whole webpage as it is rendered in a browser. You can see an example of the results by looking at my recent scaladoc example post.

The world is going to end!

Though probably not anytime soon. I found a site listing 242 dates for the end of the world. Not sure how accurate the list is, if nothing else I'm sure they missed a few predictions. For my purposes I was curious how the number of predictions varied over time. The site mentioned above was the best list I found in an easy to parse format with a large number of predictions. The authors of the site state that making end of the world predictions is ludicrous because of Mark 13:32. I prefer the simpler explanation that there is no evidence for the ridiculous claims being made.
It looks like a fair number of the predictions put the end of the world on a 500 year boundaries. In particular there are spikes at year 1000 and year 2000. Below are close up graphs for those periods:
Maybe we will get lucky and after the 2012 nonsense is over doomsday predictions will take a rest until we get closer to the year 2500. However I'm predicting that with technology making it easier to both record and disseminate these types of wacky claims, we will see a steady stream going forward.

Monday, August 15, 2011

Scaladoc wiki syntax

I was having some trouble getting the scaladoc wiki syntax to work properly so I finally spent some time and read through the code to learn the quirks. Since I prefer examples to lengthy explanations, I'm posting the reference example I used for testing along with a screen shot showing the actual rendering. So here is the example markup:
/**
 * Example of using scaladoc wiki syntax. I use this example to make sure
 * [[https://wiki.scala-lang.org/display/SW/Syntax scaladoc syntax page]]
 * examples actually work. In particular, I could not get the wiki syntax lists
 * to work based on the documentation.
 *
 * This is another paragraph (note the empty line above) containing '''bold''',
 * ''italic'', `monospace`, __underline__, ^superscript^, and ,,subscript,,
 * words. This sentence uses the inline elements specified in the section
 * "Inline elements" on the syntax page that are sometimes different from the
 * example for _italic_, *bold*, +underline+, {{monspace}}, ^superscript^, and
 * ~subscript~. Why are there multiple ways of specifying the same format?
 * Apparently there aren't, the ones from the inline elements section do not
 * work.
 *
 * == Inline elements ==
 * This section contains a correct listing of inline elements. It is also a
 * handy example of an unordered list as well as escaping.
 *
 *  - '''Italic''': `''text''` becomes ''text''.
 *  - '''Bold''': `'''text'''` becomes '''text'''.
 *  - '''Underline''': `__text__` becomes __text__.
 *  - '''Monospace''': use backticks, I couldn't figure out how to escape
 *    other than sticking something in an inline monospace section, `text`.
 *  - '''Superscript''': `^text^` becomes ^text^.
 *  - '''Subscript''': `,,text,,` becomes ,,text,,.
 *  - '''Entity links''': `[[scala.collection.Seq]]` becomes
 *    [[scala.collection.Seq]]. As far as I know there is know way to link to
 *    external scaladoc so this is useless except for linking to other classes
 *    in the same build.
 *  - '''External links''': `[[http://scala-lang.org Scala web site]]` becomes
 *    [[http://scala-lang.org Scala web site]].
 *
 * == Block elements ==
 * Paragraphs should be obvious by now, just include a blank line. So lets move
 * to code blocks with a simple fibonacci example:
 *
 * {{{
 * def fib(n: Int) = if (n < 2) n else fib(n - 1) + fib(n - 2)
 * }}}
 *
 * Headings are pretty straightforward, lets show some examples:
 * =h1=
 * Note that the default style for h1 makes it some white color with a drop
 * shadow that is difficult to see in the main body of the documentation.
 * ==h2==
 * ===h3===
 * ====h4====
 * =====h5=====
 * ======h6======
 *
 * == Lists ==
 * There is an example unordered list for the inline elements. This example will
 * be more gratuitous and try the various list types that are supported. I must
 * be an idiot because I couldn't figure out how to make unordered lists work
 * without looking at the scaladoc source code. Now it seems rather obvious
 * from the instructions. The problem I had was the "`$` is the left margin"
 * bit. I kept trying to include a `$` in the code to now avail. The other
 * problems is that the first whitespace after the `*` is ignored. However,
 * I still contend that with a simple example it would have been obvious right
 * away, so here are some list examples that have been tested and actually
 * generate a list:
 *
 *  1. item one
 *  1. item two
 *    - sublist
 *    - next item
 *  1. now for broken sub-numbered list, the leading item must be one of
 *     `-`, `1.`, `I.`, `i.`, `A.`, or `a.`. And it must be followed by a space.
 *    1. one
 *    2. two
 *    3. three
 *  1. list types
 *    I. one
 *      i. one
 *      i. two
 *    I. two
 *      A. one
 *      A. two
 *    I. three
 *      a. one
 *      a. two
 *
 * I didn't see it mentioned on the document but you can also add a horizontal
 * rule with 4 dashes. See hr below:
 *
 * ----
 *
 * Ok now a brief look at supported javadoc tags. `@code` gets mapped to
 * inline monospace, e.g., {@code testing}. `@docRoot` and `@inheritDoc` are
 * mapped to empty strings. `@link`, `@linkplain`, and `@value` are also mapped
 * to inline monospace, e.g., {@link link}, {@linkplain linkplain},
 * {@value value}. Note it seems linkplain is confused with link. `@literal`
 * just dumps the value in without modification, e.g., {@literal some value
 * '''in a literal''' that __will__ get wiki formatting}.
 *
 * @author subnormal numbers
 * @see scala.collection.Seq
 */
object Example {
   /**
    * Adds two integers.
    * @param v1  actual parameter
    * @param v2  actual second parameter
    * @param v3  garbage, but no warning :(
    * @return  sum of two integers
    * @throws java.io.Exception also garbage, but no warning
    * @since 1.5
    * @todo do something useful
    * @deprecated
    * @note a profound note
    * @example add(2, 2)
    */
   def add(v1: Int, v2: Int): Int = v1 + v2
}
The generated output with 2.9.0.final looks like:

Sunday, August 14, 2011

Basic coding question

When interviewing candidates I always like to include a basic coding question. The goal is to have something simple that can be done in just about any language to see if the candidate can actually write code. When I used to interview C programmers the question I used was to implement the strtok function. This function useful because you can then follow up with questions about memory management, modifying the input parameters, thread safety etc.

These days, I mostly interview Java programmers, but I still use string tokenization as the basic question and follow up with questions about such as regular expressions and unicode. One of the things that has always surprised me is the amount of variety in the answers. It seems everyone can find different ways of tokenizing strings. So when a colleague said that just about all candidates fail trying the same approach to his basic coding question, I couldn't help but wonder why he was seeing such consistency. The question is to write a function that will print a diamond to the console. The function should take a single integer parameter that will be the width of the diamond. The width must be odd or else the function should print an error.

I thought about it for a bit and had a general sketch of a program within about five minutes. Within another five minutes I had a working implementation in python:
#!/usr/bin/env python

import sys

def printDiamond(width):
    '''
    Print a diamond to the console. The diamond must have an odd width, if the
    width is even and error will be printed to stderr and the program will exit.
    '''
    if width % 2 == 0:
        sys.stderr.write("ERROR: width must be odd\n")
        sys.exit(1)

    # Print top half of diamond
    numSpaces = width / 2
    numAsterisks = 1
    while numAsterisks <= width:
        sys.stdout.write(" " * numSpaces)
        sys.stdout.write("*" * numAsterisks)
        sys.stdout.write("\n")
        numSpaces -= 1
        numAsterisks += 2

    # Print bottom half of dimaond
    numSpaces = 1
    numAsterisks = width - 2
    while numAsterisks >= 1:
        sys.stdout.write(" " * numSpaces)
        sys.stdout.write("*" * numAsterisks)
        sys.stdout.write("\n")
        numSpaces += 1
        numAsterisks -= 2

if len(sys.argv) < 2:
    print "Usage: %s <width>" % sys.argv[0]
    sys.exit(1)
else:
    printDiamond(int(sys.argv[1]))
So what was the common mistake he was seeing? Apparently most people start out with nested for loops and try to figure out some equations to indicate whether or not there should be an asterisk at position (i, j). At this point a lot of candidates just get bogged down trying to figure out the math and never step back to think about whether there is an easier way. I suppose the consistency is just that the basic trap is so easy to fall into for this question.

Sunday, May 8, 2011

Secure delete: why is more than one pass needed?

My mom's new computer came with some software to perform a secure delete, and after reading the advertising she asked me why it was necessary. The advertising brags about 7-pass and 35-pass options to make sure your data does not fall into the wrong hands. However, my mom just didn't get it, she thought that the data should be gone if you delete the file. I was able to answer the first question, explaining that delete just removes the index entry that refers to a given file. The data will still be there until it gets overwritten and with the right software it can be recovered.

I wasn't as prepared to answer the follow up question, why is more than one pass needed? Ok, so we overwrite the file once, how can it then be recovered? I didn't have a good answer, but bumbled through a guess that it was probably like a notepad where writing on the top sheet leaves traces on the pad even after the sheet is removed. To get a better idea of how people recover data, a friend pointed me to an excellent article by Peter Gutmann called Secure Deletion of Data from Magnetic and Solid-State Memory. He gives a nice summary of the basic idea:
In conventional terms, when a one is written to disk the media records a one, and when a zero is written the media records a zero. However the actual effect is closer to obtaining a 0.95 when a zero is overwritten with a one, and a 1.05 when a one is overwritten with a one. Normal disk circuitry is set up so that both these values are read as ones, but using specialised circuitry it is possible to work out what previous "layers" contained. The recovery of at least one or two layers of overwritten data isn't too hard to perform by reading the signal from the analog head electronics with a high-quality digital sampling oscilloscope, downloading the sampled waveform to a PC, and analysing it in software to recover the previously recorded signal.
Sometimes an oscilloscope may not be enough and you might need to use magnetic force microscopy or other techniques that require very expensive equipment. It should also be pointed out that the article was written 15 years ago, and hard drive densities have increased a lot in that time period. Microscopy techniques have no doubt improved as well, but it is still going to be much more difficult to recover data from modern drives. In the 2006 NIST Guidelines for Media Sanitization, they suggest that a single pass is enough to clear data:
For some media, clearing media would not suffice for purging. However, for ATA disk drives manufactured after 2001 (over 15 GB) the terms clearing and purging have converged. Studies have shown that most of today’s media can be effectively cleared and purged by one overwrite using current available sanitization technologies.
In short, it appears to be cost prohibitive to recover data that has been wiped with a single pass. Lets face it, for most of the data on your computer it would probably cost more to recover than the attacker could ever get back by stealing that information, and most likely there are much faster and easier ways to steal your data.

Saturday, May 7, 2011

God's body count in perspective

I recently finished reading Drunk with Blood: God's killings in the Bible, and I was curious how God would stack up with some of the more recent mass murderers. In particular, I chose some of the names that come up frequently including Adolf Hitler, Joseph Stalin, Mao Zedong, and Pol Pot. I also included two that are more contemporary and recently in the news: Osama bin Laden and Saddam Hussein. Satan's number was too small to make the cut.

Estimating the number of people killed by these individuals is difficult and it is impossible to get a precise number that is agreed on by all historians. Instead of trying I just looked around quickly and included a low and high estimate. This approach is similar to the book that includes a count where the Bible provides actual numbers and another count that estimates the number killed when actual numbers are not provided. One difference however, the estimate from the book of God's killings is probably on the low side where the high estimate I'm using is probably higher than most would fairly assign to these individuals. For more nuanced estimates try Who was the Bloodiest Tyrant of the 20th Century? and 1900-2000: A century of genocides. So lets get on with it, here are the numbers:

God

Lets start with God, Steve Wells helpfully has a post with an overview of all God's killings in the Bible. The count where the Bible provides the number is 2,476,636. The estimated count for other killings where the Bible is vague is 24,634,205. Read the blog or the book if you want more information. It should be pointed out, this only includes killings mentioned in the Bible. Some may think God deserves credit for later killings as well, but they are not included in this tally.

Adolf Hitler

For the low count I used the estimated number of people killed in the Holocaust. There are various numbers that get mentioned, but 14 million seems like a reasonable estimate. The high estimate blames Hitler for all of the deaths associated with World War II, and the extreme seems to be around 78 million.

Joseph Stalin

According to wikipedia, Stalin's death count falls somewhere between 3 million and 60 million. Other sources place the actual number between 20 and 25 million. I used the estimates from wikipedia.

Mao Zedong

Mao Zedong killed somewhere between 10 million and 70 million people. The discrepancy is in part whether you include deaths due to famine from policies such as the Great Leap Forward. Basically are we counting democide or genocide.

Pol Pot

The high end estimate for Pol Pot was only around 2.5 million. Given his competition, I didn't bother with a low estimate.

Osama bin Laden

Osama bin Laden was included because he was recently killed and has been in the news a lot lately. If you look at killings he planned or ordered the number is probably around 3,500 (from 1900-2000: A century of genocides). Looking at the wikipedia article it estimated the deaths from the global war on terror at 80,000 to 1.2 million. For my purposes, Osama represents the deaths from the war on terror with an estimate of 1.2 million.

Saddam Hussein

The estimate for Saddam Hussein seems to be around 600,000.

Global Deaths per Year

In addition to various tyrants, I wanted to have some kind of baseline for the comparison. I chose to use the estimated number of people that died in 2010. This number is calculated using the crude death rate of 8.37 deaths per 1000 people over a 1 year period. If the estimated population size is 6.92 billion, then the estimate for the number of people to die per year is 57.9 million.

Infographic

So with those estimates, here is a quick graphic to try and put the number of deaths attributed to God into context with the others:

Wednesday, May 4, 2011

Comcast Live Chat

I hate talking to sales people. These days I expect that for most activities I should be able to accomplish everything via a website and having no interaction with an actual person. Unlike some people, I prefer this lack of interaction and fill with dread when some step mentions snail mail or having to call the company. Comcast has found a new annoyance, the Live Chat. I went to the Comcast website, filled out a form, and then the only option was to enter a live chat with a Comcast representative. The first part was an infuriating series of questions asking me to given them the information I had already entered on the form. This was followed by the representative trying to sell me a bunch of crap inform me of exciting deals. At the end of this chat a survey was provided to rate the experience. Unfortunately I didn't save the survey page, but to the best of my recollection the four questions were:
Was your problem solved?
  • yes
  • no
I had to answer yes, however, my problem could have easily been solved if they would have just processed the web form in a reasonable way. I gave them all of the information they needed on the form, there was no reason to do the live chat.
Would you use this service again?
  • yes
  • no
Well, for nice high speed internet in my area Comcast is really the only choice and there was no way to avoid the live chat on the website. So yes I would use it again.
How helpful was the Comcast representative?
  • not helpful
  • helpful
The representative was as helpful as she could be given the whole service, especially for my issue, was a complete waste. The sales pitches were annoying, but I'm sure the representatives are required to nag the customers with that garbage.
Was this service more or less work than you expected?
  • less
  • about what I expected
  • a little more than I expected
  • way more than I expected
I said way more than expected. In reality when I first saw the text on the form saying I would have to do the live chat to finish I was expecting a complete pain in the ass and waste of my time. So I suppose it was about what I expected. However, since there was no place on the survey for free text and the questions are not designed to get useful feedback, this last question seemed like the best option for ranking them poorly.

Monday, April 25, 2011

Damn Data::UUID

The Data::UUID module has the annoying behavior that it will not fail if you provide an invalid namespace when creating a v3 UUID. It will generate a UUID, but depending on the circumstances it may generate a different UUID in subsequent calls. Consider the following example:
#!/usr/bin/perl

use strict;
use warnings;

use Data::UUID;
use UUID::Tiny;

sub v3_data_uuid {
    my $namespace = shift;
    my $name = shift;
    my $ug = shift;
    $ug = new Data::UUID unless defined $ug; 
    return lc($ug->create_from_name_str($namespace, $name));
}

sub v3_uuid_tiny {
    my $namespace = shift;
    my $name = shift;
    return UUID_to_string(create_UUID(UUID_V3, $namespace, $name));
}

# Generate a v3 UUID using Data::UUID
print "Data::UUID\n";
my $ug = new Data::UUID;
print '  1. ', lc($ug->create_from_name_str(UUID_NS_DNS, 'abc')), "\n";
print '  2. ', lc($ug->create_from_name_str(UUID_NS_DNS, 'abc')), "\n";
print '  3. ', v3_data_uuid(UUID_NS_DNS, 'abc'), "\n";
print '  4. ', v3_data_uuid(UUID_NS_DNS, 'abc'), "\n";
print '  5. ', v3_data_uuid(UUID_NS_DNS, 'abc', $ug), "\n";

# Generate a v3 UUID using UUID::Tiny
print "UUID::Tiny\n";
print '  1. ', UUID_to_string(create_UUID(UUID_V3, UUID_NS_DNS, 'abc')), "\n";
print '  2. ', UUID_to_string(create_UUID(UUID_V3, UUID_NS_DNS, 'abc')), "\n";
print '  3. ', v3_uuid_tiny(UUID_NS_DNS, 'abc'), "\n";
print '  4. ', v3_uuid_tiny(UUID_NS_DNS, 'abc'), "\n";

# Generate a v3 UUID using Data::UUID with an invalid namespace
print "Data::UUID - bad namespace\n";
my $namespace = 'namespace';
print '  1. ', lc($ug->create_from_name_str($namespace, 'abc')), "\n";
print '  2. ', lc($ug->create_from_name_str($namespace, 'abc')), "\n";
print '  3. ', v3_data_uuid($namespace, 'abc'), "\n";
print '  4. ', v3_data_uuid($namespace, 'abc'), "\n";
print '  5. ', v3_data_uuid($namespace, 'abc', $ug), "\n";

# Generate a v3 UUID using UUID::Tiny with an invalid namespace
print "UUID::Tiny - bad namespace\n";
print '  1. ', UUID_to_string(create_UUID(UUID_V3, $namespace, 'abc')), "\n";
This example also uses UUID::Tiny as a point of comparison. The output that I get when running this example is:
$ ./uuid.pl 
Data::UUID
  1. 1fc38bb9-aae5-3dba-9894-38925088c9c0
  2. 1fc38bb9-aae5-3dba-9894-38925088c9c0
  3. 1fc38bb9-aae5-3dba-9894-38925088c9c0
  4. 1fc38bb9-aae5-3dba-9894-38925088c9c0
  5. 1fc38bb9-aae5-3dba-9894-38925088c9c0
UUID::Tiny
  1. 5bd670ce-29c8-3369-a8a1-10ce44c7259e
  2. 5bd670ce-29c8-3369-a8a1-10ce44c7259e
  3. 5bd670ce-29c8-3369-a8a1-10ce44c7259e
  4. 5bd670ce-29c8-3369-a8a1-10ce44c7259e
Data::UUID - bad namespace
  1. 15da4c6e-29ae-3f6c-a7ed-ec36373d4e5d
  2. 15da4c6e-29ae-3f6c-a7ed-ec36373d4e5d
  3. 8fa130be-2517-3e29-a087-c7d75545a62c
  4. 8fa130be-2517-3e29-a087-c7d75545a62c
  5. 8fa130be-2517-3e29-a087-c7d75545a62c
UUID::Tiny - bad namespace
UUID::Tiny::string_to_uuid(): 'namespace' is no UUID string! at ./uuid.pl line 50
When a valid UUID is used for the namespace, Data::UUID consistently generates the same value. However, if I pass in a bad value such as the string literal namespace, then I get different values depending on how/where/when it is called. UUID::Tiny on the other hand dies with a nice error message telling me exactly what is wrong.

Polls and Research on Public Acceptance of Evolution

It was brought to my attention that my post on the acceptance of evolution by state was cited in an ASA article Polls and Research on Public Acceptance of Evolution. The article provides a summary of around 50 sites discussing the public acceptance of evolution. The ASA, American Scientific Affiliation, is an organization of Christian scientists so I would guess they have some bias towards creationism. That said, this is the first time I have heard of the ASA and their article seems to provide a fair summary of the findings presented.

Saturday, March 5, 2011

What is Life?

ASU Origins Project ASU Gammage Auditorium Feb. 12, 2011

Richard Dawkins, J. Craig Venter, Nobel laureates Sidney Altman and Leland Hartwell, Chris McKay, Paul Davies, Lawrence Krauss, and The Science Network’s Roger Bingham discuss the origins of life, the possibility of finding life elsewhere, and the latest development in synthetic biology.

Sunday, February 27, 2011

SIGMOD Experimental Repeatability Requirements

I was recently pleased to hear about the SIGMOD Experimental Repeatability Requirements. The stated goal is to:
The goal of the repeatability/workability effort is to ensure that SIGMOD papers stand as reliable, referenceable works for future research. The premise is that experimental papers will be most useful when their results have been tested and generalized by objective third parties.
Apparently it was first done for SIGMOD 2008 and has been getting refined since then. This is something that I think has been lacking in computer science research for quite a while. The obvious benefit is that it makes it much harder to fake results or to report something in a paper that is not really what was implemented and tested. The other benefit is to researchers trying to build on or provide a better alternative to the method presented in a given paper. Anyone who has gone through a graduate computer science program will know the pain of trying to figure out exactly what was done in some paper using only the written description. If you are lucky there is clear pseudocode that can be translated into a working program. But then you still have to worry about things such as magic numbers used to tune the system, or whether the differences you are seeing in the results could be due to other factors such as the machine architecture. Being able to grab the code used for the original paper provides a much faster and more accurate basis for comparing it to a new method.

Unfortunately, though it seemed to be a promising idea, I think their implementation is a crock. The first problem is that participation in the repeatability program is optional:
The repeatability & workability process tests that the experiments published in SIGMOD 2011 can be reproduced (repeatability) and possibly extended by modifying some aspects of the experiment design (workability). Authors participate on a voluntary basis, but authors benefit as well:
  • Mention on the repeatability website
  • The ability to run their software on other sites
  • Often, far better documentation for new members of research teams
The second problem is that it does not mean the code will be made available to everyone. I didn't see any mention of archiving on the 2011 description or how I would be able to get the code for a given paper. The SIGMOD 2010 requirements say the following about code archiving:
Participating in the repeatability/workability evaluation does not imply that the code will be archived for everyone to use subsequently. Pending authors' agreement, the code could be uploaded in the SIGMOD PubZone.
If I understand correctly, it means that if an author chooses to participate, then a committee for the conference will attempt to reproduce the experiments for that paper. After that is done, then the code will only be archived and made available if the author chooses. What the hell were they thinking? I would much rather get rid of the committee that tries to reproduce the results and make it a requirement that the full source code and data sets be made available as supplemental material for all papers. The code needs to be out in the open so it can be scrutinized along with the paper by the broader community of researchers. Furthermore, this supplemental material should be available to those doing the initial reviews of the paper to decide whether or not to accept it.