Sunday, February 27, 2011

SIGMOD Experimental Repeatability Requirements

I was recently pleased to hear about the SIGMOD Experimental Repeatability Requirements. The stated goal is to:
The goal of the repeatability/workability effort is to ensure that SIGMOD papers stand as reliable, referenceable works for future research. The premise is that experimental papers will be most useful when their results have been tested and generalized by objective third parties.
Apparently it was first done for SIGMOD 2008 and has been getting refined since then. This is something that I think has been lacking in computer science research for quite a while. The obvious benefit is that it makes it much harder to fake results or to report something in a paper that is not really what was implemented and tested. The other benefit is to researchers trying to build on or provide a better alternative to the method presented in a given paper. Anyone who has gone through a graduate computer science program will know the pain of trying to figure out exactly what was done in some paper using only the written description. If you are lucky there is clear pseudocode that can be translated into a working program. But then you still have to worry about things such as magic numbers used to tune the system, or whether the differences you are seeing in the results could be due to other factors such as the machine architecture. Being able to grab the code used for the original paper provides a much faster and more accurate basis for comparing it to a new method.

Unfortunately, though it seemed to be a promising idea, I think their implementation is a crock. The first problem is that participation in the repeatability program is optional:
The repeatability & workability process tests that the experiments published in SIGMOD 2011 can be reproduced (repeatability) and possibly extended by modifying some aspects of the experiment design (workability). Authors participate on a voluntary basis, but authors benefit as well:
  • Mention on the repeatability website
  • The ability to run their software on other sites
  • Often, far better documentation for new members of research teams
The second problem is that it does not mean the code will be made available to everyone. I didn't see any mention of archiving on the 2011 description or how I would be able to get the code for a given paper. The SIGMOD 2010 requirements say the following about code archiving:
Participating in the repeatability/workability evaluation does not imply that the code will be archived for everyone to use subsequently. Pending authors' agreement, the code could be uploaded in the SIGMOD PubZone.
If I understand correctly, it means that if an author chooses to participate, then a committee for the conference will attempt to reproduce the experiments for that paper. After that is done, then the code will only be archived and made available if the author chooses. What the hell were they thinking? I would much rather get rid of the committee that tries to reproduce the results and make it a requirement that the full source code and data sets be made available as supplemental material for all papers. The code needs to be out in the open so it can be scrutinized along with the paper by the broader community of researchers. Furthermore, this supplemental material should be available to those doing the initial reviews of the paper to decide whether or not to accept it.

Monday, February 21, 2011

zfcat

While messing with java, I have frequently wanted an easy way to cat a file inside of a zip archive such as a jar, war, ear, and whatever other names they have cooked up for a zip file with a manifest. As one example, when trying to setup maven to generate an OSGi bundle, I want to generate the jar file and then look at the generated manifest to see if it is correct. There is probably an existing command line tool that provides this functionality, but I couldn't find one with the little bit of searching I tried and it is trivial to write such a tool. So I wrote a quick "zip file cat" tool called zfcat in Scala:
#!/bin/sh
exec scala $0 $@
!#

import java.io._
import java.util.zip._

object zfcat {
    def copy(in: InputStream, out: OutputStream): Unit = {
        val buffer = new Array[Byte](4096)
        var length = in.read(buffer)
        while (length != -1) {
            out.write(buffer, 0, length)
            length = in.read(buffer)
        }
    }

    def main(args: Array[String]): Unit = {
        if (args.length < 2) {
            System.err.printf("Usage: zfcat <archive> [file ...]%n")
            exit(1)
        }

        val zfile = new ZipFile(new File(args(0)))
        args.tail.foreach(file => {
            val entry = zfile.getEntry(file)
            if (entry != null) {
                val in = zfile.getInputStream(entry)
                copy(in, System.out)
                in.close
            } else {
                System.err.printf("Warning: zfcat: %s: no such file%n", file)
            }
        })
    }
}

zfcat.main(args)
This tool worked ok, but it just felt really slow compared to other command line tools. Java in general, and languages that run off of the JVM such as Scala, seems to be a terrible choice for writing command line tools. The problem is that the JVM is designed for long running tasks. For command line tools where most of the time it will be a very short lived job, programs based on the JVM are just too slow. So I wrote a new version of zfcat in Go:
package main

import (
    "archive/zip"
    "fmt"
    "io"
    "os"
)

func die(err os.Error) {
    if err != nil {
        fmt.Fprintf(os.Stderr, "Error: %s\n", err)
        os.Exit(1)
    }
}

func copy(r io.Reader, w io.Writer) os.Error {
    const BUFFER_SIZE = 4096
    var buffer [BUFFER_SIZE]byte
    for {
        switch nr, er := r.Read(buffer[:]); true {
            case nr <  0: return er
            case nr == 0: return nil
            case nr >  0: if nw, ew := w.Write(buffer[0:nr]); nw != nr {
                return ew
            }
        }
    }
    return nil
}

func main() {
    if len(os.Args) < 3 {
        fmt.Fprintf(os.Stderr, "Usage: %s <archive> [file ...]\n", os.Args[0])
        os.Exit(1)
    }

    reader, err := zip.OpenReader(os.Args[1])
    die(err)

    var zfiles = make(map[string] *zip.File)
    for i := range reader.File {
        file := reader.File[i]
        zfiles[file.FileHeader.Name] = file
    }

    files := os.Args[2:]
    for i := range files {
        file := zfiles[files[i]]
        if file != nil {
            r, err := file.Open()
            die(err)
            die(copy(r, os.Stdout))
            die(r.Close())
        } else {
            fmt.Fprintf(os.Stderr, "Warning: %s: %s: no such file\n",
                os.Args[0], files[i])
        }
    }
}
The Go version is fast for short lived jobs and seems to work great. I compared the times for three versions 1) running Scala as a script, 2) Scala pre-compiled, and 3) compiled Go version. The times to cat a small manifest file:
$ time ./zfcat.scala test.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.6.0_22 (Apple Inc.)


real    0m1.588s
user    0m1.076s
sys     0m0.083s
$ time scala zfcat test.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.6.0_22 (Apple Inc.)


real    0m0.635s
user    0m0.822s
sys     0m0.070s
$ time ./zfcat test.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Created-By: 1.6.0_22 (Apple Inc.)


real    0m0.028s
user    0m0.003s
sys     0m0.018s
Pre-compiling more than doubles the speed of the Scala version, but it still takes over half a second. The Go version takes less than 30ms. I'm not sure I like the use of the error return values in Go and having to explicitly check them all the time. The decision to limit the use of exceptions as a control structure was deliberate, but I haven't bothered to read through the arguments for their proposal and it is a minor annoyance right now. I was also a little disappointed that I couldn't run valgrind, apparently the 6g compiler generates an executable that valgrind doesn't understand. Valgrind might work with executables generated by gccgo, but I didn't try it.
$ valgrind ./zfcat test.jar META-INF/MANIFEST.MF
bad executable (__PAGEZERO is not 4 GB)
valgrind: ./zfcat: cannot execute binary file

Monday, February 7, 2011

The United Kingdom Explained

C. G. P. Grey created this helpful video and blog post explaining the differences between the terms United Kingdom, Great Britain, and England.



It's too bad he didn't cover the ISO codes, I tend to think of England, Northern Ireland, Scotland, and Wales as being Great Britain because I usually see the ISO code we use for that market which is en-gb. The first part is the ISO 639-2 code for the English language. The second part is the ISO 3166 code for the United Kingdom. I'm not sure why gb was chosen instead of uk. The current set of codes does not use uk for any other country, though when I'm trying to guess the code for the Ukraine I almost always think it should be uk until I look it up and find that it is ua.

Saturday, February 5, 2011

Quicksort

I recently came across C.A.R. Hoare's 1962 paper on Quicksort. The paper is divided into two parts: theory and implementation. Most of the theory portion will be familiar to anyone that has taken an algorithms course. There are a few differences in terminology, the most noticeable to me was that Hoare uses the term bound for what is usually called the pivot in modern presentations. I also learned a new phrase, mutatis mutandis, that according to wikipedia means:
Mutatis mutandis is a Latin phrase meaning "by changing those things which need to be changed" or more simply "the necessary changes having been made".
What is really interesting is the implementation details and what they reveal about the state of computers at the time. The first example that comes up is the use of an alternative sort when a partition gets small enough. This is still a common quicksort optimization, but what is interesting is the size at which he recommends this optimization:
Furthermore, if a segment consists of less than three or four items (depending on the characteristics of the computer), then it will be advantageous to sort it by the use of a program specially written for sorting a particular small number of items.
I remember having to compare insertion sort with quick sort in college to find an appropriate cutoff point and as I recall it was around 50 items. Of course, this will vary with the computer, nuances of the sort implementations, and the data set being tested. I just would have expected this number to be at least 10 or so. Another example, is the data provided in table 1:

Number of ItemsMerge SortQuicksort
5002 min 8 sec1 min 21 sec
1,0004 min 48 sec3 min 8 sec
1,5008 min 15 sec*5 min 6 sec
2,00011 min 0 sec*6 min 47 sec

The asterisk by the results for the last two rows of merge sort indicate that those figures had to be estimated because they could not be computed. The reason provided is the limited store size available on the computer:
These figures were computed by formula, since they cannot be achieved on the 405 owing to limited store size.
The description of the machine and its capabilities:
The National-Elliott 405 computer has a delay-line working store of 512 locations, and a magnetic-disc backing store of 16,384 words. The average access time for the working store is 0.8 msec and the average access time for a block of 64 words in the backing store is 32 msec. There are 19 words of immediate-access storage, which are used to contain instructions and working space of the inner loops: the time taken by such loops is about 0.15 msec per instruction.
To get a better idea of the kind of computer he is talking about, see this video from the Yorkshire film archive showing the installation of a 405 computer at Dansom Lane. It won't help for understanding the memory limitations, but it is still interesting to get an idea of what the physical machine they were testing on looked like. The paper is not too clear on exactly what the data set looked like other than the single sentence:
The figures relate to six-word items with a single-word key.
Based on this description it is clear that the problem with merge sort is that it does not sort in-place. Duplicating the data set would mean that merge sort on the 405 could support a maximum data set size of 16,384 words / 2 copies / 6 words per item = 1,365 items. Unfortunately, there is no description of the formula used to calculate the last two times for merge sort. I was curious what the times would be for something like insertion sort. Insertion sort is in-place so it should be able to handle the larger data sets, but what would the expected running time be? An extremely rough approximation would be to assume exactly N2 instructions. If we use the number from the description in the paper of 0.15 msec per instruction and assume that for each instruction we would need to have at least one read from the working store at a cost of 0.8 msec, then we can use a cost of 0.95 msec per operation. This is almost certainly an underestimate of the actual time it would take. If nothing else I'm completely ignoring access to the magnetic disc. Crunch the numbers and look what we wind up with:

Number of ItemsN lg(N)N2
5000 min 4 sec3 min 57 sec
1,0000 min 9 sec15 min 50 sec
1,5000 min 15 sec35 min 37 sec
2,0000 min 20 sec1 hour 3 min 20 sec

Over an hour to sort 2000 items compared to less than 7 minutes for quicksort.