Saturday, January 29, 2011

Parallel and Bar

I recently discovered two UNIX command line tools that are really useful. For both tools, I had simple solutions that I've been using for a long time to provide similar functionality. The first tool is GNU Parallel. As the name implies, this tool provides a way to run many commands in parallel. Whats more is that it works in a manner that is consistent with xargs. To get an idea of all the cool things you can do with this tool, I highly recommend looking through the examples in the man page.

Before finding GNU Parallel, I used a simple shell script to accomplish similar tasks. Of course, my version is much more limited in what it can do, it reads one command per line from stdin and runs it in the background. The number of parallel jobs is controlled by using the bash jobs builtin to determine how many tasks are already running. The complete script is shown below.
#!/bin/bash

# Print usage
function usage {
    local prg=`basename $0`
    cat <<USAGE

Usage: $prg [options]

    Options:
    -h                   Print this help message
    -v                   Be verbose with log messages
    -n <tasks>           The number of tasks to run at a time.

USAGE
    exit 1
}

# Check arguments
numProcs=4
verbose="false"
while getopts "n:vh" option; do
    case $option in
        (n) numProcs="$OPTARG" ;;
        (v) verbose="true" ;;
        (h) usage ;;
    esac
done
shift $(($OPTIND-1))

# Read commands from stdin
while read line; do

    # Can we run another?
    while [ "$(jobs -r | wc -l)" -ge "$numProcs" ]; do
        sleep 1
    done

    # Run task
    sh -c "$line" &
done

# Wait for jobs to finish
wait
The second command is the command line progress bar. This command writes progress information to stderr while copying data from stdin to stdout. If you ever need to copy a big file and want some basic stats so you can see that things are still working, this is the tool for you. The output and options are certainly much nicer than the crude C program I was using:
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>

#define PAGESIZE 4096
#define STDIN    0
#define STDOUT   1

int
main() {
    char buffer[PAGESIZE];
    ssize_t length;
    int counter;
    uint64_t bytesWritten = 0;

    while ((length = read(STDIN, buffer, PAGESIZE)) > 0) {
        write(STDOUT, buffer, length);
        bytesWritten += length;
        ++counter;
        if (counter % 10 == 0) {
            fprintf(stderr, "%"PRIu64" bytes\r", bytesWritten);
        }
    }
    fprintf(stderr, "%"PRIu64" bytes\n", bytesWritten);

    return EXIT_SUCCESS;
}

Wednesday, January 19, 2011

Yahoo! is a moving company from Mumbai

I was a bit surprised when looking at Alexa's list of top sites and the site in fourth place was anshikapackersmovers. It seems Alexa has some confusion between Yahoo! and the Mumbai company Anshika Packers & Movers. Look at the site info for yahoo.com and it says:
About anshikapackersmovers (yahoo.com): providing you the best possible solutions for all kind of relocation such as home relocation, car carrier service, office or corporate relocation mumbai providing you the best possible solutions for all kind of relocation such as home relocation, car carrier service, office or corporate relocation .

Yahoo.com is ranked #4 in the world according to the three-month Alexa traffic rankings, and roughly 29% of visits to the site consist of only one pageview (i.e., are bounces). While we estimate that 32% of visitors to the site come from the US, where it is ranked #3, it is also popular in Taiwan, where it is ranked #1. Compared with internet averages, the site's audience tends to be aged under 25 and over 45; they are also disproportionately women browsing from school and home. The fraction of visits to Yahoo.com referred by search engines is approximately 7%.
I'm assuming this will get fixed relatively soon, so I included a screen capture of the Alexa site info page below.

Friday, January 14, 2011

Damn TCLLIB htmlparse

I recently needed to write a quick script to extract some information from HTML documents. Poking around I found the machine already had TCLLIB installed so I thought it would be a good opportunity to try out the htmlparse library. The library is extremely easy to use except for one annoying pitfall, the handling of attributes. The parser will invoke a callback function with the tag name, text, and attributes. A default function is provided called ::htmlparse::debugCallback. From the documentation of the param argument:
The fourth argument, param, contains the un-interpreted list of parameters to the tag.
I thought for sure they must be kidding. Do they really expect me to parse the attribute data myself? The reason I'm using a library is so I don't have to worry about all the intricacies of HTML parsing. I decided to give it a try with the example below:
#!/usr/bin/env tclsh

package require htmlparse 1.2

set html {
<html>
  <head><title>Test HTML Page</title></head>
  <body>
    <p>This is some test <a target = "_blank"
    href="http://w3.org/html"         >HTML</a> content.</p>
  </body>
</html>
}

::htmlparse::parse $html
The output of running this example:
$ ./htmlparse.tcl 
==> hmstart {} {} {
}
==> html {} {} {
  }
==> head {} {} {}
==> title {} {} {Test HTML Page}
==> title / {} {}
==> head / {} {
  }
==> body {} {} {
    }
==> p {} {} {This is some test }
==> a {} {target = "_blank"
    href="http://w3.org/html"         } HTML
==> a / {} { content.}
==> p / {} {
  }
==> body / {} {
}
==> html / {} {
}
==> hmstart / {} {}
Sure enough, the attributes are all in one big string just as the documentation stated. This is one of those times I was hoping the documentation was wrong.

Two spaces after a period

Do you still use two spaces after a period when typing documents? I remember learning this rule when I first started working with computers in the early 90s, but somewhere along the way I stopped doing it and never gave it much thought. I think a large part of the reason is that I typically use some form of markup, such as LaTeX, for doing any sort of moderately complex typography. Who really wants to learn and manually try to adhere to all of the silly rules for how to prepare a manuscript? Just use some markup with a typesetting engine to take care of all the mundane stuff like conforming to whitespace conventions. Nothing to get all worked up about.

So I was a bit surprised to see how worked up Farhad Manjoo gets over people using two spaces after a period in his article Space Invaders: Why you should never, ever use two spaces after a period. What really struck me was that he mentions the antiquated typewriters that gave rise to the double space convention, but he doesn't seem to take issue with users having to carry the burden of adhering to this style. One of the reasons I hate WYSIWYG editors is that they force the user to worry about the style as well as the content. Try to write an article having to worry about rivers, widows, and orphans that will change and need to be corrected every time you edit the text. Add to that all of the consistency issues of sentence spacing, fonts, page numbering, references to figures or sources, etc. WYSIWYG was first available, at least on a computer, in 1974 with the Bravo editor and by 1978 and it was obsolete for any serious writing tasks with the release of TeX.

Thomas Thwaites: How I built a toaster -- from scratch

Wednesday, January 5, 2011

Integer representation of dates

For a system at work, I needed to represent dates and times as integers. An integer representation is required because we need to import the data into a search index so we can efficiently perform point and range queries. The search system only supports range queries for integer types and one of our main use cases is querying based on a date range. Further, the system does not have a specific time data type, it just supports signed integers of 1, 2, 4, and 8 bytes. So the problem is how should we represent a date or time as an integer?

First, lets make the requirements more explicit. The input data comes is a JSON feed with ISO 8601 formatted strings for date and timestamp fields. Whatever representation we choose should be able to represent the full range of dates from the input including %Y, %Y-%m, %Y-%m-%d, and %Y-%m-%dT%H:%M:%S (formats specified using the notation for strftime). In addition, it is preferable if the representation is easy to use for a person, i.e., I would like to be able to encode a known date without needing a program.

Problems with Unix Time

The obvious first question is can we just use Unix time? It is well known and widely supported, but there are a few problems trying to use it for our application.

The first problem is that it has a limited range. If we use a 32-bit integer we run into the year 2038 problem. Right now we don't really care about dates that far in the future, but we do care about historical dates so the year 1901 problem is an immediate concern. This problem is easily solved by using a 64-bit integer giving a range of 1970 plus or minus around 290 billion years. We can afford to be that short-sighted.

The second problem is for some fields we do not know the precise timestamp. We only know a particular year, month, or day. Somehow we need to represent these dates as being distinct from a precise date. For example if all we know is that something occurred in 1970 we don't want to encode it as 1970-01-01, because now we can't distinguish it from something that we knew happened on January 1, 1970. Unix time uses all of the bits already so there is no room to represent these vague dates.

A third problem is that it is quite cumbersome for a person to figure out the unix timestamps for a given range. Say I want to find all entries within the year 1980, I would need to compute the unix timestamp values for the beginning (1980-01-01T00:00:00 is 315532800) and end of the year (1980-12-31T23:59:59 is 347155199). This is trivial to code, but for a person doing it manually it is a pain.

The second problem is the killer, Unix time is not an option. The other issues are annoying, but we could live with them, especially considering the benefits of using a representation that is common and widely supported like unix time. So, what is the alternative?

Concatenated Fields Representation

This is a representation I've seen numerous times, but I wasn't able to find an official name or spec for it. The idea is pretty simple, we have a date represented as %Y-%m-%dT%H:%M:%S, we just remove the punctuation symbols and use the values as digits so we get the numeric value %Y%m%d%H%M%S. The first thing that becomes apparent is that we need an integer type that can support at least 14 decimal digits (4 year + 2 month of year + 2 day of month + 2 hour of day + 2 minute of hour + 2 second of minute = 14 digits). If we only care about the date we need at least 8 digits and for just the time we need 6 digits. A 16-bit signed integer as a range of -32768 to 32767 so we need at least a 32-bit integer. The table below shows the max value for 32 and 64 bit signed integers using this representation for a time, date, and datetime.

32-bit max64-bit max
Time23:59:5923:59:59
Date214748-12-31922337203685477-12-31
DateTimetoo small922337203-12-31T23:59:59

For a time or date a 32-bit signed integer is more than enough. Time in particular is limited so the max value will be 23:59:59 and using 64-bits provides no value. For date and datetime types the additional room can be used to expand the range of years supported. However, even with 32-bit integers the date type can works through the year 214748 which is more than enough for our use cases, but if needed 64-bits will get you past the year 922 trillion. For datetime a 32-bit integer is too small for 14 decimal digits. A 64-bit value will provide support past the year 922 million, and more importantly it covers the full range of the 32-bit date.

How does this fit our requirements? It covers the full range of time values that we expect to get in the input. Vague dates that are missing the day and/or month can be encoded as %Y0000. The zeros are not valid for the month or the day of month so we can clearly distinguish from more specific dates. This format is also pretty easy for a human to use. So it fixes the problems with unix time, but have we added any new problems?

Yes. Probably the most annoying problem is that not all values are valid. For example the date 19701332 is nonsense. There is no 13th month or 32nd day of a month. This representation makes it really easy to have values that are complete garbage. In fact, most of the distinct values for the integer representation will not be valid dates. The table below shows the percentage of the distinct values that are valid:

32-bit max64-bit max
Time0.002% (86,400 of 4,294,967,296)0.000000000000468% (86,400 of 18,446,744,073,709,551,616)
Date2.08% (89,335,584 of 4,294,967,296)2.08% (383,692,276,733,158,848 of 18,446,744,073,709,551,616)
DateTimetoo small12.00% (2,213,609,288,860,213,248 of 18,446,744,073,709,551,616)

The second problem is closely related to the first. This representation wastes a lot of space and doesn't make good use of all the bits. For my purposes this is not a big problem, but if you are looking for a compact representation then this is a bad choice.

Finally, we cannot make use of the negative values. In the ISO 8601 representation the year can be negative to specify BCE dates. This approach cannot be used with this representation because the sign would apply to the entire value, not just the year, and that breaks the ordering. For example -0004-01-04 is earlier than -0004-12-30, but when represented as an integer -40104 is greater than -41230.

Summary

The concatenated fields representation is a user friendly way to represent common era dates as an integer. It is particularly useful if you need to support dates where only partial information is present, e.g., just the year.