Wednesday, January 5, 2011

Integer representation of dates

For a system at work, I needed to represent dates and times as integers. An integer representation is required because we need to import the data into a search index so we can efficiently perform point and range queries. The search system only supports range queries for integer types and one of our main use cases is querying based on a date range. Further, the system does not have a specific time data type, it just supports signed integers of 1, 2, 4, and 8 bytes. So the problem is how should we represent a date or time as an integer?

First, lets make the requirements more explicit. The input data comes is a JSON feed with ISO 8601 formatted strings for date and timestamp fields. Whatever representation we choose should be able to represent the full range of dates from the input including %Y, %Y-%m, %Y-%m-%d, and %Y-%m-%dT%H:%M:%S (formats specified using the notation for strftime). In addition, it is preferable if the representation is easy to use for a person, i.e., I would like to be able to encode a known date without needing a program.

Problems with Unix Time

The obvious first question is can we just use Unix time? It is well known and widely supported, but there are a few problems trying to use it for our application.

The first problem is that it has a limited range. If we use a 32-bit integer we run into the year 2038 problem. Right now we don't really care about dates that far in the future, but we do care about historical dates so the year 1901 problem is an immediate concern. This problem is easily solved by using a 64-bit integer giving a range of 1970 plus or minus around 290 billion years. We can afford to be that short-sighted.

The second problem is for some fields we do not know the precise timestamp. We only know a particular year, month, or day. Somehow we need to represent these dates as being distinct from a precise date. For example if all we know is that something occurred in 1970 we don't want to encode it as 1970-01-01, because now we can't distinguish it from something that we knew happened on January 1, 1970. Unix time uses all of the bits already so there is no room to represent these vague dates.

A third problem is that it is quite cumbersome for a person to figure out the unix timestamps for a given range. Say I want to find all entries within the year 1980, I would need to compute the unix timestamp values for the beginning (1980-01-01T00:00:00 is 315532800) and end of the year (1980-12-31T23:59:59 is 347155199). This is trivial to code, but for a person doing it manually it is a pain.

The second problem is the killer, Unix time is not an option. The other issues are annoying, but we could live with them, especially considering the benefits of using a representation that is common and widely supported like unix time. So, what is the alternative?

Concatenated Fields Representation

This is a representation I've seen numerous times, but I wasn't able to find an official name or spec for it. The idea is pretty simple, we have a date represented as %Y-%m-%dT%H:%M:%S, we just remove the punctuation symbols and use the values as digits so we get the numeric value %Y%m%d%H%M%S. The first thing that becomes apparent is that we need an integer type that can support at least 14 decimal digits (4 year + 2 month of year + 2 day of month + 2 hour of day + 2 minute of hour + 2 second of minute = 14 digits). If we only care about the date we need at least 8 digits and for just the time we need 6 digits. A 16-bit signed integer as a range of -32768 to 32767 so we need at least a 32-bit integer. The table below shows the max value for 32 and 64 bit signed integers using this representation for a time, date, and datetime.

32-bit max64-bit max
Time23:59:5923:59:59
Date214748-12-31922337203685477-12-31
DateTimetoo small922337203-12-31T23:59:59

For a time or date a 32-bit signed integer is more than enough. Time in particular is limited so the max value will be 23:59:59 and using 64-bits provides no value. For date and datetime types the additional room can be used to expand the range of years supported. However, even with 32-bit integers the date type can works through the year 214748 which is more than enough for our use cases, but if needed 64-bits will get you past the year 922 trillion. For datetime a 32-bit integer is too small for 14 decimal digits. A 64-bit value will provide support past the year 922 million, and more importantly it covers the full range of the 32-bit date.

How does this fit our requirements? It covers the full range of time values that we expect to get in the input. Vague dates that are missing the day and/or month can be encoded as %Y0000. The zeros are not valid for the month or the day of month so we can clearly distinguish from more specific dates. This format is also pretty easy for a human to use. So it fixes the problems with unix time, but have we added any new problems?

Yes. Probably the most annoying problem is that not all values are valid. For example the date 19701332 is nonsense. There is no 13th month or 32nd day of a month. This representation makes it really easy to have values that are complete garbage. In fact, most of the distinct values for the integer representation will not be valid dates. The table below shows the percentage of the distinct values that are valid:

32-bit max64-bit max
Time0.002% (86,400 of 4,294,967,296)0.000000000000468% (86,400 of 18,446,744,073,709,551,616)
Date2.08% (89,335,584 of 4,294,967,296)2.08% (383,692,276,733,158,848 of 18,446,744,073,709,551,616)
DateTimetoo small12.00% (2,213,609,288,860,213,248 of 18,446,744,073,709,551,616)

The second problem is closely related to the first. This representation wastes a lot of space and doesn't make good use of all the bits. For my purposes this is not a big problem, but if you are looking for a compact representation then this is a bad choice.

Finally, we cannot make use of the negative values. In the ISO 8601 representation the year can be negative to specify BCE dates. This approach cannot be used with this representation because the sign would apply to the entire value, not just the year, and that breaks the ordering. For example -0004-01-04 is earlier than -0004-12-30, but when represented as an integer -40104 is greater than -41230.

Summary

The concatenated fields representation is a user friendly way to represent common era dates as an integer. It is particularly useful if you need to support dates where only partial information is present, e.g., just the year.

3 comments:

  1. Where else have you seen this "digits concatenated without punctuation representation" for date or time or both?

    I know it wasn't in the Wikipedia article http://en.wikipedia.org/w/index.php?title=Calendar_date&diff=527552185&oldid=526137406 because I wrote that *after* reading this blog post. :-).

    Today I'm calling that representation the "RFC 5545 (iCalendar) date format", but I suspect that it has other names.

    ReplyDelete
  2. Is there any difference between your "concatenated fields representation" and what Wikipedia calls "GeneralizedTime", as defined in ITU-T X.680 ?

    http://en.wikipedia.org/wiki/GeneralizedTime

    ReplyDelete
  3. http://microformats.org/wiki/abbr-design-pattern-alternatives briefly mentions "ISO 8601 date without dashes".

    ReplyDelete