Time-based (version 1) UUIDs ordering in PostgreSQL

March 28, 2016

Page content

The problem

Not so long ago I wrote about a little strange problem with time-based UUIDs I faced (Retrospective time-based UUID generation (with Spark)). This time I needed to do something more usual. As you surely know, UUID standard has several versions. Version 4 is purely random numbers, version 1 relies on the identity of a machine and the timestamp with the counter, etc.

Let’s consider version 1 UUIDs. They include timestamps with counters, so they naturally can be ordered by it. In other words, having two time-based UUIDs, I wanted to say if the first is lower, greater or equal to the second. No big deal, say, in Java: u1.timestamp().compare(u2.timestamp()). But I wanted to do this in PostgreSQL, inside SQL query. Postgresql does have uuid data type, but it provides no functions for comparing them by version 1 timestamps. That is why I decided to write such a function myself.

Version 1 UUID structure

You can find the full UUID specification in RFC 4122. Let’s consider here only the part of version 1 which we are interested in. UUIDs have length of 128 bits, i.e. 16 bytes, which have different meaning in different versions of the standard. Version 1 layout is shown on the picture:

UUID version 1 structure

UUID timestamp is a count of 100-nanoseconds intervals since 00:00:00.000 15 Oct 1582. This looks strange, but it is for the sake of rational usage of bits and to give room for the counter (which is used to prevent collisions). The timestamp is 60-bit long, which requires 8 full bytes to store - the first half of the UUID. Actually, we are only interested in this first half. The timestamp is cut into three pieces: hi - 12 bits | mid - 16 bits | low - 32 bits. First 4 bytes of an UUID is low, next two - mid, and the next two is hi. But the higher 4 bits of hi’s byte are used to store the version number. The counter is a part of low.

UUID timestamps ordering

As we know the structure of timestamps, it is easy to order them. The algorithm is the following:

Having two time-based UUIDs u1 and u2:
1. Compare hi(u1) and hi(u2):
1.1. If hi(u1) > hi(u2), return "u1 > u2";
1.2. If hi(u1) < hi(u2), return "u1 < u2".

2. Compare mid(u1) and mid(u2):
2.1. If mid(u1) > mid(u2), return "u1 > u2";
2.2. If mid(u1) < mid(u2), return "u1 < u2".

3. Compare low(u1) and low(u2):
3.1. If low(u1) > low(u2), return "u1 > u2";
3.2. If low(u1) < low(u2), return "u1 < u2";
3.3. Otherwise, return "u1 = u2".

Now, let’s write this in a form of PL/pgSQL function:

-- Compare two version 1 UUIDs by timestamps and counters.
-- Returns 1 IF u1 > u2; -1 IF u1 < u2; 0 IF u1 = u2.
CREATE OR REPLACE FUNCTION uuids_v1_compare(u1 uuid, u2 uuid)
    RETURNS SMALLINT AS $$
DECLARE
    u1_bytes BYTEA := decode(REPLACE(u1::TEXT, '-', ''), 'hex');
    u2_bytes BYTEA := decode(REPLACE(u2::TEXT, '-', ''), 'hex');
    version1 INTEGER := get_byte(u1_bytes, 6) >> 4;
    version2 INTEGER := get_byte(u2_bytes, 6) >> 4;
    time_hi1 INTEGER;
    time_hi2 INTEGER;
    time_mid1 INTEGER;
    time_mid2 INTEGER;
    time_low1 BIGINT;
    time_low2 BIGINT;
BEGIN
    -- Version must be 1.
    IF version1 <> 1 THEN
        RAISE EXCEPTION 'u1 version is %. Only version 1 must be passed.',
            version1;
    END IF;
    IF version2 <> 1 THEN
        RAISE EXCEPTION 'u2 version is %. Only version 1 must be passed.',
            version2;
    END IF;

    -- Compare time hi.
    -- 15 = 0xF = 1111
    -- The highest 4 bits is for version.
    time_hi1 = (get_byte(u1_bytes, 6) & 15) << 8 | get_byte(u1_bytes, 7);
    time_hi2 = (get_byte(u2_bytes, 6) & 15) << 8 | get_byte(u2_bytes, 7);


    IF time_hi1 > time_hi2 THEN
        RETURN 1;
    END IF;
    IF time_hi1 < time_hi2 THEN
        RETURN -1;
    END IF;

    -- Compare time mid
    time_mid1 = get_byte(u1_bytes, 4) << 8 | get_byte(u1_bytes, 5);
    time_mid2 = get_byte(u2_bytes, 4) << 8 | get_byte(u2_bytes, 5);

    IF time_mid1 > time_mid2 THEN
        RETURN 1;
    END IF;
    IF time_mid1 < time_mid2 THEN
        RETURN -1;
    END IF;

    -- Compare time low
    time_low1 = get_byte(u1_bytes, 0)::bigint << 24 |
                get_byte(u1_bytes, 1)::bigint << 16 |
                get_byte(u1_bytes, 2)::bigint <<  8 |
                get_byte(u1_bytes, 3)::bigint <<  0;
    time_low2 = get_byte(u2_bytes, 0)::bigint << 24 |
                get_byte(u2_bytes, 1)::bigint << 16 |
                get_byte(u2_bytes, 2)::bigint <<  8 |
                get_byte(u2_bytes, 3)::bigint <<  0;

    IF time_low1 > time_low2 THEN
        RETURN 1;
    END IF;
    IF time_low1 < time_low2 THEN
        RETURN -1;
    END IF;

    RETURN 0;
END;
$$ LANGUAGE plpgsql;

I know no standard way in PostgreSQL for transforming uuid type into a byte array or getting a single byte from it. That is why I used the conversion into a string with the following decoding as a hex number.

The function is quite transparent, but I tested it by comparing its result to Java’s timestamp comparison. I tested it on equal UUIDs, on UUIDs with the same time but different counters, on UUIDs with different times, sequentially and randomly, etc. You can find the tests (in Scala) in the repository. As for the performance, I have not measured it.

So, if you need to order time-based (version 1) UUIDs in PostgreSQL in this way, that is one of possible solutions. Maybe one day we will have something out-of-the-box for doing this.

All relevant code can be found in the repository.