Retrospective time-based UUID generation (with Spark)

I have faced a problem: having about 50 Gb of data in one database, export records to another database with slight modifications, which include UUID generation based on timestamps of records (collisions were intolerable). We have Spark cluster, and with it the problem did not seem even a little tricky: create RDD, map it and send to the target DB.

(In this post I am telling about Spark. I have not told about it in this blog so far, but I will. Generally, it is a framework for large-scale data processing, like Hadoop. It operates on abstract distributed data collections called Resilient Distributed Datasets (RDDs). Their interface is quite similar to functional collections (map, flatMap, etc.), and also has some operations specific for Spark’s distributed and large-scale nature.)

However, I was too optimistic – there were difficulties.

UUID

The idea of UUID or universally unique identifier is quite simple: it is a 128-bit value that can be generated locally and guaranteed (practically) to be unique (across the human civilization). It has so-called versions – different ways to generate it. The simplest is a version 4 (random generation), but there are more sophisticated ones. I needed a version 1, which is generated from a MAC address and a timestamp.

This version supports up to ten thousand unique IDs on one host (one MAC address) in one moment of time (timestamp), that is achieved by using a counter. In normal situation the generation of UUIDs of this versions is pretty simple: store previous timestamp. If the next timestamp is equals to it, then increment the counter. If the next is newer – set the counter to 0 and save the passed timestamp as the previous. (For instance, see the code from Cassandra driver.) But it is invariable: the next timestamp is equal or greater than the previous.

The problem

The problem starts when the invariant is violated, i.e. when we generate UUIDs from timestamps that are not following in order. This could happen when we generate it retrospectively, on a set of records that already exists and is not ordered by the timestamp. This was my case.

Ordering of records using DBMS or Spark was theoretically possible but quite hard practically due to the amount of data.

I had an idea to store the counters for each timestamp instead of the current (using e.g. ConcurrentHashMap), but I had estimated that it would take approximately 6 Gb of additional memory per Spark worker. In our setup that was feasible, but I opted to avoid that if possible.

My solution

In fact, it is not necessary to store the counters for all timestamps. There is an option to divide them into groups by timestamp and use an individual UUID generator for each group, so we will not have to use these gigabytes of memory to store all the timestamps, but only for timestamps that are in a particular group. The crucial requirement: all records with the same timestamp must be in the same group (to actually make use of the counters and avoid collisions).

The UUID generator

Let us start from the generator:

The generator is basically the same as I would write for the solution in which all the counters are stored. As you see, the counters are saved in ConcurrentHashMap, where the timestamps are keys. A counter starts from 0 and increments when the same timestamp arrives again and again. This construction with do-while is compare-and-swap operation for ConcurrentHashMap.

Record grouping

As I mentioned earlier, all records with the same timestamp must go to the same group. Grouping by timestamp itself is an obvious solution. There is a small inconvenience: the records are distributed non-uniformly along the timeline: their density function grows linearly (we can consider so for this task).

Spark supports custom partitioners, thus I coded one.

As you can see, I divided the timeline into unequal regions with 1500, 3000 and 6000 partitions in them. In Spark’s terms, partition is a piece of data that is entirely processed on a single worker. We will use an individual UUID generator per partition.

Putting it all together

The essential Spark code is fairly simple:

I believe the piece of code is concise enough.

That is it: correct UUIDs are generated for all records.
No big deal :)