spark

Retrospective time-based UUID generation (with Spark)

Update 26.03.2023: 8 years ago 50 Gb sounded more serious that it is now, but even then we could and should have done this easily with one beefy machine without Spark or any other then-fancy tool.

I have faced a problem: having about 50 Gb of data in one database, export records to another database with slight modifications, which include UUID generation based on timestamps of records (collisions were intolerable). We have Spark cluster, and with it the problem did not seem even a little tricky: create RDD, map it and send to the target DB.

(In this post I am telling about Spark. I have not told about it in this blog so far, but I will. Generally, it is a framework for large-scale data processing, like Hadoop. It operates on abstract distributed data collections called Resilient Distributed Datasets (RDDs). Their interface is quite similar to functional collections (map, flatMap, etc.), and also has some operations specific for Spark's distributed and large-scale nature.)

However, I was too optimistic - there were difficulties.