About Akka Streams

Excellent post about Akka Streams — Akka Team

Really nice introduction to #Akka Streams! FEEL THE POWER! ^^ — Viktor Klang

A must-read on #Akka #Streams!!! — Ellan Vannin CA

In many computer programs, the whole logic (or a vast part of it) is essentially step-by-step processing of data. Of course, this includes the situation when we iterate over the data and just execute the processing logic on every piece of it. However, there are a couple of complications here:

  • the processing logic may be quite complex, with various aggregations, merging, routing, error recoveries, etc.;
  • we might want to execute steps asynchronously, for instance, to take advantage of multi-processor machines or use I/O;
  • asynchronous execution of data processing steps inherently involves buffering, queues, congestion, and other matter, which are really difficult to handle properly (please, read “Handling Overload” by Fred Hébert).

Therefore, sometimes it is a good idea to express this logic on a high level using some kind of a framework, without a need to implement (possibly complex) mechanics of asynchronous pipeline processing. This was one of the rationales behind frameworks like Apache Camel or Apache Storm.

Actor systems like Erlang or Akka are fairly good for building robust asynchronous data pipelines. However, they are quite low-level by themselves, so writing such pipelines might be tiresome. Newer versions of Akka include the possibility for doing pipeline processing on a quite high level, which is called Akka Streams. Akka Streams grows from Reactive Streams initiative. It implements a streaming interface on top of Akka actor system. In this post I would like to give a short introduction to this library.

You might be interested in an introductory post about Akka in this blog.

We will need a Scala project with two dependencies:

Akka Streams basics

In Akka Streams, we represent data processing in a form of data flow through an arbitrary complex graph of processing stages. Stages have zero or more inputs and zero or more outputs. The basic building blocks are Sources (one output), Sinks (one input) and Flows (one input and one output). Using them, we can build arbitrary long linear pipelines. An example of a stream with just a Source, one Flow and a Sink:
Continue reading

Time-based (version 1) UUIDs ordering in PostgreSQL

The problem

Not so long ago I wrote about a little strange problem with time-based UUIDs I faced (Retrospective time-based UUID generation (with Spark)). This time I needed to do something more usual. As you surely know, UUID standard has several versions. Version 4 is purely random numbers, version 1 relies on the identity of a machine and the timestamp with the counter, etc.

Let’s consider version 1 UUIDs. They include timestamps with counters, so they naturally can be ordered by it. In other words, having two time-based UUIDs, I wanted to say if the first is lower, greater or equal to the second. No big deal, say, in Java: u1.timestamp().compare(u2.timestamp()). But I wanted to do this in PostgreSQL, inside SQL query. Postgresql does have uuid data type, but it provides no functions for comparing them by version 1 timestamps. That is why I decided to write such a function myself.

Version 1 UUID structure

You can find the full UUID specification in RFC 4122. Let’s consider here only the part of version 1 which we are interested in. UUIDs have length of 128 bits, i.e. 16 bytes, which have different meaning in different versions of the standard. Version 1 layout is shown on the picture:

UUID version 1 structure

Continue reading

Type-safe query builders in Scala revisited: shapeless

Not so long ago, I wrote a post about creating type-safe query builders in Scala from scratch. In it, I also suggested using shapeless library to do what was described. Now, I decided to write how it could be done. The code is here.

Problem reminder

Without going into much details, the problem was to provide a type-safe way to build queries (to an abstract database, for instance) with parameters of different types. Something like

Continue reading

Retrospective time-based UUID generation (with Spark)

I have faced a problem: having about 50 Gb of data in one database, export records to another database with slight modifications, which include UUID generation based on timestamps of records (collisions were intolerable). We have Spark cluster, and with it the problem did not seem even a little tricky: create RDD, map it and send to the target DB.

(In this post I am telling about Spark. I have not told about it in this blog so far, but I will. Generally, it is a framework for large-scale data processing, like Hadoop. It operates on abstract distributed data collections called Resilient Distributed Datasets (RDDs). Their interface is quite similar to functional collections (map, flatMap, etc.), and also has some operations specific for Spark’s distributed and large-scale nature.)

However, I was too optimistic – there were difficulties.

Continue reading

Type-safe query builders in Scala

Recently, I was hacking on a Scala library for queries to Cassandra database, phantom. At the moment, it was not able to build prepared statements, which I needed, so I added this functionality. However, phantom developers also implemented prepared statements quickly :) Nevertheless, I decided to write this post about the core idea of my implementation.

Caution: Don’t do this at home, use shapeless :)

My plan was to be able to prepare queries like this:

and after that to execute this query: query.execute("string", 1, false). Moreover, I wanted this execution to be type-safe, so it was not possible to execute something like query.execute(1, "string", 0).

I will abstract from phantom queries and will focus on the general idea. The code is here.

Continue reading

Linkblog #6



  • Futures for C++11 at Facebook – Facebook added futures to their Folly C++ library. Futures is a great abstraction for asynchronous programming (and in fact, it is not the first implementation of them in C++). And the article is interesting.
  • Realtime Webcam Sudoku Solver – this project is old (well, not very old – 2011), but very impressive. You point a webcam on a sudoku field, the program solves it and draws the result directly on the screen. The full process is explained in the article. A great thing to re-implement if you have a couple of weeks.



  • Papers We Love – a repository of academic computer science papers and a community who loves reading them. They even have meetups for discussion in different cities across the world.
  • DB-Engines – a knowledge base of relational and NoSQL database management systems.
  • How I Start – a collection of personal stories / tutorials of people who took up a new programming languages and done something with it.


And a bit of humour for you →

Linkblog #5



  • Testing Distributed Systems w/ Deterministic Simulation by Will Wilson. I was impressed by this talk. The speaker is a Senior Engineer at FoundationDB (Apple now). He talks about how difficult it is to develop a distributed system, how it is more difficult to test it, to reproduce indeterministic bugs etc., and how they contended with this. Spoiler: before writing FoundationDB, they wrote a deterministic simulator of FoundationDB and developed end tested algorithms on it.


And a bit of humour for you →

Linkblog #4



  • One Hacker Way by Erik Meijer – an interesting and a bit controversial expressive talk about the organization of the software development process and software business.


  • Orleans. It was bound to happen: Microsoft released their actor framework for distributed systems that includes many interesting features such as cluster membership. If your want more meat, here is the paper.
  • Electron. You probably use Atom text editor and you probably know that it is in fact a Node.js + Webkit. Now, Github released Electron, a framework that Atom built with. Now, you can easily write almost-native application with JS.
  • Hubot – another project of Github (not as new as Electron). It is about creating a bot that can join your work chat (in HipChat, Slack, etc; a list of adapters), react on the messages, fulfil your commands and other things. In one of projects I take part, we use it for status checking, metrics monitoring and fun.
  • Motion sensing using the Doppler effect – an amazing idea of using the Doppler effect in order to detect user’s hand movements, in browser. However, I failed to run it :)

And a bit of humour for you →

Linkblog #3


  • NOSQL Patterns – an article about various patterns you can meet in NoSQL-systems: data partitioning, replication, cluster membership, consistency models etc.
  • NoSQL Data Modeling Techniques. If the previous article is about internals of NoSQL systems, this one is closer to usage. It covers some data modeling techniques useful for storing your data in a NoSQL storage, such as denormalization, aggregation, hierarchy storing etc.
  • Facebook’s Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services – an overview of a paper that describes Facebook’s approach to analyze (internal) service performance and interesting findings about it. Shortly, they parse logs of internal services and this gives them information about how long each part of a request is. The approach requires no additional instrumentation, as it said.
  • A series of articles about implementing monads in C# with a bit of theory. The series shows that there is no mystery in monads and monad pattern can be easily implemented in a non-functional language like C#. That might be useful for understanding monads without diving into a language like Haskell. Follow the links at the end of each article to get to the next part.
  • Java Garbage Collection Distilled – an explanation of garbage collectors used HotSpot JVM and OpenJDK, GC tradeoffs, its monitoring and tuning.
  • Akka Cluster Load Balancing – an approach to adaptive load balancing in Akka cluster based on free heap space metrics. The author has implemented a custom actor router that directs workload to the actor on a node with the smallest heap.



  • Bazel – Google have open-sourced theirs build tool Bazel that is used to build most of theirs projects. (Marked as alpha.)
  • facebook-tunnel – have free Facebook traffic? You can get to your internets through Facebook chat :) An amusing proof-of-concept.


  • IoT Podcast – a new podcast about Internet of things. There are two episodes for now.


  • Principles of Reactive Programming – the new iteration of Martin Odersky’s, Erik Meijer’s and Roland Kuhn’s course about reactive programming, reactive streams, actors, Akka, etc. The material is promised to be updated and improved. It starts at 13 April. I finished the course already, but it will be interesting to know how it has changed.
    If you are going to take the course, drop me a couple of lines, it will be interesting to discuss things.

And a funny picture for you →

The Bloom filter

In many software engineering problems, we have a set and need to determine if some value belongs to this set. If the possible maximum set cardinality (size; maximum size = total count of elements we consider) is small, the solution is straightforward: just store the set explicitly (for instance, in form of a RB-tree), update it when necessary and check if the set contains elements that we are interested in. But what if maximum set cardinality is large or we need many such sets to operate simultaneously? Or if the set membership test is an expensive operation?

Suppose we want to know if an element belongs to a set. We have decided that it is acceptable to get false positive answers (the answer is “yes”, but the element is not actually in the set) with probability p and not acceptable to get false negative (the answer is “no”, but the element in actually in the set). The data structure that could help us in this situation is called the Bloom filter.

A Bloom filter (proposed by Burton Howard Bloom in 1970) is a bit array of m bits (initially set to 0) and k different hash functions. Each hash function maps a value into a single integer number.

Look at this picture from Wikipedia:
Bloom filter

Continue reading