Ivan Yurchenko

You probably heard about property testing (AKA property-based testing), a testing technique where test inputs are randomly generated in great quantity and particular desired properties of the code under test are checked. It’s very practical and I’m a big proponent of using it where it makes sense.

In this post, I will tell you how I used property testing with the Proptest library in Rust to ensure the correctness of a bunch of generated serialization/deserialization code for the Apache Kafka protocol.

In this post:

A real-life example of using property testing in Rust with Proptest.
A step-by-step explanation.

The post assumes you’re familiar with Rust. However, to follow it you don’t need any knowledge about Kafka or its protocol. I will explain the problem I was working on briefly, but will not go into much detail.

Introduction

Apache Kafka has a sizeable custom binary protocol that is versioned, with various data types, optional fields, etc. Unfortunately, it doesn’t use a well-known serialization format like Protobuf. The protocol message schema is described in JSON. The actual Java code that does serialization and deserialization is generated from this description¹.

I created a library kafka_wire_protocol. It’s a generated de-/serialization code for the Kafka protocol, but for Rust (and in the future, for Go). The correctness of the generated code is paramount. I applied many testing techniques to it: quick unit tests, integration tests against a real Kafka instance, fuzzying. And also kafka_wire_protocol is a good example of software that can benefit from property testing. About this is going to be this post.

You don’t need to familiarize yourself with kafka_wire_protocol to understand what will be discussed. I will give examples from the code, but they will be self-contained. However, if you’re interested, feel free to check out this repo and follow the instruction in README to run the code generator, compile, and run the tests.

I worked with the Apache Kafka protocol on the low level quite a bit. It wasn’t easy to start doing this following the official guide only and I read the code a lot. With this post, I want to give you a head start by guiding you step by step from primitive values to meaningful requests.

In this post:

Explore the Kafka protocol code and the protocol in action with Wireshark.
Learn how to read and write primitive values.
Combine primitives to perform meaningful requests.

We will use Python as the programming language. However, the code will be zero-dependency and easily portable to the language of your choice.

Java Virtual Machine (JVM) is a really great platform, mature, and well-established. Apart from lots of normal features used by all developers, there are some that are more low-level, designed to serve more “system” or “tooling” purposes. One example is sun.misc.Unsafe, which gives e.g. low-level access to memory. Another such feature is agents. Agents are the JVM mechanism that enables external code to integrate with a running JVM, including access to the bytecode before it’s loaded and executed.

In this post:

The basics of Java agents and Instrumentation API.
An example agent for metrics collection.
Libraries for bytecode manipulation: Javassist and Byte Buddy.

Agents were introduced in Java 1.5, but programming them is rarely a part of JVM programmer’s everyday job. However, having JVM code in production means a high probability of using some agents. I’d like to mention several widely used classes of them:

JMX-HTTP bridges, e.g. Jolokia, which gives access to JXM MBeans over HTTP (very useful for monitoring);
profilers, e.g. YourKit or JProfiler;
debuggers, namely Java Debug Wire Protocol (JDWP) agent;
aspect-oriented programming toolkits, AspectJ in particular;
hot code reloading tools like JRebel, which are especially useful in Java EE environment.

There are two types of agents in JVM: Java agents and native agents. Java agents are in JVM bytecode (i.e. written in one of JVM languages, most commonly in Java) packed in JARs and follow special code organisation convention so as JVM could use them. Native agents are different: they’re in native code (most commonly compiled from C++) packed in dynamic libraries, use JVM Tool Interface, and operate on a more low level than Java agents. Particularly, they can affect the garbage collector, thread management, locking and synchronisation, etc. Profilers and debuggers use native agents.

In this post, we’re focusing solely on Java agents.

Update 26.03.2023: In February 2021, Bintray was shut down. This post still may be useful from the point of view of artifact publishing in Maven and SBT.

Many developers are interested in making their JVM artifacts (e.g. libraries) available for others. Probably, the simplest way to do this is to publish an artifact to Bintray, which gives free hosting for publicly available artifacts. Also, some teams would like to have private repositories for their shared JARs. In this case, Bintray can help as well.

In this post:

Setting up a Bintray repository.
Publishing JARs of a Maven project to Bintray, including sources, Javadoc and tests.
Publishing JARs of an SBT project to Bintray + sbt-bintray plugin.

All the code used in this project is available in bintray-hello-world Github repository.

Excellent post about Akka Streams — Akka Team

Really nice introduction to #Akka Streams! FEEL THE POWER! ^^ — Viktor Klang

A must-read on #Akka #Streams!!! — Ellan Vannin CA

In many computer programs, the whole logic (or a vast part of it) is essentially step-by-step processing of data. Of course, this includes the situation when we iterate over the data and just execute the processing logic on every piece of it. However, there are a couple of complications here:

the processing logic may be quite complex, with various aggregations, merging, routing, error recoveries, etc.;
we might want to execute steps asynchronously, for instance, to take advantage of multi-processor machines or use I/O;
asynchronous execution of data processing steps inherently involves buffering, queues, congestion, and other matter, which are really difficult to handle properly (please, read “Handling Overload” by Fred Hébert).

Therefore, sometimes it is a good idea to express this logic on a high level using some kind of a framework, without a need to implement (possibly complex) mechanics of asynchronous pipeline processing. This was one of the rationales behind frameworks like Apache Camel or Apache Storm.

Actor systems like Erlang or Akka are fairly good for building robust asynchronous data pipelines. However, they are quite low-level by themselves, so writing such pipelines might be tiresome. Newer versions of Akka include the possibility for doing pipeline processing on a quite high level, which is called Akka Streams. Akka Streams grows from Reactive Streams initiative. It implements a streaming interface on top of Akka actor system. In this post I would like to give a short introduction to this library.

The problem

Not so long ago I wrote about a little strange problem with time-based UUIDs I faced (Retrospective time-based UUID generation (with Spark)). This time I needed to do something more usual. As you surely know, UUID standard has several versions. Version 4 is purely random numbers, version 1 relies on the identity of a machine and the timestamp with the counter, etc.

Let’s consider version 1 UUIDs. They include timestamps with counters, so they naturally can be ordered by it. In other words, having two time-based UUIDs, I wanted to say if the first is lower, greater or equal to the second. No big deal, say, in Java: u1.timestamp().compare(u2.timestamp()). But I wanted to do this in PostgreSQL, inside SQL query. Postgresql does have uuid data type, but it provides no functions for comparing them by version 1 timestamps. That is why I decided to write such a function myself.

Version 1 UUID structure

You can find the full UUID specification in RFC 4122. Let’s consider here only the part of version 1 which we are interested in. UUIDs have length of 128 bits, i.e. 16 bytes, which have different meaning in different versions of the standard. Version 1 layout is shown on the picture:

UUID version 1 structure

Not so long ago, I wrote a post about creating type-safe query builders in Scala from scratch. In it, I also suggested using shapeless library to do what was described. Now, I decided to write how it could be done. The code is in this repository.

Problem reminder

Without going into much details, the problem was to provide a type-safe way to build queries (to an abstract database, for instance) with parameters of different types. Something like

val query = beginQuery()
  .addParameter[String]("param1")
  .addParameter[Int]("param2")
  .addParameter[Boolean]("param3")
  .build()

// Compile:
query.execute("some string", 1, false)
// Won't compile:
query.execute(42, true, "another string")

Update 26.03.2023: 8 years ago 50 Gb sounded more serious that it is now, but even then we could and should have done this easily with one beefy machine without Spark or any other then-fancy tool.

I have faced a problem: having about 50 Gb of data in one database, export records to another database with slight modifications, which include UUID generation based on timestamps of records (collisions were intolerable). We have Spark cluster, and with it the problem did not seem even a little tricky: create RDD, map it and send to the target DB.

(In this post I am telling about Spark. I have not told about it in this blog so far, but I will. Generally, it is a framework for large-scale data processing, like Hadoop. It operates on abstract distributed data collections called Resilient Distributed Datasets (RDDs). Their interface is quite similar to functional collections (map, flatMap, etc.), and also has some operations specific for Spark's distributed and large-scale nature.)

However, I was too optimistic - there were difficulties.

Update 11.01.2016: the next post about type-safe query builders using shapeless.

Recently, I was hacking on a Scala library for queries to Cassandra database, phantom. At the moment, it was not able to build prepared statements, which I needed, so I added this functionality. However, phantom developers also implemented prepared statements quickly :) Nevertheless, I decided to write this post about the core idea of my implementation.

Caution: Don’t do this at home, use shapeless :)

My plan was to be able to prepare queries like this:

val query = select
  .where(typedParamPlaceholder1)
  .and(typedParamPlaceholder2)
  .and(typedParamPlaceholder3)
  .build()

and after that to execute this query: query.execute("string", 1, false). Moreover, I wanted this execution to be type-safe, so it was not possible to execute something like query.execute(1, "string", 0).

I will abstract from phantom queries and will focus on the general idea. The code is in this repository.

In many software engineering problems, we have a set and need to determine if some value belongs to this set. If the possible maximum set cardinality (size; maximum size = total count of elements we consider) is small, the solution is straightforward: just store the set explicitly (for instance, in form of a RB-tree), update it when necessary and check if the set contains elements that we are interested in. But what if maximum set cardinality is large or we need many such sets to operate simultaneously? Or if the set membership test is an expensive operation?

Suppose we want to know if an element belongs to a set. We have decided that it is acceptable to get false positive answers (the answer is “yes”, but the element is not actually in the set) with probability p and not acceptable to get false negative (the answer is “no”, but the element in actually in the set). The data structure that could help us in this situation is called the Bloom filter.

A Bloom filter (proposed by Burton Howard Bloom in 1970) is a bit array of m bits (initially set to 0) and k different hash functions. Each hash function maps a value into a single integer number.

Look at this picture from Wikipedia:

Bloom filter

Proptest: property testing in Rust

Introduction

Kafka protocol practical guide

Java agents, Javassist and Byte Buddy

Publishing JARs to Bintray with Maven and SBT

About Akka Streams

Time-based (version 1) UUIDs ordering in PostgreSQL

The problem

Version 1 UUID structure

Type-safe query builders in Scala revisited: shapeless

Problem reminder

Retrospective time-based UUID generation (with Spark)

Type-safe query builders in Scala

The Bloom filter