GNU Parallel

How much CPU cores does your computer have? 2-8, I think. It’s very time to use them all, isn’t it? But there are plenty of Unix utils such as grep, find, wc etc., which have no idea about parallel data processing. They can’t split their input into 8 pieces and spawn the corresponding number of threads or processes to process it using all the power of your modern CPU.

Definitely, this problem is quite interesting and practical to rest unsolved. According to the Unix philosophy, 1) it is good for programs to do one thing well; 2) it is a good idea to combine simple programs to do complex things. grep do pattern matching well. How about parallelization?

There is an utility know as GNU Parallel which main purpose is to execute arbitrary jobs in parallel on one or even multiple machines. The program is quite complex and multifunctional, look at man and tutorial. Here, I want just to give a little flavor of it.

GNU Parallel

Suppose we have a file in which we want to find some pattern. Exactly the task for good old grep:

$ grep pattern_i_want_to_find textfile.txt

But there is a problem: scanning could be very long if the file is huge, regular expression is used etc. Let’s speed it up and utilize our 8 cores. We can manually split the file into 8 chunks:

$ split -n 8 -d hugetextfile.txt

We will get 8 files x00..x07. Then we run 8 grep processes on these files in background and let the OS spread them among CPU cores:

$ for i in x0*; do grep pattern_i_want_to_find $i > ${i}_out & done

Then we combine the result:

$ for i in x0*_out; do cat $i >> out; rm $i; done;

Pretty simple, but what if we don’t know the number of cores (writing a script for different machines), want to set max CPU load or niceness, control swap activity, do various manipulations with input and output, control subprocesses, display progress, run remote jobs or resume failed ones etc. we’ll have to write a lot of code. Which is already written. Meet Parallel:

$ cat hugetextfile.txt | \
  parallel --block 10M --pipe grep pattern_i_want_to_find

Here, the idea is simple: Parallel takes the standard output of cat, splits it into chunks of 10 megabytes, runs as much grep processes as the computer has CPU cores and spreads the chunks to greps’ stdin. Then each grep sends its output to stdout as usual. Feel the power.

‘Hey!’ you might say. ‘What about this --block 10M? If we split file into chunks of 10 Mb we’ll very likely bisect some lines and the semantics of grep will be violated.’ Reasonable objection, but it is OK: Parallel deals with lines correctly. It considers input as a set of records which can’t be split. By default, records are lines, but it is possible to set the record separator.

For example, working with binary files we would want not to use any separators: --recend ''. Using this with bzip2 compression:

$ cat big_binary_file | \
  parallel --pipe --recend '' -k bzip2 --best > big_binary_file.bz2

It is OK to append bzipped parts one to another, but what about order? What if, i.g. the second bzip2 finishes earlier than the first? -k parameter (or --keep-order) helps here.

Another example: suppose we want to generate small previews for a big number of JPEG images. There is a very good tool ImageMagick and its command line tool convert.

$ find . -name '*.jpg' | \
  parallel convert -geometry 120 {} {.}_thumb.jpg

First of all, we find all JPEGs recursively starting from the current directory. Then parallel splits the list of found files into chunks and spreads them among convert processes. {} is a replacement line substituted by full line read from the input source, {.} is the same but without extension.

Applications are really numerous, but that’s not all. Suppose you have half a dozen computers and want to share a really havy task among them. Parallel can help here, all you need is an SSH access to the machines.

Consider an example: you’re doing some machine learning and want to create an SVM classifier. It’s needed to choose parameters C and γ. Usually it’s done by a grid search: a set of C and γ values are selected and classifiers with all combinations are created. Then the classifiers are compared by their cross validation results. But there is a problem: learning process of an SVM classifier can take some time, and you might get tired waiting for of e.g. 100 classifiers to learn.

We could distribute the job among multiple computers. Assume that the ML environment is already set on the target machines (libraries and programs are installed, etc.) The algorithm is simple:

  1. Transfer the data set file to each target machine.
  2. Run cross-validation script with different SVM parameters on the target machines.
  3. Collect the result and choose the best parameter set.
$ parallel --gnu --sshlogin $REMOTE_COMPUTERS --basefile data_set \
  ./cv.sh -C {1} -g {2} dataset ::: `seq -5 2 15` ::: `seq -15 2 3` | choose_best.sh

--gnu enforces GNU Parallel behaviour mode. --sshlogin and the following list of the remote computers (see format in the man) tells Parallel to exceute jobs on listed remote computers via SSH. --basefile data_set tells that data_set file is required on the remote computers and must be transferred. Script cv.sh will be executed (we assumed that it’s already on the remote computers, but it is possible to transfer it like data_set). cv.sh‘s parameters -C and -g (for C and γ respectively) will be chosen from sequences generated by seq util. The result will be accumulated on the master computer (where command is executed) and passed to choose_best.sh script which choose the best SVM parameter set. A poor man’s map-reduce :)

You’d better setup authentication keys for the remote computers or you’d have to enter password for each SSH connection (see this HOWTO).

In my opinion, GNU Parallel is a very useful utility for various tasks. Tell me, how do you use it, if you do.