Decompressing bzipped files with Julia

I’m currently working with Wikipedia dumps, and to save space, it’s a good thing to make scripts that read directly content from (and write results to) BZipped files.

Setup

Tests where executed on my personal computer:

i7
16GB of ram

On a small Wikipedia dump of 407MB. All timings are in seconds.

bzcat alone

To have a point of comparison, I decompressed the dump using bzcat alone. The timing is 64 seconds.

$ time 1>/dev/null bzcat wikidump.xml.bz2

Using Python

It’s easy enough with Python thanks to the bz2 module that allows to transparently manipulate a compressed file as if it were a normal opened file. Before jumping to Julia, let see how it is done in Python:

from __future__ import print_function
import sys
import bz2


def main():
    in_stream = bz2.BZ2File(sys.argv[1])
    for line in in_stream:
        print(line)
    in_stream.close()


if __name__ == "__main__":
    main()

Nothing easier, it takes 85 seconds to run.

What about Julia?

Since I really love Julia language, I was tempted to do the same with Julia. Here are the differents solutions that I went through, with their respective timings.

Using bz2 Python module through PyCall

The first naive option is to use the original module from Python. It’s easy enough using the PyCall module. We can install it like so:

julia> Pkg.add("PyCall")
julia> Pkg.update()

The script:

using PyCall
@pyimport bz2

function main()
    in_stream = bz2.BZ2File(ARGS[1])
    for line in in_stream
        println(line)
    end
    in_stream[:close]()
end

But then we hit the wall… timing is: 1352 seconds. This is likely due to the conversion between Python and Julia datatypes. So not the best option for a data-intensive usage.

Piping result of bzcat to Julia

The second option that came to my mind was: “why not using bzcat?”. It’s easy enough, we just have to read from STDIN:

function main()
    for line in eachline(STDIN)
        print(line)
    end
end

Here is the invocation:

$ bzcat wikidump.xml.bz2 | julia bz2_bench.jl

Timing is now a more reasonable 72 seconds. So that is less than the Python version shown above. But this is not satisfactory enough. Why not use the wonderful capabilities of Julia to run external commands an pipe results?

Invoking bzcat from Julia

It is easy to invoke commands from inside Julia using backquotes: run(`cmd`) and |> to pipe between commands and streams. Let’s do it:

function main()
    file = ARGS[1]
    run(`bzcat $(file)` |> STDOUT)
end

The script is equivalent to: bzcat wikidump.xml.bz2, but it’s quite impressive to see how easy it is to do this inside a Julia script. This time is about 66 seconds, more or less the same than with the external piping from bzcat.

But it would be useful to get lines of contents from the stream, like it was in the original Python script. For this task, Julia standard library offers a multitudes of handy functions. The one we will use is readsfrom that returns two things: stdout of the given process, and the process itself. Here it is in action:

function main()
    file = ARGS[1]
    stdout, p = readsfrom(`bzcat $(file)`)
    for line in eachline(stdout)
        print(line)
    end
end

Timing is now about 74 seconds, this is 10 seconds faster than the first Python version. But we don’t rely on a module. Instead, we make use of the ability to play with command invocations, stream pipings, and the like that Julia allows.

Timings

Timings are relatively close since the big work is done in the decompression, that’s why there isn’t much difference between Julia and Python.

Conclusion

I was first tempted to implement a Julia wrapper over bzlib, but what for? When it’s so easy to invoke external commands and manipulate their input and output streams. Julia is a young language, but it’s so flexible and extensible, that often I forget about it!