Simple Python and Vala XML parsers

  |   Source

As some of you might know, I'm an OpenStreetMaper. In the last month, during those bits of spare time I had, I wrote a set of python scripts which compute some statistics over an OSM dump. I do this by parsing the whole XML tree, and national dumps are pretty huge (italy.osm is ~4G nowadays, and is not as well-mapped as Germany!), so I needed a way to do this without creating a memory-hungry beast.

With Python, I could succesfully do this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import xml.etree.cElementTree as etree

def parse(filename):
    source = open(filename)
    context = etree.iterparse(source, events=("start", "end"))
    context = iter(context)

    event, root = context.next()

    for event, elem in context:
        if event == "end":
            root.clear()
            continue

        print elem.tag

if __name__ == '__main__':
    import sys
    parse(sys.argv[1])

And here the results:

$ /usr/bin/time ./parsexml.py ~/osmstats/dumps/italy.osm.bz2.20100912.out 1>/dev/null
413.22user 6.15system 8:57.80elapsed 77%CPU (0avgtext+0avgdata 21200maxresident)k
7739552inputs+0outputs (10major+1470minor)pagefaults 0swaps

However, since some time, I wanted to learn Vala, so I tried to do this very same task with it. Here's the code:

using Xml;

class XmlParser {
    public void parse_file(string path) {
        var handler = SAXHandler();
        void* user_data = null;

        handler.startElement = start_element;
        handler.user_parse_file(user_data, path);
    }

    public void start_element(string name, string[] attr) {
        stdout.printf("%s\n", name);
    }
}

int main(string[] args) {
    Parser.init();
    var parser = new XmlParser();
    parser.parse_file(args[1]);
    Parser.cleanup();
    return 0;
}

You need to compile it with:

valac --pkg libxml-2.0 xml.vala

And here's the result:

$ /usr/bin/time ./xml ~/osmstats/dumps/italy.osm.bz2.20100912.out 1>/dev/null
122.01user 4.03system 3:14.61elapsed 64%CPU (0avgtext+0avgdata 6352maxresident)k
7738984inputs+0outputs (0major+461minor)pagefaults 0swaps

Both these codes are, however, CPU-hungry. But at least they don't swap :-)

Needless to say that I'm planning to switch my scripts to Vala for 2.0.

Comments powered by Disqus
Contents © 2013 David Paleino - Powered by Nikola