Computing Text: July 2010

This is rather tired ground that has been well trodden by other feet, but in the aftermath of a disagreement with one of the happy chappies from Ontotext I thought I'd reiterate a couple of home truths about open source projects and how you go about joining in. Along the way I'll also ask what "hack" means, for the benefit of those software people who've been locked in a small room without access to books or networks for the last couple of decades :-)

To start with something that should be obvious, all engineering projects of whatever type are social processes in which human factors are at least as important as technical ones. In open source this is often more important than in other areas because the people involved often give their time and expertise for free, and even when they're being paid specifically to participate there is almost always a discretionary element of their contribution (should I bother to answer this email from a complete beginner who obviously hasn't managed to find the user manual, or shall I finish work ten minutes early today?). This means that when you want to join an open source project (i.e. to become a developer, contribute code etc.) you need to show a little sensitivity and think about the needs of the project and its participants as a whole, not just your own take on the thing. I remember a particularly clear case of the opposite approach on the JavaCC project a few years ago (JavaCC is an excellent parser generator that was one of the first available for Java and is used in GATE for analysing JAPE transduction rules). Along came a new developer with some good ideas and some useful code -- which in principle was great news for everyone in the project. Unfortunately said developer jumped in with both feet, screaming abusive nonsense at the project administrators and demanding his own way at every juncture. The result? His useful code was useless and unused.

Why can't we use code from people with whom it is impossible to communicate and collaborate? Because, to paraphrase Stuart Brand, software is not something you finish but something you start -- if it is good and useful then it has a long life span, and during that span it changes and mutates and needs active support and maintenance. If we accept code into our projects from sources whose long-term commitment is questionable (and angry young men with poor collaboration skills are unpredictable in that respect) then we compromise the evolution of our systems (and sooner or later alienate our users).

On a more positive note, if you want to join an open source project here are some steps to help you start off on the right foot:

Talk to the developers. Communicate, communicate, communicate! I agree absolutely with Sussman et al. that "A computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute", but this doesn't mean that the only thing you need to write is code. Get in touch with the developers as early in the process as possible, tell them what you're working on or plan to look at, and ask for their advice. Very often you'll get not just advice but active support, and the flip side is that when you produce your contribution they'll know where it comes from and be better able to judge its quality and its implications for the project as a whole.
Get to know the mechanisms in place for quality assurance (and quality control) and adopt them. If the project has a test suite you must at minimum ensure that your code doesn't break tests, and you should think seriously about writing new tests to cover all the things that you work on. Look at the documentation and write patches to cover the stuff you do. Contribute to discussions on the mailing list or forum in your area. Think hard about backwards compatibility -- has that interface you just added a method to been linked by a thousand other jars out there, and is it really worth forcing recompilation of all those systems? Don't just think about your own little patch, think about the knock-on effects on the whole system and on the ecosystem of users and developers around it.
Be humble. The reason I can write this post on a fantastic Ubuntu Debian GNU Linux system is because lots of people cleverer than me worked hard and contributed their work for the good of humanity. Even if I've invented something useful in my own little corner of computing that certainly doesn't mean that I have the right to sneer at others, even if their knowledge is less than mine in some area. Who knows what greatness their hearts and heads contain? (We are all geniuses, it is part of being human. If you don't believe me, try getting a computer to do image recognition like my 1-year old daughter!)
Be prepared to help developers when you want something integrated into the project. Most of the time your work will not be top of the todo-lists of the people you're joining; most often you're adding work to their already full plates, and you should be patient and helpful while they look at your work and figure out whether to include it, or to work towards making you a committer on the project.

So: not rocket science, just basic collaboration skills.

One of the obvious things not to do is to start throwing around pejorative terms. One of these that I find annoying is 'hack'.

"Your work is just a hack! My approach is state of the art! Thou shalt do it my way!"

Oh dear.

First, the users of software don't care. They choose one tool over another because of what they can do with it, not because of the way it conforms or otherwise to visions of elegance or correctness. Of course elegance and correctness can be factors in software performance and maintainability and so on -- but most often these qualities are subjective, particularly when applied by newcomers unfamiliar with the big picture -- which is my next point...

Second, such visions are personal, and especially so to outsiders. If you've sweated over a specific problem (in this case transducing graphs with FSAs) for years at a time then I'll listen to you tell me what is the most elegant solution with interest -- but if you haven't, I'm inclined to assume that there is likely to be stuff you don't know that may well compromise your view.

Third, if we define 'hack' as a quick or heuristic route to get something done, then why would that be a bad thing? And if you start down this route where will you end? For example, the case that lead to this post was in relation to a system that does finite state pattern matching and transduction over annotation graphs. Those of you with a background in formal languages may have already spotted the strangeness here: graphs can describe languages whose expressive power is greater than regular, which would seem to invalidate the whole idea of applying finite state techniques to them. It turns out that the data structures we're working with here (while doing information extraction over human language) have a lot of regular features, and that the indeterminacy that arises from the mismatch between regular FSAs and a graph-structured input stream are not an obstacle to our work (in fact they can sometimes be a good way to ignore ambiguities that we're not currently interested in). But: doesn't that fit the definition of a hack? From that point-of-view the whole subsystem under discussion is one big hack, which makes it even more ridiculous to criticise one approach or another as a "hack".

The moral of the story? Technology is never as important as it seems in our commodity-driven age. Better communication and collaboration skills win every time. The good news is that this is one of the best things about open source :-)

Permalink. On Blogspot.

Computing Text

Friday, July 9, 2010

How to Join Open Source Projects

Share

Hamish Cunningham

Blog Archive