Computing Text: April 2010

Last weekend I had the pleasure of attending this year's Open Knowledge Conference. The list of good reasons for making government (and other) data open and for breaking down barriers to finding information on-line are longer than would fit in this post; one nice one that I hadn't heard before was Glyn Moody's point that Turing equivalence implies that there can be only one digital revolution, and that this in turn can prove the impossibility of preserving analogue bad habits like 'rights management'. At this point I should probably mention my employer's lawyers and what they'll do to you if you imply that I'm in favour of file sharing, but perhaps I'll just make do with a tired but accurate simile between the RIAA and those loveable old dinosaurs, dodos and other casualties of unsustainable lifestyle choices.

There was a lot of other interesting stuff being presented, including a talk by Jeni Tennison on large-scale open data from government and her work at data.gov.uk. Jeni ended her talk by saying that we shouldn't worry about proliferation of redundant and (potentially) contradictory material -- after all, this is what has happened with the web and no animals were harmed in the making, etc.

I like this point, and it chimes nicely with a move from "neat" to "scruffy" that we can observe around semantic technology in general and the semantic web in particular. The original vision published by Berners-Lee and others around a decade ago was very much inspired by Artificial Intelligence: your computer was going to book your dentist appointment on the right day to coincide with picking your mother up from the station, make sure the fridge was stocked with her favourite orange juice for later, and blah blah blah. Good stuff if you're a professor of logic computation looking for your next funding opportunity, but not really any nearer the horizon now than it was 10 (or 20 or 30) years ago.

Thankfully we've mostly woken up again, and now things are boiling down to a more practical residue, which, to paraphrase a more recent comment by Berners-Lee, is "all about the data, stupid". And this brings us back to Jeni's talk -- if we can get all those public data silos openned up and usable in the right way this will be a huge leap forward, and the fact that it will not be universally nice and neat and dressed in a shiney new bow tie is neither here nor there. Scruffy is neat in its own way.

A second thing that was interesting for me at this talk (and at OKCon more generally) was the question of data vs. content. The focus of the discussion today was very much about data in spreadsheets, relational databases and so on, and this seems to be where current success is happenning as more and more databases are being exported to variants of RDF. This must be good news for text analysis: looked at from an information extraction point-of-view, linked open data is a rich source of domain terminology (seeds for our gazetteers) and conceptual backbones (seeds for our result templates, taxonomies and ontologies). The next wave, it seems to me, is to link the linked data to all that text that's lurking in the databases telling all sorts of interesting stories -- if only we could find them.

Permalink.