Monday, February 22, 2010

Google stole my ngrams

A while ago Dave Schubmehl of Fairview pointed me to a paper by several prominent Googleers which does a nice and clear job of summarising some important lessons from the last decade of web analysis research. The upshot is that if you've a billion examples of human behaviour that pertains to your particular problem it will be a good bet to use a simple non-parametric word count model to try and generalise from that behaviour.

Absolutely true. This is, in fact, the main reason why Google was so successful to start with: they realised that hyperlinks represent neatly codified human knowledge and that learning search ranking from the links in web pages is a great way to improve accuracy.

What do we do with the cases where we can't find a billion examples? Probably we end up lashing together a model of the domain in a convenient schema language (sorry, I mean "build an ontology"), grubbing up whatever domain terminologies and so on that come to hand, and writing some regular expression graph transducers to answer the question.

So: we're not trying to replace Google. We're not applicable to every problem everyone has ever had with text ("Not every problem someone has with his girlfriend is necessarily due to the capitalist mode of production" -- Herbert Marcuse). But neither is Google going to pop round to your office next Tuesday and help you build an ngram model of a couple of billion user queries from their search logs to help you figure out why your customers hate the latest product release.

There's not really a competition here, the approaches are orthogonal.


Friday, February 12, 2010

Cloud Computing, GATE and Text Processing

When a new thing comes along in computing the first thing that happens is that a small and exclusive set of nerds like me get all excited. If the excitement seems likely to relate to the real world in any fashion that might actually generate someone somewhere some money (or can be spun as something that might do so) then the next thing that happens is that the marketing departments of 1001 IT corporations jump in with both feet and start generating acre after acre of turgid prose about how their aged and creaking product line is actually a prime example of Phenomenon X, the Bright New Thing of Computing.

So it has been with software "in the cloud", which is, it turns out, actually quite a good idea in various ways (setting it apart from most new trends in IT). What does the Cloud Computing commonly refer to (now that the sound and fury of the marketing teams has had a chance to settle a little)? Three things:

  • software as a service (SaaS), for example Google Docs or
  • platform as service (PaaS), for example Google App Engine
  • infrastructure as service (IaaS), for example Amazon Web Services and most famously their Elastic Compute Cloud (EC2 -- which probably did most to popularise the term in the recent period)

These three now consitute the new wave: they are one of the main tracks that Google is betting on (SaaS and PaaS), what Amazon continues to succeed with (IaaS), and the grist for a hundred new startup mills (from specific applications like searching US campus sites to infrastructural help for cloud developers).

What does it have to do with GATE? IaaS is particularly well-suited to hosting text processing, which is typically bursty in its computational cost and therefore ill-suited to fixed infrastructure. SaaS is great for the provision of large web applications that are complex to install and maintain (like GATE Teamware). Hopefully this and other cloud offerings will be available on in the not too distant... so watch this space!


Tuesday, February 9, 2010

Certifiable GATE gurus wanted.

In my previous post I described how we came to start taking our user community more seriously again; in the first part of 2010 the effect of this turn has been that the world and her dog seem to be beating a path to our door with requests for technical support, training, bespoke development and/or access to our latest prototypes. In fact it is proving difficult to keep up with demand, so: if you're a GATE expert how about getting certified and taking on some of the work with us? If you have a good knowledge of one or more part of GATE (and/or related application domains), please get in touch. (We promise not to tell anyone that you're certifiable :-) .) Permalink.

I love GATE users (though I couldn't eat a whole one).

Users. A bit of a nuisance. They insist on asking questions, testing limits, finding bugs. Around 5 years ago, after something like a decade of giving away software, the GATE team felt very like our old systems administrator, who had a habit of saying "the only secure network is one without any computers attached": we knew that our user community was a good idea in principle, but we really rather wished they'd all leave us alone. In fact we did our best to discourage GATE users: we stopped doing regular releases, we ignored the mailing list, and if we could have figured out how to take the thing out in the woods and bury it under a tree we probably would have.

We failed: GATE refused to die, people obstinately continued to use it, and, as we used it ourselves for all sorts of projects, more and more features were added, quality and functionality improved, and every time we decided it was all over someone would turn up with a pile of cash and a novel problem. So we conceded defeat and resolved to succeed. I think.

This is all a long-winded way of explaining our shift in emphasis over the past year or so: we are introverts no longer, but happy and well-adjusted user-friendly liveware. Text processing for ever! Forwards to world domination comrades! Oops, wrong blog.

So now we're back to actively supporting our users and growing our community. We've upgraded the documentation, we're running regular training weeks and developer sprints, and we've built up several new products and services around the core GATE code to cater for more of the cases we've seen of people trying to deploy text processing over the years (15 of which, incredibly, have passed under the bridge since we first set metaphorical pen to digital paper for GATE version 0.1). We've also revamped the website and no longer look like something that might have been produced at CERN circa 1995.

So far the response has been quite astonishingly positive... so perhaps users aren't such a bad thing after all.