Computing Text: Google stole my ngrams

A while ago Dave Schubmehl of Fairview pointed me to a paper by several prominent Googleers which does a nice and clear job of summarising some important lessons from the last decade of web analysis research. The upshot is that if you've a billion examples of human behaviour that pertains to your particular problem it will be a good bet to use a simple non-parametric word count model to try and generalise from that behaviour.

Absolutely true. This is, in fact, the main reason why Google was so successful to start with: they realised that hyperlinks represent neatly codified human knowledge and that learning search ranking from the links in web pages is a great way to improve accuracy.

What do we do with the cases where we can't find a billion examples? Probably we end up lashing together a model of the domain in a convenient schema language (sorry, I mean "build an ontology"), grubbing up whatever domain terminologies and so on that come to hand, and writing some regular expression graph transducers to answer the question.

So: we're not trying to replace Google. We're not applicable to every problem everyone has ever had with text ("Not every problem someone has with his girlfriend is necessarily due to the capitalist mode of production" -- Herbert Marcuse). But neither is Google going to pop round to your office next Tuesday and help you build an ngram model of a couple of billion user queries from their search logs to help you figure out why your customers hate the latest product release.

There's not really a competition here, the approaches are orthogonal.

Permalink.

Computing Text

Monday, February 22, 2010

Google stole my ngrams

No comments:

Post a Comment

Share

Hamish Cunningham

Blog Archive