Tuesday, May 17, 2011

The Infernal Beauty of Text

We talk, we write, we listen or read, and we have such a miraculous facility in all these skills that we rarely remember how hard they are. It is natural, therefore, that a large proportion of what we know of the world is externalised exclusively in textual form. That fraction of our science, technology and art that is codified in databases, taxonomies, ontologies and the like (let's call this structured data) is relatively small. Structured data is, of course, machine tractable in ways that text can never be (at least in advance of a true artificial intelligence, something that recedes as fast as ever over the long-term horizon). Structure is also inflexible and expensive to produce in ways that text is not.

Language is the quintessential product of human cooperation, and this cooperation has shaped our capabilities and our culture more than any other single factor. Text is a beautiful gift that projects the language of particular moments in time and place into our collective futures. It is also infernally difficult to process by computer (a measure, perhaps, of the extent of our ignorance regarding human intelligence).

When scientific results are delivered exclusively via textual publication, the process of reproducing these results is often inefficient as a consequence. Although advances in computational platforms raise exciting possibilities for increased sharing and reuse of experimental setups and research results, still there is little sign that scientific publication will cease its relentless growth any time soon.

Similary, although clinical recording continues to make progress away from paper and towards on-line systems with structured data models, still the primacy of text as a persistent communication mechanism within and between medical teams and between medics and their patients means that medical records will contain a wealth of textual, unstructured material for the forseeable future.

It occurred to me recently that the research programme that resulted in GATE (http://gate.ac.uk/) is now 20 years old. It seems that we're at a tipping point at which heaps and heaps of incremental advances create a qualitative leap forward. In recent years GATE has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem with a strong claim to cover a uniquely wide range of the lifecycle of text-related systems. GATE is a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the our own group) who work in text processing for diverse purposes. The upshot of all this effort is that text analysis has matured to become a predictable and robust engineering process. The benefits of deriving structured data from textual sources are now much easier to obtain as a result.

This change has been brought home to me recently in several contexts. Whereas in the past we thought of our tools as the exclusive preserve of specialists in language and information, it is becoming common that GATE users are text analysis neophytes who nonetheless develop sophisticated systems to solve problems in their own (non-language) specialisms. From medics in hospital research centers to archivists of the governmental web estate, from environment researchers to BBC staff building web presence for sporting events, our new breed of users are able to use our tools as a springboard to get up and running very quickly.

I think there are a number of reasons for this.

First, we stopped thinking of our role as suppliers of software systems and started thinking about all the problems that arise in the processes of describing, commissioning, developing, deploying and maintaining text analysis in diverse applied contexts. We thought about the social factors in successful text analysis projects, and the processes that underly those factors, and built software specifically to support these processes.

Second, we spend a huge amount of effort on producing training materials and delivering training courses. (And as usual, the people who learned most from that teaching process were the teachers themselves!)

Third, in line with the trends towards openness in science and in publishing, GATE is 100% open source. This brings the usual benefits that have been frequently recognised (vendor independence; security; flexibility; minimisation of costs). Less often remarked upon but particularly significant in contexts like science or medicine are traceability and transparency. Findings that are explicable and fully open are often worth much more than results that appear magically (but mysteriously) from black boxes.

To conclude, it seems that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. GATE has grown to cover all the stages of this proces. This is how it all hangs together:

  1. Take one large pile of text (documents, emails, tweets, patient records, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth).
  2. Pick a structured description of interesting things in the text (perhaps as simple as a telephone directory, or a chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology.
  3. Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
  4. Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard. (If you have enough training data from (3.) or elsewhere you can use GATE's machine learning facilities.)
  5. Take the pipeline from (4.) and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). Use it to bootstrap more manual (now semi-automatic) work in Teamware.
  6. Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server.
  7. (Probably) write a UI to go on top of Mímir, or integrate it in your existing front-end systems via our RESTful web APIs.
  8. Hey presto, you have search that applies your annotations and you ontology to your corpus (and a sustainable process for coping with changing information needs and/or changing text).
  9. Your users are happy (and GATE.ac.uk has a "donate" button ;-) ).


Monday, February 28, 2011


(Summarising some of David Servan-Schreiber's book Anticancer, 2nd edition, 2011.)

Why are cancer rates increasing so quickly? What can we do to stop the spread, both as individuals and as a society?

Curing cancer: the long route

The World Health Organisation runs the world's biggest cancer epidemiology lab (IARC, in Lyon, France). They publish the standard reference work on carcinogensis, and in recent years have done a lot of work on genetic and epigenetic factors in cancer. They found the first gene that increases the risk of lung cancer in smokers (see Nature, April 2008), for example.

Their work involves finding tiny needles in huge haystacks. The new generation of sequencing machines can process an entire genome within a week. Then billions of data points from populations of cancer sufferers must be correlated with environmental factors like smoking, diet, pollution, etc.

In their latest experiments IARC scientists are using GATE to adjust their statistical models based on previously published research results. In 2010 they found a gene that associates with head and neck cancer this way. The method can boost the productivity of cancer research by exploiting the gigabytes of published research, government studies and patent applications that cover cancer and its causes.

One of the best things for me about working with GATE in recent years has been this stuff on carcinogenesis with the genetic epidemiology team in Lyon. As part of the LarKC project we developed a simple text processing system to boost probabilities in their models of gene-disease association. It is quite a buzz to be able to say that we've contributed to finding a new proven susceptibility that associates with a particular genetic marker. Knowledge of this type of susceptibility promises to contribute significantly to the development of targeted pharmaceuticals for cancer treatments in the long term.

This is, though, only part of the story -- just as one of the main consequences of our increased understanding of genetics has demonstrated the need to understand the biological and environmental context of gene expression (or epigenetics), so an understanding of cancer must also lead beyond the purely pharmaceutical. Let's look at the other side of the coin: what is it that causes cancer in the first place?

Polar bears

As pure as the driven snow? Unfortunately not:

Polar bears live far from civilization... Yet of all the animals in the world, the polar bear is the most contaminated by toxic chemicals, to the point where its immune system and its reproductive capacities are threatened... The pollutants we pour into our rivers and streams all end up in the sea... The polar bear is at the top of the food chain that is contaminated from one end to the other... There is another mammal that reigns at the top of its food chain and its habitat is, moreover, distinctly less protected than the polar bear's: the human being.

And it is here that we find two huge causes of cancer: first, the food we eat, and second the artificial pollutants that permeate both our environment and our food. As many as 85% of all cancers are caused by environmental factors (the food we eat, the air we breath, the stresses and strains of modern life). For example, a large Danish study

... found that the genes of biological parents who died of cancer before fifty had no influence on an adoptee’s risk of developing cancer. On the other hand, death from cancer before the age of fifty of an adoptive parent (who passes on habits but not genes) increased the rate of mortality from cancer fivefold among the adoptees. This study shows that lifestyle is fundamentally involved in vulnerability to cancer. All research on cancer concurs: genetic factors contribute to at most 15% of mortalities from cancer.

In other words, there is a clear potential for preventing cancer by adjusting our lifestyles, and the same comment applies to increasing our chances of surviving the disease once diagnoses. In the short term there is an increasing body of work that can guide individuals, families and communities to methods for decreasing their cancer risks and for improving the prognosis once diagnosed, and Servan-Schreiber's book is an excellent summary.

In the longer term, we must change our modes of transport, our agriculture and our industrial processes, if we are serious about making a lasting difference to cancer rates.

Curing cancer 2: short cuts

It is in this sense that cures for cancer already exist: we don't need to wait for scientific miracles or technological breakthroughs (that may or may not come) -- we can prevent many cancers and remit many existing cancers by changing practices that we already understand very well (perhaps starting with your next meal!). It seems that

... by upsetting the balance in our diets we have created optimal conditions in our bodies for the development of cancer. If we accept that cancer growth is stimulated to a large extent by toxins from the environment, then in order to combat cancer, we have to begin by detoxifying what we eat. Facing this overwhelming body of evidence, here are simple recommendations to slow the spread of cancer:

  1. Eat sugar and white flour sparingly: replace them with agave nectar, acacia honey or coconut sugar for sweetening; multigrain flour for pastas and breads, or sourdough.
  2. Reduce consumption of red meat and avoid processed pork products. The World Cancer Research Fund recommends limiting consumption to no more than 500 g (18 oz) of red meat and pork products every week – in other words, at most four or five steaks. Their ideal recommended goal is 300 g (11 oz) or less.
  3. Avoid all hydrogenated vegetable fats – ‘trans fats’ – (also found in croissants and pastries that are not made with butter) and all animal fats loaded with omega-6s. Olive oil and canola oil are excellent vegetable fats that doesn’t promote inflammation. Butter (not margarine) and cheese that are well-balanced in omega-3s may not contribute to inflammation either.

(Lower cost options are also discussed.)

Servan-Schreiber's book is one of those rare texts that combines a rigourous and detailed apprehension of the scientific literature with a clear, simple and practical messages about how we can live better. If you read one book this year....!

Permalink. On Blogspot.

Thursday, February 24, 2011

Talk or Technology?

Talk or technology -- which is most expensive? Talk, it seems.

I spent the weekend with a friend of mine who runs one of the bigger semantics companies (he's a peddlar of used meanings, I like to tell him -- a kind of wholesale supplier of double entendres). He's very active in the community, and he follows the fate of all the other startups and joint ventures that have sprung up over the last decade or so, and the machinations of their customers, the tech savvy media and the analyst firms and so on. Several months ago he told me a story about Corporation X (let's call them Turnip for no particularly good reason), Startup Y (let's call them Cabbage) and a certain popular text processing framework, which, if you're sitting comfortably, I shall relay for your general delectation and personal improvement.

Now, Turnip are a megacorp, biggest publisher of one sort or another, and supplier of diverse databases and data streams to the jobbing information worker. In common with pretty much every other publisher out there (except Cory Doctorow) Turnip can see the writing on the digitally revoluntionary wall, and are casting around for ways to make their offerings more exciting than the competition. (Whether they can make their expensive, closed and stuffy stuff seem more attractive than the new free and open world is not something I'd bet the house on, but there you go.) One obvious route is to use text analysis to hook their text corpora up to conceptual models and bung the results into a semantic repository. Hey presto, all sorts of new and nifty search and browsing behaviours suddenly become possible. So publishers have been pretty keen customers of both the GATE team and my friend's company in recent years.

Turnip realised the importance of text processing in their collective future some time ago, and, after reporting work based on GATE up until a few years back, decided to take the function in-house. They bought Cabbage, one of the most active text analysis startups of the time. We assumed that they were going to use Cabbage tech to replace the stuff they'd done with GATE...

Fast forward to the present, and my friend was chatting to one of the people who run the publishing side of things at Turnip. Surprise surprise: the Cabbage stuff is nowhere to be seen, and they're still using the old and trusty Volkswagen Beetle of text processing.

Well who'd have thought it.

So, coming back to the question of talk vs. technology, we can conclude that the good people at Cabbage, who were big enough talkers to see themselves bought for a large chunk of readies by Turnip, had the right approach. Sad old technologists like me and mine just don't cut the mustard in the self-promotion stakes.

In fact, I've seen this pattern in a number of contexts. The people best at telling you about why you need them are generally too busy doing just that to really get to grips with all that inconvenient science and engineering that needs doing to actually make a practical difference. Therefore I have formulated Cunningham's Law: the quality of the work varies in inverse proportion to the quality of the slideware. (Next time you're unlucky enough to be bored to tears by one of my talks, please bear this in mind.)

To finish, a hint, free of charge, for those who have text processing problems to solve but would prefer not to spend large sums of money on cabbage and the like. You need open systems, you need to measure from the word go, and you need a process that incorporates robust mechanisms for task definition, quality assurance and control and system evolution. And you need a pool of available users and developers, training materials, etc. etc. So mosey on over to http://gate.ac.uk/ :-)

Permalink. On Blogspot.

Monday, January 17, 2011

Private Frazer was Right

Frazer was right!

(We're all doomed!)

(A note about climate change, the media and open science, January 2011.)

Last week NASA and others announced that 2010 was the joint hottest year on record. The announcement was almost universally ignored by the UK media. In wondering why that might be, several reasons come to mind:

First, the long-term subordination of media output to supporting the status quo (see much of Chomsky's work since Manufacturing Consent in the late 1980s; also more locally Edwards & Cromwell's Newspeak in the 21st Century). The status quo is, of course, dominated by oil (the 10 biggest companies in the world are often listed as 9 oil and car companies plus Walmart, owner of some of the world's largest car parks). Even more so we are dominated by profit: if it doesn't make a profit it isn't worth doing, no matter that this results in idiocy on a massive scale (from a market point-of-view, for example, it makes sense to ship all our manufactured goods from China, or to pay the bankers who caused our most recent crisis huge sums for their unproductive work, etc. etc. etc.).

This is, however, a general reason for the media to ignore climate change, and the NASA announcement about 2010 was actually quite widely reported around the world -- but not in the UK. A more specific and local reason can be found in Nick Davies' book Flat Earth News, which documents the severe reduction in the quantity of journalism (and of journalists) in the UK over the last 20 years (since Murdoch's relocation of his print operation from Fleet Street to Wapping). The majority of reporting is now supplied by organisations that aim for neutrality, not objectivity. (What's the difference? If two people report the progress of mowing a meadow and one says "we're finished" and the other "we haven't started", neutral reporting simply quotes both sides. Objective reporting goes and looks at how much grass is left. Clearly the latter is expensive and harder to make a profit at -- but the former is not journalism.)

Worse, more than 80% of the stories in our press have no journalistic oversight at all, let alone an objective appraisal. This is because they are the unmediated creations of Public Relations staff, either direct to the paper or via a press agency like PA, AP or Reuters -- and note that press agencies explicity define themselves as neutral, not as objective investigators. The old role of investigative journalist has retrenched so far that it is now a rare exception.

So far, so depressing, but there's another reason that UK media sources are using to ignore climate change at present, and that is the aftermath of the Climategate scandal that began in November 2009. It was sad to see the outpouring of unqualified censure and obfuscation that greeted the selective publication of a few emails between a few climate scientists that had been stolen from their hard drives by hostile critics. Several enquiries have since exonerated the scientists concerned and restated the underlying strength of their argument, but nonetheless a good deal of damage has been done and our chances of avoiding the worst of the risks that face us are lessened as a result.

The attack was, of course, disingenuous (and most reminiscent of Big Tobacco's tactics with respect to lung cancer research) but the amunition was also too freely available, and that brings us to the connection between climate change and the subject of this blog -- which is at least loosely focussed on information management, text processing and the like.

One of the contributing factors to the Climategate fiasco is a mismatch between technological capabilities and research practice. Scientists are habituated to a model where artefacts such as their intermediate results, computational tools and data sets are both transient and private. Repeatability, the cornerstone of empirical methods, is most often addressed by publication in peer-reviewed journals, but not by reuse of open data and resources. It is this culture that has proved vulnerable to vexatious freedom of information requests from climate change deniers. It is also a culture which is non-optimal with respect to openness and the efficient disemination of scientific knowledge equally across the globe.

This is not to say that all experimental and modelling data can become open over night -- but information management systems that support sharing between scientists can be built in ways that facilitate greater openness, traceability and data longevity.

To cut a long story short, open science is an idea whose time has come, and the question now is not if but when: how rapidly we will shift, how efficient the results will be, and what the experiences of individual scientists will be. The battle isn't over, of course; last year I went to a talk by James Boyle, one of the founders of Creative Commons and now Science Commons, and he showed very clearly how "to make the web work for science is illegal" -- the mechanisms that work so well for on-line shopping or social networking are prevented from working for scientists by restrictive publishing contracts and so on. But, as Glyn Moody points out, Turing's results imply the atomicity of the digital revolution, and its consequences are that the genie is now so far out of the bottle that all our human achievements will follow into the realm of openness and cooperative enterprise sooner or later.

How can we encourage openness in climate science, and reduce exposure to climate change deniers?

The technology we need falls into three categories:

  • Cloud computing and virtualisation. Server usage optimisation and scaleability via cloud computing now make it possible to address problems at the scale of every scientific research department in the UK (for example) within existing infrastructure budgets. Virtualisation makes it possible to store not just data but the entire compute platform operable for particular experiments or analyses.
  • Distributed version control repositories. Server-side repository systems that are commonplace for software engineers (e.g. Bazaar, Git or Subversion) have a large part of the answer for storing and versioning the data sets generated during collaborative research. They need to be integrated with on-line collaboration tools to make their use easier and more intuitive.
  • Open search infrastructure. Findablility is a key criterion upon which scientists base their evaluation of computational tools. Open source search engines are mature enough to perform very well when properly configured, and techniques exist for adaptation to non-textual data. The desktop metadata gathering facilities now available in e.g. KDE add exciting new possibilities, for example to make queries like "show me all the email I wrote around the time I edited the journal paper on tree rings".

Of course technology is only part of the picture, and has to be coupled with intervention at the cultural and organisational levels. The message of open knowledge and open data is becoming a powerful meme which can be exploited to promote new technologies and help change culture (and in doing so increase the effectiveness of climate scientists and decrease the power of climate change deniers).

Scientists are most often motivated by desire to do their work, and not very often by ticking the boxes that research assessment exercises demand, so if we can show a route to replacing "publish or perish" with "research and flourish" we can gain a lot of mindshare.

To conclude, the best hope for our collective future lies in cooperation, and after all that is the great strength of our species. Ursula le Guin makes this point very clearly:

"The law of evolution is that the strongest survives!"
"Yes; and the strongest, in the existence of any social species, are those who are most social. In human terms, most ethical. ... There is no strength to be gained from hurting one another. Only weakness."

Mechanisms for open discussion and consensus building in science can translate into mechanisms for promoting democracy and cooperation, and help light the path to a better world.