Computing Text: The Infernal Beauty of Text

We talk, we write, we listen or read, and we have such a miraculous facility in all these skills that we rarely remember how hard they are. It is natural, therefore, that a large proportion of what we know of the world is externalised exclusively in textual form. That fraction of our science, technology and art that is codified in databases, taxonomies, ontologies and the like (let's call this structured data) is relatively small. Structured data is, of course, machine tractable in ways that text can never be (at least in advance of a true artificial intelligence, something that recedes as fast as ever over the long-term horizon). Structure is also inflexible and expensive to produce in ways that text is not.

Language is the quintessential product of human cooperation, and this cooperation has shaped our capabilities and our culture more than any other single factor. Text is a beautiful gift that projects the language of particular moments in time and place into our collective futures. It is also infernally difficult to process by computer (a measure, perhaps, of the extent of our ignorance regarding human intelligence).

When scientific results are delivered exclusively via textual publication, the process of reproducing these results is often inefficient as a consequence. Although advances in computational platforms raise exciting possibilities for increased sharing and reuse of experimental setups and research results, still there is little sign that scientific publication will cease its relentless growth any time soon.

Similary, although clinical recording continues to make progress away from paper and towards on-line systems with structured data models, still the primacy of text as a persistent communication mechanism within and between medical teams and between medics and their patients means that medical records will contain a wealth of textual, unstructured material for the forseeable future.

It occurred to me recently that the research programme that resulted in GATE (http://gate.ac.uk/) is now 20 years old. It seems that we're at a tipping point at which heaps and heaps of incremental advances create a qualitative leap forward. In recent years GATE has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem with a strong claim to cover a uniquely wide range of the lifecycle of text-related systems. GATE is a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the our own group) who work in text processing for diverse purposes. The upshot of all this effort is that text analysis has matured to become a predictable and robust engineering process. The benefits of deriving structured data from textual sources are now much easier to obtain as a result.

This change has been brought home to me recently in several contexts. Whereas in the past we thought of our tools as the exclusive preserve of specialists in language and information, it is becoming common that GATE users are text analysis neophytes who nonetheless develop sophisticated systems to solve problems in their own (non-language) specialisms. From medics in hospital research centers to archivists of the governmental web estate, from environment researchers to BBC staff building web presence for sporting events, our new breed of users are able to use our tools as a springboard to get up and running very quickly.

I think there are a number of reasons for this.

First, we stopped thinking of our role as suppliers of software systems and started thinking about all the problems that arise in the processes of describing, commissioning, developing, deploying and maintaining text analysis in diverse applied contexts. We thought about the social factors in successful text analysis projects, and the processes that underly those factors, and built software specifically to support these processes.

Second, we spend a huge amount of effort on producing training materials and delivering training courses. (And as usual, the people who learned most from that teaching process were the teachers themselves!)

Third, in line with the trends towards openness in science and in publishing, GATE is 100% open source. This brings the usual benefits that have been frequently recognised (vendor independence; security; flexibility; minimisation of costs). Less often remarked upon but particularly significant in contexts like science or medicine are traceability and transparency. Findings that are explicable and fully open are often worth much more than results that appear magically (but mysteriously) from black boxes.

To conclude, it seems that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. GATE has grown to cover all the stages of this proces. This is how it all hangs together:

Take one large pile of text (documents, emails, tweets, patient records, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth).
Pick a structured description of interesting things in the text (perhaps as simple as a telephone directory, or a chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology.
Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard. (If you have enough training data from (3.) or elsewhere you can use GATE's machine learning facilities.)
Take the pipeline from (4.) and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). Use it to bootstrap more manual (now semi-automatic) work in Teamware.
Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server.
(Probably) write a UI to go on top of Mímir, or integrate it in your existing front-end systems via our RESTful web APIs.
Hey presto, you have search that applies your annotations and you ontology to your corpus (and a sustainable process for coping with changing information needs and/or changing text).
Your users are happy (and GATE.ac.uk has a "donate" button ;-) ).