tag:blogger.com,1999:blog-79464060170819234182024-03-13T16:45:52.302+00:00Computing TextA blog about GATE (<a href="http://gate.ac.uk/">http://gate.ac.uk/</a>) and text analysis.Unknownnoreply@blogger.comBlogger18125tag:blogger.com,1999:blog-7946406017081923418.post-57841360237050321432013-12-09T10:28:00.001+00:002013-12-09T10:34:54.994+00:00Moving house (reMaker unCompany)The bags are packed, the electric is turned off and the van is rammed to the gills...<br />
<br />
Metaphorically speaking, that is. I've revamped my personal pages, and set up a blog there -- <a href="http://hamish.gate.ac.uk/">hamish.gate.ac.uk</a> -- so head on over. There's coffee on the stove. <br />
<br />
The new house? A bijou residence on the slopes of Mount Physical, with the catchy title of reMaker unCompany, to put a roof over some stuff I've been working on in relation to the new world of digital manufacture, <a href="http://pi.gate.ac.uk/">single-board computers</a> and the like.<br />
<br />
See you soon.<br />
<br />
<a href="http://gate.ac.uk/blogs/hamish/moving-house.html">Permalink.</a>Unknownnoreply@blogger.com19tag:blogger.com,1999:blog-7946406017081923418.post-84805564598058395052012-04-17T16:39:00.000+01:002012-04-17T16:39:18.182+01:00Open access journals<p>It has long been obvious that the days of closed scientific publishing are
just as numbered as those of all restrictive practices. In the age of the free
flow of bits sharing information will only ever get easier (as Cory Doctorow
is fond of pointing out). </p>
<p>As workers in text mining it is, of course, frustrating that we often can't
apply our algorithms as widely as would be useful for scientific users of our
systems because of journal access restrictions. (The results are real; see for
example our recent contribution to a PLoS One paper about oral cancer,
<em>Incorporation of prior information from the medical literature in GWAS of
oral cancer identifies novel susceptibility variant on chromosome 4 - the
AdAPT method</em>, in press April 2012.)</p>
<p>A recent report suggests that the losses associated with these restrictions
are more than £100 million per year:</p>
<p><blockquote>
Text mining, for example, is a relatively new research method where computer
programmes hunt through databases of plain-text research articles, looking for
associations and connections – between drugs and side effects, for example, or
between genes and disease – that a person scouring through papers one by one
may never notice.</p>
<p>In March, JISC, a government-funded agency that champions the use of digital
technology in UK universities for research and teaching, published
<a class="cow-url" href="http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx">
a report</a>. This said that if text mining enabled just a 2% increase in
productivity for scientists, it would be worth £123m-£157m in working time per
year.</p>
<p>But the process requires research articles to be accessed, copied, analysed
and annotated – all of which could be illegal under current copyright laws.</p>
<p>(<a class="cow-url" href="http://www.guardian.co.uk/science/2012/apr/09/frustrated-blogpost-boycott-scientific-journals?cat=science&type=article">
The Guardian, 9th April 2012</a>.)
</blockquote></p>
<p>It is time to open up!</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/open-access-journals.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-59603845796755205232012-03-04T15:02:00.001+00:002012-03-04T15:07:52.074+00:00Thanks for the memories<p>When Kevin Humphreys and I wrote the first version of
GATE back in the 1990s we used Tcl/Tk, a nice clean
scripting language with an extensible C API
underneath. One of the innovative things that Tcl
provided was a dynamic loading mechanism, and I used
it to allow CREOLE plugins to be reloaded at run-time.
A year or two after we'd released the system I could
often be heard cursing my stupidity — the reloading
system worked well when it was configured, but getting
it to run cross-platform at user sites with diverse
collections of underlying shared C libraries was a
huge pain in the bum.</p>
<p>Fast forward 15 years or so and the class loader code that I put into GATE
version 2 (the first Java version) also has some pain associated with it, and
it is a real pleasure to see
<a class="cow-url" href="http://englishjavadrinker.blogspot.com/2012/03/disposable-memories_04.html">this post</a> with all its carefull study and presentation. Even better, a new
chunk of code to take away one of the gotchas with classloading and memory
consumption in long-running server processes. Nice one Mark!</p>
<p>One other thing springs to mind — the design choices
that we took for GATE 2 (around the turn of the
millenium, with a first release in 2002) turned out to
be pretty good, by and large (more luck than judgement
on my part, of course). GATE has mushroomed orders of
magnitude beyond our original plan in the intervening
period, but despite a few creaking joints it still
holds its own. That's a credit to several of the
long-term inmates of the GATE team, and also to Java
(and its later offshoots like Spring, Groovy and
Grails). It's easy to get blinded by the Next Big
Thing in computing, but if you stand on solid
foundations (and keep working on reusability and
refactoring) you <em>can</em> have your cake and eat it!</p>
<p>(And sorry for the cheesy title.)</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/thanks-for-the-memories.html">Permalink.</a></p>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-7946406017081923418.post-90823804553652345362011-05-17T12:36:00.004+01:002011-05-17T14:27:22.804+01:00The Infernal Beauty of Text<p>We talk, we write, we listen or read, and we have such a miraculous facility
in all these skills that we rarely remember how hard they are. It is natural,
therefore, that a large proportion of what we know of the world is
externalised exclusively in textual form. That fraction of our science,
technology and art that is codified in databases, taxonomies, ontologies and
the like (let's call this <em>structured data</em>) is relatively small. Structured
data is, of course, machine tractable in ways that text can never be (at least
in advance of a true artificial intelligence, something that recedes as fast
as ever over the long-term horizon). Structure is also inflexible and
expensive to produce in ways that text is not.</p>
<p>Language is the quintessential product of human cooperation, and this
cooperation has shaped our capabilities and our culture more than any other
single factor. Text is a beautiful gift that projects the language of
particular moments in time and place into our collective futures. It is also
infernally difficult to process by computer (a measure, perhaps, of the extent
of our ignorance regarding human intelligence).</p>
<p>When scientific results are delivered exclusively via textual publication, the
process of reproducing these results is often inefficient as a consequence.
Although advances in computational platforms raise exciting possibilities for
increased sharing and reuse of experimental setups and research results, still
there is little sign that scientific publication will cease its relentless
growth any time soon.</p>
<p>Similary, although clinical recording continues to make progress away from
paper and towards on-line systems with structured data models, still the
primacy of text as a persistent communication mechanism within and between
medical teams and between medics and their patients means that medical records
will contain a wealth of textual, unstructured material for the forseeable
future.</p>
<p>It occurred to me recently that the research programme that resulted in GATE
(<a class="cow-url" href="http://gate.ac.uk/">http://gate.ac.uk/</a>) is now 20 years old. It seems that we're at a tipping
point at which heaps and heaps of incremental advances create a qualitative
leap forward. In recent years GATE has grown from its roots as a specialist
development tool for text processing to become a rather comprehensive
ecosystem with a strong claim to cover a uniquely wide range of the lifecycle
of text-related systems. GATE is a focal point for the integration and reuse
of advances that have been made by many people (the majority outside of the
our own group) who work in text processing for diverse purposes. The upshot of
all this effort is that text analysis has matured to become a predictable and
robust engineering process. The benefits of deriving structured data from
textual sources are now much easier to obtain as a result.</p>
<p>This change has been brought home to me recently in several contexts. Whereas
in the past we thought of our tools as the exclusive preserve of specialists
in language and information, it is becoming common that GATE users are text
analysis neophytes who nonetheless develop sophisticated systems to solve
problems in their own (non-language) specialisms. From medics in hospital
research centers to archivists of the governmental web estate, from
environment researchers to BBC staff building web presence for sporting
events, our new breed of users are able to use our tools as a springboard to
get up and running very quickly.</p>
<p>I think there are a number of reasons for this.</p>
<p>First, we stopped thinking of our role as suppliers of software systems and
started thinking about all the problems that arise in the processes of
describing, commissioning, developing, deploying and maintaining text analysis
in diverse applied contexts. We thought about the social factors in successful
text analysis projects, and the processes that underly those factors, and
built software specifically to support these processes.</p>
<p>Second, we spend a huge amount of effort on producing training materials and
delivering training courses. (And as usual, the people who learned most from
that teaching process were the teachers themselves!)</p>
<p>Third, in line with the trends towards openness in science and in publishing,
GATE is 100% open source. This brings the usual benefits that have been
frequently recognised (vendor independence; security; flexibility;
minimisation of costs). Less often remarked upon but particularly significant
in contexts like science or medicine are traceability and transparency.
Findings that are explicable and fully open are often worth much more than
results that appear magically (but mysteriously) from black boxes.</p>
<p>To conclude, it seems that the deployment of text mining for document
abstraction or rich search and navigation is best thought of as a process, and
that with the right computational tools and data collection strategies this
process can be made defined and repeatable. GATE has grown to cover all the
stages of this proces. This is how it all hangs together:</p>
<ol>
<li>Take one large pile of text (documents, emails, tweets, patient records,
papers, transcripts, blogs, comments, acts of parliament, and so on and so
forth).</li>
<li>Pick a structured description of interesting things in the text (perhaps as
simple as a telephone directory, or a chemical taxonomy, or something from the
<a class="cow-url" href="http://linkeddata.org/">Linked Data</a> cloud) -- call this your <em>ontology</em>. </li>
<li>Use <a class="cow-url" href="http://gate.ac.uk/teamware/">GATE Teamware</a> to mark up a <em>gold standard</em> example set of
annotations of the corpus (1.) relative to the ontology (2.).</li>
<li>Use <a class="cow-url" href="http://gate.ac.uk/family/developer.html">GATE Developer</a> to build a <em>semantic annotation
pipeline</em> to do the annotation job automatically and measure performance
against the gold standard. (If you have enough training data from (3.) or
elsewhere you can use GATE's machine learning facilities.)</li>
<li>Take the pipeline from (4.) and apply it to your text pile using
<a class="cow-url" href="http://gatecloud.net/">GATE Cloud</a> (or embed it in your own systems
using <a class="cow-url" href="http://gate.ac.uk/family/embedded.html">GATE Embedded</a>). Use it to bootstrap more manual
(now semi-automatic) work in Teamware.</li>
<li>Use <a class="cow-url" href="http://gate.ac.uk/family/mimir.html">GATE Mímir</a> to store the annotations relative to the
ontology in a <em>multiparadigm index server</em>.</li>
<li>(Probably) write a UI to go on top of Mímir, or integrate it in
your existing front-end systems via our RESTful web APIs.</li>
<li>Hey presto, you have search that applies your annotations and you
ontology to your corpus (and a <a class="cow-url" href="http://gate.ac.uk/family/process.html">sustainable process</a>
for coping with changing information needs and/or changing text).</li>
<li>Your users are happy (and <a class="cow-url" href="http://gate.ac.uk/">GATE.ac.uk</a> has a "donate"
button ;-) ).</li>
</ol>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/the-infernal-beauty-of-text.html">Permalink.</a>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-7946406017081923418.post-75532614261502600322011-02-28T17:00:00.001+00:002011-03-01T16:13:05.255+00:00Anticancer<p><em>(Summarising some of David Servan-Schreiber's book
<a class="cow-url" href="http://tinyurl.com/anticancerbook">Anticancer</a>, 2nd edition, 2011.)</em></p>
<p>Why are cancer rates increasing so quickly? What can we do to stop the spread,
both as individuals and as a society?</p>
<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->
<h1 class="cow-heading">Curing cancer: the long route</h1>
<p>The World Health Organisation runs the world's biggest
<a class="cow-url" href="http://www.iarc.fr">cancer epidemiology lab</a> (IARC, in Lyon, France). They
publish the standard <a class="cow-url" href="http://monographs.iarc.fr/">reference work on
carcinogensis</a>, and in recent years have done a lot of work on genetic and
epigenetic factors in cancer. They found the first gene that increases the
risk of lung cancer in smokers (see <em>Nature</em>, April 2008), for example.</p>
<p>Their work involves finding tiny needles in huge haystacks. The new generation
of sequencing machines can process an entire genome within a week. Then
billions of data points from populations of cancer sufferers must be
correlated with environmental factors like smoking, diet, pollution, etc.</p>
<p>In their latest experiments IARC scientists are using <a class="cow-url" href="http://gate.ac.uk/">GATE</a> to adjust their statistical models based on previously published
research results. In 2010 they found a gene that associates with head and neck
cancer this way. The method can boost the productivity of cancer research by
exploiting the gigabytes of published research, government studies and patent
applications that cover cancer and its causes.</p>
<p>One of the best things for me about working with GATE in recent years has been
this stuff on carcinogenesis with the
<a class="cow-url" href="http://www.iarc.fr/en/research-groups/GEP/index.php">genetic epidemiology
team</a> in Lyon. As part of the <a class="cow-url" href="http://gate.ac.uk/projects/larkc/">LarKC
project</a> we developed a simple text processing system to boost probabilities
in their models of gene-disease association. It is quite a buzz to be able to
say that we've contributed to finding a new proven susceptibility that
associates with a particular genetic marker. Knowledge of this type of
susceptibility promises to contribute significantly to the development of
targeted pharmaceuticals for cancer treatments in the long term.</p>
<p>This is, though, only part of the story -- just as one of the main
consequences of our increased understanding of genetics has demonstrated the
need to understand the biological and environmental context of gene expression
(or <em>epigenetics</em>), so an understanding of cancer must also lead beyond the
purely pharmaceutical. Let's look at the other side of the coin: what is it
that causes cancer in the first place?</p>
<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->
<h1 class="cow-heading">Polar bears</h1>
<p>As pure as the driven snow? Unfortunately not:</p>
<p><blockquote> Polar bears live far from civilization... Yet of all the animals in the
world, the polar bear is the most contaminated by toxic chemicals, to the
point where its immune system and its reproductive capacities are
threatened... The pollutants we pour into our rivers and streams all end up in
the sea... The polar bear is at the top of the food chain that is contaminated
from one end to the other... There is another mammal that reigns at the top of
its food chain and its habitat is, moreover, distinctly less protected than
the polar bear's: the human being. </blockquote></p>
<p>And it is here that we find two huge causes of cancer: first, the food we eat,
and second the artificial pollutants that permeate both our environment and
our food. As many as 85% of all cancers are caused by environmental factors
(the food we eat, the air we breath, the stresses and strains of modern life).
For example, a large Danish study</p>
<p><blockquote>... found that the genes of biological parents who died of cancer before
fifty had no influence on an adoptee’s risk of developing cancer. On the other
hand, death from cancer before the age of fifty of an adoptive parent (who
passes on habits but not genes) increased the rate of mortality from cancer
fivefold among the adoptees. This study shows that lifestyle is fundamentally
involved in vulnerability to cancer. All research on cancer concurs: genetic
factors contribute to at most 15% of mortalities from cancer.
</blockquote></p>
<p>In other words, there is a clear potential for preventing cancer by adjusting
our lifestyles, and the same comment applies to increasing our chances of
surviving the disease once diagnoses. In the short term there is an increasing
body of work that can guide individuals, families and communities to methods
for decreasing their cancer risks and for improving the prognosis once
diagnosed, and Servan-Schreiber's book is an excellent summary.</p>
<p>In the longer term, we must change our modes of transport, our agriculture and
our industrial processes, if we are serious about making a lasting difference
to cancer rates.</p>
<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->
<h1 class="cow-heading">Curing cancer 2: short cuts</h1>
<p>It is in this sense that <b>cures for cancer already exist</b>: we don't need to
wait for scientific miracles or technological breakthroughs (that may or may
not come) -- we can prevent many cancers and remit many existing cancers by
changing practices that we already understand very well (perhaps starting with
your next meal!). It seems that</p>
<p><blockquote> ... by upsetting the balance in our diets we have created optimal
conditions in our bodies for the development of cancer. If we accept that
cancer growth is stimulated to a large extent by toxins from the environment,
then in order to combat cancer, we have to begin by detoxifying what we eat.
Facing this overwhelming body of evidence, here are simple recommendations to
slow the spread of cancer: </p>
<ol>
<li>Eat sugar and white flour sparingly: replace them with agave nectar, acacia
honey or coconut sugar for sweetening; multigrain flour for pastas and
breads, or sourdough. </li>
<li>Reduce consumption of red meat and avoid processed pork products. The World
Cancer Research Fund recommends limiting consumption to no more than 500 g
(18 oz) of red meat and pork products every week – in other words, at most
four or five steaks. Their ideal recommended goal is 300 g (11 oz) or less. </li>
<li>Avoid all hydrogenated vegetable fats – ‘trans fats’ – (also found in
croissants and pastries that are not made with butter) and all animal fats
loaded with omega-6s. Olive oil and canola oil are excellent vegetable fats
that doesn’t promote inflammation. Butter (not margarine) and cheese that
are well-balanced in omega-3s may not contribute to inflammation either.</li>
</ol>
<p></blockquote></p>
<p>(Lower cost options are also discussed.)</p>
<p>Servan-Schreiber's book is one of those rare texts that combines a rigourous
and detailed apprehension of the scientific literature with a clear, simple
and practical messages about how we can live better. If you read one book this
year....!</p>
<p><hr></p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/anticancer.html">Permalink.</a>
<a class="cow-url" href="http://computingtext.blogspot.com/2011/02/anticancer.html">On Blogspot.</a></p>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-7946406017081923418.post-41726382830726112942011-02-24T11:06:00.000+00:002011-02-24T11:06:25.618+00:00Talk or Technology?<p>Talk or technology -- which is most expensive? Talk, it seems.</p>
<p>I spent the weekend with a friend of mine who runs one of the bigger semantics
companies (he's a peddlar of used meanings, I like to tell him -- a kind of
wholesale supplier of <em>double entendre</em>s). He's very active in the community,
and he follows the fate of all the other startups and joint ventures that have
sprung up over the last decade or so, and the machinations of their customers,
the tech savvy media and the analyst firms and so on. Several months ago he
told me a story about Corporation X (let's call them <em>Turnip</em> for no
particularly good reason), Startup Y (let's call them <em>Cabbage</em>) and a certain
popular text processing framework, which, if you're sitting comfortably, I
shall relay for your general delectation and personal improvement.</p>
<p>Now, Turnip are a megacorp, biggest publisher of one sort or another, and
supplier of diverse databases and data streams to the jobbing information
worker. In common with pretty much every other publisher out there (except
<a class="cow-url" href="http://craphound.com/">Cory Doctorow</a>) Turnip can see the writing on the
digitally revoluntionary wall, and are casting around for ways to make their
offerings more exciting than the competition. (Whether they can make their
expensive, closed and stuffy stuff seem more attractive than the new free and
open world is not something I'd bet the house on, but there you go.) One
obvious route is to use text analysis to hook their text corpora up to
conceptual models and bung the results into a semantic repository. Hey presto,
all sorts of new and nifty search and browsing behaviours suddenly become
possible. So publishers have been pretty keen customers of both the GATE team
and my friend's company in recent years.</p>
<p>Turnip realised the importance of text processing in their collective future
some time ago, and, after reporting work based on GATE up until a few years
back, decided to take the function in-house. They bought Cabbage, one of the
most active text analysis startups of the time. We assumed that they were
going to use Cabbage tech to replace the stuff they'd done with GATE...</p>
<p>Fast forward to the present, and my friend was chatting to one of the people
who run the publishing side of things at Turnip. Surprise surprise: the
Cabbage stuff is nowhere to be seen, and they're still using the old and
trusty <a class="cow-url" href="http://gate.ac.uk/g8/page/print/2/sale/talks/tal/fig-intro-slidy.html">Volkswagen Beetle of text processing</a>.</p>
<p>Well who'd have thought it.</p>
<p>So, coming back to the question of talk vs. technology, we can conclude that
the good people at Cabbage, who were big enough talkers to see themselves
bought for a large chunk of readies by Turnip, had the right approach. Sad old
technologists like me and mine just don't cut the mustard in the
self-promotion stakes. </p>
<p>In fact, I've seen this pattern in a number of contexts. The people best at
telling you about why you need them are generally too busy doing just that to
really get to grips with all that inconvenient science and engineering that
needs doing to actually make a practical difference. Therefore I have
formulated <u>Cunningham's Law</u>: the quality of the work varies in inverse
proportion to the quality of the slideware. (Next time you're unlucky enough
to be bored to tears by one of my talks, please bear this in mind.)</p>
<p>To finish, a hint, free of charge, for those who have text processing problems
to solve but would prefer not to spend large sums of money on cabbage and the
like. You need <em><a class="cow-url" href="http://gate.ac.uk/biz/usps.html">open systems</a></em>, you need to
<em>measure</em> from the word go, and you need a process that incorporates robust
mechanisms for task definition, quality assurance and control and system
evolution. And you need a pool of available users and developers, training
materials, etc. etc. So mosey on over to <a class="cow-url" href="http://gate.ac.uk/">http://gate.ac.uk/</a> :-)</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/talk-or-technology.html">Permalink.</a>
<a class="cow-url" href="http://computingtext.blogspot.com/2011/02/talk-or-technology.html">On Blogspot.</a></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-54792774586025632462011-01-17T14:05:00.000+00:002011-01-17T14:05:23.635+00:00Private Frazer was Right<h1 class="cow-title-heading">Frazer was right! <br> <br>
<img src="http://www.bbc.co.uk/archive/imageArchive/images/5200_gall_004.jpg" alt=""James Frazer from Dad's Army"" width="200">
<br> <br>
(<em>We're all doomed!</em>)</h1>
<p><em>(A note about climate change, the media and <b>open science</b>, January 2011.)</em></p>
<p>Last week
<a class="cow-url" href="http://www.nasa.gov/home/hqnews/2011/jan/HQ_11-014_Warmest_Year.html">NASA
and others</a> announced that 2010 was the joint hottest year on record. The
announcement was almost universally ignored by the UK media. In wondering why
that might be, several reasons come to mind:</p>
<p><b>First</b>, the long-term subordination of media output to supporting the status
quo (see much of Chomsky's work since <em>Manufacturing Consent</em> in the late
1980s; also more locally Edwards & Cromwell's <em>Newspeak in the 21st Century</em>).
The status quo is, of course, dominated by oil (the 10 biggest companies in
the world are often listed as 9 oil and car companies plus Walmart, owner of
some of the world's largest car parks). Even more so we are dominated by
profit: if it doesn't make a profit it isn't worth doing, no matter that this
results in idiocy on a massive scale (from a market point-of-view, for
example, it makes sense to ship all our manufactured goods from China, or to
pay the bankers who caused our most recent crisis huge sums for their
unproductive work, etc. etc. etc.).</p>
<p>This is, however, a general reason for the media to ignore climate change, and
the NASA announcement about 2010 was actually quite widely reported around the
world -- but not in the UK. A more specific and <b>local reason</b> can be found in
Nick Davies' book <em>Flat Earth News</em>, which documents the severe reduction in
the quantity of journalism (and of journalists) in the UK over the last 20
years (since Murdoch's relocation of his print operation from Fleet Street to
Wapping). The majority of reporting is now supplied by organisations that aim
for <em>neutrality</em>, not <em>objectivity</em>. (What's the difference? If two people
report the progress of mowing a meadow and one says "we're finished" and the
other "we haven't started", neutral reporting simply quotes both sides.
Objective reporting goes and looks at how much grass is left. Clearly the
latter is expensive and harder to make a profit at -- but the former is not
journalism.)</p>
<p>Worse, more than 80% of the stories in our press have no journalistic
oversight at all, let alone an objective appraisal. This is because they are
the unmediated creations of Public Relations staff, either direct to the paper
or via a press agency like PA, AP or Reuters -- and note that press agencies
explicity define themselves as neutral, not as objective investigators. The
old role of investigative journalist has retrenched so far that it is now a
rare exception.</p>
<p>So far, so depressing, but there's <b>another reason</b> that UK media sources are
using to ignore climate change at present, and that is the aftermath of the
<a class="cow-url" href="http://www.realclimate.org/index.php/archives/2010/11/one-year-later/">Climategate</a> scandal that began in November 2009. It was sad to see the
outpouring of unqualified censure and obfuscation that greeted the selective
publication of a few emails between a few climate scientists that had been
stolen from their hard drives by hostile critics. Several enquiries have since
exonerated the scientists concerned and restated the underlying strength of
their argument, but nonetheless a good deal of damage has been done and our
chances of avoiding the worst of the risks that face us are lessened as a
result.</p>
<p>The attack was, of course, disingenuous (and most reminiscent of Big Tobacco's
tactics with respect to lung cancer research) but the amunition was also too
freely available, and that brings us to the connection between climate change
and the subject of this blog -- which is at least loosely focussed on
information management, text processing and the like.</p>
<p>One of the contributing factors to the Climategate fiasco is a mismatch
between technological capabilities and research practice. Scientists are
habituated to a model where artefacts such as their intermediate results,
computational tools and data sets are both transient and private.
Repeatability, the cornerstone of empirical methods, is most often addressed
by publication in peer-reviewed journals, but not by reuse of open data and
resources. It is this culture that has proved vulnerable to vexatious freedom
of information requests from climate change deniers. It is also a culture
which is non-optimal with respect to openness and the efficient disemination
of scientific knowledge equally across the globe.</p>
<p>This is not to say that all experimental and modelling data can become open
over night -- but information management systems that support sharing between
scientists can be built in ways that facilitate greater openness, traceability
and data longevity.</p>
<p>To cut a long story short, <em>open science</em> is an idea whose time has come, and
the question now is not if but when: how rapidly we will shift, how efficient
the results will be, and what the experiences of individual scientists will
be. The battle isn't over, of course; last year I went to a talk by James
Boyle, one of the founders of Creative Commons and now Science Commons, and he
showed very clearly how "to make the web work for science is illegal" -- the
mechanisms that work so well for on-line shopping or social networking are
prevented from working for scientists by restrictive publishing contracts and
so on. But, as Glyn Moody points out, Turing's results imply the atomicity of
the digital revolution, and its consequences are that the genie is now so far
out of the bottle that all our human achievements will follow into the realm
of openness and cooperative enterprise sooner or later.</p>
<p>How can we encourage openness in climate science, and reduce exposure to
climate change deniers?</p>
<p>The technology we need falls into three categories:</p>
<ul>
<li>Cloud computing and virtualisation. Server usage optimisation and
scaleability via cloud computing now make it possible to address problems at
the scale of every scientific research department in the UK (for example)
within existing infrastructure budgets. Virtualisation makes it possible to
store not just data but the entire compute platform operable for particular
experiments or analyses.</li>
<li>Distributed version control repositories. Server-side repository systems
that are commonplace for software engineers (e.g. Bazaar, Git or Subversion)
have a large part of the answer for storing and versioning the data sets
generated during collaborative research. They need to be integrated with
on-line collaboration tools to make their use easier and more intuitive.</li>
<li>Open search infrastructure. Findablility is a key criterion upon which
scientists base their evaluation of computational tools. Open source search
engines are mature enough to perform very well when properly configured, and
techniques exist for adaptation to non-textual data. The desktop metadata
gathering facilities now available in e.g. KDE add exciting new
possibilities, for example to make queries like "show me all the email I
wrote around the time I edited the journal paper on tree rings".</li>
</ul>
<p>Of course technology is only part of the picture, and has to be coupled with
intervention at the cultural and organisational levels. The message of open
knowledge and open data is becoming a powerful meme which can be exploited to
promote new technologies and help change culture (and in doing so increase the
effectiveness of climate scientists and decrease the power of climate change
deniers).</p>
<p>Scientists are most often motivated by desire to do their work, and not very
often by ticking the boxes that research assessment exercises demand, so if we
can show a route to replacing "publish or perish" with "research and flourish"
we can gain a lot of mindshare.</p>
<p>To conclude, the best hope for our collective future lies in cooperation, and
after all that is the great strength of our species. Ursula le Guin makes this
point very clearly:</p>
<p><blockquote>
"The law of evolution is that the strongest survives!" <br>
"Yes; and the strongest, in the existence of any social species, are those who
are most social. In human terms, most ethical. ... There is no strength to be
gained from hurting one another. Only weakness."
</blockquote></p>
<p>Mechanisms for open discussion and consensus building in science can translate
into mechanisms for promoting democracy and cooperation, and help light the
path to a better world.</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/private-frazer-was-right.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-2315526857304618252010-12-03T13:17:00.000+00:002010-12-03T13:17:49.885+00:00Harvard's Selection Process and UK Research "Careers"<p>One of the points that Malcolm Gladwell makes in his beautiful book
<em><a class="cow-url" href="http://www.gladwell.com/outliers/index.html">Outliers</a></em> is that selections
made in face of over-abundance are likely to be random. He cites a study of
Harvard undergraduate admissions which shows that there is a large element of
chance involved -- the point being that where there is a surfeit of excellence
on offer (and the queue of young hopefuls at Harvard's door probably
qualifies) it is pretty meaningless to try and select the "best" using
anything more high-tech than the toss of a coin.</p>
<p>There's an analogous randomness in the fate of UK research staff (that is,
those staff employed only to do research, as opposed to academic faculty
members whose remit includes research, teaching and related administrative
tasks). These staff are most often known as <em>research assistants</em> (RAs -- a
term that gives a clue as to their general status within our Universities).</p>
<p>The custom and practice of RA employment arose in a time when the ratio of
research volume to faculty sizes was a lot lower, and it made perfect sense in
that context for the position to be a staging post between postgraduate
research and faculty jobs. Indeed a common longer form of the term is
<em>postdoctoral RA</em>, and in previous periods there was a reasonably strong
expectation that this was the final fence to jump on the way to the academic
finishing line. There was, in other words, a built-in assumption that being an
RA was just as temporary as the state of being a PhD student, for example.</p>
<p>Fast forward to the present (or even to 10 years ago, in fact), and there is
now a significant problem with this picture: there are too many RAs for them
to ever make the conversion to faculty. In my own department, for example, it
is not uncommon for the number of RAs to be double that of faculty, and unless
the rate of retirement of the latter leaps into the stratosphere (not
impossible, I admit, given that our
<a class="cow-url" href="http://www.ucu.org.uk/index.cfm?articleid=4598">pensions</a> and the HEFCE
funding backbone are <a class="cow-url" href="http://www.ucu.org.uk/index.cfm?articleid=3787">currently ConDemned</a>) then most research staff can hold little hope of ever
joining the grownups.</p>
<p>Why does this matter? Isn't this even a good thing, given that we want to
select only the most committed to take responsibility for the future of
research and of degree-level teaching? Mr. Gladwell's Harvard tale would tend
to indicate otherwise. UK research is certainly in the world elite,
consistently over-achieving relative to its size over a good range of metrics.
We are not separating off the cream as much as taking a random sample, and, as
modern employment law makes clear, any practice which leads to comparable
employees getting different treatment for no good reason is illegitimate.
Further, there is no need to claim that we are all of Harvard class to make
the point: it is sufficient that a researcher is productive in their field
(with all the implications of specialist knowledge and long years of training
that this implies) for the waste involved in treating them as casual staff to
be clear.</p>
<p>Above and beyond this point there are several other negative outcomes. The
current system is:</p>
<ul>
<li>Divisive. In my experience RAs don't usually feel an integral part of the
departments and universities in which they work, and as a consequence their
commitment to those organisations is often low.</li>
<li>Inefficient. Over a three year project an RA often spends the first year
learning the job, the second year being productive, and the third year
looking for another job.</li>
<li>Discouraging. In common with many of my colleagues, I spent 20 years on
short-term contracts. If I hadn't been lucky enough to graduate to an
open-ended contract I probably would be planning a move out of academia, and
given that even now my funding is contingent on continued success within the
shifting sands of the research funding agencies, I still don't feel secure.</li>
</ul>
<p>The key point is that RAs are not temporary: as long as the volume of research
being done is greater than the capacity of academic faculty we're here to
stay, and in large numbers. This means that a system predicated on employment
insecurity is no longer appropriate, and indeed commentators of all shapes and
sizes (including former Sheffield VC Gareth Roberts) have advocated radical
change of one sort or another.</p>
<p>What are the options? Changing the big picture of research careers requires
intervention at a national level, but there are several local measures that
can start to change the culture and help make Universities more attractive for
contract research staff:</p>
<ul>
<li>Making employment contracts open-ended. This doesn't magically improve job
security but it does send out positive signals about our support for
research as a career (and also means that responsibility for triggering
redundancy moves from HR to the departments, increasing the likelihood of
departments taking the issue seriously).</li>
<li>Setting up a buffer fund for bridging between research projects. This will
necessarily be small-scale to begin with, but can serve as part of our
arguments for wider changes in funding structures.</li>
<li>Shifting terminology away from "assistant" or "postdoc" and towards
"professional researcher" and encouraging funding applications and other
career development steps for contract staff.</li>
</ul>
<p>For a longer version of this list see
<a class="cow-url" href="http://ucu.group.shef.ac.uk/documents/sucu-ftc-paper-2008.html">this discussion paper from Sheffield UCU</a> (which also has links to related
documents including the Roberts report). A good summary of the issues from a
principal investigator perspective is
<a class="cow-url" href="http://www.ucu.org.uk/media/pdf/o/0/stampout_pi_leaflet.pdf">available on the national UCU site</a>. Time for a change?</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/research-careers.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-56589951461866912182010-10-20T12:44:00.000+01:002010-10-20T12:44:13.886+01:00More Clouding<p>When you plug your fridge into the mains electricity supply you don't worry
about all the technology sitting behind the wall socket -- it just works.
<a class="cow-url" href="http://gate.ac.uk/blogs/hamish/cloud-computing.html">Cloud
computing</a> is starting to supply IT in a similar fashion. No more worrying
about backups, no more wasted hours configuring a new or repaired machine --
just plug into the network, fire up your web browser and away you go.</p>
<p>Researchers have tougher and more specialised IT needs than most, so to
realise the same ease of use that the cloud now provides for email or word
processing requires work in several areas. One of these areas is to adapt
existing established research tools to the cloud, and that is what we plan to
do with GATE in the next period. Over the last decade GATE has become a
world leader for research and development of text mining algorithms.</p>
<p>Text has become a more and more important communication method in recent
decades. Our children's thumbs often spend half the day typing on their tiny
phone keypads; our evenings often include sessions on Facebook or writing
email to distant friends and relatives. When we interact with the corporations
and governmental organisations whose infrastructure and services underpin our
daily lives we fill in forms or write emails. When we want to publicise our
work for our employer or share details of our leisure activities with a wider
audience we create websites, post Twitter messages or make blog entries.
Scientists also now use these channels in their work, in addition to
publishing in peer-reviewed journals -- a process which has also seen a huge
expansion in recent years.</p>
<p>This avalanche of the written word has changed many things, not least the way
that scientists gather information from the experiences of their peers. For
example, a team at the World Health Organisation's cancer research agency
recently found the first evidence of a link between a particular genetic
mutation and the risk of lung cancer in smokers. Their experiments require
large amounts of costly laboratory time to verify or falsify hypotheses based
on samples of mutations in gene sequences from their test subjects. Text
mining from previous publications makes it possible for them to reduce this
lab time by factoring in probabilities based on asssociation strengths between
mutations, environmental factors and active chemicals.</p>
<p>A second area that has been revolutionised by the new world of text concerns a
core function that commercial concerns must implement in order to stay in
business. Customer relations and market research are no longer just about
monitoring the goings on of the corporate call center. Keeping up to date with
the public image of your products or services now means coping with the
Twitter firehose (45 million posts per day), the comment sections of consumer
review sites, or the point-and-click 'contact us' forms from the company
website. To do this by hand is now impossible in the general case: the data
volume long ago outstripped the possibility of cost-effective manual
monitoring. Text mining provides alternative, automatic methods for dealing
with this data.</p>
<p>GATE provides four core systems to support scientists experimenting with new
text mining algorithms and developers using text mining in their applications:</p>
<ul>
<li>GATE Developer: an integrated development environment for language
processing components</li>
<li>GATE Embedded: an object library optimised for inclusion in diverse
applications</li>
<li>GATE Teamware: a collaborative annotation environment for high volume
factory-style semantic annotation projects built around a workflow engine</li>
<li>GATE Mímir: (Multi-paradigm Information Management Index and Repository) a
massively scaleable multiparadigm index</li>
</ul>
<p>Our plan for the next period is to work towards making use of these systems
more like electric sockets and fridges!</p>
<p>A caveat: it is important to note that current commercial cloud offerings are
not yet appropriate as a drop-in replacement for all academic computing
facilities. For example, the cost of running a virtual machine on Amazon's EC2
continuously for 1 year is roughly equivalent to the cost of buying a similar
machine. In the latter case the hardware may be expected to perform reliably
for at least 3 years, which means that the Amazon option is only cost
effective if the cost of hosting a server in your organisation is on the order
of 3 times the cost of the server hardware. Careful quantification of the
costs is important when moving to the cloud.</p>
<p>See also this <a class="cow-url" href="http://gate.ac.uk/blogs/hamish/cloud-computing.html">previous
post on cloud computing.</a></p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/more-clouding.html">Permalink.</a>
<a class="cow-url" href="http://computingtext.blogspot.com/2010/10/more-clouding.html">On Blogspot.</a></p>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-7946406017081923418.post-59093933223884278482010-07-09T15:37:00.000+01:002010-07-09T15:37:02.813+01:00How to Join Open Source Projects<p>This is rather tired ground that has been well trodden by other feet, but in
the aftermath of a disagreement with one of the happy chappies from
<a class="cow-url" href="http://www.ontotext.com/">Ontotext</a> I thought I'd reiterate a couple of home
truths about open source projects and how you go about joining in. Along the
way I'll also ask what "hack" means, for the benefit of those software people
who've been locked in a small room without access to books or networks for the
last couple of decades :-)</p>
<p>To start with something that should be obvious, all engineering projects of
whatever type are social processes in which human factors are at least as
important as technical ones. In open source this is often more important than
in other areas because the people involved often give their time and expertise
for free, and even when they're being paid specifically to participate there
is almost always a discretionary element of their contribution (should I
bother to answer this email from a complete beginner who obviously hasn't
managed to find the user manual, or shall I finish work ten minutes early
today?). This means that when you want to join an open source project (i.e. to
become a developer, contribute code etc.) you need to show a little
sensitivity and think about the needs of the project and its participants as a
whole, not just your own take on the thing. I remember a particularly clear
case of the opposite approach on the JavaCC project a few years ago (JavaCC is
an excellent parser generator that was one of the first available for Java and
is used in <a class="cow-url" href="http://gate.ac.uk/">GATE</a> for analysing JAPE transduction rules).
Along came a new developer with some good ideas and some useful code -- which
in principle was great news for everyone in the project. Unfortunately said
developer jumped in with both feet, screaming abusive nonsense at the project
administrators and demanding his own way at every juncture. The result? His
useful code was useless and unused. </p>
<p>Why can't we use code from people with whom it is impossible to communicate
and collaborate? Because, to paraphrase Stuart Brand, software is not
something you finish but something you start -- if it is good and useful then
it has a long life span, and during that span it changes and mutates and needs
active support and maintenance. If we accept code into our projects from
sources whose long-term commitment is questionable (and angry young men with
poor collaboration skills are unpredictable in that respect) then we
compromise the evolution of our systems (and sooner or later alienate our
users).</p>
<p>On a more positive note, if you want to join an open source project here are
some steps to help you start off on the right foot:</p>
<ul>
<li><b>Talk to the developers</b>. Communicate, communicate, communicate! I agree
absolutely with Sussman et al. that "A computer language is not just a way
of getting a computer to perform operations but rather that it is a novel
formal medium for expressing ideas about methodology. Thus, programs must be
written for people to read, and only incidentally for machines to execute",
but this doesn't mean that the only thing you need to write is code. Get in
touch with the developers as early in the process as possible, tell them
what you're working on or plan to look at, and ask for their advice. Very
often you'll get not just advice but active support, and the flip side is
that when you produce your contribution they'll know where it comes from and
be better able to judge its quality and its implications for the project as
a whole.</li>
<li>Get to know the mechanisms in place for <b>quality assurance</b> (and quality
control) and adopt them. If the project has a test suite you must at minimum
ensure that your code doesn't break tests, and you should think seriously
about writing new tests to cover all the things that you work on. Look at
the documentation and write patches to cover the stuff you do. Contribute to
discussions on the mailing list or forum in your area. Think hard about
backwards compatibility -- has that interface you just added a method to
been linked by a thousand other jars out there, and is it really worth
forcing recompilation of all those systems? Don't just think about your own
little patch, think about the knock-on effects on the whole system and on
the ecosystem of users and developers around it.</li>
<li><b>Be humble</b>. The reason I can write this post on a fantastic Ubuntu Debian
GNU Linux system is because lots of people cleverer than me worked hard and
contributed their work for the good of humanity. Even if I've invented
something useful in my own little corner of computing that certainly doesn't
mean that I have the right to sneer at others, even if their knowledge is
less than mine in some area. Who knows what greatness their hearts and heads
contain? (We are all geniuses, it is part of being human. If you don't
believe me, try getting a computer to do image recognition like my 1-year
old daughter!)</li>
<li>Be prepared to <b>help developers</b> when you want something integrated into the
project. Most of the time your work will not be top of the todo-lists of the
people you're joining; most often you're adding work to their already full
plates, and you should be patient and helpful while they look at your work
and figure out whether to include it, or to work towards making you a
committer on the project.</li>
</ul>
<p>So: not rocket science, just basic collaboration skills. </p>
<p>One of the obvious things not to do is to start throwing around pejorative
terms. One of these that I find annoying is 'hack'.</p>
<p><blockquote>"Your work is just a hack! My approach is state of the art! Thou shalt
do it my way!"</blockquote></p>
<p>Oh dear.</p>
<p><b>First</b>, the users of software don't care. They choose one tool over another
because of what they can do with it, not because of the way it conforms or
otherwise to visions of elegance or correctness. Of course elegance and
correctness can be factors in software performance and maintainability and so
on -- but most often these qualities are subjective, particularly when applied
by newcomers unfamiliar with the big picture -- which is my next point...</p>
<p><b>Second</b>, such visions are personal, and especially so to outsiders. If you've
sweated over a specific problem (in this case transducing graphs with FSAs)
for years at a time then I'll listen to you tell me what is the most elegant
solution with interest -- but if you haven't, I'm inclined to assume that
there is likely to be stuff you don't know that may well compromise your view.</p>
<p><b>Third</b>, if we define 'hack' as a quick or heuristic route to get something
done, then why would that be a bad thing? And if you start down this route
where will you end? For example, the case that lead to this post was in
relation to a system that does finite state pattern matching and transduction
over annotation graphs. Those of you with a background in formal languages may
have already spotted the strangeness here: graphs can describe languages whose
expressive power is greater than regular, which would seem to invalidate the
whole idea of applying finite state techniques to them. It turns out that the
data structures we're working with here (while doing information extraction
over human language) have a lot of regular features, and that the
indeterminacy that arises from the mismatch between regular FSAs and a
graph-structured input stream are not an obstacle to our work (in fact they
can sometimes be a good way to ignore ambiguities that we're not currently
interested in). But: doesn't that fit the definition of a hack? From that
point-of-view the whole subsystem under discussion is one big hack, which
makes it even more ridiculous to criticise one approach or another as a
"hack".</p>
<p>The <b>moral of the story</b>? Technology is never as important as it seems in our
commodity-driven age. Better communication and collaboration skills win every
time. The good news is that this is one of the best things about open source
:-)</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/how-to-join-open-source-projects.html">Permalink.</a>
<a class="cow-url" href="http://computingtext.blogspot.com/2010/07/how-to-join-open-source-projects.html">On Blogspot.</a></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-13137364764394387512010-06-05T10:19:00.000+01:002010-06-05T10:19:20.562+01:00Data mining won't make you safer<p>(Wednesday 31st December 2008)</p>
<p>This holiday I read a fantastic novel called <em>Little Brother</em> by
<a class="cow-url" href="http://craphound.com/">Cory Doctorow</a>. It reminded me of how UK and US moves
to collect more data about their citizens and give more powers to 'security'
staff are in fact <a class="cow-url" href="http://tinyurl.com/yd4moga">worse than useless</a>. As
someone who <a class="cow-url" href="http://gate.ac.uk/">works in language processing</a> I note with
dismay the tendency of technologists to happily provide mining of personal
data for state purposes, while cheerfully ignoring the fact that it <b>won't
make anyone any safer</b>.</p>
<p>There are many reasons why
<a class="cow-url" href="http://www.schneier.com/blog/archives/2008/01/security_vs_pri.html">invading
privacy is counter-productive</a>. Two important ones are:</p>
<ul>
<li>the information isn't useful</li>
<li>state power is almost always abused</li>
</ul>
<p>Why isn't the information useful? Imagine a method which is 99% successful at
detecting anomalous behaviour and suggesting further investigation. Let's
apply that method to 50 million adults in the UK, for example. That's 500,000
people who you now have to regard as suspects. In fact the accuracy of data
mining in this type of case is much more likely to be around 50%, so if you
collect all the data you can you'll still only know that 10s of millions of
people might be suspect. Useless.</p>
<p>Second, security service personnel are just like everyone else: some are
consciencious and some are unscrupulous. While you might just about consider
it acceptable for all your personal data to be in the hands of a
conscienscious, competent, well-trained and well-provisionned state employee,
are we really naive enough to imagine that this covers everyone in every
police force, army barracks or 'intelligence' office? Of course not; and if we
were, the recent history of appalling
<a class="cow-url" href="http://www.unitedagainstinjustice.org.uk/">miscarriages of justice</a> should
soon convince us otherwise.</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/data-mining-wont-make-you-safer.html">Permalink.</a>
</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-90136809668678776622010-05-21T12:37:00.000+01:002010-05-21T12:37:27.851+01:00Open Data at the National Archives<p>The GATE team, <a class="cow-url" href="http://www.ontotex.com">Ontotext</a> and
<a class="cow-url" href="http://www.ssl.co.uk/">SSL</a> have won a contract to help open up the UK
<a class="cow-url" href="http://www.nationalarchives.gov.uk/">National Archive</a>'s records of
<tt>.gov.uk</tt> websites (going back through 1997 and comprising some 340 million
pages).</p>
<p>I've been quite ignorant about this stuff until recently, and it has been a
pleasure to discover that the archives and related organisations are actively
pursuing the vision of open data and open knowledge. This project has taken a
big step forward in the UK recently, with government funding allocated to
publishing more and more material on <tt>data.gov.uk</tt> in open and accessible
forms. The battle is by no means over, but I'm really looking forward to
contributing in a small way to this work, and, hopefully, showing how GATE can
help improve access to large volumes of government data.</p>
<p>We're going to use GATE and Ontotext's open data systems (which hold the
largest subsets of Linked Open Data currently available with full reasoning
capabilities) to:</p>
<ol>
<li>import/store/index structured data in a scaleable semantic repository
<ul>
<li>data relevant for the web archive</li>
<li>in an easy to manipulate form</li>
<li>using linked data principles</li>
<li>in the range of 10s of billions of facts</li>
</ul></li>
<li>make links from web archive documents into the structured data
<ul>
<li>over 00s of millions of documents and terabytes of plain text</li>
</ul></li>
<li>allow browsing/search/navigation
<ul>
<li>from the document space into the structured data space via semantic
annotation and vice versa</li>
<li>via a SPARQL endpoint</li>
<li>as linguistic annotation structures</li>
<li>as fulltext </li>
</ul></li>
<li>create scenarios with usage examples, stored queries</li>
<li>show TNA how to DiY more scenarios</li>
</ol>
<p>Quoting from the proposal,</p>
<p><blockquote> "Sophisticated and complex semantics can transform this archive... but the
real trick is to show how simple and straightfoward mechanisms (delivered on
modest and fixed budgets) can add value and increase usage in the short and
medium terms. ... We aim to bring new methods of search, navigation and
information modelling to TNA and in doing so make the web archive a more
valuable and popular resource.</p>
<p>Our experience is that facetted and conceptual search over spaces such as
concept hierarchies, specialist terminologies, geography or time can
substantialy increase the access routes into textual data and increase usage
accordingly."
</blockquote></p>
<p>Text processing technology is inherently inaccurate (think of how often you
mis-hear or mis-understand part of a conversation, and then multiply that by
the number of times you've seen a computer do something stupid!); what can we
do to make this type of access trustworthy?</p>
<p><blockquote>
"Any archive of government publications is an inherently a tool of democracy,
and any technology that we apply to such a tool must consider issues relating
to reliability of the information that users will be lead to as a result, for
example:</p>
<ul>
<li>what is the provenance of the information used for structured search and
navigation? have there been commercial interests involved? have those
interests skewed the distribution of data, and if so how can we make this
explicit to the user?</li>
<li>what is the quality of the annotation? these methods are often less accurate
than human performance, and again we must make such inaccuracy a matter of
obvious record lest we negatively influence the fidelity of our navigational
idioms</li>
</ul>
<p>Therefore we will take pains to measure accuracy and record provenance, and
make these explicit for all new mechanisms added to the archive."
</blockquote></p>
<p>So open science (and our open source implementations of measurement tools in
GATE) will contribute to open data and open government.</p>
<p><a class="cow-url" href="http://computingtext.blogspot.com/2010/04/open-knowledge-linked-data-scruffy-vs.html">More open stuff</a>.</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/open-data-at-the-national-archives.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-79811689216023370922010-05-04T16:39:00.002+01:002010-05-04T16:39:52.120+01:00More GATE Products Coming<p>Several years ago we (<a class="cow-url" href="http://gate.ac.uk/">the GATE project</a>, that is, not
the royal "we" -- my knighthood seems to have got lost in the post for some
reason) reached the conclusion that the tools that we've built for developing
language processing components (GATE Developer) and deploying them as parts of
other applications (GATE Embedded) were only one part of the story of
successful semantic annotation projects. We like to think that our specialist
open source software and our user community are the best in the world in many
respects, but when we help people who are not specialists we encountered a
bunch of other perspectives and problems. We also came across some hard
problems of scaleability and efficiency which led us to implement a completely
new system for annotation indexing (with thanks to Sebastiano Vigna and
<a class="cow-url" href="http://mg4j.dsi.unimi.it/">MG4J</a>).</p>
<p>So, cutting to the chase, we developed a bunch of new systems and tools,
partly with our <a class="cow-url" href="http://gate.ac.uk/partners.html">commercial partners</a>. We
did this largely behind closed doors (although we did run
<a class="cow-url" href="http://gate.ac.uk/sale/talks/sam/repositories-workshop-agenda.html">a
workshop on multiparadigm indexing</a> at which we got a lot of useful input),
partly because of our partners' requirements and partly because we wanted to
minimise our support load while we ironed out the bugs in the initial
versions.... which process has now run its course, and we're pleased to
announce the imminent availability of <b><a class="cow-url" href="http://gate.ac.uk/family/coming-soon">lots of new stuff</a></b>. Keep a watch out on <a class="cow-url" href="http://gate.ac.uk/">GATE.ac.uk</a> over
the summer, as we'll be moving it all into our source repositories in advance
of our 6.0 release <a class="cow-url" href="http://gate.ac.uk/roadmap.html">in the autumn</a>.</p>
<p>Enjoy...</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/more-gate-products-coming.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-53990652361957343232010-04-27T13:50:00.001+01:002010-05-21T12:38:10.574+01:00Open Knowledge, Linked Data, Scruffy vs. Neat<p>
Last weekend I had the pleasure of attending this year's <a class="cow-url" href="http://okfn.org/okcon/">Open Knowledge Conference</a>. The list of good reasons for making government (and other) data open and for breaking down barriers to finding information on-line are longer than would fit in this post; one nice one that I hadn't heard before was <a class="cow-url" href="http://opendotdotdot.blogspot.com/">Glyn Moody</a>'s point that <a class="cow-url" href="http://en.wikipedia.org/wiki/Church_Turing_thesis">Turing equivalence</a> implies that there can be only one digital revolution, and that this in turn can prove the impossibility of preserving analogue bad habits like 'rights management'. At this point I should probably mention my employer's lawyers and what they'll do to you if you imply that I'm in favour of file sharing, but perhaps I'll just make do with a tired but accurate simile between the RIAA and those loveable old dinosaurs, dodos and other casualties of unsustainable lifestyle choices.</p>
<p>
There was a lot of other interesting stuff being presented, including a talk by <a class="cow-url" href="http://www.jenitennison.com/blog/">Jeni Tennison</a> on large-scale open data from government and her work at <a class="cow-url" href="http://data.gov.uk">data.gov.uk</a>. Jeni ended her talk by saying that we shouldn't worry about proliferation of redundant and (potentially) contradictory material -- after all, this is what has happened with the web and no animals were harmed in the making, etc.</p>
<p>
I like this point, and it chimes nicely with a move from "neat" to "scruffy" that we can observe around semantic technology in general and the semantic web in particular. The original vision published by Berners-Lee and others around a decade ago was very much inspired by Artificial Intelligence: your computer was going to book your dentist appointment on the right day to coincide with picking your mother up from the station, make sure the fridge was stocked with her favourite orange juice for later, and blah blah blah. Good stuff if you're a professor of logic computation looking for your next funding opportunity, but not really any nearer the horizon now than it was 10 (or 20 or 30) years ago.</p>
<p>
Thankfully we've mostly woken up again, and now things are boiling down to a more practical residue, which, to paraphrase a more recent comment by Berners-Lee, is "all about the data, stupid". And this brings us back to Jeni's talk -- if we can get all those public data silos openned up and usable in the right way this will be a huge leap forward, and the fact that it will not be universally nice and neat and dressed in a shiney new bow tie is neither here nor there. Scruffy is neat in its own way.</p>
<p>
A second thing that was interesting for me at this talk (and at <a class="cow-url" href="http://okfn.org/okcon/">OKCon more generally</a>) was the question of data vs. content. The focus of the discussion today was very much about data in spreadsheets, relational databases and so on, and this seems to be where current success is happenning as more and more databases are being exported to variants of RDF. This must be good news for text analysis: looked at from an <a class="cow-url" href="http://gate.ac.uk/ie/">information extraction</a> point-of-view, linked open data is a rich source of domain terminology (seeds for our gazetteers) and conceptual backbones (seeds for our result templates, taxonomies and ontologies). The next wave, it seems to me, is to link the linked data to all that text that's lurking in the databases telling all sorts of interesting stories -- if only we could find them.</p>
<p>
<a class="cow-url" href="http://gate.ac.uk/blogs/hamish/okcon.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-68319771831213442002010-02-22T21:52:00.000+00:002010-02-22T21:53:35.871+00:00Google stole my ngrams<p>A while ago Dave Schubmehl of Fairview pointed me to
<a class="cow-url" href="http://www2.computer.org/portal/web/csdl/doi/10.1109/MIS.2009.36">a paper by
several prominent Googleers</a> which does a nice and clear job of summarising
some important lessons from the last decade of web analysis research. The
upshot is that if you've a billion examples of human behaviour that pertains
to your particular problem it will be a good bet to use a simple
non-parametric word count model to try and generalise from that behaviour.</p>
<p>Absolutely true. This is, in fact, the main reason why Google was so
successful to start with: they realised that hyperlinks represent neatly
codified human knowledge and that learning search ranking from the links in
web pages is a great way to improve accuracy.</p>
<p>What do we do with the cases where we can't find a billion examples? Probably
we end up lashing together a model of the domain in a convenient schema
language (sorry, I mean "build an ontology"), grubbing up whatever domain
terminologies and so on that come to hand, and writing some regular expression
graph transducers to answer the question.</p>
<p>So: we're not trying to replace Google. We're not applicable to every problem
everyone has ever had with text ("Not every problem someone has with his
girlfriend is necessarily due to the capitalist mode of production" -- Herbert
Marcuse). But neither is Google going to pop round to your office next Tuesday
and help you build an ngram model of a couple of billion user queries from
their search logs to help you figure out why your customers hate the latest
product release.</p>
<p>There's not really a competition here, the approaches are orthogonal.</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/google-stole-my-ngrams.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-57062791782613656012010-02-12T16:19:00.001+00:002010-02-12T16:22:32.234+00:00Cloud Computing, GATE and Text Processing<p>When a new thing comes along in computing the first thing that happens is that
a small and exclusive set of nerds like me get all excited. If the excitement
seems likely to relate to the real world in any fashion that might actually
generate someone somewhere some money (or can be spun as something that might
do so) then the next thing that happens is that the marketing departments of
1001 IT corporations jump in with both feet and start generating acre after
acre of turgid prose about how their aged and creaking product line is
actually a prime example of Phenomenon X, the Bright New Thing of Computing.</p>
<p>So it has been with software "in the cloud", which is, it turns out, actually
quite a good idea in various ways (setting it apart from most new trends in
IT). What does the Cloud Computing commonly refer to (now that the sound and
fury of the marketing teams has had a chance to settle a little)? Three
things:</p>
<ul>
<li>software as a service (SaaS), for example Google Docs or SalesForce.com</li>
<li>platform as service (PaaS), for example Google App Engine</li>
<li>infrastructure as service (IaaS), for example Amazon Web Services and most
famously their Elastic Compute Cloud (EC2 -- which probably did most to
popularise the term in the recent period)</li>
</ul>
<p>These three now consitute the new wave: they are one of the main tracks that
Google is betting on (SaaS and PaaS), what Amazon continues to succeed with
(IaaS), and the grist for a hundred new startup mills (from specific
applications like searching US campus sites to infrastructural help for cloud
developers).</p>
<p>What does it have to do with GATE? IaaS is particularly well-suited to hosting
text processing, which is typically bursty in its computational cost and
therefore ill-suited to fixed infrastructure. SaaS is great for the provision
of large web applications that are complex to install and maintain (like
<a class="cow-url" href="http://gate.ac.uk/teamware/">GATE Teamware</a>). Hopefully this and other cloud
offerings will be available on <u>GATECloud.net</u> in the not too distant... so
watch this space!</p>
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/cloud-computing.html">Permalink.</a></p>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-7946406017081923418.post-39642296745192423332010-02-09T20:05:00.000+00:002010-02-09T20:20:29.931+00:00Certifiable GATE gurus wanted.In <a class="cow-url" href="http://computingtext.blogspot.com/2010/02/i-love-gate-users-though-i-couldnt-eat.html">my previous post</a> I described how we came to start taking our user community more seriously again; in the first part of 2010 the effect of this turn has been that the world and her dog seem to be beating a path to our door with requests for technical support, training, bespoke development and/or access to our latest prototypes.
In fact it is proving difficult to keep up with demand, so: <b>if you're a GATE expert</b> how about <a href="http://gate.ac.uk/family/training.html">getting certified</a> and taking on some of the work with us? If you have a good knowledge of one or more part of GATE (and/or related application domains), please <a href="http://gate.ac.uk/g8/contact/">get in touch</a>.
(We promise not to tell anyone that you're <a class="cow-url" href="http://tinyurl.com/yz9qphp">certifiable</a> :-) .)
<a class="cow-url" href="http://gate.ac.uk/blogs/hamish/certifiable-gurus.html">Permalink.</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-7946406017081923418.post-66435083239924650282010-02-09T19:57:00.001+00:002010-02-10T09:53:47.483+00:00I love GATE users (though I couldn't eat a whole one).<p>Users. A bit of a nuisance. They insist on asking questions, testing
limits, finding bugs. Around 5 years ago, after something like a
decade of giving away software, the GATE team felt very like our old
systems administrator, who had a habit of saying "the only secure
network is one without any computers attached": we knew that our user
community was a good idea in principle, but we really rather wished
they'd all leave us alone. In fact we did our best to discourage GATE
users: we stopped doing regular releases, we ignored the mailing list,
and if we could have figured out how to take the thing out
in the woods and bury it under a tree we probably would have.
<p>We failed: GATE refused to die, people obstinately continued to use
it, and, as we used it ourselves for all sorts of projects, more and
more features were added, quality and functionality improved, and
every time we decided it was all over someone would turn up with a
pile of cash and a novel problem. So we conceded defeat and resolved
to succeed. I think.
<p>This is all a long-winded way of explaining our shift in emphasis
over the past year or so: we are introverts no longer, but happy
and well-adjusted user-friendly liveware. Text processing for ever!
Forwards to world domination comrades! Oops, wrong blog.
<p>So now we're back to actively supporting our users and growing our
community. We've upgraded the documentation, we're running regular
training weeks and developer sprints, and we've built up
several new products and services around the core GATE code to cater
for more of the cases we've seen of people trying to deploy text
processing over the years (15 of which, incredibly, have passed under
the bridge since we first set metaphorical pen to digital paper for GATE
version 0.1). We've also revamped the website and no longer look like
something that might have been produced at CERN circa 1995.
<p>So far the response has been quite astonishingly positive... so perhaps
users aren't such a bad thing after all.
<p><a class="cow-url" href="http://gate.ac.uk/blogs/hamish/getting-used-to-users.html">Permalink.</a>Unknownnoreply@blogger.com0