Monday, February 28, 2011

Anticancer

(Summarising some of David Servan-Schreiber's book Anticancer, 2nd edition, 2011.)

Why are cancer rates increasing so quickly? What can we do to stop the spread, both as individuals and as a society?

Curing cancer: the long route

The World Health Organisation runs the world's biggest cancer epidemiology lab (IARC, in Lyon, France). They publish the standard reference work on carcinogensis, and in recent years have done a lot of work on genetic and epigenetic factors in cancer. They found the first gene that increases the risk of lung cancer in smokers (see Nature, April 2008), for example.

Their work involves finding tiny needles in huge haystacks. The new generation of sequencing machines can process an entire genome within a week. Then billions of data points from populations of cancer sufferers must be correlated with environmental factors like smoking, diet, pollution, etc.

In their latest experiments IARC scientists are using GATE to adjust their statistical models based on previously published research results. In 2010 they found a gene that associates with head and neck cancer this way. The method can boost the productivity of cancer research by exploiting the gigabytes of published research, government studies and patent applications that cover cancer and its causes.

One of the best things for me about working with GATE in recent years has been this stuff on carcinogenesis with the genetic epidemiology team in Lyon. As part of the LarKC project we developed a simple text processing system to boost probabilities in their models of gene-disease association. It is quite a buzz to be able to say that we've contributed to finding a new proven susceptibility that associates with a particular genetic marker. Knowledge of this type of susceptibility promises to contribute significantly to the development of targeted pharmaceuticals for cancer treatments in the long term.

This is, though, only part of the story -- just as one of the main consequences of our increased understanding of genetics has demonstrated the need to understand the biological and environmental context of gene expression (or epigenetics), so an understanding of cancer must also lead beyond the purely pharmaceutical. Let's look at the other side of the coin: what is it that causes cancer in the first place?

Polar bears

As pure as the driven snow? Unfortunately not:

Polar bears live far from civilization... Yet of all the animals in the world, the polar bear is the most contaminated by toxic chemicals, to the point where its immune system and its reproductive capacities are threatened... The pollutants we pour into our rivers and streams all end up in the sea... The polar bear is at the top of the food chain that is contaminated from one end to the other... There is another mammal that reigns at the top of its food chain and its habitat is, moreover, distinctly less protected than the polar bear's: the human being.

And it is here that we find two huge causes of cancer: first, the food we eat, and second the artificial pollutants that permeate both our environment and our food. As many as 85% of all cancers are caused by environmental factors (the food we eat, the air we breath, the stresses and strains of modern life). For example, a large Danish study

... found that the genes of biological parents who died of cancer before fifty had no influence on an adoptee’s risk of developing cancer. On the other hand, death from cancer before the age of fifty of an adoptive parent (who passes on habits but not genes) increased the rate of mortality from cancer fivefold among the adoptees. This study shows that lifestyle is fundamentally involved in vulnerability to cancer. All research on cancer concurs: genetic factors contribute to at most 15% of mortalities from cancer.

In other words, there is a clear potential for preventing cancer by adjusting our lifestyles, and the same comment applies to increasing our chances of surviving the disease once diagnoses. In the short term there is an increasing body of work that can guide individuals, families and communities to methods for decreasing their cancer risks and for improving the prognosis once diagnosed, and Servan-Schreiber's book is an excellent summary.

In the longer term, we must change our modes of transport, our agriculture and our industrial processes, if we are serious about making a lasting difference to cancer rates.

Curing cancer 2: short cuts

It is in this sense that cures for cancer already exist: we don't need to wait for scientific miracles or technological breakthroughs (that may or may not come) -- we can prevent many cancers and remit many existing cancers by changing practices that we already understand very well (perhaps starting with your next meal!). It seems that

... by upsetting the balance in our diets we have created optimal conditions in our bodies for the development of cancer. If we accept that cancer growth is stimulated to a large extent by toxins from the environment, then in order to combat cancer, we have to begin by detoxifying what we eat. Facing this overwhelming body of evidence, here are simple recommendations to slow the spread of cancer:

  1. Eat sugar and white flour sparingly: replace them with agave nectar, acacia honey or coconut sugar for sweetening; multigrain flour for pastas and breads, or sourdough.
  2. Reduce consumption of red meat and avoid processed pork products. The World Cancer Research Fund recommends limiting consumption to no more than 500 g (18 oz) of red meat and pork products every week – in other words, at most four or five steaks. Their ideal recommended goal is 300 g (11 oz) or less.
  3. Avoid all hydrogenated vegetable fats – ‘trans fats’ – (also found in croissants and pastries that are not made with butter) and all animal fats loaded with omega-6s. Olive oil and canola oil are excellent vegetable fats that doesn’t promote inflammation. Butter (not margarine) and cheese that are well-balanced in omega-3s may not contribute to inflammation either.

(Lower cost options are also discussed.)

Servan-Schreiber's book is one of those rare texts that combines a rigourous and detailed apprehension of the scientific literature with a clear, simple and practical messages about how we can live better. If you read one book this year....!


Permalink. On Blogspot.

Thursday, February 24, 2011

Talk or Technology?

Talk or technology -- which is most expensive? Talk, it seems.

I spent the weekend with a friend of mine who runs one of the bigger semantics companies (he's a peddlar of used meanings, I like to tell him -- a kind of wholesale supplier of double entendres). He's very active in the community, and he follows the fate of all the other startups and joint ventures that have sprung up over the last decade or so, and the machinations of their customers, the tech savvy media and the analyst firms and so on. Several months ago he told me a story about Corporation X (let's call them Turnip for no particularly good reason), Startup Y (let's call them Cabbage) and a certain popular text processing framework, which, if you're sitting comfortably, I shall relay for your general delectation and personal improvement.

Now, Turnip are a megacorp, biggest publisher of one sort or another, and supplier of diverse databases and data streams to the jobbing information worker. In common with pretty much every other publisher out there (except Cory Doctorow) Turnip can see the writing on the digitally revoluntionary wall, and are casting around for ways to make their offerings more exciting than the competition. (Whether they can make their expensive, closed and stuffy stuff seem more attractive than the new free and open world is not something I'd bet the house on, but there you go.) One obvious route is to use text analysis to hook their text corpora up to conceptual models and bung the results into a semantic repository. Hey presto, all sorts of new and nifty search and browsing behaviours suddenly become possible. So publishers have been pretty keen customers of both the GATE team and my friend's company in recent years.

Turnip realised the importance of text processing in their collective future some time ago, and, after reporting work based on GATE up until a few years back, decided to take the function in-house. They bought Cabbage, one of the most active text analysis startups of the time. We assumed that they were going to use Cabbage tech to replace the stuff they'd done with GATE...

Fast forward to the present, and my friend was chatting to one of the people who run the publishing side of things at Turnip. Surprise surprise: the Cabbage stuff is nowhere to be seen, and they're still using the old and trusty Volkswagen Beetle of text processing.

Well who'd have thought it.

So, coming back to the question of talk vs. technology, we can conclude that the good people at Cabbage, who were big enough talkers to see themselves bought for a large chunk of readies by Turnip, had the right approach. Sad old technologists like me and mine just don't cut the mustard in the self-promotion stakes.

In fact, I've seen this pattern in a number of contexts. The people best at telling you about why you need them are generally too busy doing just that to really get to grips with all that inconvenient science and engineering that needs doing to actually make a practical difference. Therefore I have formulated Cunningham's Law: the quality of the work varies in inverse proportion to the quality of the slideware. (Next time you're unlucky enough to be bored to tears by one of my talks, please bear this in mind.)

To finish, a hint, free of charge, for those who have text processing problems to solve but would prefer not to spend large sums of money on cabbage and the like. You need open systems, you need to measure from the word go, and you need a process that incorporates robust mechanisms for task definition, quality assurance and control and system evolution. And you need a pool of available users and developers, training materials, etc. etc. So mosey on over to http://gate.ac.uk/ :-)

Permalink. On Blogspot.

Monday, January 17, 2011

Private Frazer was Right

Frazer was right!



(We're all doomed!)

(A note about climate change, the media and open science, January 2011.)

Last week NASA and others announced that 2010 was the joint hottest year on record. The announcement was almost universally ignored by the UK media. In wondering why that might be, several reasons come to mind:

First, the long-term subordination of media output to supporting the status quo (see much of Chomsky's work since Manufacturing Consent in the late 1980s; also more locally Edwards & Cromwell's Newspeak in the 21st Century). The status quo is, of course, dominated by oil (the 10 biggest companies in the world are often listed as 9 oil and car companies plus Walmart, owner of some of the world's largest car parks). Even more so we are dominated by profit: if it doesn't make a profit it isn't worth doing, no matter that this results in idiocy on a massive scale (from a market point-of-view, for example, it makes sense to ship all our manufactured goods from China, or to pay the bankers who caused our most recent crisis huge sums for their unproductive work, etc. etc. etc.).

This is, however, a general reason for the media to ignore climate change, and the NASA announcement about 2010 was actually quite widely reported around the world -- but not in the UK. A more specific and local reason can be found in Nick Davies' book Flat Earth News, which documents the severe reduction in the quantity of journalism (and of journalists) in the UK over the last 20 years (since Murdoch's relocation of his print operation from Fleet Street to Wapping). The majority of reporting is now supplied by organisations that aim for neutrality, not objectivity. (What's the difference? If two people report the progress of mowing a meadow and one says "we're finished" and the other "we haven't started", neutral reporting simply quotes both sides. Objective reporting goes and looks at how much grass is left. Clearly the latter is expensive and harder to make a profit at -- but the former is not journalism.)

Worse, more than 80% of the stories in our press have no journalistic oversight at all, let alone an objective appraisal. This is because they are the unmediated creations of Public Relations staff, either direct to the paper or via a press agency like PA, AP or Reuters -- and note that press agencies explicity define themselves as neutral, not as objective investigators. The old role of investigative journalist has retrenched so far that it is now a rare exception.

So far, so depressing, but there's another reason that UK media sources are using to ignore climate change at present, and that is the aftermath of the Climategate scandal that began in November 2009. It was sad to see the outpouring of unqualified censure and obfuscation that greeted the selective publication of a few emails between a few climate scientists that had been stolen from their hard drives by hostile critics. Several enquiries have since exonerated the scientists concerned and restated the underlying strength of their argument, but nonetheless a good deal of damage has been done and our chances of avoiding the worst of the risks that face us are lessened as a result.

The attack was, of course, disingenuous (and most reminiscent of Big Tobacco's tactics with respect to lung cancer research) but the amunition was also too freely available, and that brings us to the connection between climate change and the subject of this blog -- which is at least loosely focussed on information management, text processing and the like.

One of the contributing factors to the Climategate fiasco is a mismatch between technological capabilities and research practice. Scientists are habituated to a model where artefacts such as their intermediate results, computational tools and data sets are both transient and private. Repeatability, the cornerstone of empirical methods, is most often addressed by publication in peer-reviewed journals, but not by reuse of open data and resources. It is this culture that has proved vulnerable to vexatious freedom of information requests from climate change deniers. It is also a culture which is non-optimal with respect to openness and the efficient disemination of scientific knowledge equally across the globe.

This is not to say that all experimental and modelling data can become open over night -- but information management systems that support sharing between scientists can be built in ways that facilitate greater openness, traceability and data longevity.

To cut a long story short, open science is an idea whose time has come, and the question now is not if but when: how rapidly we will shift, how efficient the results will be, and what the experiences of individual scientists will be. The battle isn't over, of course; last year I went to a talk by James Boyle, one of the founders of Creative Commons and now Science Commons, and he showed very clearly how "to make the web work for science is illegal" -- the mechanisms that work so well for on-line shopping or social networking are prevented from working for scientists by restrictive publishing contracts and so on. But, as Glyn Moody points out, Turing's results imply the atomicity of the digital revolution, and its consequences are that the genie is now so far out of the bottle that all our human achievements will follow into the realm of openness and cooperative enterprise sooner or later.

How can we encourage openness in climate science, and reduce exposure to climate change deniers?

The technology we need falls into three categories:

  • Cloud computing and virtualisation. Server usage optimisation and scaleability via cloud computing now make it possible to address problems at the scale of every scientific research department in the UK (for example) within existing infrastructure budgets. Virtualisation makes it possible to store not just data but the entire compute platform operable for particular experiments or analyses.
  • Distributed version control repositories. Server-side repository systems that are commonplace for software engineers (e.g. Bazaar, Git or Subversion) have a large part of the answer for storing and versioning the data sets generated during collaborative research. They need to be integrated with on-line collaboration tools to make their use easier and more intuitive.
  • Open search infrastructure. Findablility is a key criterion upon which scientists base their evaluation of computational tools. Open source search engines are mature enough to perform very well when properly configured, and techniques exist for adaptation to non-textual data. The desktop metadata gathering facilities now available in e.g. KDE add exciting new possibilities, for example to make queries like "show me all the email I wrote around the time I edited the journal paper on tree rings".

Of course technology is only part of the picture, and has to be coupled with intervention at the cultural and organisational levels. The message of open knowledge and open data is becoming a powerful meme which can be exploited to promote new technologies and help change culture (and in doing so increase the effectiveness of climate scientists and decrease the power of climate change deniers).

Scientists are most often motivated by desire to do their work, and not very often by ticking the boxes that research assessment exercises demand, so if we can show a route to replacing "publish or perish" with "research and flourish" we can gain a lot of mindshare.

To conclude, the best hope for our collective future lies in cooperation, and after all that is the great strength of our species. Ursula le Guin makes this point very clearly:

"The law of evolution is that the strongest survives!"
"Yes; and the strongest, in the existence of any social species, are those who are most social. In human terms, most ethical. ... There is no strength to be gained from hurting one another. Only weakness."

Mechanisms for open discussion and consensus building in science can translate into mechanisms for promoting democracy and cooperation, and help light the path to a better world.

Permalink.

Friday, December 3, 2010

Harvard's Selection Process and UK Research "Careers"

One of the points that Malcolm Gladwell makes in his beautiful book Outliers is that selections made in face of over-abundance are likely to be random. He cites a study of Harvard undergraduate admissions which shows that there is a large element of chance involved -- the point being that where there is a surfeit of excellence on offer (and the queue of young hopefuls at Harvard's door probably qualifies) it is pretty meaningless to try and select the "best" using anything more high-tech than the toss of a coin.

There's an analogous randomness in the fate of UK research staff (that is, those staff employed only to do research, as opposed to academic faculty members whose remit includes research, teaching and related administrative tasks). These staff are most often known as research assistants (RAs -- a term that gives a clue as to their general status within our Universities).

The custom and practice of RA employment arose in a time when the ratio of research volume to faculty sizes was a lot lower, and it made perfect sense in that context for the position to be a staging post between postgraduate research and faculty jobs. Indeed a common longer form of the term is postdoctoral RA, and in previous periods there was a reasonably strong expectation that this was the final fence to jump on the way to the academic finishing line. There was, in other words, a built-in assumption that being an RA was just as temporary as the state of being a PhD student, for example.

Fast forward to the present (or even to 10 years ago, in fact), and there is now a significant problem with this picture: there are too many RAs for them to ever make the conversion to faculty. In my own department, for example, it is not uncommon for the number of RAs to be double that of faculty, and unless the rate of retirement of the latter leaps into the stratosphere (not impossible, I admit, given that our pensions and the HEFCE funding backbone are currently ConDemned) then most research staff can hold little hope of ever joining the grownups.

Why does this matter? Isn't this even a good thing, given that we want to select only the most committed to take responsibility for the future of research and of degree-level teaching? Mr. Gladwell's Harvard tale would tend to indicate otherwise. UK research is certainly in the world elite, consistently over-achieving relative to its size over a good range of metrics. We are not separating off the cream as much as taking a random sample, and, as modern employment law makes clear, any practice which leads to comparable employees getting different treatment for no good reason is illegitimate. Further, there is no need to claim that we are all of Harvard class to make the point: it is sufficient that a researcher is productive in their field (with all the implications of specialist knowledge and long years of training that this implies) for the waste involved in treating them as casual staff to be clear.

Above and beyond this point there are several other negative outcomes. The current system is:

  • Divisive. In my experience RAs don't usually feel an integral part of the departments and universities in which they work, and as a consequence their commitment to those organisations is often low.
  • Inefficient. Over a three year project an RA often spends the first year learning the job, the second year being productive, and the third year looking for another job.
  • Discouraging. In common with many of my colleagues, I spent 20 years on short-term contracts. If I hadn't been lucky enough to graduate to an open-ended contract I probably would be planning a move out of academia, and given that even now my funding is contingent on continued success within the shifting sands of the research funding agencies, I still don't feel secure.

The key point is that RAs are not temporary: as long as the volume of research being done is greater than the capacity of academic faculty we're here to stay, and in large numbers. This means that a system predicated on employment insecurity is no longer appropriate, and indeed commentators of all shapes and sizes (including former Sheffield VC Gareth Roberts) have advocated radical change of one sort or another.

What are the options? Changing the big picture of research careers requires intervention at a national level, but there are several local measures that can start to change the culture and help make Universities more attractive for contract research staff:

  • Making employment contracts open-ended. This doesn't magically improve job security but it does send out positive signals about our support for research as a career (and also means that responsibility for triggering redundancy moves from HR to the departments, increasing the likelihood of departments taking the issue seriously).
  • Setting up a buffer fund for bridging between research projects. This will necessarily be small-scale to begin with, but can serve as part of our arguments for wider changes in funding structures.
  • Shifting terminology away from "assistant" or "postdoc" and towards "professional researcher" and encouraging funding applications and other career development steps for contract staff.

For a longer version of this list see this discussion paper from Sheffield UCU (which also has links to related documents including the Roberts report). A good summary of the issues from a principal investigator perspective is available on the national UCU site. Time for a change?

Permalink.

Wednesday, October 20, 2010

More Clouding

When you plug your fridge into the mains electricity supply you don't worry about all the technology sitting behind the wall socket -- it just works. Cloud computing is starting to supply IT in a similar fashion. No more worrying about backups, no more wasted hours configuring a new or repaired machine -- just plug into the network, fire up your web browser and away you go.

Researchers have tougher and more specialised IT needs than most, so to realise the same ease of use that the cloud now provides for email or word processing requires work in several areas. One of these areas is to adapt existing established research tools to the cloud, and that is what we plan to do with GATE in the next period. Over the last decade GATE has become a world leader for research and development of text mining algorithms.

Text has become a more and more important communication method in recent decades. Our children's thumbs often spend half the day typing on their tiny phone keypads; our evenings often include sessions on Facebook or writing email to distant friends and relatives. When we interact with the corporations and governmental organisations whose infrastructure and services underpin our daily lives we fill in forms or write emails. When we want to publicise our work for our employer or share details of our leisure activities with a wider audience we create websites, post Twitter messages or make blog entries. Scientists also now use these channels in their work, in addition to publishing in peer-reviewed journals -- a process which has also seen a huge expansion in recent years.

This avalanche of the written word has changed many things, not least the way that scientists gather information from the experiences of their peers. For example, a team at the World Health Organisation's cancer research agency recently found the first evidence of a link between a particular genetic mutation and the risk of lung cancer in smokers. Their experiments require large amounts of costly laboratory time to verify or falsify hypotheses based on samples of mutations in gene sequences from their test subjects. Text mining from previous publications makes it possible for them to reduce this lab time by factoring in probabilities based on asssociation strengths between mutations, environmental factors and active chemicals.

A second area that has been revolutionised by the new world of text concerns a core function that commercial concerns must implement in order to stay in business. Customer relations and market research are no longer just about monitoring the goings on of the corporate call center. Keeping up to date with the public image of your products or services now means coping with the Twitter firehose (45 million posts per day), the comment sections of consumer review sites, or the point-and-click 'contact us' forms from the company website. To do this by hand is now impossible in the general case: the data volume long ago outstripped the possibility of cost-effective manual monitoring. Text mining provides alternative, automatic methods for dealing with this data.

GATE provides four core systems to support scientists experimenting with new text mining algorithms and developers using text mining in their applications:

  • GATE Developer: an integrated development environment for language processing components
  • GATE Embedded: an object library optimised for inclusion in diverse applications
  • GATE Teamware: a collaborative annotation environment for high volume factory-style semantic annotation projects built around a workflow engine
  • GATE Mímir: (Multi-paradigm Information Management Index and Repository) a massively scaleable multiparadigm index

Our plan for the next period is to work towards making use of these systems more like electric sockets and fridges!

A caveat: it is important to note that current commercial cloud offerings are not yet appropriate as a drop-in replacement for all academic computing facilities. For example, the cost of running a virtual machine on Amazon's EC2 continuously for 1 year is roughly equivalent to the cost of buying a similar machine. In the latter case the hardware may be expected to perform reliably for at least 3 years, which means that the Amazon option is only cost effective if the cost of hosting a server in your organisation is on the order of 3 times the cost of the server hardware. Careful quantification of the costs is important when moving to the cloud.

See also this previous post on cloud computing.

Permalink. On Blogspot.

Friday, July 9, 2010

How to Join Open Source Projects

This is rather tired ground that has been well trodden by other feet, but in the aftermath of a disagreement with one of the happy chappies from Ontotext I thought I'd reiterate a couple of home truths about open source projects and how you go about joining in. Along the way I'll also ask what "hack" means, for the benefit of those software people who've been locked in a small room without access to books or networks for the last couple of decades :-)

To start with something that should be obvious, all engineering projects of whatever type are social processes in which human factors are at least as important as technical ones. In open source this is often more important than in other areas because the people involved often give their time and expertise for free, and even when they're being paid specifically to participate there is almost always a discretionary element of their contribution (should I bother to answer this email from a complete beginner who obviously hasn't managed to find the user manual, or shall I finish work ten minutes early today?). This means that when you want to join an open source project (i.e. to become a developer, contribute code etc.) you need to show a little sensitivity and think about the needs of the project and its participants as a whole, not just your own take on the thing. I remember a particularly clear case of the opposite approach on the JavaCC project a few years ago (JavaCC is an excellent parser generator that was one of the first available for Java and is used in GATE for analysing JAPE transduction rules). Along came a new developer with some good ideas and some useful code -- which in principle was great news for everyone in the project. Unfortunately said developer jumped in with both feet, screaming abusive nonsense at the project administrators and demanding his own way at every juncture. The result? His useful code was useless and unused.

Why can't we use code from people with whom it is impossible to communicate and collaborate? Because, to paraphrase Stuart Brand, software is not something you finish but something you start -- if it is good and useful then it has a long life span, and during that span it changes and mutates and needs active support and maintenance. If we accept code into our projects from sources whose long-term commitment is questionable (and angry young men with poor collaboration skills are unpredictable in that respect) then we compromise the evolution of our systems (and sooner or later alienate our users).

On a more positive note, if you want to join an open source project here are some steps to help you start off on the right foot:

  • Talk to the developers. Communicate, communicate, communicate! I agree absolutely with Sussman et al. that "A computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute", but this doesn't mean that the only thing you need to write is code. Get in touch with the developers as early in the process as possible, tell them what you're working on or plan to look at, and ask for their advice. Very often you'll get not just advice but active support, and the flip side is that when you produce your contribution they'll know where it comes from and be better able to judge its quality and its implications for the project as a whole.
  • Get to know the mechanisms in place for quality assurance (and quality control) and adopt them. If the project has a test suite you must at minimum ensure that your code doesn't break tests, and you should think seriously about writing new tests to cover all the things that you work on. Look at the documentation and write patches to cover the stuff you do. Contribute to discussions on the mailing list or forum in your area. Think hard about backwards compatibility -- has that interface you just added a method to been linked by a thousand other jars out there, and is it really worth forcing recompilation of all those systems? Don't just think about your own little patch, think about the knock-on effects on the whole system and on the ecosystem of users and developers around it.
  • Be humble. The reason I can write this post on a fantastic Ubuntu Debian GNU Linux system is because lots of people cleverer than me worked hard and contributed their work for the good of humanity. Even if I've invented something useful in my own little corner of computing that certainly doesn't mean that I have the right to sneer at others, even if their knowledge is less than mine in some area. Who knows what greatness their hearts and heads contain? (We are all geniuses, it is part of being human. If you don't believe me, try getting a computer to do image recognition like my 1-year old daughter!)
  • Be prepared to help developers when you want something integrated into the project. Most of the time your work will not be top of the todo-lists of the people you're joining; most often you're adding work to their already full plates, and you should be patient and helpful while they look at your work and figure out whether to include it, or to work towards making you a committer on the project.

So: not rocket science, just basic collaboration skills.

One of the obvious things not to do is to start throwing around pejorative terms. One of these that I find annoying is 'hack'.

"Your work is just a hack! My approach is state of the art! Thou shalt do it my way!"

Oh dear.

First, the users of software don't care. They choose one tool over another because of what they can do with it, not because of the way it conforms or otherwise to visions of elegance or correctness. Of course elegance and correctness can be factors in software performance and maintainability and so on -- but most often these qualities are subjective, particularly when applied by newcomers unfamiliar with the big picture -- which is my next point...

Second, such visions are personal, and especially so to outsiders. If you've sweated over a specific problem (in this case transducing graphs with FSAs) for years at a time then I'll listen to you tell me what is the most elegant solution with interest -- but if you haven't, I'm inclined to assume that there is likely to be stuff you don't know that may well compromise your view.

Third, if we define 'hack' as a quick or heuristic route to get something done, then why would that be a bad thing? And if you start down this route where will you end? For example, the case that lead to this post was in relation to a system that does finite state pattern matching and transduction over annotation graphs. Those of you with a background in formal languages may have already spotted the strangeness here: graphs can describe languages whose expressive power is greater than regular, which would seem to invalidate the whole idea of applying finite state techniques to them. It turns out that the data structures we're working with here (while doing information extraction over human language) have a lot of regular features, and that the indeterminacy that arises from the mismatch between regular FSAs and a graph-structured input stream are not an obstacle to our work (in fact they can sometimes be a good way to ignore ambiguities that we're not currently interested in). But: doesn't that fit the definition of a hack? From that point-of-view the whole subsystem under discussion is one big hack, which makes it even more ridiculous to criticise one approach or another as a "hack".

The moral of the story? Technology is never as important as it seems in our commodity-driven age. Better communication and collaboration skills win every time. The good news is that this is one of the best things about open source :-)

Permalink. On Blogspot.

Saturday, June 5, 2010

Data mining won't make you safer

(Wednesday 31st December 2008)

This holiday I read a fantastic novel called Little Brother by Cory Doctorow. It reminded me of how UK and US moves to collect more data about their citizens and give more powers to 'security' staff are in fact worse than useless. As someone who works in language processing I note with dismay the tendency of technologists to happily provide mining of personal data for state purposes, while cheerfully ignoring the fact that it won't make anyone any safer.

There are many reasons why invading privacy is counter-productive. Two important ones are:

  • the information isn't useful
  • state power is almost always abused

Why isn't the information useful? Imagine a method which is 99% successful at detecting anomalous behaviour and suggesting further investigation. Let's apply that method to 50 million adults in the UK, for example. That's 500,000 people who you now have to regard as suspects. In fact the accuracy of data mining in this type of case is much more likely to be around 50%, so if you collect all the data you can you'll still only know that 10s of millions of people might be suspect. Useless.

Second, security service personnel are just like everyone else: some are consciencious and some are unscrupulous. While you might just about consider it acceptable for all your personal data to be in the hands of a conscienscious, competent, well-trained and well-provisionned state employee, are we really naive enough to imagine that this covers everyone in every police force, army barracks or 'intelligence' office? Of course not; and if we were, the recent history of appalling miscarriages of justice should soon convince us otherwise.

Permalink.

Share