Not too long ago scientists could analyse their results with a few charts and graphs, more recently standard commercial databases have been used, but new processes in biological research have progressed to deliver such a vast amount of data that a whole discipline – bioinformatics – has evolved to handle the information.
Much of this flood of data is coming from researchers using DNA microarrays for gene expression profiling experiments in which the expression levels of thousands of genes are simultaneously monitored to observe changes caused by disease or the effects of treatments.
A single array may produce 20,000 data objects, each updated over a period of time. Then the whole experiment might be repeated with a different cell, a different organism or under changed conditions. A large commercial database may have 20,000 fields, one for each product the company stocks, but they don’t expect to update them all in seconds, nor do they have to compare them with similarly huge data sets acquired in parallel experiments to derive new information.
Bioinformatics departments in universities are mostly working on the interface between biologists and computer science using the techniques developed by the computer people to handle the data generated by the biologists. The traffic isn’t all one way though; biology is now making demands that the IT researchers never expected, forcing computer science to develop in new directions too.
Developing new software in order to carry out research is time consuming and expensive so at the level of basic research it makes sense to share techniques. Open Source software is useful here since it is available free to any researcher who can develop it as necessary to carry out the work and make the new version available to other researchers. Even Microsoft, not known for its love of the open source philosophy, has joined in by creating the Microsoft Biology Foundation (MBF) which provides free open-source tools to bioinformatics researchers.
MBF is a programming-language-neutral bioinformatics toolkit built as an extension to the Microsoft .NET Framework to serve as a library of commonly used bioinformatics functions. Speaking at the annual eScience Workshop at Berkeley in October last year where version 1 of MBF was released Tony Hey, corporate vice president of Microsoft External Research said:
“Biologists face a number of issues today, such as detecting correlations between human genome sequencing or identifying the likelihood for a patient to develop a certain disease, the MBF aims to provide healthcare research facilities with the tools needed to help scientists advance their research and ensure data accuracy.”
The Informatics Group at Johnson & Johnson Pharmaceutical Research and Development has used MBF to extend its Advanced Biological & Chemical Discovery informatics platform to seamlessly integrate small and large molecule discovery data.
“The bioinformatics features and functionality within the MBF equipped us with pre-existing functions so we didn’t have to re-invent the wheel,” said Jeremy Kolpak, senior analyst at Johnson & Johnson Pharmaceutical Research and Development. “Ultimately, it saved us a tremendous amount of time, allowing us to focus on the development of higher-level analysis and visualisation capabilities, and delivering them faster to our scientists, thus improving their ability to make data-driven discoveries and critical diagnoses.”
Stitching a web of data
Andrew Brass is professor of Bioinformatics at the University of Manchester, one of the leading centres for this work in the UK, explains that they are developing in conjunction with researchers around the world. At this stage they are refining their techniques whilst carrying out valuable research into animal health, when the research methods are fully proven they will be able to apply them to human health issues.
A current project on sleeping sickness (trypanosomiasis) in African cattle brought together Kenyan experts in cattle genetics and cattle health, geneticists from the The Roslin Institute (University of Edinburgh) and Australia and cattle breeders from Ireland. All of these produced a vast amount of data.
Carole Groble, Professor of Computer Science at Manchester explains that they have around a thousand databases of biological research, and hundreds of different analysis and visualisation tools. In her words bioinformatics is about ‘stitching them all together’. Doing that by hand is very time consuming, and can lead to errors.
Given an overwhelming amount of data it is human nature to look for a means of simplifying it. A common response is to adopt a ‘hypothesis driven’ approach, focussing on parts of the data studied already cited in previous research, or associated with similar conditions, when a thorough analysis of all the data would reveal something completely new. There is also the issue that much of this data is coming from other active projects, and is subject to change.
Various web-based tools and informal data sharing links between researchers had been set up in the past but now the Taverna project, led by Prof. Groble provides a much more structured way of comparing disparate information. Taverna allows researchers to define ‘workflows’, the methodologies for carrying out the research, which can be saved, re-run and edited as necessary.
In the case of the Kenyan cattle researcher Paul Fisher at Manchester used data from two genetic data sets to determine which genes might impart resistance to trypanosomiasis.
Taverna comes with some pre-defined workflow services to carry out common tasks, others can be created or imported. The import feature is crucial, it allows researchers to publish their workflow along with the research findings, making it easier for peers to review the processes used. It also means that useful workflow services are made available to other scientists sparing them the effort and time needed to define every service from scratch.
The Taverna team very quickly spotted that workflow sharing via email, social networking and so on was becoming widespread, but haphazard. Now they have created the ‘MyExperiment’ site where scientists can share workflows. Less time hacking together disparate databases means more time spent on efficient research.