The absence of universal data standards is a notorious problem within life sciences. In an industry where each company uses its own data formats, information from other sources can be extremely hard to decipher. This serves to reinforce silos between companies and raise the barriers to collaboration.
According to research by the data mining company CrowdFlower, 60% of data scientists spend the majority of their day cleaning up messy data, a mundane exercise also known as ‘data wrangling’. A 2014 New York Times article makes the case that ‘data janitor work’ often stands in the way of what data scientists actually want to be doing – namely, mining that ‘Wild West of data’ for insights.
Within life sciences specifically, the problem directly curbs innovation. Unable to easily share their pre-competitive data (i.e., data that adds little competitive advantage to research projects), life sciences companies frequently duplicate findings that would otherwise be at their fingertips. It also causes issues during mergers and acquisitions, when the two companies struggle to merge their data sets.
“In the life science industry today, productivity, and therefore research outputs, are significantly reduced by a lack of well-defined data definitions,” says Tim Hoctor, vice president of professional services at Elsevier. “The need to integrate both internal and external data sources consistently causes problems for researchers around the world because of the complexity and time involved.”
He adds that the problem is gathering urgency. With R&D costs soaring, the industry can no longer simply struggle on and hope to find the answer eventually.
“This approach is radically reducing outcomes and hindering the quest to solve even the most common of diseases,” he says.
A unified data model
In October, Elsevier announced it had donated its Unified Data Model (UDM) to The Pistoia Alliance, a global, not-for-profit alliance that works to lower barriers to innovation in life sciences R&D. The goal is to create a common tool that allows data to be shared between parties.
“The decision to collaborate with The Pistoia Alliance was an easy one to make,” says Hoctor. “Elsevier is a member of The Pistoia Alliance and strongly supports what it is trying to achieve in lowering the barriers to pre-competitive collaboration within the industry. By donating the UDM, Elsevier wanted to extend its partnership with the not-for-profit group, with the aim of publishing an open and freely available format for the storage and exchange of drug discovery data.”
The Pistoia Alliance was established in 2009, by representatives of AstraZeneca, Novartis, GSK and Pfizer. It now comprises a diverse group of experts, with publishers, academic groups and pharma companies among its members.
With a number of projects underway, covering everything from Internet of Things to ‘ontologies mapping’, the group places a strong emphasis on transparency. It pools resources from a number of companies, and openly publishes its work for the benefit of the worldwide research community. The UDM project should go a long way towards furthering its objectives.
“Elsevier wanted the opportunity to bring their UDM into the public domain. Many Pistoia Alliance members were excited with this development, as it now gives The Alliance a chance to develop a common data standard for our members and the industry at large,” says Steve Arlington, president of The Pistoia Alliance.
Accelerating drug discovery
The UDM was originally developed alongside Roche in 2013, when Roche sought to integrate proprietary reaction information into Elsevier’s chemistry database. The idea was to free up time for companies like Roche, allowing them to spend that time on drug discovery rather than data management.
“The model is an XML file format, designed as a starting point for informatics systems that are developed by both life sciences and technology companies,” says Hoctor. “The data model was designed to achieve a standard platform for implementing experimental business processes, which would eventually accelerate the entire drug discovery process.”
Data transfer can be particularly tricky when one company uses a horizontal system (such as an in-house Electronic Lab Notebook) and the other, commonly an academic or contract research organisation, uses a vertical system. Elsevier’s UDM helps to upload data sets into horizontal systems, which should significantly reduce the costs of research projects.
Under the stewardship of The Pistoia Alliance, the model will be developed and extended, with a view to making it more generic and vendor-neutral.
“Our members’ experiences will be critical to how we enhance the UDM,” says Arlington. “In consultation with them, we are looking to expand the types of experimental data covered by the UDM and enable extensibility of the model to support specialised data.”
As a member of The Pistoia Alliance, Elsevier will contribute to these ongoing efforts, alongside other members of the project committee and the broader chemistry community.
A common research language
Once the UDM is completed, it will be used to store and exchange experimental information about compound synthesis and biological testing. Although no single organisation is equipped to create an ‘industry standard’, the model should serve as a useful starting point for companies developing their own informatics systems.
“Our ultimate aim is to publish an open and freely available format for the storage and exchange of drug discovery experimental data that can be adopted by all,” says Arlington. “This new standard can contribute to reducing the time it takes the industry to develop new therapeutics. This will also mean that during M&A activity, valuable research data can be readily transferred and utilised, instead of being lost in the reorganisation process.”
He adds that The Pistoia Alliance, directed by the project committee, will publish the first version of the extended UDM in Q1 2018. After that, the group intends to make further tweaks in response to members’ feedback.
“For Elsevier, the ultimate goal of this project is to ensure that data is a ‘common language’ in the research community. By achieving this goal, the industry will experience fewer bottlenecks in their research; and the lifting of barriers to collaboration, innovation and discovery, often caused by a lack of data standards,” explains Hoctor.