Best Practices For Building Scalable R&D Data Platforms

Dec 20, 2017

Pharmaceutical organizations are starting to turn the corner on using the vast amounts of data they’ve collected to answer the wide array of questions that their internal teams have—whether scientific, clinical, or operational in nature. Companies are increasingly able to answer important questions ranging from “Are we on track to completing this clinical trial on time?” to “Do patients with this biomarker fare better or worse with this treatment?”

Much of this is due to a change in the approach to building R&D clinical data warehouses or translational data platforms that connect a wide variety of clinical and non-clinical data. Traditionally, it took pharmaceutical companies multiple years and a significant amount of money to build these data platforms due, in large part, to the desire to combine clinical data with other sources of insight—like biomarker or specimen data—using top-down, manual approaches to integrating and serving data. Now new practices are surfacing that have delivered significant, tangible results.

Having worked with a variety of large pharmaceutical companies on their quest to build scalable data platforms, the best practices below serve as the cornerstones to transforming legacy approaches of managing data into modern ones capable of achieving the ambitious visions of these organizations.


Best Practice #1: Identify a small collection of high-priority use cases before building the data platform

Often when planning to construct a clinical data warehouse or translational data platform within R&D, CDOs/CIOs will focus on building out a central, curated repository of all available enterprise information—clinical and non-clinical—that’s easily accessible to the business. While that is certainly the ultimate goal, it’s important to start by identifying a small collection of high-priority use cases that are gathered from the data consumers. Ask questions like “What are the most important analytics to each business group?” Naturally, there will be a wide variety of use cases to tackle depending on each group’s focus but building a data platform that will quickly answer the most critical questions—and achieve ROI— is a best practice. Once the initial use cases are successfully completed and the business trusts the results, it’s much easier to scale.

Moreover, it's important to understand how consumers will use the data. Often the most compelling analytics combine data from several sources, for example clinical and translational science. Also, the same data may be used by different groups with very different needs. For example, clinical trial operations wants to know if a vendor's results are on schedule but biostatistics is looking to generate a p-value for publication. The usage of data has major implications for the data model, technologies chosen, etc, and making the wrong choices up front can greatly limit adoption of the system.


Best Practice #2: Plan for change

It’s an inevitability that use cases, models, and datasets are going to change within the business over time so plan for this and create an agile IT stack. More and more, analysts want to bring in data from third party sources, work with new vendors and systems, and corporate M&A and in-licensing, which means more changes are always in store. This all results in a very fluid data environment that IT must adapt to. They must figure out how to deal with change in both source and consumption of data in an agile way to continue to serve end s.

Data standards had often been thought of as an answer to this issue of variety and they are obviously very important to data curation efforts, but you can't solve all problems by trying to standardize data collection at the source. Look for technologies that let you plug in new data sources and reduce the cost of doing so.

Most people approach companies like Tamr because they look for agility in their IT stack. They want to leverage automation to deal with the significant variety in data sources and end data models. Employing machine learning as a means to easily map source data attributes to target attributes will deliver the flexibility and efficiency desired when creating an agile IT stack.


Best Practice #3: Centralize curation stack while collaborating with end s

Curation of pharmaceutical data is sometimes a centralized function and sometimes distributed in nature. As a best practice, companies should centralize the technology stack but collaborate on curation in a distributed fashion (i.e. with each end group).

Collaboration with the end s in curating data is paramount. They want to be involved to ensure mappings were done correctly so they’ll consequently be able to trust the results. Moreover, they want to work with IT to iterate on the final models if they believe something needs to be modified. Finding technology that is not only agile but collaborative in nature and building process around this is a key to success and one of the reasons why approaches like Tamr—that fuse machine learning with human expertise in curating pharmaceutical data—are increasingly desired.

Pervasive data-driven decision making is the ambition of most enterprises and fueling the desire to build data platforms, but the pharmaceutical world is in a unique position to capitalize on harnessing their R&D data to improve both their internal operations and their life-saving offerings. However, undergoing this business transformation requires rethinking the processes and technologies used in managing data at a large organization. If mastered, though, it can drive significant growth, create a sustainable competitive advantage, and provide patient outcomes that have never been realized. 


Ted Snyder, Sr. is Solution Architect at Tamr, Inc.


читать дальше
lorem ipsum