July 26, 2022

The Past, Present, and Future of Transforming Biological Data Into Insight

The ability to acquire omics data has overtaken most labs’ capacity to analyze and gain insight from that data. At best, there is a bottleneck to analysis, and at worst, the data processing and analytics effort is paralyzed. R&D teams must accept: 

In today's time-compressed and budget-constrained drug discovery environments, these scenarios are simply not tenable. And yet, these costly problems define the current state of affairs for many life sciences R&D organizations. How did we get here – and more importantly, can life sciences organizations learn from the mistakes of the past to leapfrog the competition?

The start: academia and government explore the genome

The terms -ome and -omics exploded in PubMed in the 1990s, nearly thirty years ago. The multi-variate and high-throughput nature of this data, when it could be acquired, posed an immediate challenge. Dedicated bioinformatics groups with access to exceptional local computing power, command-line scripting gurus, and leading-edge scientific domain expertise saved the day. These groups thrived in academia and government settings, and in pharma niche groups were established, often by acquisition, to explore the power of computational analysis to advance discovery.

As time passed, access to omics data, not to mention the breadth and depth of the data itself, exploded. Initially in the form of public datasets, (later, the exponential decrease in sequencing cost democratized data generation) the appetite for accessible “user-friendly” analytics expanded. With the announcement of the initial draft of the human euchromatic genome in June 2000, the race to transform data to insight was on.

The beginning: no-code forward offerings

In the mid-2000s, the solution appeared in the form of “no-code” offerings for genomic analysis. These products were available in familiar desktop or browser-based environments and required no scripting expertise. The now familiar names Galaxy, DNANexus, and Seven Bridges, as well as more recent entrants such as Rosalind, Latch.Bio, and BasePair, attempted to democratize genomics analysis by enabling biologists to access the computing and tooling they needed to start interpreting their data. Biologists were, and continue to be, wowed by the seemingly effortless, fast, and visually appealing initial results these tools can provide. When data needs to be processed with a standard, static pipeline, no-code environments can be very suitable, such as with bulk RNAseq workflows, where gene abundance is understood and the parameters around fold change are more or less locked in. In fact, they might better be described as “pre-code,” as the analysis, premeditated in the experimental design, is baked in. Once those biologists were ready for the next set of computational experiments, however, the costs of the “no-code” approach quickly became apparent. 

Scientists ask questions of their data, and the best scientists ask innumerable questions. The ease of data analysis in no-code offerings is a consequence of their user-friendliness – expertise in R, Bash, and Python are not required to produce helpful results – but fine-grained, fully customized control remained, and remains, beyond the reach of these tools. Biologists can start, but rarely finish, interpreting their data. It turned out that some level of coding, or at least parameterization, is required to fully interrogate data beyond the most predictable analyses. In data-first, modeling-centric settings, parameterization is king. This principle applies to both data acquisition and analysis. In fact, it is the sine qua non of a fully-contextualized “data-as-an-asset” approach. Richly parameterized data permits bioinformaticians to wring maximum value out of every data acquisition without constantly refactoring their data. But even more importantly, when presented in a low-code environment, this capability can be made available to a far broader audience. More on low-code solutions in a minute.

No-code solutions (which provide everything within the green box) are fantastic - until you need to change any component of your workflow, which can require a full-stack engineering team (red box). 

Are high-code environments in the cloud the answer?

Their limitations notwithstanding, some segments of the market continue to be adequately served by no-code products. The need for flexible, parameterized analyses and scalable compute has driven many life science organizations to adopt highly flexible, high-code COTS (commercial-off-the-shelf) cloud computing products like as AWS and DataBricks (overlayed with tools like Nextflow), or the Broad Institute’s Terra collaboration with Verily and Microsoft. These solutions lower the bar compared to the earliest days of command-line omics analysis that took place at the supercomputing clusters of academic medical centers and government institutes, and shed the flexibility limitations of no-code solutions. 

More accessible than their command-line forebears, this generation of tools also embraced modularity and interoperability, evidenced, for example, by their adoption of standards offered by the Global Alliance for Genomics and Health (GA4GH), an international alliance of over 600 organizations started in 2013. This effort provides frameworks and standards required for the secure and responsible sharing of biomedical data via APIs supporting authentication, clinical connectivity, semantic tagging, and workflow execution4.  

However, high-code remains… high-code. The specialty skill sets that make bioinformaticians a rate-limiting resource persist as a bottleneck when using these solutions, and preclude biologists from making even minor modifications to a given analysis. Accessing the required systems administrator and software engineering expertise to securely and scalably configure COTS clouds is another, often underappreciated challenge that lies outside the toolkit of most life sciences organizations. These concerns should give decision-makers pause: misconfiguration can come with potentially catastrophic consequences, as seen with Pfizer’s accidental release of patient-provided data due to a simple mistake in their cloud storage setup5.  

High-code solutions provide scalability and flexibility, but require users to assemble their own storage/compute infrastructure and overlay fragile, fragmented bioinformatics tooling prior to beginning a single analysis. 

The “hero” approach: consultancies

The above challenges have driven some organizations to simply outsource their bioinformatics analyses. There is one indisputable advantage to the use of a third party to perform omics analysis – the analysis will be completed, and in the context of a trusted relationship the quality of the results can be reliable. Unfortunately, the operational cost of such an approach is prohibitive, the local fund of knowledge is not expanded, the timeframes can be lengthy, and additional iterative analyses are typically undertaken from scratch. You can’t get started in genomics without doing genomics, and you can’t do genomics without getting started in genomics. Leading thinkers acknowledge that an “omics” culture must be nurtured within an organization, an opportunity that is lost when the analysis is outsourced.

More analysis means more pain (in money and time) - with little institutional knowledge gain.

Low-code: striking a balance

Parametrically-driven (which is just another way of saying low-code) solutions acknowledge that these analyses are actually dry-lab experiments: rather than testing the same, static hypothesis with every dataset, scientists need to be able to iteratively adjust their computational experiment based on an evolving understanding of their experimental system. In non-pipeline scenarios such as these, parametrically driven analysis is essential, which makes swapping out different aligners6, testing various dimensionality reduction techniques7, or even making minor modifications to a standard tool as easy as it always should have been. 

Ample precedent exists across a wide array of verticals for enablement via low-code solutions – Zapier and Microsoft Power Automate for productivity, IFTTT for home automation, and DataRobot for AI, to name a few. These solutions share the notion of unit operations that “componentize” thousands of lines of code into a single analysis step, allowing the exploration of data outside of any predetermined schemes.

Today’s omics solutions must reach across the entire R&D ecosystem and meet the needs of both the wet and dry laboratory environments. Biotech leaders need persuasive results that animate investors to accelerate company growth. Bench scientists want tools that enable them to generate statistically robust insights with publication-quality figures without waiting weeks for a bioinformatics colleague to change a single variable. Bioinformaticians, with their exceptionally valuable biology-specific data science skills, want to focus on high-value bioinformatics problems without being forced to be their own systems administrator and software engineer – or they’ll leave for somewhere else where they can.

Meet Watershed Informatics 

Watershed’s Cloud Data Lab is the first fully verticalized, low-code/no-code bio-IT cloud solution to meet the needs of wet and dry lab scientists alike. Purpose-built for life sciences discovery, our ultra-scalable computational infrastructure, solved informatics tooling environments, and instantly accessible PhD-level bioinformatics expertise enable order-of-magnitude improvements in both the speed and flexibility of generating insights from biomedical data.  For a free, in-depth 1:1 (or bring your friends!) demonstration, don’t hesitate to contact us.

  1. “Data Preparation Overview.” IBM, 2021. https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=preparation-data-overview
  2. Wilkinson et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific Data, 2016. https://www.nature.com/articles/sdata201618
  3. John Conway. “Multi-omics Data Management White Paper. 20/15 Visioneers, 2020. https://www.20visioneers15.com/post/multi-omics-data-management-white-paper
  4. Global Alliance for Genomics and Health:  https://www.ga4gh.org/about-us/
  5. Duncan Riley. “Pharma giant Pfizer exposes patient data on unsecured cloud storage.” siliconANGLE, 2020. https://siliconangle.com/2020/10/20/pharma-giant-pfizer-exposes-patient-data-unsecured-cloud-storage/
  6. Musich et al. “Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider.” Frontiers Plant Science, 2021. https://www.frontiersin.org/articles/10.3389/fpls.2021.657240/full
  7. Meng et al. “Dimension reduction techniques for the integrative analysis of multi-omics data.” Briefings in Bioinformatics, 2016. https://academic.oup.com/bib/article/17/4/628/2240645
Mark Kalinich, MD, PhD
Co-founder and CSO