June 3, 2022

What Does it Mean to Be an End-to-End Bio-IT Platform 

A Century of Genomics 

It may not feel like it, but the idea of genomics is not new. In 1920, botanist Hans Winkler coined the word genome1, a hybrid of the words gene and chromosome, to represent the genetic makeup of an organism. 32 years later, Martha Chase and Alfred Hershey demonstrated that DNA, rather than proteins, encoded that genomic information, and in 1986 Tom Roderick, while discussing the feasibility of sequencing an entire human genome, came up with the term genomics. 

 

A century from its birth, the study of the genome is finally coming of age. Next generation sequencing (NGS)2 has enabled the sequencing of hundreds of thousands of human genomes3, with similar advances in generating transcriptomic, epigenomic, and proteomic datasets for understanding the molecular-level underpinnings of human health and disease. While this explosion of multi-omic data4 holds great promise for accelerating the drug discovery process, the exponential growth in the size and variety of biologic data has rapidly outpaced existing bio-IT infrastructure’s ability to generate insights from it. 

The Exponential Growth of Big Data in Biology  

Understanding that exponential growth is not one of humanity's strong-suits explains a lot of the problem. From 1982 to present, GenBank, the NIH’s genetic sequence database5, has approximately doubled in size every 18 months. Put another way: 65% of all the public genomic data ever generated has been made since the start of the COVID-19 pandemic (Figure 1). 

Figure 1: The Exponential Growth in NGS Data 

There’s no sign of slowing down, either. The National Human Genome Research Institute6 estimates that up to 40 exabytes of genomics data will be generated by 2025. Extending this trend to the growing set of other high dimensional data sources (e.g. metabolomics, lipidomics, proteomics, high-content imaging, etc.) highlights the daunting problem of finding a scalable, secure, and accessible storage solution for the life sciences7

The Challenges of Building and Maintaining Computational  Infrastructure 

Managing the compute resources necessary to process raw data into interpretable results has become increasingly challenging as well. Configuring and provisioning the terabytes of RAM and hundreds of CPUs/GPUs needed for -omics analyses often requires teams of dedicated systems administrators and software engineers. These skill sets lie well outside the expertise of most life sciences organizations. Even for large pharmaceutical companies with experienced bioinformaticians and pre-existing IT teams, setting up efficient and performant storage/compute infrastructure, either by building on-premise systems or using 3rd party cloud service providers (AWS, GCP, Azure, Databricks, etc.) has proven extraordinarily challenging; it’s notable that even in clinical trials, a significant proportion of costs are driven simply by IT and dedicated infrastructure setup8.  For younger biotechnology startups lacking the above expertise, these challenges can seem insurmountable. 

Once these resources are in place, R&D teams face yet another roadblock: installing the patchwork of poorly documented, often unstable, sometimes incompatible bioinformatics tools necessary to perform their analysis of interest. Mission-critical genomics software commonly originates from the laboratories of academic bioinformaticians, who lack the funding, skill sets, or career incentives to create and maintain high-quality code9. This lack of a harmonized code base means that each life sciences group must ‘start from scratch’ to install and integrate a tangled web of tooling for their specific use case, as well as face continued headaches when they inevitably need to modify or upgrade a given analysis workflow. Finally, enforcing methods for robust data provenance and analysis reproducibility10 continues to plague most organizations. Tools like Nextflow and Docker can help groups mitigate some of the above concerns, after, of course, they’ve figured out their own storage/compute solutions. 

Flexibility versus Usability: A Biologist’s Conundrum 

After the computational infrastructure is in place and the bioinformatics tooling is up and running, building and customizing bioinformatics pipelines can be laborious and error-prone, even for seasoned bioinformaticians. What then, for biologists? Extensive programming training has not yet made it into the standard curriculum of wet lab scientists. Although most trainees are comfortable with basic coding concepts, few have the bandwidth to learn the intricacies of Python, Bash, R, and/or Perl required to use the same systems as their bioinformatics colleagues. Making even the most routine modifications to a bioinformatics analysis (switching the colors on a heatmap, adding metadata to a plot, looking at a different gene set) can require detailed coding know-how. Biologists must often wait days, weeks, or even months for their highly over-subscribed bioinformatics collaborators to make these basic changes - grinding progress to a halt. 

Several established (DNANexus, BaseSpace Sequence Hub, Galaxy) and newer (Basepair Genomics, LatchBio, BioBox Analytics) no-code focused bioinformatics platforms exist to empower biologists to analyze their datasets without any code at all. Such tools are fantastic, assuming the platform already has constructed the exact analyses a user is interested in. Their usability comes at the cost of flexibility: running a pre-existing pipeline is easy, but building a new workflow or customizing an existing one can be extremely challenging. Given the explosion in potential analysis approaches relative to single-omics analysis, this challenge rings especially true for integrative analyses using multi-omics datasets. 

End-to-End Bio-IT Platforms

 

Simply put, an end-to-end bio-IT managed service provider simultaneously solves all of the above problems, enabling both biologists and bioinformaticians to rapidly build, run, and customize bioinformatics workflows. Watershed’s interdisciplinary team of PhD-level biologists, bioinformaticians, and veteran software engineers gives us the perspective and expertise to do exactly that - empowering all of our users to generate bioinformatics insights in minutes instead of months. The Watershed Cloud Data Lab® (CDL) offers secure, scalable, and accessible storage and computation resources, all without having to figure out S3 buckets or EC2 instances. Our simple yet powerful package management system makes it easy to use industry-standard bioinformatics tooling without spending weeks dealing with version incompatibilities or inexplicably failed environment builds. 

Built from the ground up for reproducible research, CDL data objects instantly reveal the set of operations used to generate them; users always know which tools and parameters were used to make every result. Finally, our templated no-code workflows, intuitive user interface, extensive documentation, and patent-pending no-code/low-code transition technology make all of the above accessible to everyone - even researchers who have never coded in their life. 

Of course, no analysis is ever exactly the same, and our team of world-class bioinformaticians is always here to lend a helping hand. Whether it’s a quick question about a standard workflow (our average response time is 4.5 minutes) or a completely bespoke bioinformatics build-out, with Watershed, you’re never alone: we have the expertise and bandwidth to enable your science.

If any of these challenges resonate with you, let’s chat! Book a free live demo or get in touch with us at contact@watershed.ai – our team is eager to help massively accelerate your life sciences research. 

References

  1. Weissenbach J. The rise of genomics. Comptes Rendus Biologies, Volume 339, Issues 7–8, 2016, Pages 231-239. ISSN 1631-0691. https://doi.org/10.1016/j.crvi.2016.05.002.
  2. Behjati S, Tarpey PS. What is Next Generation Sequencing?  Arch Dis Child Educ Pract Ed 2013;98:236–238.  doi:10.1136/archdischild-2013-304340
  3. United States Department of Health and Human Services (last accessed 1/26/2022). National Institutes of Health: All of Us Research Program. https://allofus.nih.gov/
  4. Krassowski M, Das V, Sahu SK, Misra BB. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front. Genet.,10 December 2020. https://doi.org/10.3389/fgene.2020.610798.
  5. National Center for Biotechnology Information, U.S. National Library of Medicine. GenBank and WGS Statistics (last accessed 1/26/2022). https://www.ncbi.nlm.nih.gov/genbank/statistics/.
  6. National Institutes of Health, National Human Genome Research Institute. Genomic Data Science Fact Sheet (last accessed 1/26/2022). https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science.
  7. Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinformatics and Biology Insights. January 2021. doi:10.1177/11779322211035921.
  8. Mestre-Ferrandiz, J., Sussex, J. and Towse, A. (2012) The R&D Cost of a New Medicine. OHE Monograph.
  9. Siepel, A. Challenges in funding and developing genomic software: roots and remedies. Genome Biol 20, 147 (2019). https://doi.org/10.1186/s13059-019-1763-7.
  10. Papin JA, Mac Gabhann F, Sauro HM, Nickerson D, Rampadarath A (2020) Improving reproducibility in computational biology research. PLoS Comput Biol 16(5): e1007881. https://doi.org/10.1371/journal.pcbi.1007881.

Mark Kalinich, MD, PhD
Co-founder and CSO