It may not feel like it, but the idea of genomics is not new. In 1920, botanist Hans Winkler coined the word genome1, a hybrid of the words gene and chromosome, to represent the genetic makeup of an organism. 32 years later, Martha Chase and Alfred Hershey demonstrated that DNA, rather than proteins, encoded that genomic information, and in 1986 Tom Roderick, while discussing the feasibility of sequencing an entire human genome, came up with the term genomics.
A century from its birth, the study of the genome is finally coming of age. Next generation sequencing (NGS)2 has enabled the sequencing of hundreds of thousands of human genomes3, with similar advances in generating transcriptomic, epigenomic, and proteomic datasets for understanding the molecular-level underpinnings of human health and disease. While this explosion of multi-omic data4 holds great promise for accelerating the drug discovery process, the exponential growth in the size and variety of biologic data has rapidly outpaced existing bio-IT infrastructure’s ability to generate insights from it.
Understanding that exponential growth is not one of humanity's strong-suits explains a lot of the problem. From 1982 to present, GenBank, the NIH’s genetic sequence database5, has approximately doubled in size every 18 months. Put another way: 65% of all the public genomic data ever generated has been made since the start of the COVID-19 pandemic (Figure 1).
There’s no sign of slowing down, either. The National Human Genome Research Institute6 estimates that up to 40 exabytes of genomics data will be generated by 2025. Extending this trend to the growing set of other high dimensional data sources (e.g. metabolomics, lipidomics, proteomics, high-content imaging, etc.) highlights the daunting problem of finding a scalable, secure, and accessible storage solution for the life sciences7.
Managing the compute resources necessary to process raw data into interpretable results has become increasingly challenging as well. Configuring and provisioning the terabytes of RAM and hundreds of CPUs/GPUs needed for -omics analyses often requires teams of dedicated systems administrators and software engineers. These skill sets lie well outside the expertise of most life sciences organizations. Even for large pharmaceutical companies with experienced bioinformaticians and pre-existing IT teams, setting up efficient and performant storage/compute infrastructure, either by building on-premise systems or using 3rd party cloud service providers (AWS, GCP, Azure, Databricks, etc.) has proven extraordinarily challenging; it’s notable that even in clinical trials, a significant proportion of costs are driven simply by IT and dedicated infrastructure setup8. For younger biotechnology startups lacking the above expertise, these challenges can seem insurmountable.
Once these resources are in place, R&D teams face yet another roadblock: installing the patchwork of poorly documented, often unstable, sometimes incompatible bioinformatics tools necessary to perform their analysis of interest. Mission-critical genomics software commonly originates from the laboratories of academic bioinformaticians, who lack the funding, skill sets, or career incentives to create and maintain high-quality code9. This lack of a harmonized code base means that each life sciences group must ‘start from scratch’ to install and integrate a tangled web of tooling for their specific use case, as well as face continued headaches when they inevitably need to modify or upgrade a given analysis workflow. Finally, enforcing methods for robust data provenance and analysis reproducibility10 continues to plague most organizations. Tools like Nextflow and Docker can help groups mitigate some of the above concerns, after, of course, they’ve figured out their own storage/compute solutions.
After the computational infrastructure is in place and the bioinformatics tooling is up and running, building and customizing bioinformatics pipelines can be laborious and error-prone, even for seasoned bioinformaticians. What then, for biologists? Extensive programming training has not yet made it into the standard curriculum of wet lab scientists. Although most trainees are comfortable with basic coding concepts, few have the bandwidth to learn the intricacies of Python, Bash, R, and/or Perl required to use the same systems as their bioinformatics colleagues. Making even the most routine modifications to a bioinformatics analysis (switching the colors on a heatmap, adding metadata to a plot, looking at a different gene set) can require detailed coding know-how. Biologists must often wait days, weeks, or even months for their highly over-subscribed bioinformatics collaborators to make these basic changes - grinding progress to a halt.
Several established (DNANexus, BaseSpace Sequence Hub, Galaxy) and newer (Basepair Genomics, LatchBio, BioBox Analytics) no-code focused bioinformatics platforms exist to empower biologists to analyze their datasets without any code at all. Such tools are fantastic, assuming the platform already has constructed the exact analyses a user is interested in. Their usability comes at the cost of flexibility: running a pre-existing pipeline is easy, but building a new workflow or customizing an existing one can be extremely challenging. Given the explosion in potential analysis approaches relative to single-omics analysis, this challenge rings especially true for integrative analyses using multi-omics datasets.
Simply put, an end-to-end bio-IT managed service provider simultaneously solves all of the above problems, enabling both biologists and bioinformaticians to rapidly build, run, and customize bioinformatics workflows. Watershed’s interdisciplinary team of PhD-level biologists, bioinformaticians, and veteran software engineers gives us the perspective and expertise to do exactly that - empowering all of our users to generate bioinformatics insights in minutes instead of months. The Watershed Cloud Data Lab® (CDL) offers secure, scalable, and accessible storage and computation resources, all without having to figure out S3 buckets or EC2 instances. Our simple yet powerful package management system makes it easy to use industry-standard bioinformatics tooling without spending weeks dealing with version incompatibilities or inexplicably failed environment builds.
Built from the ground up for reproducible research, CDL data objects instantly reveal the set of operations used to generate them; users always know which tools and parameters were used to make every result. Finally, our templated no-code workflows, intuitive user interface, extensive documentation, and patent-pending no-code/low-code transition technology make all of the above accessible to everyone - even researchers who have never coded in their life.
Of course, no analysis is ever exactly the same, and our team of world-class bioinformaticians is always here to lend a helping hand. Whether it’s a quick question about a standard workflow (our average response time is 4.5 minutes) or a completely bespoke bioinformatics build-out, with Watershed, you’re never alone: we have the expertise and bandwidth to enable your science.
If any of these challenges resonate with you, let’s chat! Book a free live demo or get in touch with us at contact@watershed.ai – our team is eager to help massively accelerate your life sciences research.
References