August 29, 2022

Addressing the Bioinformatician Shortage

We’ve entered a new paradigm of data-intensive drug discovery, but a critical shortage of bioinformaticians paired with an antiquated tech stack threatens to stymie therapeutics development. To better understand and address the current informatics crisis, biotechnology leaders can look to a surprising source - the late Jim Gray, a pioneer in scalable database and transaction processing systems. In the latter half of his career, Gray studied how scalable computing systems could be leveraged for answering key scientific questions, and while doing so, segmented scientific discovery into four distinct paradigms1.

The first paradigm was descriptive empirical science: the earliest scientists observed and recorded natural phenomena, such as Aristotle’s zoological reports2 and Galileo’s discovery of Jupiter’s moons3. The second paradigm, arriving with the 17th-century Enlightenment, explored how individual observations could be integrated into cohesive, generalizable theoretical models: Kepler and planetary motion4; Coulomb and electrostatics5; Lovoisier and conservation of mass6. After generating a model, researchers could test if hypotheses were consistent with their predictions, and based on these results, update the model to account for new information.

Eventually, these models (think molecular dynamics7 and metabolic flux analysis8), became too complex to solve and explore manually. The invention of digital electronic computers9 in the 20th century enabled scientists to instead computationally simulate solutions to these problems, unlocking their predictive power and ushering in the 3rd paradigm - computational simulation10

The present-day scientific community now marches toward the 4th paradigm1. Rather than data analysis answering a pre-defined question, the marriage of data, theory, and simulation drives both the generation and interrogation of key questions. Multi-modal unsummarized datasets, computational simulations, and theoretical frameworks are as much laboratory reagents as polymerase and primers.

Figure 1. Jim Gray’s Four Paradigms as a progression toward data-intensive computational science.

Drug discovery R&D teams have borne witness to the radical advancements brought forward within the 4th paradigm. For example, an integrated pipeline for the accelerated discovery of antiviral antibody therapeutics developed by a Vanderbilt-led team in the early months of the COVID-19 pandemic combined single-cell mRNA-sequence analysis, bioinformatics, synthetic biology, and high-throughput functional analysis to enable the rapid discovery of highly potent antiviral human monoclonal antibodies11. This antibody was promptly licensed by AstraZeneca, and received emergency use authorization in December 2021 as Evushield. How can this supercharged data-intensive and collaborative effort be the rule, rather than the exception, for modern drug discovery? 

Headwinds: technical debt, recruitment, and more

Today, unlocking these data-intensive analyses requires deep expertise across systems administration, cybersecurity, devOps, HPC architecture, and production-level software engineering - skill sets most bioinformaticians have no interest in acquiring. This ballooning problem is reflected in the exponential growth of the bioinformatics services market, which is projected to reach USD 5.3 billion by 2026 from USD 2.5 billion in 2021 (15.8% CAGR)12. The critical shortage of bio-IT capabilities has forced bioinformaticians to take up the superhero role of an "integrated devOps/sysadmin/SWE/data scientist" instead of focusing on their core skill sets – using computation as a lens through which to better understand biological systems. 

In software development, there is a concept called "technical debt13." Organizations incur technical debt when development teams, under pressure to deliver, reduce scope or take shortcuts to finish a project – exchanging long term pain for short term gain. In addition to the code itself, the broader bio-IT computing ecosystem is vulnerable to technical debt in its complex configuration, data collection, verification, machine resource management, serving infrastructure, monitoring, and compliance. This impediment needs to be addressed not only in terms of employees or technology, but also in the processes and culture of the organization14. Finally, lack of consensus among internal stakeholders, such as IT organizations, bioinformaticians, administrators, and bench scientists can grind progress to a halt15.

Technical debt is a difficult problem to solve even under ideal circumstances, and is further compounded by a massive shortfall of software engineering and systems administration expertise in biotech and biopharma. This results in already overworked bioinformatics teams being forced to manage user permissions, ensure regulatory compliance, architect scalable and secure HPC systems, handle package management and versioning, refactor code, improve unit tests, delete dead code, reduce dependencies, tighten APIs, improve internal documentation16… all in addition to the actual responsibilities of a bioinformatician. 

Accumulating technical debt isn’t the only problem. The widening gap between the supply and the demand for bioinformatics talent17 has rapidly become a rate-limiting step for many drug discovery efforts (doubly true for earlier-stage R&D organizations that lack the resources of large biopharma companies). The skills required to excel in bioinformatics are highly sought after; excellent bioinformaticians make similarly excellent data scientists for high-tech, fin-tech, and other deep-pocketed industries, which can offer much higher compensation relative to the life sciences18. Combined with the continuously growing size and variety of biological datasets and the fragmented, fragile bioinformatics tooling ecosystem, this lack of available bioinformatics expertise is poised to stymie discovery pipelines across therapeutic areas19,20.

What are we going to do?

To summarize: biotech and biopharma R&D efforts are bottlenecked by the lack of available bioinformaticians, which is driven by:

  1. Massive technical debt in the form of outdated, inaccessible, and often insecure computational infrastructure, fragmented bioinformatics tooling, and unFAIR data practices that force bioinformaticians to spend time on commodity IT tasks instead of analyzing data
  2. A growing number of business-critical R&D efforts requiring bioinformatics expertise 
  3. Competitive recruitment for limited bioinformatics talent, which is being siphoned by high-tech, fintech, and other deep-pocketed industries 
  4. Uninformed “vetos” within IT organizations that do not understand the nature and urgency of the problem

To date, efforts to combat the bioinformatics shortage have focused on addressing (2) and (3) by expanding the current analytical workforce. STEM initiatives are building interest in data-intensive professional paths21,22, but these programs will not remedy the current need with sufficient scale or urgency.  Since we can’t produce more bioinformaticians on demand today, we need to empower the ones we have to work more effectively.

Our experience tells us there are two ways technology can increase the output from a fixed pool of talent: increasing throughput by abstracting away unneeded complexity, or reducing the level of expertise needed to perform basic tasks, freeing up experts to operate at their highest level of training (problem 1, above). Unfortunately, the available hardware and software has widened, rather than solved, this talent gap. For instance, currently available open-source packages can jumpstart an analysis for an experienced bioinformatician, but only after successful configuration of computing infrastructure and installation of the slew of required dependencies. The lack of easy-to-use and cross-tool compatible data analysis and visualization workflows prevents broader, less programming-proficient scientists from even beginning to access the basic functionality of today’s bio-IT ecosystem. For the typical bench biologist, the enormously powerful tools and repositories are a bewildering alphabet soup that they lack the time, skills, or resources to operationalize.

To stress, each individual tool may be perfectly capable of delivering on its promised analysis for a trained bioinformatician. However, the requirements for deploying these tools and the lack of accessibility for non-computationally sophisticated users prevent existing solutions from addressing the bioinformatics talent gap. In order to unleash the latent potential of the current bioinformatics workforce, we need both tools for augmenting bioinformaticians, and tools for collaborating with biologists:

Watershed’s mission is to eliminate the bio-IT obstacles separating scientists from the insights they need to build better drugs faster, delivering the promise of the 4th paradigm and revolutionizing the data-intensive research of discovery teams. Our fully-verticalized hardware and software stack dramatically accelerates bioinformaticians’ analyses, addressing head-on the critical gap between the supply and demand of skilled bioinformaticians, and providing a driving force for innovation and discovery. To learn more about how Watershed can kick-start the development of a data science culture within your organization, please reach out at - we are always happy to chat!

  1. Hey AJG, ed. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research; 2009.
  2. Leroi AM, MacPherson S, Koutsogiannopoulos D. The Lagoon: How Aristotle Invented Science. 1. publ. Bloomsbury; 2015.
  3.  Galilei G. Sidereus Nuncius or, The Sidereal Messenger. Second edition. The University of Chicago Press; 2015.
  4.  Caspar M, Hellman CD. Kepler. Dover Publications; 1993.
  5.  Coulomb CA. Premiere Memoire sur l’electricite et le Magnetisme, Histoire de l’Academie Royale des Sciences; 1785.
  6.  Lavoisier AL, MacKie D, Lavoisier AL. Elements of Chemistry: In a New Systematic Order, Containing All the Modern Discoveries. Repr. of the 1790 ed. Dover Publ; 1965.
  7.  Hollingsworth SA, Dror RO. “Molecular Dynamics Simulation for All.” Neuron. 2018;99(6):1129-1143. doi:10.1016/j.neuron.2018.08.011
  8.  Antoniewicz MR. “Methods and advances in metabolic flux analysis: a mini-review.” Journal of Industrial Microbiology and Biotechnology. 2015;42(3):317-325. doi:10.1007/s10295-015-1585-x
  9.  Dickinson, A.H., "Accounting Apparatus", U.S. Patent 2,580,740, filed Jan. 20, 1940, granted Jan. 1, 1952 
  10.  Pugh EW. Building IBM: Shaping an Industry and Its Technology. MIT Press; 2009.
  11.  Gilchuk P, Bombardi RG, Erasmus JH, et al. Integrated pipeline for the accelerated discovery of antiviral antibody therapeutics. Nat Biomed Eng. 2020;4(11):1030-1043. doi:10.1038/s41551-020-0594-x
  13.  Sculley D, Holt G, Golovin D, et al. “Hidden Technical Debt in Machine Learning Systems.” Google, Inc.
  15.  Eisenhardt KM. “Making Fast Strategic Decisions In High-Velocity Environments.” Academy of Management Journal. 1989;32(3):543-576. doi:10.5465/256434
  19.  Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. doi:10.1038/sdata.2016.18
  20.  Abdurakhmonov IY. Bioinformatics: Basics, Development, and Future. In: Abdurakhmonov IY, ed. Bioinformatics - Updated Features and Applications. InTech; 2016. doi:10.5772/63817
  21.  Waite AM, McDonald KS. Exploring Challenges and Solutions Facing STEM Careers in the 21st Century: A Human Resource Development Perspective. Advances in Developing Human Resources. 2019;21(1):3-15. doi:10.1177/1523422318814482
Mark Kalinich, MD, PhD
Co-founder and CSO