Code to Cure is a newsletter co-hosted by the Wyss Institute at Harvard University and Milad Alucozai on the intersection of AI, biology, and healthcare transforming medicine.
Imagine a world where algorithms predict the next pandemic before it strikes, where you receive custom-designed treatments tailored to your individual genetic makeup, lifestyle, and medical history, and where scientists test new drugs and therapies on digital twins in simulated clinical trials, speeding up the development of life-saving treatments and reducing our reliance on traditional, time-consuming methods. We live in an age where the digital and physical boundaries are dissolving. We’re no longer just passively observing life through a microscope; we’re actively engineering it. This isn’t science fiction; it’s the groundbreaking reality of biology today, where we’re rewriting the very code of life itself. Powerful technologies converge, generating a tsunami of biological data—petabytes upon petabytes of intricate information. Universities, hospitals, pharmaceutical companies, startups, and technology giants are racing to harness this data, equipping a new generation of researchers with the tools to decipher life’s deepest secrets and conquer its most daunting challenges.
From In Silico to In Vitro to In Vivo
From the moment we step into a classroom, we’re subtly conditioned to see the world in distinct categories. Biology, engineering, public health – each subject occupies its own neatly labeled box, seemingly separate and distinct. We learn about cells and ecosystems in one class, circuits and algorithms in another, and epidemiology and healthcare systems in another. This compartmentalization, while perhaps helpful for organizing curricula, can obscure the deep and intricate connections that weave these disciplines together.
Traditionally, the realms of computational biology and informatics, often referred to as the “in silico” domain, have operated as distinct disciplines from the myriad wet lab fields within the life sciences, which primarily revolve around experimentation using molecular biology tools, cell lines, microbial cultures, and biochemical assays, known as “in vitro” studies. These, in turn, stand apart from fields concentrating on research within animal models and clinical settings, termed “in vivo” investigations. This segregation reflects the diverse methodologies and emphases characteristic of each domain, with computational approaches emphasizing data analysis, modeling, and simulation. At the same time, wet lab experiments delve into biological phenomena through physical experimentation and observation.
Nevertheless, contemporary trends are blurring these traditional boundaries as interdisciplinary collaborations become increasingly prevalent. In the modern landscape of scientific inquiry, the delineation between in silico, in vitro, and in vivo is growing increasingly porous, with researchers recognizing the value of integrating diverse methodologies to tackle complex biological questions comprehensively. The research process is evolving into a more dynamic, iterative loop where hypothesis generation, hypothesis testing, results interpretation, and hypothesis refinement intertwine cyclically.
This iterative nature underscores the fluidity of scientific inquiry as researchers navigate between computational analyses, experimental validations, and clinical observations to refine their understanding of biological phenomena. As such, the once rigid divisions between computational and experimental approaches give way to a more holistic and synergistic approach to scientific discovery.
Building Software for Science
The evolution of biological research is not solely confined to advancements within the laboratory setting. Parallel to the technological progress in experimental methodologies, significant strides have been made in software development, fundamentally altering the landscape of computational biology. These advancements encompass a wide range of areas, including:
- Code generation—Modern Integrated Development Environments (IDEs) offer features such as code linting for automated error checking and even code generation facilitated by large language models (LLMs), significantly enhancing programming efficiency.
- Source Code Management — Platforms like GitHub, GitLab, and BitBucket promote collaborative software development and version control, streamlining the development process.
- Software packaging and delivery — Containerization technologies like Docker and Kubernetes simplify software deployment and management, ensuring reproducibility and scalability.
- Cloud Computing— Cloud platforms such as AWS, Google Cloud, and Azure provide access to vast computational resources, enabling researchers to perform complex analyses without substantial hardware investments.
- Testing — Continuous Integration and Continuous Development (CI/CD) pipelines automate software testing and deployment, ensuring software quality and reliability.
- Project management — Agile and Scrum methodologies, often facilitated by tools like Jira, enhance project organization and efficiency.
These advancements have profound implications for the scientific discovery process. The development of specialized software tools tailored for scientific applications reduces the barrier to entry for researchers, enabling effective utilization of computational methods without requiring extensive programming expertise. This democratization of computational resources fosters a convergence of traditionally distinct research domains.
Consequently, the conventional boundaries delineating in silico, in vitro, and in vivo studies are becoming increasingly porous. Laboratories are transitioning from isolated entities to interdisciplinary hubs, seamlessly integrating various methodologies. This collaborative environment, facilitated by accessible computational tools, accelerates scientific progress. Furthermore, incorporating ML algorithms within this framework empowers researchers to optimize their workflows, expedite data analysis, and accelerate extracting meaningful insights.
The Role of ML in Discovery: Hypothesis Generation vs. Hypothesis Testing
Machine learning has emerged as a powerful tool for generating hypotheses in biological research. By identifying patterns and trends within vast datasets, ML algorithms can predict outcomes and suggest potential avenues for investigation. This predictive capability is precious in complex fields like genomics, proteomics, and drug discovery, where researchers grapple with massive amounts of data. However, biological systems’ inherent variability and complexity often pose challenges for ML models. Even with extensive data, accurately predicting biological phenomena can be complicated due to many biological processes’ multitude of interacting factors and stochastic nature.
Traditional laboratory-based hypothesis testing plays a crucial role. Experimental validation provides a critical reality check, grounding the predictions generated by ML models in empirical evidence. By carefully designing and conducting experiments, researchers can test the validity of ML-generated hypotheses and refine the models based on real-world observations. This iterative process of prediction and validation is essential for ensuring the accuracy and reliability of ML models in biological research.
Ultimately, the synergy between ML and experimental validation drives scientific discovery. The iterative feedback loop, where ML models generate hypotheses and experiments provide validating evidence, allows for continuous refinement and a deeper understanding of complex biological systems. This dynamic interplay between computational prediction and empirical observation holds immense promise for accelerating breakthroughs in fields ranging from disease modeling and drug development to personalized medicine and synthetic biology.
Horizontal & Vertical: Bringing the AI Gap
The world of AI in the life sciences is often divided into two distinct approaches: horizontal and vertical. Horizontal AI platforms excel at tackling the universal challenges of data management and engineering across various biological disciplines. These platforms streamline workflows by optimizing computing resources, automating data cleaning, and transforming raw, unstructured data into easily accessible formats. Think of them as the foundational infrastructure, providing a robust and efficient base for managing the ever-growing flood of biological data.
Vertical AI platforms, on the other hand, delve into the specifics. They are designed for hypothesis-driven analysis, offering tailored solutions for research questions or problems within a specific domain. These platforms provide specialized tools and workflows optimized for analyzing genomic sequences, predicting protein structures, or simulating drug interactions. They act as expert systems, providing in-depth analysis and insights within a focused area of research.
However, the true power of AI in the life sciences lies in bridging the gap between these two approaches into a new technology stack. By integrating horizontal and vertical AI capabilities within a unified technological framework, we can unlock a new era of computational life sciences. This hybrid approach combines both strengths, enabling efficient and scalable analysis of diverse data types across multiple domains. Imagine a platform that manages and processes vast datasets and provides specialized tools for genomics, proteomics, and drug discovery, all within a single, interoperable environment. This fosters collaboration between researchers with varying computational expertise, accelerating innovation and driving the development of even more powerful tools and applications.
The Next Generation of Platforms
The landscape of computational biology tools is a testament to the ingenuity of both biologists who ventured into coding and computer scientists who embraced biological challenges. This fusion of expertise has given rise to a rich ecosystem of open-source tools, with platforms like Bioconda offering nearly 10,000 options. These tools, often tailored to specific research needs, have been instrumental in advancing various areas of biological inquiry.
Many biotech companies built their own proprietary software platforms as the field matured. This approach allowed them to address the unique challenges within their specific sub-domains. However, these legacy platforms, often developed before the latest advancements in software engineering, now face limitations. They may need more user-friendly interfaces, collaborative features, and seamless integration capabilities that modern research demands.
A new wave of platforms is emerging to address these limitations. Designed with the modern scientist in mind, these platforms prioritize intuitive interfaces, enabling researchers with diverse computational backgrounds to easily navigate and analyze data. They emphasize collaboration, allowing teams to share data and insights seamlessly. And they increasingly incorporate artificial intelligence, offering powerful tools for accelerating analysis and discovery. This shift marks a move towards more user-centric, efficient, and collaborative computational biology, empowering researchers to tackle increasingly complex biological questions.
Emerging Platforms:
- Seqera Labs: Spearheading a movement towards efficient and reproducible research, Seqera Labs provides a suite of tools, including the popular open-source workflow language Nextflow. Their platform empowers researchers to design scalable and reproducible data analysis pipelines, particularly for cloud environments. Seqera streamlines complex computational workflows across diverse biological disciplines by emphasizing automation and flexibility, making data-intensive research scalable, flexible, and collaborative.
- Form Bio: Aimed at democratizing access to computational biology, Form Bio provides a comprehensive tech suite built to enable accelerated cell and gene therapy development and computational biology at scale. Its emphasis on collaboration and intuitive design fosters a more inclusive research environment to help organizations streamline therapeutic development and reduce time-to-market.
- Code Ocean: Addressing the critical need for reproducibility in research, Code Ocean provides a unique platform for sharing and executing research code, data, and computational environments. By encapsulating these elements in a portable and reproducible format, Code Ocean promotes transparency and facilitates the reuse of research methods, ultimately accelerating scientific discovery.
- Pluto Biosciences: Championing a collaborative approach to biological discovery, Pluto Biosciences offers an interactive platform for visualizing and analyzing complex biological data. Its intuitive tools empower researchers to explore data, generate insights, and seamlessly share findings with collaborators. This fosters a more dynamic and interactive research process, facilitating knowledge sharing and accelerating breakthroughs.
Open Source Platform:
- Galaxy: A widely used open-source platform for bioinformatics analysis. It provides a user-friendly web interface and a vast collection of tools for various tasks, from sequence analysis to data visualization. Its open-source nature fosters community development and customization, making it a versatile tool for diverse research needs.
- Bioconductor is a prominent open-source platform for bioinformatics analysis, akin to Galaxy’s commitment to accessibility and community-driven development. It leverages the power of the R programming language, providing a wealth of packages for tasks ranging from genomic data analysis to statistical modeling. Its open-source nature fosters a collaborative environment where researchers can freely access, utilize, and contribute to a growing collection of tools.
Pushing Boundaries Together
The life sciences are undergoing a revolution fueled by the rapid advancement and adoption of ML. This transformation is driven by a convergence of factors, including the urgent demands of the COVID-19 pandemic, which showcased the power of ML in addressing complex problems like disease modeling and drug discovery. Beyond the pandemic, the rise of user-friendly platforms like Seqera Labs and Form Bio, along with open-source tools like Galaxy and Bioconductor, is democratizing access to powerful AI technologies, empowering a more comprehensive range of researchers to leverage ML for groundbreaking discoveries.
A growing emphasis on diversity within developer teams further amplifies this new era of biological research. By bringing together a more comprehensive range of perspectives and experiences, these teams are better equipped to tackle complex challenges, reduce algorithm bias, and drive innovation across the field. The competitive landscape, spurred by successes like ChatGPT, is also accelerating advancements in large language models and AI, pushing the boundaries of what’s possible in the life sciences.
This convergence of AI, accessible platforms, and collaborative tools is ushering in a future where scientific discovery is faster, more efficient, and more inclusive. Imagine a world with readily available personalized cures, diseases predicted and prevented before they occur, and the dream of a cure within reach. This is the future that bioinformatics is actively shaping – a future where breakthroughs transform our understanding of life and improve human well-being.