Skip to Main Content Menu Search Site

From Data to Drugs: The Role of Artificial Intelligence in Drug Discovery

Code to Cure is a newsletter co-hosted by the Wyss Institute at Harvard University and Milad Alucozai on the intersection of AI, biology, and healthcare transforming medicine.

By Milad Alucozai, Will Fondrie, and Megan Sperry

The Challenge of Traditional Drug Discovery

From Data to Drugs: The Role of Artificial Intelligence in Drug Discovery
This issue was written by (left to right) Milad Alucozai, MSc, MPH, DrPH., Will Fondrie, Ph.D., and Megan Sperry, Ph.D. Credit: Wyss Institute at Harvard University

Traditional drug discovery is slow, expensive, and prone to high clinical failure rates due to poor safety profiles, lack of efficacy, or both. Developing a new drug requires 13–15 years, with less than 10% of Phase I candidates receiving FDA approval. The average R&D investment for a new product is more than $2.5 billion when accounting for out-of-pocket expenses and abandoned trials. The pharmaceutical industry and academic and translational researchers are broadly interested in approaches that would accelerate this process. The goal is better patient outcomes and reduced costs, as this encourages innovation and investment in new drugs. Emerging techniques in computational biology, and specifically artificial intelligence, have demonstrated early potential to do this.

Lessons from the Frontlines of AI in Drug Discovery

  1. Pass the ball between the wet and dry lab. We mentioned this point in the first article. Robust iteration between wet lab and dry lab is essential, and on the best teams, it’s hard to even tell where the line is between these groups. This collaboration is especially critical because the data underlying a model is always limited and biased by the experiments that generated it. By iterating early in the development process, the team can more quickly identify underlying issues and tune the model appropriately. It is often better to start iterating when a model is “imperfect” rather than spend years optimizing toward the wrong target. This is another opportunity to apply reinforcement and active learning approaches, which can be powerful even in a low-data regime.
  2. The power of relevant data to empower experts. While the adage “garbage in, garbage out” highlights the importance of data quality, the reality is far more nuanced.  Even outside of biology, the success of companies like OpenAI underscores this point. Their strategic focus on generating high-quality datasets has been crucial in refining their models and achieving breakthroughs in AI. Those on the frontlines in biology understand no magical, perfect dataset exists. That said, the data underlying a model is essential because it defines the possible solution space for the model and the boundaries of what is likely to be predicted or generated. While several high-quality, publicly available databases have broad applicability, these may only be appropriate for specific questions or products. For example, a compound library may be missing chemicals with particular substructures, or a phenotypic screen may have been performed on a cell line quite remote from the target tissues. In response, some companies invest in generating their proprietary datasets, which can be a time-consuming and expensive aspect of computational drug discovery.
  3. There are new chemical spaces to be explored. In silico approaches to drug discovery vary widely, with some predicting existing drugs for repurposing and others identifying potential drugs from large compound libraries. A third category of algorithms builds novel compounds using generative AI approaches. Across methods, we can see that new chemical spaces are waiting to be explored; there are thought to be more than 1060 pharmacology active compounds, which outnumber the stars in the nighttime sky. These structural classes can be novel for a drug category – such as antibiotics – where deep learning has contributed to identifying novel active compounds. The design of de novo structures also opens up the possibility for compounds at the edge of what is currently known across all drug classes; however, these compounds carry the most risk regarding synthesizability, druggability, and safety profile.
  4. It’s not just about small molecules. In addition to small-molecule development, in silico-based discovery is being used for medicinal macromolecules, such as designing antimicrobial peptides (AMPs), therapeutic proteins, and CRISPR-Cas9 systems, which are well reviewed here. Larger molecules like antibodies, proteins, gene therapies, and RNA-based treatments accounted for 40% of FDA approvals in 2022 and represent an essential focus for in silico development.
  5. Compound identification is only part of the equation. To mitigate the myriad other risks involved in drug discovery, especially with completely de novo designs, there has been an explosion in approaches for predicting potency, toxicity, and other drug characteristics. Recent work has used reinforcement learning to fine-tune target properties, such as synthesizability and drug-like properties. Because reinforcement learning optimizes towards a user-defined target, it can be used to narrow outcomes within an enormous possible solution space.
  6. Let everyone in on the discovery. Related to our first point on establishing robust iteration between wet and dry lab teams, integrating tools that enhance accessibility to AI and computation opens up discovery across functional areas of a research team or business. In addition to this being a critical mindset, many platforms support this type of work with user-friendly software that non-computational team members can use, and within which computational scientists are permitted to build custom software. With this approach, we can share data and capture knowledge from team members for incorporation into a model and to avoid fruitless paths. More broadly, there is an opportunity to combine and share knowledge across entities in the computational drug discovery space, including academics, pharmaceutical companies, software developers, and non-profits. Organizations like the Open Molecular Software Foundation are developing open-source software and community connections within the molecular sciences to accelerate discovery and innovation.
From Data to Drugs: The Role of Artificial Intelligence in Drug Discovery
There is a broad interest in using AI models to accelerate the drug discovery process. Credit: Will Fondrie

Leveling the Playing Field: How AI Can Drive Drug Development for All

In silico drug discovery provides an opportunity to impact spaces beyond traditional small-molecule drugs in the largest disease markets. As mentioned, AI is used to design larger molecules, which account for an increasing proportion of FDA-approved medicines and a larger market share than small molecules. Further, biologics are particularly well suited to AI-augmented development due to the many physicochemical properties that must be optimized, creating a highly complex and multidimensional search space. The critical technical considerations for machine learning in biologics are well-reviewed here.

In sectors where funding is challenging, return on investment may be meager, and there is high patient need, computational approaches can provide a lower cost approach for narrowing the scope of compounds to be tested in vitro and in vivo. This has been explored academically for antibiotic drugs and through a related non-profit, Phare Bio, which was co-founded by Wyss Core Faculty member James J. Collins, Ph.D. This is in the context of challenges in the broader infectious disease market, including Phase 2 failures, investor disengagement, and layoffs.

A long-overdue focus on diseases that primarily affect women and even knowledge of healthy female biology has recently received increased funding through ARPA-H and has been recognized as an essential area of research and development at the Wyss Institute. Despite the need to understand how drugs differentially impact disease in females and develop therapeutics for health issues specific to women, R&D investments in women’s health have been deprioritized by large pharma. Broadly, investments in women’s health have failed to attract industry funding, accounting for only 1% of R&D investments outside of oncology.

The Power of Foundation Models in AI Drug Discovery

In recent years, the rise of large language models (LLMs) designed specifically for biological sequences and chemical structures has transformed AI-driven drug discovery. DeepMind’s AlphaFold breakthrough in protein structure prediction marked the dawn of a new era, where powerful models could predict complex protein structures with remarkable accuracy. AlphaFold’s success preempted a wave of new models, including David Baker’s RoseTTAFold and Meta’s Evolutionary Scale Modeling (ESM) family, providing invaluable tools for understanding protein structure and function on an unprecedented scale.

These foundation models are also fueling a surge in biotech innovation. Companies like Isomorphic Labs—a DeepMind spinout that has recently secured partnerships with Eli Lilly and Novartis—and EvolutionaryScale, which launched this year with $142 million in funding, are at the forefront of applying these capabilities in the real world. Additionally, the University of Washington’s Institute for Protein Design has spun out a growing ecosystem of companies leveraging these technologies for protein design and drug discovery.

Another striking entrant into the biological foundation model space is Amgen, which recently introduced its open-source protein language model, AMPLIFY. Amgen has publicly made all of AMPLIFY’s code, data, and model weights available, effectively democratizing access to high-quality pLM for diverse drug discovery applications. Amgen’s Chief Technology Officer, Dr. David Reese, said, “AMPLIFY has the potential to transform medicine through the acceleration of protein sequence prediction. It proves that data quality can surpass sheer model size, offering faster, more accurate, and cost-effective solutions for protein science.” AMPLIFY stands out not only for its accessibility but also for its efficiency—it achieves impressive performance with only a fraction of the parameters found in competing models. This move by Amgen is a significant win for the scientific community, as it provides researchers and startups with an adaptable, fully open-source model to integrate into their AI-driven research.

The true power of foundation models for drug discovery lies in their ability to act as launchpads for more specialized models that incorporate proprietary data. Foundation models significantly lower computational costs by offering a pre-trained base for various biology tasks. Much like fine-tuning LLMs in natural language (e.g., GPT or Llama), building atop these models enables companies to maximize the impact of their proprietary data. By injecting prior knowledge of biological domains into these foundation models, companies can tune smaller parameters on their unique datasets, allowing for faster, more cost-effective model development and rapid iteration.

Talus Bio is one company that is already effectively leveraging these advantages. This startup has developed a platform focused on designing drugs for transcription factors, a challenging and traditionally undruggable class of targets. Talus’s approach combines custom AI models optimized through iterative improvement based on new data generated from its platform with the foundational knowledge from pre-trained models for proteins and chemistry. This blend enables Talus to make the most of its proprietary data, ultimately increasing the speed and efficiency of drug candidate selection and development.

AI’s Promise for Drug Discovery: A Measured Approach

Although AI for drug discovery has been talked about for some time, sophisticated AI is relatively recent. Compared to other industries, which have already seen the early impacts of AI tools, the effect on pharmaceuticals will have a longer time horizon due to its longer development timeline. With several AI-identified and designed drugs in clinical trials and more in the pipeline, the in silico drug discovery hypothesis will soon be tested. Reflection on the successes and failures offers additional insights, with recent readouts suggesting a mix of wins and losses thus far. Because AI has the potential for exponential drug discovery, those who wait for AI to have an impact may find themselves behind and need help to recover.

Dr. Reese said, “With the dawn of these new technologies, we are poised to chart new territories that help get medicines into the hands of patients faster than ever before. In protein drug development, generative biology has helped cut antibody discovery times in half and when we look at things like clinical trial recruitment, where it can take up to 18 months to enroll a mid-stage trial, machine learning models have the potential to similarly reduce those times by half. These are significant improvements, especially when you start compounding them across the journey a medicine takes from idea to patient.”

With the dawn of these new technologies, we are poised to chart new territories that help get medicines into the hands of patients faster than ever before.

David Reese, M.D., CTO Amgen

On the technical front, although more appropriate training datasets and reinforcement learning can help, generative AI often suggests compounds that are challenging or impossible to synthesize or lack drug-like properties. New computational approaches and improved iteration between dry lab and wet lab teams may lead to improvements in this area. To enhance the efficiency of in silico methods, research teams and companies are still working toward solutions for efficiently collecting, storing, and retrieving relevant data to support discovery.

Furthermore, assessing the effectiveness of AI chemistry models is fraught with challenges—it has even been suggested that current models provide little more information than simply compressing the text representation of a molecule for many downstream tasks. These challenges, in part, arise from inadequate and inconsistent benchmarking of models for drug discovery. MoleculeNet contains perhaps the most common benchmark datasets in the field, yet it has many known flaws, leading to inconclusive results. However, there has been an exciting new entrant into benchmarking for drug discovery: Polaris is a platform to host benchmarking datasets for drug discovery and consistently evaluate and compare proposed methods. One exciting aspect of Polaris is the investment from major pharma players; the current steering committee consists of folks from Relay Therapeutics, Merck, Pfizer, Blueprint Medicines, Nimbus Therapeutics, AstraZeneca, Johnson & Johnson, Bayer, Novartis, and Valence Labs.

Big pharma is eagerly watching this space, but most are waiting to see who the first winners are before making substantial bets. A few have made efforts of their own, e.g., Genentech. Since many big pharma companies are flush with cash and M&A deals have rebounded, there is a significant acquisition opportunity for AI drug discovery startups and biotech companies with promising early clinical data.

How Startups Can Win with Large Language Models

The next generation of LLMs will be defined by specialization and accessibility, not just hyper-scale models. While massive LLMs like those developed by Google and OpenAI will continue to advance, the future lies in a diverse ecosystem of LLMs tailored to specific tasks and industries. This will allow for a “mixture of expertise” where multiple models collaborate to achieve superior results. Imagine a biotech company leveraging a specialized LLM for drug discovery alongside another for clinical trial analysis, combining their strengths for a more comprehensive approach.

This shift opens doors for startups to thrive in the LLM space. Rather than engaging in a costly arms race to build the most significant model, they can focus on developing targeted LLMs that excel in niche applications. This could involve creating models optimized for analyzing genomic data, predicting protein structures, or generating personalized treatment plans. By leveraging existing hyper-scale models for general tasks and augmenting them with their specialized LLMs, startups can deliver unique value and drive innovation in the biotech industry without excessive spending on training and infrastructure.

Future of AI-Driven Drug Discovery

From Data to Drugs: The Role of Artificial Intelligence in Drug Discovery
The FDA Modernization Act 2.0 opened the door for non-animal-based testing in preclinical trials. Technologies like organoids and organs-on-chips hold promise for studying the response of an organ to novel therapeutics. Credit: Wyss Institute at Harvard University

Walking the tightrope of AI development in biotech requires a careful balancing act. We envision agentic AI designing groundbreaking therapies, yet we must ensure these powerful tools act ethically, transparently, and without bias, especially given the profound implications for human health. This balancing act extends to the foundation of AI drug discovery: the models we use. Animal models are still the norm today, and in addition to animal cruelty, their utility is quite limited by the differences in physiology between humans and other species.

In December 2022, the FDA Modernization Act 2.0 opened the door for non-animal-based testing in preclinical trials. Moving beyond animal testing, sophisticated human cell culture techniques like organoids and organs-on-chips offer a more ethical and automated path to preclinical testing. Ex vivo perfusion of transplant-declined human organs has also emerged as a promising platform to study the response of an organ to novel therapeutic strategies to mirror human biology truly. These techniques and approaches generate vast amounts of complex biological data. AI and machine learning are ideal for analyzing this data, identifying patterns, and predicting drug responses or disease progression.

AI and advanced human-relevant models will enable a future in which drug discovery is more accurate, efficient, and humane. By embracing this approach, we can accelerate the discovery and development of life-saving therapies, reduce cost, unlock the mysteries of human biology, and ultimately improve patient outcomes and access to critical treatments.

Close menu