Through a Glass Darkly: Mechanistic Interpretability as the Bridge to End-to-End Biology

The year is 2024 and machine learning for biology (“bio-ML”) has burst onto the scene. From Nobel prizes for structural biology prediction to venture-funded GPT-3-scale protein sequence models, things feel a bit frenzied right now.

But, for all the excitement, there are stormclouds on the horizon:

  • If we extend the compute-optimal protein sequencing modeling curves (i.e. draw lines on log-log plots, a surprisingly useful thing), it looks like we’ll be running out of existing protein sequence data pretty soon. And the structural data has been all but exhausted–such that we now train our protein-sequence models on synthetic structural data (and the models might be stalling out in terms of performance).
  • Protein-folding models might be more of a complement than a substitute to traditional docking approaches, and it’s unclear how much of the recent generative protein modeling hype is just glorified interpolation (which isn’t to say it’s useless). Recent failures of AI on a binder prediction contest have received mixed interpretations.
  • There are foundation models for everything now—ranging from mass spectrometry to single-cell gene expression counts to fMRI—but suprisingly few attempts at multimodality or multiscale models.
  • There’s an ongoing crisis in the world of DNA modeling: models do pretty well when trained on efficient, coding-sequence-heavy genomes (hence why the largest DNA model to date was trained on prokaryotic genomes) but often do no better than naive baselines on mammalian regulatory genomic tasks (highly noisy/low conservation regions which comprise the vast majority of mammalian genomes). DNA is not all you need to learn context-specific gene regulatory logic, obviously; non-scaling-pilled functional genomics advocates are being vindicated.
  • And in the pipeline, perhaps it’s a bit too soon to say, but it appears the most recent crop of AI-discovered drugs (the discovery of which admittedly precedes the most recent bio-ML boom) are shooting around par when it comes to early clinical trials.

Thus, with all this recent progress, it’s a good moment to take stock of where we’re at, what trajectory we’re headed on, and perhaps reorient to reach that endpoint faster. In particular, we know we’re headed toward some hazy vision of accelerated biomedical progress, controlling biology end-to-end, cure all the diseases, etc., but how much closer are we to actually getting there, and what, concretely, will this world look like?

We will take a winding journey to answer this question, and in doing so tie together a few loose threads latent in the bio-ML space. After all the twists and turns, we will have a better sense of why AI for biology is indeed poised for an inflection point, one which, perhaps surprisingly, will return it to its intellectual roots as an experimental, mechanistically understandable science—at least temporarily.

Outline

In part 1, we’ll make a journey into the world of mechanistic interpretability, a promising new (or really quite old) field of AI research that aims to explain and control the behavior of neural networks. We’ll then explore the surprising role of cellular biology in its intellectual origins, and bring it full circle, discussing our preliminary results using sparse auto-encoders to peer inside, probe and perturb virtual cell models, with exciting potential applications for early-stage therapeutics discovery.

In part 2, we’ll do a retrospective on the single-cell foundation model field a little over a year after its real inception, and show how a hyperbolic, perverse academic system has led to it being severely underrated. We’ll then make an extended comparison to the state of the protein structure prediction field, explaining how the single-cell field is on the precipice of its own AlphaFold moment.

Finally, in part 3, we’ll argue that mechanistic interpretability will be necessary for this AlphaFold moment. Then we’ll work our way toward an empirical framework for how mechanistically interpretable simulators will accelerate drug discovery over the next five or so years, a period of “liminal legibility”. We’ll then explain how the complexity of these simulators will eventually outstrip human cognition, necessitating the handoff from humans to AI experimental agents as we near end-to-end biology takeoff.

Pt 1. Mechanistic Interpretability Comes Full Circle: Virtual Cell as Missing Model Organism of Biology

Origins of the Mechanistic Interpretability Field

We take it for granted that the microscope is an important scientific instrument. It’s practically a symbol of science. But this wasn’t always the case, and microscopes didn’t initially take off as a scientific tool. In fact, they seem to have languished for around fifty years. The turning point was when Robert Hooke published Micrographia, a collection of drawings of things he’d seen using a microscope, including the first picture of a cell.

…Our impression is that there is some anxiety in the interpretability community that we aren’t taken very seriously. That this research is too qualitative. That it isn’t scientific. But the lesson of the microscope and cellular biology is that perhaps this is expected. The discovery of cells was a qualitative research result. That didn’t stop it from changing the world.

– OpenAI Mechanistic Interpretability Team, Zoom In

Machine learning for biology lags the general machine learning field. A couple years ago this lag was roughly 3-5 years, but due to compression in timelines of general AI progress (there are occasional spurts of progress over months that would have taken years previously) and increasing awareness of AI in the computational biology field, the lag time is now down to more like 18-24 months. So, a useful starting place for forecasting the future of machine learning for biology (“bio-ML”) might be to take our cues from the general AI field and see where they’ve been allocating their efforts lately, since this is where the bio-ML field might be at soon.

We begin our journey around 4 years ago with Chris Olah, an AI researcher then leading the interpretability team at OpenAI. Chris’s team was trying to figure out what was going on inside the mind of vision models—for instance, when a ConvNet proceses a picture of a dog, what and how does it see? In what terms does it think about goldendoodles or schnauzers? (This is not just meant to be a funny example, the article actually focuses on dogs as a stimulus class that the model detects.) At that point in time the machine learning field already had decent image classification models, but the interpretability team at OpenAI was more concerned with how the classification model predicted if an image was of a dog or not, and in what terms it understood this world of images.

The issue they ran into was that neurons inside the neural network would often fire on varying, seemingly unrelated stimuli. That is, rather than a neuron firing specifically on a particular dog breed or even dogs in general, many neurons would fire apparently indiscriminately on multiple inputs, like cat faces and fronts of cars—an issue they called “polysemanticity”. To make a long story short, later work done by Chris and his team at Anthropic convincingly suggested that polysemanticity is one of the major drivers of the opaqueness of neural networks’ internal representations, and that it is a byproduct of the neural network trying to compress a lot of information about the world into a comparatively small set of weights:

Why would it do such a thing? We believe superposition allows the model to use fewer neurons, conserving them for more important tasks. As long as cars and dogs don’t co-occur, the model can accurately retrieve the dog feature in a later layer, allowing it to store the feature without dedicating a neuron.

There were many false starts in trying to solve the polysemanticity issue, but real progress was finally made in late 2023, first by Cunningham et al. and then by Olah’s group at Anthropic, in disentangling the activations of these neural networks into human-interpretable sparse “features”. These advances relied on a tool called sparse autoencoders (SAEs), which are often analogized as a kind of “fMRI for LLMs”. These SAEs allowed the researchers to peer inside the LLM and parse its internal “thoughts” as combinations of these features (for instance, you might have heard of “Golden Gate Claude”, which references a feature Anthropic researchers found inside their model which fires specifically on text and images of the Golden Gate Bridge).

SAEs are but a small part of the growing armamentarium of mechanistic interpretability techniques that give us a new toolkit for understanding and controlling the behavior of AI models, with researchers currently hard at work to build the tracer dyes, patch clamps, and optogenetic probes for experimenting on these virtual minds. This analogy is beyond superficial, though, with many computational neuroscientists recently entering the mechanistic interpretability field, since they now have a viable in silico model organism (LLMs) and the tools to quickly and cheaply experiment on it (microscopes, probes, scalpels)—much nicer than dealing with the hassle of IRBs and motion artifacts in the scanner.

This recent transfusion of talent is part of an ongoing dialogue between neuroscience and mechanistic interpretability—for instance, Olhausen and Field’s work on sparse coding in the primary visual cortex in the late 90’s eventually led to sparse autoencoders (and of course the original wet-ware inspiration of the perceptron, one of the basic building blocks of today’s deep learning models, came from biological neurons).

But what is perhaps lesser known is that the modern mechanistic interpretability field can trace some of its intellectual roots to cellular biology.

In an essay from 2020, Chris Olah explicitly motivates the development of the mechanistic interpretability field by making a comparison to the development of the microscope in biology and how it revolutionized the way we understand biological systems, going so far as to call their work a kind of “cellular biology of deep learning”:

Many important transition points in the history of science have been moments when science “zoomed in.” At these points, we develop a visualization or tool that allows us to see the world in a new level of detail, and a new field of science develops to study the world through this lens.

For example, microscopes let us see cells, leading to cellular biology. Science zoomed in. Several techniques including x-ray crystallography let us see DNA, leading to the molecular revolution. Science zoomed in. Atomic theory. Subatomic particles. Neuroscience. Science zoomed in.

These transitions weren’t just a change in precision: they were qualitative changes in what the objects of scientific inquiry are. For example, cellular biology isn’t just more careful zoology. It’s a new kind of inquiry that dramatically shifts what we can understand.

…Just as the early microscope hinted at a new world of cells and microorganisms, visualizations of artificial neural networks have revealed tantalizing hints and glimpses of a rich inner world within our models. This has led us to wonder: Is it possible that deep learning is at a similar, albeit more modest, transition point?

It is unfathomable to think about the counterfactual histories in which we didn’t develop optics and the microscope, did not discover the cell as an atomic unit of biological organization, and therefore missed out (or delayed) all the biological experimentation and discovery that followed. Likewise we can imagine how our understanding of LLMs would be comparatively impoverished had we not developed SAEs and all the mechanistic interpretability tools that will be created in the coming years.

But what if we come full circle and turn modern mechanistic interpretability research back on the subject that originally motivated the field’s development? That is, if LLMs are the model system of this new virtual computational neuroscience and SAEs provide the microscope, what would be the canonical model organism and microscope of a new kind of virtual biology?

A Prototypical Virtual Cell

While our goal is safety, we also believe there is something deeply beautiful hidden inside neural networks, something that would make our investigations worthwhile even in worlds with less pressing safety concerns. With progress in deep learning, interpretability is the research question which is just crying out to be answered!

…The beauty of deep learning and scale is a kind of biological beauty. Just as the simplicity of evolution gives rise to incredible beauty and complexity in the natural world, so too does the simplicity of gradient descent on neural networks give rise to incredible beauty in machine learning.

Neural networks are full of beautiful structure, if only we care to look for it.

– OpenAI Mechanistic Interpretability Team, Zoom In

Just as a fMRI is useless without a brain to scan and an SAE for LLMs useless without an LLM to introspect, so too would a virtual microscope be useless without an object of study. So the first task before us is constructing a model system, and there seems no more appropriate a place to begin than with a single cell.

The machine learning field had a more difficult challenge: it had to develop an entire framework for growing/training virtual minds, and only then could it develop the tools to introspect and control them. Luckily for us, due to historical contingency, the development of machine learning preceded the development of virtual cells, so we can rely on these existing ML frameworks.

By a “virtual cell”, I mean a machine learning model trained on data emitted from real biological cells. That is, just as an LLM trained on a corpus of Internet text and books learns a world model of the data-generating process of that text (i.e., human minds and their description of and interaction with the world around them), so too will a model trained on a corpus of internal cellular states come to learn a world model of the the cell’s “mind” and its interaction with its biological niche—or so we hope (this agentic framing of cells will be elaborated on later).

As a proof of concept, we trained a rather modest virtual cell model, though it is far more sophisticated than any other virtual cells to date. The model is a continuous diffusion transformer model trained on single-cell gene count data—that is, static snapshots of the counts of genes expressed inside individual cells within human tissue, which are a proxy for those cells’ functional states.

This “Gene Diffusion” model is trained via self-supervised learning to generate sets of gene tokens (from a vocabulary of around 60,000 human genes, both protein-coding and non-protein-coding). The model is trained primarily with a diffusion denoising objective: the set of gene tokens are represented as vectors in some continuous embedding space, Gaussian noise is added to these embeddings, and the model is tasked with predicting the clean, un-noised embeddings, similar to how text-to-image models are trained. By learning which genes tend to co-express across a variety of cell types and tissues, our model learns rich internal representations of cell state which allow it to predict gene expression.

Constructing the Microscope

Now that we have our virtual cell, the next step is to build a microscope to peer inside it. That is, rather than try to understand any given cellular state directly in terms of the 60,000 or so genes (protein-coding and non-coding) that can be expressed within any cell, we can instead exploit the world model learned by our virtual cell by looking at its internal representations as it processes this gene expression data.

Why is this necessary, and why can’t we just look at gene expression directly? For one thing, genes often play multiple context-specific biological roles, even within the same cell. Furthermore, there are many biological processes that are distributed among sets of genes. But most importantly, genes are not necessarily the right level of description for many biological processes, and we believe our model has actually developed a world model of more than just RNA. That is, even though it trains on RNA, to develop a compressed world model that predicts gene expression well, it must learn a model of the deeper, previously obscured forms driving cellular behavior.

But it is only through the lens of our learned machine learning model that we are able to see through the ambient gene data matrix and glimpse the computational primitives which form the basis of the virtual cell’s world model.

As with LLMs, these internal computational states are tangled up and largely inscrutable as is, so we must train a sparse autoencoder to disentangle them—a kind of prism that diffracts them into separable, potentially human-interpretable concepts or “features”. In brief, building the microscope goes like this:

  1. Run forward passes with real single-cell gene count data and capture the internal activations (i.e., the brain scan of the virtual cell) across a variety of cell types and tissues.
  2. Train a sparse autoencoder to reconstruct these activations as a sparse sum of learnable features (i.e., directions in the activation space), which encourages it to learn a decomposable dictionary of concepts.
  3. Take the features/concepts learned by the sparse autoencoder, and apply auto-interpretation techniques to label them, making them human-interpretable.

Virtual Micrographia

We've released a dashboard with a curated selection of some of the most interesting features we found inside our virtual cell via this process of unsupervised concept discovery.

A brief explanation of the feature dashboard and our auto-interpretation approach seems in order (for a more exhaustive explanation, refer to Anthropic’s recent Scaling Monosemanticity, which explains the basic framework we are relying on and extending).

On the left-hand side are a selection of the sparse autoencoder features, or concepts, we discovered inside the virtual cell. Initially these are just vectors, and text labels for these features are generated post-hoc by feeding in a variety of information about the feature into a large language model for analysis, such as:

  • which cell types and tissue types this feature tends to activate in
  • the genes which the feature maximally activates on, and properties of these genes (like Gene Ontology attributes)

All this information for a single feature is fed into an LLM which, using its knowledge of biology and external biological knowledge we feed into it, generates a label of what it believes that feature represents.

In the middle panel are some of the enriched cell types, gene tokens, gene ontology terms, and so on that we find are associated with this feature.

On the right panel are the top-activating cells, the cells which this feature fires most strongly on (doing a special kind of pooling across the activation of the feature across all the genes within the cell). We then present the most highly activating tokens/genes for that particular feature inside that cell in descending order. This begins to paint a picture of what the feature might be doing in that particular cellular context.

For instance, we found a feature related to plasma cell terminal differentiation and the osmotic stressors these cells face in producing and secreting antibodies, as indicated by the enrichment of sorbitol response pathways alongside key plasma cell regulators like IRF4 and FAM46C. This feature reflects how B cells must dramatically expand in size and manage osmotic stress while transforming their cellular machinery for antibody production – the cell increases its volume to accommodate an enlarged endoplasmic reticulum while maintaining osmotic balance through stress-response pathways, enabling the production of thousands of antibodies per second without bursting from internal pressure.

But often it can be hard to tell what a feature does simply by what it activates on. Luckily, we can run an experiment and turn up the features activation inside the cell (a kind of “overexpression” experiment, in the language of biology), and see which genes it makes more or less likely to express (as encoded by the change in the logits for that particular gene token). These logit attribution scores begin to get at the sort of causal effect this feature has inside the mind of the virtual cell.

For instance, we found a feature related to tight junctions in epithelial barrier formation that showed coordinated regulation of barrier defense through multiple mechanisms. It positively regulated barrier-forming genes like CLDN4/7 and PKP1, while also regulating immune-modulatory factors both positively (CLCA2, which can promote inflammatory responses) and negatively (CX3CR1, which mediates immune cell trafficking through barriers). The feature positively regulated protective factors like SERPINB5 (which can limit immune-mediated tissue damage) while suppressing immune cell receptors like SIGLEC7. This balanced program appears particularly relevant in tissues like the intestine where careful immune regulation is crucial, suggesting a coordinated system linking physical barrier formation with immunological homeostasis.

Feature Families and Internal Belief Geometry

Up until now, our focus has been on individual features in isolation. But if we zoom out and look at how features co-activate (similar to how genes co-express), both on single tokens and within cells, we begin to identify some interesting patterns. By doing a kind of cell-type specific internal clustering of features, we get a context-specific hierarchy of cellular functions, or nested “feature families”.

These feature families provide a coarse-grained, human-interpretable representation of the internal regulatory wiring of any given cellular state, abstracting up from individual features toward something more amenable to human understanding. That is, just as many gene regulatory programs are best understood by looking at the coordinated activity of multiple genes, so too can we better understand many features by looking at their coordinated activity.

Causal Discovery and Reverse-Engineering Drivers of Behavior

In biology, a circuit motif is a recurring pattern in complex graphs like transcription networks or biological neural networks. Motifs are helpful because understanding one motif can give researchers leverage on all graphs where it occurs.

We think it’s quite likely that studying motifs will be important in understanding the circuits of artificial neural networks. In the long run, it may be more important than the study of individual circuits. At the same time, we expect investigations of motifs to be well served by us first building up a solid foundation of well understood circuits first.

– OpenAI Mechanistic Interpretability Team, Zoom In

To be clear, these feature families are based purely on co-activation of features at this point. But as the earlier logit-attribution suggests, we can do more with our virtual cell than just passively observe which features tend to co-activate: we can actively run experiments on our virtual cell and see how it changes feature activations.

Before we looked at changes in predicted gene expression with respect to feature activation, but we can additionally look at what happens to downstream feature activation if we experimentally upregulate a feature earlier in the computational graph.

By running two forward passes for every feature in the dictionary and measuring the change in downstream feature activations, we get an $n_{feat}$ x $n_{feat}$ matrix of directed “attribution effects” between $feat_{i}$ and $feat_{j}$ : if I turn up $feat_{4938}$, how much more can I expected $feat_{1029}$ to activate, compared to baseline?

With this “feature regulatory network” (borrowing the idea of a “gene regulatory network” from cellular biology), we can then begin deriving powerful insights that are useful to reverse-engineer the cells behavior.

Unfortunately, however, real gene regulatory networks contain feedback loops (this is necessary for homeostasis), and likely so too does our feature regulatory network, which means that our directed graph has cycles. This means our first-order feature attribution matrix will likely miss the higher-order evolution of regulatory dynamics encoded within our model.

However, we can use some simple spectral graph theory to get around this problem. For instance, if we apply a variant of the Pagerank algorithm to this feature-feature graph for a single virtual cell state, we can get the stationary distribution which reveals the most causally influential features in that particular cell wiring. We might call these highly influential features the “driver features”, in that they appear to be controlling the cell in that region of state space.

A fascinating example is applying this to differentiation trajectories. For instance, we investigated a thymus dataset by applying our feature-feature algorithm to a sampling of cells along these differentiation trajectories. We then did our PageRank-like algorithm over the resulting feature-feature graphs and looked at the features it predicted would be most influential.

This approach identified a feature which fired on a known marker (CD25) and master transcription factor (FOXP3) of T-cell regulatory fate commitment, a good confirmation that the model’s feature regulatory networks are picking up on something real.

But additionally, our algorithm also surfaced what appears to be a previously unknown factor in reinforcing this T-regulatory cell fate, related to the shift from protein translation and growth to centriole assembly and preparation for mitosis, driven primarily by the CEP63 gene. This feature appears to fire in other immune cell types as well, perhaps being a general differentiation and fate commitment program.

(We also theorize that we could look at the entropy of the top N eigenvalues of the Pagerank matrix (the “spectral entropy”) as cells approach a branching point to get a coarse statistical description of how the internal wiring is changing. Our hypothesis is that as the cell approaches the branching point, it must break internal symmetry, which will likely be preceded by a phase transition in this spectral entropy of the feature-feature regulatory graph. We further predict that this phase transition will identify a small subset of highly influential features that tip the balance in favor of either of the branching fates, which would likely make for good experimental targets to better control cell fate. This could first be tested experimentally in the virtual cell by overexpressing or ablating these features and seeing how it alters predicted cell fate.)

Spatial Investigation

We then wondered if our our sparse autoencoder might have picked up on higher-order spatial patterns in gene expression, despite only being trained on single-cell data. In a 2D slice of intestine we found rather striking patterns of feature activation that identified unique populations of enterocytes, the absorptive cells which line the inner surface of the intestine and are known to be implicated in autoimmune diseases of the gut like Crohn’s and Inflammatory Bowel Disease.

First, we found broadly expressed features across most enterocytes. For instance, this feature showed high activation on FECH and other heme synthesis genes, and was enriched for red blood cells, suggesting a role related to iron processing. The broad activation of this feature in enterocytes aligns with their location near blood vessels and their need for heme-containing proteins to support iron absorption.

But we then identified two much more sparsely activated features that appeared in only a handful of enterocytes and were quite intriguing.

The first revealed what appeared to be a RAB21-mediated immune sentinel program in select enterocytes, marked by immune mediators like IL6. These cells appear to act as strategic monitors along the intestinal barrier, sampling particles from the environment via endocytosis (cell ingestion) to detect threats and coordinate tissue responses. Intriguingly, we found the same program active in oligodendrocytes, the decidua (the first immune barrier between the fetus and mother during pregnancy), and centrilobular hepatocytes (the cells that surround and patrol the major veins of the liver). This suggests the feature might have to do with RAB-mediated protein trafficking and sorting in barrier tissues rather than immune surveillance per se, and warrants deeper investigation.

The second, and sparsest, feature we found centered on NSF-mediated vesicle recycling, which drives the disassembly of SNARE complexes after membrane fusion events. This exact molecular machinery is known to be involved in neurotransmitter recycling in neurons, so it was striking to find it re-purposed inside this select group of enterocytes, where we think it supports nutrient absorption through improved vesicle recycling.

Obvious Therapeutics Discovery Applications

SAEs give us a new language to describe biology, an interface for concepts inside our virtual cell that are currently beyond our comprehension. This language is useful not only for explaining the cell’s behavior, but also specifying desired cellular behavior (“push it in direction of feature_459” rather than “increase expression of gene_X”)—and perhaps in not too long it will be the description language for action on cells, whether that be of a therapy or an experimental perturbation.

Our virtual cell is rather simple at this point, and we’re barely scratching the surface of what is possible with these mechanistic interpretability techniques. We will later give a framework for thinking about what is possible with these tools in a grand sense, but for now, a couple obvious therapeutics discovery-relevant applications of these tools include:

  • Target discovery (this is by far the most obvious and boring)
  • Reverse-engineering the mechanism of action of existing therapeutics by tracing the computational paths they take through the model (our virtual cell can be conditioned with perturbations)
  • Identifying better combinations of transcriptions factors for directed differentiation and cellular reprogramming (again, with pertubation conditioning; now extend this to any kind of pertubation)
  • In general, using attribution and ablation techniques to reverse-engineer which sub-circuits of the model are causally necessary or sufficient to cause some desired delta in behavior

Pt 2. The AlphaFold Moment for Cellular Biology

Let’s address the elephant in the room: if you have any passing familiarity with single-cell foundation models or know colleagues who work on them, you might be extremely surprised by all these interesting results, which contrast so strongly with the pessimistic zeitgeist. You might be thinking “don’t single-cell foundation models do no better than naive linear baselines on perturbation prediction?” or “doesn’t predicting mean gene expression do just as well as using a fancy foundation model?”

As someone who has followed the field for around three years, before and during the boom and bust, and actively built and dug into the guts of these models, I’m probably one of those best positioned to answer these questions and give an accurate (but likely biased) assessment of the state of the field. Doing so will require a little sociological detour, but it will be very worth our while, since diagnosing the pathologies of the field is the first step to treating them. This will then clarify what direction the field should head in and what broader project it is contributing to.

The Single-Cell Foundation Model Field Is a Prime Example of Hyperbolic Science

To give the punch-line up front: the incompetence and perverse incentives of academics in the single-cell field (both those training the single-cell foundation models and those creating the benchmarks to evaluate them) conspire to make the field and the broader scientific public, which is taking cues from this group, unduly pessimistic about the potential of these models. Therefore, single-cell foundation models have been systematically underrated for the past year or so.

The brief history of the field for those not keeping up:

  • For the prehistory of the single-cell computational biology field before the foundation model era, see this earlier essay.
  • Around Q1 2023, The Chan Zuckerberg Initiative (CZI) really started making an effort to standardize existing single-cell transcriptomic atlas data and store it in a single, easy-to-access location called CELLxGENE (prior to this cellxgene was still a bit difficult to use). They later even harmonized the metadata of these datasets, which was a huge improvement. This made it much easier to train single-cell models.
  • There was the first efflorescence of single-cell foundation models in Q2 and Q3 2023, all with masked language modeling or autoregressive architectures: Geneformer, scGPT, the lesser-known ANDRE. Many of these papers made rather ambitious claims about the potential of these models for e.g. gene regulatory network prediction.
  • In Q4 2023, a couple bearish benchmarking papers came out claiming that these models did no better than naive baselines on zero-shot representation learning evaluations like cell type clustering and batch integration. But the bombshell claim was that single-cell foundation models performed no better than just predicting the mean when it came to gene expression prediction tasks.
  • These bearish publications appeared to have a chilling effect on the field throughout 2024, though there were a few foundation models in the interim, such as Universal Cell Embeddings.
  • The field was largely quiet for most of 2024, until a couple more bearish publications came out in late Q3 of this year, further claiming that these single-cell foundation models were no better than simple methods on tasks like gene regulatory network prediction or expression forecasting under perturbations. Everyone then aped into this and the pessimism was reinforced.

We don’t have time to systematically critique the field, but the high-level analysis is this:

  • The number of non-academics who actually read these papers in full and have interrogated the code, both on the foundation model side and on the benchmarking side, probably numbers in the single digits. So people are largely taking their cues from those inside the field and the hot takes they see from people vocal online. This puts a strong sociological prior on something being awry here.
  • The academics training the foundation models tend to be biologists, not machine learning people. To put it charitably, they’re a bit behind the times and not the best at training these models, which: are severely underparametrized (the largest recent model is 130 million parameters with a 1000:1 token-to-parameter ratio); use old architectures, suboptimal hyperparameter configurations, and naive losses; and don’t do smart data curation or filtering.
    • To give but only a couple indicators of how suboptimal the training of these models is, I’ll focus on the popular Geneformer model. To be extra charitable, I’ll focus on the more recent, larger version of the model released a couple months ago in August 2024.
    • The largest model they train has a little over 130 million parameters, and they train it on 128 billion tokens from 95 million cells, a tokens-to-parameters ratio of ~1000x. Either gene count data is information-poor or these models are severely underparametrized, yet the leading lights of computational biology either fail to see this or deliberately ignore it (it appears no one in the field has done a proper scaling analysis into the billion parameter range, unlike in the protein language modeling world where such an analysis was published earlier this year)
    • Their largest model uses about $2700 worth of compute (around ~1000 H100-hours) which they had to get from Argonne National Laboratory.
    • They’re still using basic BERT encoder architecture, not even changing the default masking ratio of 15%, and certainly not incorporating recent improvements like QK-norm or rotary positional embeddings.
    • They choose the strangest combination of hyperparameters, like a hidden_dim of 896, 14 attention heads, and 20 layers, apparently believing that depth is more important than width. They don’t appear to understand what optimal hyperparameter scaling means and confuse it with something related to images: “To match the increase in pretraining data, we also increased the depth of the model, maintaining the width-to-depth aspect ratio, and compared the pretraining loss per computation and tokens observed by the model.”
  • Those on the benchmarking side are no better, putting out code that is painful to work with and often not even using the most recent foundation models in their analyses.
    • The most recent high-profile evaluation publication, which was most recently updated in October 2024, is still using the older, smaller version of Geneformer, released in May of 2023, rather than the more recent, larger version trained with a larger context length (4096) released in August 2024. For context, this older version of Geneformer had a whopping 6 layers and hidden dimension of 256.
    • None of these benchmarking papers appear to do obvious things like sweep hyperparameters, or at least their code doesn’t suggest this.
    • They don’t do obvious things like condition the model on metadata, as we do with our gene diffusion model, which will help mitigate spurious batch effects (granted, this is harder with the models they evaluate).
    • As mentioned, the benchmarking code is painful to work with, typically a chimera of R and Python that is hard to get working out of the box (e.g. it requires you to install multiple different conda environments).
    • Perhaps this is imputing motivations, but there appears to be a feeling of disdain from the academics doing the benchmarking toward the industry, particularly startups applying AI (and rightfully so, as the bio-ML space is full of nonsensical hype).
  • Perhaps the worst issue of all is a technical one and the follow-on sociological effects: single-cell gene count data is the most cursed machine learning data format in existence (second only to tabular data).
    • For context, here’s where this data comes from: when you fish out RNA transcripts from a single cell, what you really get are thousands of sequences of composed of “A”, “U” “C” and “G”. These RNA sequences are then reverse transcribed (I’m skipping many intermediate steps) to get the complementary DNA sequences, which are then read by a sequencer to get DNA sequencing reads, producing thousands of these sequences. The standard practice is then to do some kind of mapping from these sequences to the predicted regions of the genome from which they were transcribed, called “alignment” or “pseudo-alignment” or something similar; you then count up (“quantification”) how many reads are predicted to have come from each of these regions of the genome, of which there are 20,000-60,000, resulting in a list of genes and their integer “counts”: gene A has 1, gene B has 3, etc. Do this for a bunch of cells and you get a “cell-by-gene” matrix, hence the name.
    • One major issue is that the distribution of RNA expression inside any given cell is extremely sparse. The vast majority of genes are not transcribed (and/or not picked up on by the library prep) in any given cell, and hence have a count value of 0 (for this reason gene count data is often stored in sparse matrix formats). Of the genes that are expressed, their counts appear to be Poisson-distributed or similar (more on that later).
    • Trying to train machine learning models on count data forces all sorts of awkward contortions. For starters, you need to find a way to represent the expression counts associated with each gene/token inside the model. Typically the way this is done is by rank-ordering the genes from most to least expressed, injecting the count information as some kind of learned positional embedding (where each integer is used as an index over a learned position embedding table; or alternatively you can convert the counts to normalized floats and inject these in via RoPE or similar), and then simply truncating all genes that fall outside the context window (that is, it is not typical to include the genes with zero expression inside the context window). Then your objective function can be to either predict only the gene token at that given position (the way our current gene diffusion model works), or additionally also predict the counts of the associated gene.
      • The rank truncation scheme creates a major issue: due to the limits of context length, your model will necessarily only see and train on the most expressed set of genes in the cell. Yes, you may catch glimpses of the long-tail of infrequently expressed genes occasionally, but by and large your model is learning to model the distribution of highly expressed, often biologically uninteresting genes, like those coding for ribosomal or mitochondrial proteins, which dominate expression data in terms of total counts. The August 2024 version of Geneformer made strides here, extending the context window to 4096, but these issues still largely remain.
    • But suppose you’ve figured out a way to train your machine learning model using count information, gotten around the data issues created by rank-encoding, and now you want to evaluate it on some task that involves comparing predicted counts vs. true counts. You run into a major statistical issue that is known within the field: Poisson-distributed data.
      • As mentioned before, in any given cell, only a minority of all 20,000-60,000 genes (protein-coding only vs. all genes) are expressed, with most genes having zero expression. The distribution of the counts across all genes tends to follow a Poisson distribution (or a zero-inflated negative binomial distribution). This has a few consequences when trying to evaluate your model on count prediction tasks.
      • First, the losses of the higher expressed genes are going to dominate your reconstruction loss, since the squared errors are larger in absolute terms for higher count values. Hence a weighted means squared regression or an L1 loss should be used (though even that won’t solve the whole issue). But supposing you do decide on a loss function, you now much choose from between around 5 different metrics to calculate this loss: do you calculate it over the top 20/100/200 genes? Or do you instead look at a Spearman rank correlation? Or how about just the direction of expression change?
      • Secondly, and far more perniciously, genes with higher expression levels naturally exhibit higher variance because, in Poisson-distributed data, the variance equals the mean, leading to a known issue when dealing with Poisson-distributed data called “heteroskedasticity”. This intrinsic property means that even if we accurately predict the true underlying expression level of a gene, the observed counts will vary more widely for highly expressed genes, and thus the losses will necessarily be higher. (One way around this is to train a generalized linear model with a Poisson link function, that transforms continuous real values into estimated integer counts which your loss can then be computed against, accounting for the issue of heteroskedasticity.)
      • Therefore, when you evaluate your model’s predictions, not only are you over-indexing on the losses of the more highly expressed genes, but these genes are precisely those that will be harder to predict due to higher variance (though there is the countervailing effect from rank-truncation that these are probably the genes your model has seen most often). (And even if you process these integer counts into floats, via some combination of normalization and taking the log, as is typical in the single-cell space, you are still going to run into these systematic distortions.)
  • The previous nested bullet point illustrates an important sociological issue. This sort of inside baseball puts most people to sleep (I’ll just assume you didn’t read it, dear reader), and this provides cover for deliberate obscurantism and negligence by those inside the field.
    • The number of people who: understand how to train cutting-edge machine learning models and keep up with the literatures; have enough of a statistics background (e.g. from econometrics or hierarchical linear modeling in psychology, which is where my familiarity comes from) to understand a generalized linear model, Poisson-distributed data, and the issue of heteroskedasticity; and who either have a firm grasp of single-cell data or have the ability and desire to pick it up is quite small. Most of the people within this intersection of skills and interests are doing more remunerative things with their time than working on single-cell biology, like being paid ungodly sums of money to do time-series modeling at quantitative trading firms or building demand forecasting models at ride-sharing companies—and the remaining few work in the single-cell modeling field (and probably don’t actually understand or have incentive to train cutting-edge machine learning models, as evidenced by the current state of the single-cell foundation modeling field).
    • This enables those in the field to engage in what appears to be either deliberate obscurantism or simply negligence. For instance, a co-author of one of the most recent pessimistic single-cell benchmarking papers making the rounds online actually wrote a paper precisely on the issue of heteroskedasticity in single-cell RNA sequencing data, and so is therefore aware of it. And yet the most recent benchmarking paper has zero mentions of the word “heteroskedastic” or “Poisson”; they simply calculate L2 losses like everything is normal. To be clear, this isn’t to say addressing these issues would fix current problems with single-cell foundation models, which as mentioned before are extremely suboptimal. But it does evidence that the field has serious issues on both sides which outsiders are completely unaware of, and it would therefore be a mistake to dismiss the entire field out of hand, as so many have done.

So, given all those preceding facts, when viewing the single-cell foundation model field as an outsider, you have to believe one of two things, either:

  • Machine learning works across all other domains except for this particular one, perhaps because of “batch effects” or some other boogeyman concocted by the field like “not enough data” (despite having at least 50B tokens easily accessible to everyone, and 100B if you actually try, like the most recent Geneformer paper did), and so the field is doomed to fail.
  • Or you have to believe that the field is less than two years old, is an absolute mess, and because it’s insulated from the rigors of the real world there has been no selection pressure for things that work—only selection for things that receive grant money. Therefore, the field has not attracted our best and brightest (and those who might enter the field have been cowed into working on other things since working on single-cell foundation models is currently so low-status/taboo and has such low academic returns.)

If you believe the latter, that the field is broken and pre-paradigmatic, then this presents an amazing opportunity for intellectual arbitrage. What might it look like for the field to mature and how far away are we from this? The evolution of the protein structure field over the past 50+ years provides an interesting case study from which we can draw parallels in answering these questions.

A Techno-Sociological Case Study of the Protein Structure Prediction Field

A Brief History of the Protein Structure Field

Perhaps suprisingly, researchers have been working on computational protein structure prediction for quite a long time. What precipitated the emergence of the field was Christian Anfinsen’s hypothesis that, in standard physiological conditions, there exists a single, unique mapping from a protein’s amino acid sequence to a themodynamically stable structure (its “native state”), and that this mapping does not depend on the the particular path the protein took to fold into that structure:

Anfinsen’s hypothesis was a direct result of experiments he ran showing that when a ribonuclease protein was purposefully denatured (i.e. unfolded) in solution, the proteins would refold into the native protein structure, in the proper conditions.

This hypothesis had two important implications for the field:

First, it enabled the large research enterprise of in vitro protein folding that has come to understand native structures by experiments inside test tubes rather than inside cells. Second, the Anfinsen principle implies a sort of division of labor: Evolution can act to change an amino acid sequence, but the folding equilibrium and kinetics of a given sequence are then matters of physical chemistry.

Mind you, this work was done in the early 1960s. With this dogma in place, the field had become paradigmatic. Eventually the field coalesced around three major problems:

  1. the folding code: the thermodynamic question of what balance of interatomic forces dictates the structure of the protein, for a given amino acid sequence

  2. protein structure prediction: the computational problem of how to predict a protein’s native structure from its amino acid sequence

  3. the folding process: the kinetics question of what routes or pathways some proteins use to fold so quickly.

An important step toward solving these problems was the establishment in 1971 of the Protein Data Bank (PDB), a database of experimentally collected 3D protein and nucleic acid structures (which AlphaFold and other proteins structure models still heavily rely on; it currently houses around 200,000 structures). Over the 70’s and 80’s, experimentalists began adding more and more structures to the database, and eventually computationalists began trying to use these structural data to solve the three major problems.

On the question of the folding code and structure prediction, after plodding along for a while, in the 80’s there was a major shift in focus from modeling microscopic forces to macroscopic forces, in particular hydrophobic interactions:

Prior to the mid-1980s, the protein folding code was seen a sum of many different small interactions—such as hydrogen bonds, ion pairs, van der Waals attractions, and water-mediated hydrophobic interactions. A key idea was that the primary sequence encoded secondary structures, which then encoded tertiary structures. However, through statistical mechanical modeling, a different view emerged in the 1980s, namely, that there is a dominant component to the folding code, that it is the hydrophobic interaction, that the folding code is distributed both locally and nonlocally in the sequence, and that a protein’s secondary structure is as much a consequence of the tertiary structure as a cause of it

Further strides on the structure prediction problem were made in the late 80’s and 90’s by incorporating advances in coarse-grained computational physics modeling and machine learning over evolutionary sequence data, which made structure search and prediction more efficient. That is, the challenge was to search the space of possible conformations that a protein could fold into, which is astronomically large, and find those conformations which are predicted to have the lowest energy (i.e. be the most stable) without needing to predict the energy of every single possible structure, and these methods gave faster, more efficient ways of performing that search.

But the major inflection point in the protein structure prediction field was the creation of the CASP competition in 1994, a biennial competition which galvanized the academic community. According to the founder of the competition, John Moult, he started CASP to “try and speed up the solution to the protein-folding problem,” but in retrospect he was “certainly naive about how hard this was going to be.”

CASP went like this: the computationalists were given a set of 100 sequences, corresponding to 100 non-public protein structures collected by experimentalists. There task was to predict the 3D structures from the sequences, and these predicted structures were compared against the ground-truth structures and evaluated a single metric, the Global Distance Test (GDT), which is a summary metric of the deviations between the actual and predicted positions of alpha-carbons in the structure (where the deviations are bucketed into a few distance thresholds). These summary distance scores are calculated for each of the 100 predicted proteins, and then all summed to get a a total score (GDT-TS), which ranges from 0 to 100.

The first breakthrough in the competition came with Rosetta, out of David Baker’s lab, on CASP III in 1998. Rosetta exploited a hybrid approach to more efficiently build up and search the energy landscape of possible protein conformations. At its core, it was essentially doing a kind of early retrieval-augmented prediction, retrieving fragments of known protein structures from PDB that matched parts of the target sequence, and then trying to assemble these pieces into a structure guided by an energy function that combined physics and statistics. (The canon surrounding the creation of Rosetta is, shall we say, interesting.)

There were then almost two decades of what in retrospect seems like middling progress. And in 2018, AlphaFold burst onto the scene at CASP 13:

DeepMind’s entry, AlphaFold, placed first in the Free Modeling (FM) category, which assesses methods on their ability to predict novel protein folds (the Zhang group placed first in the Template-Based Modeling (TBM) category, which assess methods on predicting proteins whose folds are related to ones already in the Protein Data Bank.) DeepMind’s success generated significant public interest. Their approach builds on two ideas developed in the academic community during the preceding decade: (i) the use of co-evolutionary analysis to map residue co-variation in protein sequence to physical contact in protein structure, and (ii) the application of deep neural networks to robustly identify patterns in protein sequence and co-evolutionary couplings and convert them into contact maps.

Unlike Rosetta, AlphaFold learned to predict the distances and angles between amino acids end-to-end. Crucially, however, AlphaFold did not dispense with the use of evolutionary similarity information—but unlike Rosetta, it actually learned to incorporate this evolutionary information into its model’s predictions.

(This isn’t to say that AlphaFold didn’t rely on domain knowledge: the core of their model exploited an invariant point attention (IPA) operation, which respects symmetries under rotation and translation in 3D space, a domain-specific inductive bias. Their model also used a kind of recurrence called “recycling”, whereby the output of one forward pass of the model would be fed back in as input, progressively refining the predicted structure through iterative updates—something conceptually similar to the iterative refinement of energy minimization, but learned from data rather than using an explicit energy function. In AlphaFold 3, this connection to energy landscapes becomes more direct through the use of diffusion models, which learn to predict the score function (gradient of the log of the unnormalized density). This effectively lets the model learn an implicit energy landscape and use it to find low-energy conformational states.)

In CASP 14 (2020), the most recent CASP in which DeepMind officially competed, AlphaFold2 absolutely dominated regular targets and interdomain prediction, with the Baker lab coming in 2nd place to them in both. AlphaFold3 was released in mid-2024, to much excitement. We will know in a couple months how they fare on CASP 16.

But what does any of this have to do with single-cell foundation models?

Spelling Out The Analogies

We can use the protein structure field’s history as an analogical domain to map to and from (pun intended). On a sociological level, a couple observations are worth noting:

Firstly, there were already insiders in the protein structure field doing ostensibly useful computational work in the 80’s who had some insights and success; but it took a motley group of outsiders with computational skills to bust the doors open and truly revolutionize the field. Though the team had a few people with structural biology backgrounds, it was mostly made up of CS people: for instance, lead author John Jumper was a theoretical chemistry PhD but had a background in CS.

Secondly, these outsiders only came to the field because there was a legible challenge with cut and dry evaluation criteria; solving it would bring status and generate obvious utility beyond the academic field (it was not only a grand challenge, but a useful one); and there was a large stock of publicly accessible data to train their models on.

Thirdly, the problem had been correctly specified such that they weren’t encumbered by unnecessary dogma or theoretical baggage. The task was to predict structure from sequence, not to develop the most physically realistic model of protein folding or explain why a protein folded how it did.

On a technical level, there are a few observations to make:

First, AlphaFold was not the first publication to use deep learning (e.g. earlier work used convolutional neural networks), but it was the first to rely on deep learning end-to-end. That said, deep learning was not the entire solution: due to the limited number of protein structures available, they didn’t dispense with the use of multiple sequence alignments, which have been a mainstay of all 3 versions of AlphaFold (were there 200 million structures in PDB rather than 200,000, this might be a different story). But the progression across versions of AlphaFold has been jettisoning knowledge-based inductive biases in favor of a more learning-based approach.

Secondly, AlphaFold didn’t actually solve protein structure prediction. They solved the problem of predicting small, static pieces of larger proteins from their sequences, which has some utility but doesn’t immediately solve e.g. the more applied binder design problem or the problem of predicting the dynamics of proteins switching between multiple conformations. Yet this has not stopped it from generating lots of excitement about the field and spurring hundreds of millions in investment and billions in partnership deals, which undoubtedly will drive progress toward solving these more difficult problems.

Thirdly, the field was operating based on dogma which has later been shown not to be completely correct: many proteins switch between multiple conformations, and some don’t even have a single stable conformation. And yet these useful abstractions still worked well enough to get real traction on the problem. (Or at least the dogma had to be stretched: “Intrinsically disordered proteins (IDPs) are proteins that lack rigid 3D structure. Hence, they are often misconceived to present a challenge to Anfinsen’s dogma. However, IDPs exist as ensembles that sample a quasi-continuum of rapidly interconverting conformations and, as such, may represent proteins at the extreme limit of the Anfinsen postulate.”)

With all that in place, let’s now take a swing at analogizing what might lead to an AlphaFold moment in the single-cell field and why we haven’t seen it yet.

As technical table settings, let’s motivate our analysis with the following analogy: the sequence-to-structure task is to the protein field as gene regulatory network prediction is to the single-cell transcriptomics field.

That is, just as we can map a sequence of amino acids to a 3D molecular conformation as encoded by a contact/adjacency matrix, so too can we map a multiset of expressed genes to an information theoretic gene regulatory network (conformation), as encoded by a gene x gene directed graph of regulatory effects/contacts (i.e. does gene A increase or decrease the expression of gene B).

First, let’s do the obvious sociological analogies which will track cleanly:

  • The cellular biology field already has insiders doing ostensibly useful computational work, many are applying deep learning to various sub-fields like functional genomics, and some are even starting to apply deep learning to single-cell transcriptomics modeling, as we’ve discussed. But to truly revolutionize the field, it will likely take a computational group of outsiders with serious machine learning chops.
  • The outsiders who solve the problem will only come to the field when there’s a legible grand challenge with clear evaluation criteria, and a big, useful prize for solving it, none of which the field has—protein folding simply has an intellectual appeal that gene regulatory prediction does not, perhaps because the protein field has been historically more physics-adjacent than cellular biology, and hence higher status. However, the field has an immense amount of public data, though the single-cell count format isn’t the most accessible and creates evaluation nightmares, as mentioned before.
  • Those who solve the problem won’t be encumbered by the field’s theoretical baggage, and will rely on existing knowledge only insofar as it helps solve the task of gene multiset to gene regulatory network prediction. Currently the field has a lot of baggage and is rather impenetrable to outsiders.

When we try to analogize at the technical level, things begin to diverge. Firstly, the gene regulatory network (GRN) as gene expression conformation prediction analogy is a bit weak.

For instance, what might something like Anfinsen’s dogma look like in the single-cell transcriptomics field? As a first pass, it might be: for small gene regulatory networks in a standard cellular environment, the native gene regulatory network structure is determined only by the expressed genes in the cell.

On its face, immediately this analogy appears to break down: what about epigenetic modification, which plays a crucial role in controlling transcription and is mutually reinforcing? And what of the downstream proteins which too control gene expression? And are there not examples where one cell’s gene regulatory structure depends on the activity of another cell?

Further, none of Anfinsen’s three criteria seem to hold when ported to single-cell transcriptomics:

  • There are likely multiple gene regulatory conformations with comparable free energy for a given gene expression state, thus the uniqueness condition fails.
  • The energy landscape of cells doesn’t have deep, isolated minima like protein energy landscapes do.
  • The path along the energy landscape to the stable gene regulatory conformation isn’t going to be smooth but will involve many discontinuous jumps, with often violent phase transitions and symmetry-breaking, the equivalent of knots and tangles in gene regulatory folding dynamics.

Yet let us remember that Anfinsen’s dogma not being completely universal did not stop the protein structure field making immense progress on the problem. Counterfactually, had Anfinsen not forwarded such a strong hypothesis, the protein structure field might have not made its eventual computational turn and ended up more like—well, what the non-bioinformatics parts of the cellular biology field looked like for most of its history, a field enamored with building specific models of particulars.

Thus, perhaps we should indulge these useful fictions, at least temporarily, if it would be a way of galvanizing interest in the field. For too long the cellular biology field has been focused on the questions of “what is the gene regulatory code?” and “what is the mechanism of gene expression regulation”, both of which it has made stunning progress in answering, but paid little attention to the third and more useful question of predicting gene regulatory network structure from gene multiset.

What might it look like for this problem to be “solved”, for the single-cell transcriptomics field to have its AlphaFold moment in the next year or two?

Taking our cues from the protein structure field, solving this problem will likely involve shedding dearly held beliefs and upsetting those who have worked on small pieces of the broader problem for decades. The field will develop its equivalent of CASP, someone will engineer the right PR campaign to make the problem high status or profitable, and then the computational outsiders will come in and finally solve the problem in one fell swoop.

On a technical level, critically, the single-cell gene regulatory prediction problem will involve massive amounts of self-supervised pretraining, which the protein structure prediction field doesn’t quite have (though it tries to hack it with synthetic data), relying instead of largely on supervised training on the structure prediction task; this is because the single-cell field has around 100 million cells to train with currently whereas the protein structure world only has 200,000 structures in PDB. For this same reason, the AlphaFold of gene regulatory network prediction will likely involve an almost purely deep learning based approach (which will almost certainly involve diffusion, as AlphaFold3 does). But then what, what comes after the single-cell field’s AlphaFold moment?

Well, just as the protein structure problem hasn’t actually been completely solved and the structure prediction field has now moved on to the more interesting, useful applied problems, like predicting multimer dynamics (not statics) and the problem of design, so too will the cellular biology field move beyond the rather boring, not-so-useful problem of single-cell gene regulatory network prediction to more exciting problem of multicellular dynamics prediction and perturbation/therapeutic design. But to get there, it first needs to solve the spherical cow problem of gene set to regulatory structure prediction, and perhaps simple perturbation prediction, too.

But what really are these single-cell models we’re training and what are they good for?

Learned Simulators and Energy Landscapes

On the surface, it may look like we are just learning statistical correlations in text. But, it turns out that to just learn it to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world; there is a world out there and it’s as a projection on this text.

Ilya Sutskever

While a protein folding model seems immediately useful for biomedicine, it’s not clear what we’d do with a “gene regulatory network folding” model. Target discovery is an obvious application, or maybe better transcription factor selection for directed differentiation and reprogramming. But beyond that, these just seem like toy models.

But what the analogy to protein structure prediction might obscure is that we’re not directly training our model on the supervised task of GRN prediction; rather, we first pretrain it via self-supervised learning on the massively multi-task problem of predicting the noised/masked genes. Through doing this, the model develops rich internal representations that can be used for GRN prediction—and much more.

The bigger vision of what we’re actually building here is, in the limit, a learned cell simulator (or, more accurately, a learned cell dynamics model).

Just as the objective of predicting the next token of internet text teaches a neural net more than mere statistical patterns of grammar and sytnax but a compressed model of the world that generated that text (embedded in which there is quite literally a world model), so too will training a neural net to emulate the internal states of cells teach it more than just how to model gene expression patterns.

This is much clearer in the case of protein folding: AlphaFold3 is a kind of diffusion-based learned simulator that predicts the wiggling motions of proteins as they descend down the energy landscape into a stable conformation. Rather than using a 1-for-1 direct molecular dynamics simulation, which is quite expensive, it instead learns to approximate the energy landscape which such simulations encode by learning to estimate its score (i.e., the gradient of the log density of the approximated energy landscape).

Similarly, though it may not be obvious (largely because the computational cellular biology field suffers from an immense lack of ambition), our single-cell model is also a kind of learned simulator in gene expression space. The score function of our diffusion model encodes the directions of steepest descent on this imaginary energy landscape of cellular states, and through iterative denoising the gene regulatory network does its own kind of wiggling and annealing as it falls into a stable conformation.

And just as protein folding models are now being extended from modeling static structures to modeling the dynamics of folding trajectories (e.g. of conformation-switching proteins), so too will we soon extend our single-cell simulator to modeling multi-frame trajectories or “videos” in cell state space. Yes, even though our model is currently trained on static snapshots of cellular state, this sort of pretraining provides the model with a manifold of cellular states, a prior over which it can learn to interpolate continuous trajectories through a bit of fine-tuning on multi-frame single-cell state “videos”, just as text-to-image models are used to initialize text-to-video models.

The Mechanistic Mind’s Retort: Why Black Box Simulators Aren’t Enough

But for a moment let’s assume it’s 2-3 years hence and AlphaFold 5 has solved binder design such that we can perfectly drug any target we choose; the single-cell field has finally had its AlphaFold moment, and we have now a more advanced, video-like version of our gene diffusion model, which can predict gene expression trajectories in response to perturbations: you feed in a starting gene expression state, $s_0$, and an action, $a$, and it outputs a series of predicted future cellular states $s_{1}:s_{n}$. I hand you this black box and ask you to go accelerate drug discovery, in particular to find the biological target (which we can now drug perfectly) $a^*$ that moves the cell from some given state $s_{0}$ to a desired state $s^*$. How would you do it?

There are two stipulations of the problem:

  • The simulator takes a small amount of time to run, say 1 second, so you can’t simply brute force search.
  • You can look inside the black box model and read out its internal activations as vectors, but you don’t yet have tools like sparse autoencoders to make sense of them. Similarly, you’re also allowed to perturb the activations inside the model, but you’re only able to do this in the language of genes.

This is really an iterative search problem. First you might try doing a random search over the action space and seeing what the results are. You can then look at the resulting gene expression states, compare to your target states, and try to reverse-engineer a model of what is going on inside, perhaps by relying on you gene regulatory network prediction module. You then might run some in silico experiments by knocking out or overexpressing genes inside you virtual cell. You’ll then think really hard about how to update your mental model of the cell’s behavior given these new in silico experimental results.

With these results in hand you will have a hazy guess about why the cell behaved the way it did, but it’s rather hard to understand a dynamical system like that, so you’ll probably latch onto a small subnetwork that you feel really explains the entire dynamics and focus your efforts on that. You’ll then go into the lab to run experiments and see if your hunch was correct. Then, with your newly gained knowledge about this small part of the overall system in hand, you’ll return to the single-cell simulator and try running what amount to basically another few random guesses, and cross your fingers that things start to make more sense.

Just as you can’t brute force search chess, you certainly can’t brute force biology. Even when trying to test combinations of just two or three gene perturbations at a time, the combinatorics become astronomical pretty quickly. Therefore, controlling biological systems—which is the goal of drug discovery—becomes an iterative search problem through drug space, which in turn is a search problem through the space of models of these biological systems used to predict the effect of drugs on them, which in turn is really an iterative experiment selection problem to choose the experiment that maximizes the amount of knowledge gained about that biological system (insofar as its useful for building a model of how to better control it). This iterative search problem is a dance between the coupled system of experimental agent (currently a human) and biological system of investigation (currently a physical cell).

The core loop of this iterative navigation process is: build a model about how the system works; formulate a hypothesis to test this model; run an experiment to test the hypothesis; receive data back and update your model.

It’s like a game of of twenty questions, where the experimental “question” elicits information from the cell that helps you determine the next best question to ask to figure out what it is and how to control it.

Hypothesis and understanding are currently the rate-limiting step of this iterative search process, and they will continue to be if we end up in this counterfactual world of black box single-cell simulators without mechanistic interpretability. We’ll revert to using the exact same sort of overly compressed models of biological systems, and therefore continue to ask the wrong sorts of questions, not fully exploiting the power of these simulators.

In other words, we will be in relation to our virtual cell as cellular biologists currently are to real cells: dumbly staring at a largely impenetrable black box, barely able to begin describing the true complexity inside it.

Yet this seems absurd: don’t we have extremely powerful tools to see inside the cell and therefore immense amounts of data? How has that not led to corresponding understanding of these cells—why do they remain black boxes to us? And what might it look like to peer inside their minds and truly understand them, not in the language of genes, but in some deeper language?

Pt 3. The Path From Liminal Legibility to Takeoff

Agentic Disposition: Mechanistic Interpretability as Intentional Stance

And so what the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams, motivations, and their interactions, and the situations that we are in right now. And the neural network learns a compressed, abstract, and usable representation of that. This is what’s been learned from accurately predicting the next word. Further, the more accurate you are at predicting the next word, the higher with fidelity, the more resolution you get in this process.

Ilya Sutskever

Imagine you had to construct a theory of human behavior primarily through observation but without theory of mind. That is, imagine you applied a kind of mechanical philosophy and believed that other humans were simply black-box stimulus-response machines. Maybe you could run experiments by trying to push different buttons on them and see how they responded, but when trying to model things like the complex patterns of compressed airwaves coming out of their mouths, you’d be at a loss.

But then suppose I decided to give you a futuristic fMRI machine that allowed you to read out the neural states (or more accurately, the blood flow oxygenation proxy for neural activity) of peoples’ brains exactly. Would this help at all? Likely not—a 1-for-1 map of the territory isn’t all that useful and you’d be overwhelmed with all the data.

But then suppose you started believing in theory of mind and realized that humans had internal mental states like beliefs, desires, and intentions. How might this change your approach? You might start providing targeted stimuli to elicit these internal mental states, which you could then use to build up a model of their internal world to predict their behavior with: for instance, giving them a present might create joy, or surprising them with a scary costume might trigger fear. Once you had a sufficiently good model of their internal mental states, you might start even asking them targeted questions (in the form of compressed airwaves) to elicit their beliefs and intentions, since you now have a map from their utterances to the internal mental states that generated them.

Crucially, however, your map of this internal world wouldn’t be directly isomoprhic to the patterns of neural firings in that person’s brain; rather, to efficiently and robustly model them, you’d probably arrive at something looking more like folk psychology, which is built in terms of humans’ internal concepts of how the mental works. (And with this folk psychology in place, your futuristic fMRI starts to become much more useful.)

In cellular biology, our epistemic stance toward the cell is confusing: one the one hand, we’ve developed incredibly sophisticated tools for reading out the equivalent of cells’ fine-grain neural firings, and this has yielded some progress in modeling extremely specific parts of their behavior (the equivalent of modeling isolated, small neural sub-circuits). On the other hand, when it comes to modeling the entirety of the cell, we throw away all the richness of these data and treat cells like stimulus-response machines, a kind of cogwheel-and-hydraulic tubes based automata.

What would it look like to apply the intentional stance to cells, to attribute propositional attitudes to them? That is, just as there is a middleground between treating a human as a black-box stimulus-response machine and an inscrutable mass of neural firings, might there be a middleground between treating cells as black-box stimulus response machines and a inscrutable mass of gene expression or proteins? And might this actually produce a more efficient map of the cell’s internal world model, one which enables better understanding and control of it?

Luckily, we now have a tool for peering inside the cell’s mind and beginning to answer these questions: mechanistic interpretability. As our earlier sparse autoencoder results suggest, the cell does not think purely in terms of genes but instead models its world in terms of higher-order concepts, like gene regulatory modules or sub-cellular locations, or perhaps even relative positions of nearby cells. These might form the basis of a “folk psychology of cellular behavior”, or even an integrative psychotherapeutic model of the cell’s inner life.

Of course, on one level this is absolutely absurd: the cell doesn’t think in the sense that we humans do, so why are we trying to be its therapist? But thinking in these terms might be epistemically advantageous compared to the way we currently think about the cell.

With this new language to describe the internal model of the cell, we immediately cut down the space of possible hypotheses to test and can begin to ask better experimental questions of it, eliciting the exact information we need and fast-tracking our way through the game of twenty questions.

For instance, there are roughly 20,000 protein-coding genes in the human genome; rather than trying to brute-force search the 8 trillion possible combinations of knocking out three of these at a time (which would be like trying to ablate sets of individual neurons inside a human’s brain), and rather than trying to explicitly model these genes as a gene regulatory network (which as mentioned before is not actually a useful tool for navigating the internal complexity of the cell, since the cell isn’t really computing at the level of genes), through the language of SAE features and feature families we can see through the ambient data space matrix and glimpse the deeper, sparser forms really driving the cell’s behavior—which encode these sorts of gene regulatory networks but at a more compressed, useful level of description that allows us to navigate the experimental maze with thousands of tests, not trillions.

Let’s now paint a picture of the counterfactual world where we get not just learned cell simulators but mechanistic interpretability for them, and develop an empirical model for how this will accelerate early-stage drug discovery by making the iterative experimental search process more efficient.

Mechanistic Interpretability Beyond SAEs: A Cellular Biologist’s Dream

Mechanistic interpretability is not just about building better tools to peer inside our simulator, but also better tools to probe and perturb it in order to reverse-engineer its behavior. This will be the second great revolution in mechanistic interpretability for biology after sparse autoencoders.

Researchers working with a real cell are limited in the amount of information they can extract from a single experimental query of it—essentially the output state compared to the input state, whatever signposts they might have put up to flag interesting things going on inside the system, and whatever else they’re able to deduce from this information—whereas when working with a virtual cell we can exploit its computational graph to derive all sorts of useful information from as little as one forward pass.

For instance, the integrated gradients method is an older method that has already been used on functional genomics models to determine which input nucleotides are most important in predicting an epigenetic readout like binding of a particular transcription factor to that region of DNA. The idea of integrated gradients is to exploit the gradients from the backward pass of the model to “attribute” which activations at an earlier layer have the greatest effect on activations at a later layer. (Similarly, you could exploit forward gradients when doing a single forward pass of the model to do the same sort of attribution.)

In the functional genomics case, integrated gradients and similar methods can indeed tell you which parts of a nucleotide sequence are most predictive of transcription factor binding scores; this is an amazing step forward, attributing changes in the output to changes in the input. But, crucially, what this does not give you is a useful explanation of why that region of the input has such a large effect, or an explanation of what the computational pathways mediating this effect actually mean.

Sparse autoencoders offer a solution here, allowing us to read out the intermediate computational pathways in terms of our SAE’s concept dictionary. They make methods like integrated gradients much more useful for understanding model behavior.

But we can do better than just seeing how changes in the input effect the output; we can directly intervene on the intermediate activations of the model, over-expressing or ablating different features, to see which have an important effect on the predicted output, as we did with our earlier logit attribution method.

Or we can even “patch” in activations from another forward pass of the model, finding the minimally sufficient changes in activations needed to cause a desired model output, to isolate the circuits which implement a particular function (e.g., predicting a particular indirect object in a natural language model).

Neel Nanda’s thoughts on the patching technique may sound familiar to the biologists in the audience:

The conceptual framework I use when thinking about patching is to think of a model as an enormous mass of different circuits. On any given input, many circuits will be used. Further, while some circuits are likely very common and important (eg words after full stops start with capitals, or “this text is in English”), likely many are very rare and niche and will not matter on the vast majority of inputs (eg “this is token X in the Etsy terms and conditions footer” - a real thing that GPT-2 Small has neurons for!)

…As a mech interp researcher, this is really annoying! I can get traction on a circuit in isolation, and there’s a range of tools with ablations, direct logit attribution, etc to unpick what’s going on. And hopefully any given circuit will be clean and sparse, such that I can ignore most of a model’s parameters and most activation dimensions, and focus on what’s actually going on. But when any given input triggers many different circuits, it’s really hard to know what’s happening.

The core point of patching is to solve this. In IOI [the indirect object identification task], most of the circuits will fire the same way regardless of which name is the indirect object/repeated subject. So by formulating a clean and corrupted input that are as close as possible except for the key detail of this name, we can control for as many of the shared circuits as possible. Then, by patching in activations from one run to another, we will not affect the many shared circuits, but will let us isolate out the circuit we care about. Taking the logit difference (ie difference between the log prob of the correct and incorrect answer) also helps achieve this, by controlling for the circuits that decide whether to output a name at all.

Importantly, patching should be robust to some of our conceptual frameworks being wrong, eg a model not having a linear representation, circuits mattering according to paths that go through every single layer, etc. Though it’s much less informative when a circuit is diffuse across many heads and neurons than when it’s sparse.

That is, through the extended toolkit of mechanistic interpretability, we can begin running the same sorts of causal mechanistic experiments we do on real biological systems to determine necessity and sufficiency, and through doing so begin to reverse-engineer the circuits driving cellular behavior.

As a sketch, suppose we are trying to solve the earlier problem of finding the intervention a that moves your cell from starting state s0 to a desired end state s. Suppose you have a bunch of single-cell perturbation experiments you ran in the lab when trying to answer this question: the starting expression state, an embedding of the perturbation, and the resulting expression state. How might your virtual cell combined with mechanistic interpretability techniques allow you to squeeze more bits of information out of these wet-lab data, such that you can more quickly and efficiently navigate intervention space and find the intervention a that will move your cell to the desired state?

First, you format the existing experimental data to feed into your virtual cell model as a two-frame video (or as a single sequence with an additional positional embedding to represent the time dimension), where the perturbation is fed in as conditioning. You then noise the data sufficiently, run a forward pass, and look at the major SAE features that are active across tokens in the cell. This gives you a general idea of where you’re at in cellular state space and what the major biological modules at play are.

Then, you might start doing some cheap attribution methods like the forward-gradient method or DeepLIFT, which require O(1) forward and backward passes, just to get an idea of which input genes matter for the output.

Then you might try a more expensive sparse feature attribution method, either computing attribution of output logits (particular genes you’re interested in) or the activation of other features with respect to upstream features. The cheapest way of doing this requires 2 forward passes for each of the features you want to run attribution on, but conceivably you could run many more forward passes to get a more precise estimate (and do this at varying diffusion noise levels to get an even more precise one). With such a feature-feature or feature-logit attribution graph, you can begin to get an idea of which features are most causally central to the cell’s computation.

Then you might try to see which of these features is most important in mediating the effect of the perturbations on the cell’s transition from the starting expression state to the end expression state, by ablating them from the model one-by-one. Supposing you identified a small group of candidate mediator features, you might then further ablate features downstream of them to determine the sufficient feature pathways through which they exert their effect on the outcome state, a kind of extended description of the perturbation’s mechanism of action.

Eventually, using a combination of attribution, ablation, patching, and other methods, you might be able to discover and reverse-engineer circuits inside your virtual cell that implement some behavior, or identify the minimally sufficient perturbation needed to cause some desired change while minimizing undesirable effects, like activation of the “toxicity feature”.

If your mechanistic interpretability techniques were truly advanced and you had an SAE trained in cell space and an SAE trained in your perturbation embedding space, this might give you interpretable axes of variation in perturbation space which accelerate your search through that space even faster—operating in the space of higher-order protein or small molecule concepts, rather than directly in the embedding space of amino acid sequences or molecular structures, an interplay between the computational primitives of the world models of two classes of agents that have been evolving together for eons.

And rather than manually reasoning about these perturbation space features, you might be able to simply compute feature-feature attribution effects between the perturbation features and your cell state space features, telling you which perturbation features have the largest effect on which features inside your cell. Then, when you run a forward pass with some perturbation , you can measure the delta between your desired and actual cell feature state, and backpropagate this delta all the way back into your perturbation embedding space to then take an optimization step and choose the next best perturbation embedding to test.

You’ll then go into the wet-lab, test your model against biological reality, and come back with the data in-hand, ready to do more virtual experiments on it.

This is the opposite of brute-forcing: this is a cellular biologist’s dream.

The Economics of Trading bioFLOPs for FLOPs

Though we may think of experiments on our virtual cell as effectively costless, they are not. They cost FLOPs: both the inference-time compute of running experiments on the virtual cell, and the amortized cost of training the virtual cell and the supporting mechanistic interpretability toolkit up-front.

For instance, the feature-feature attribution method we run on our model takes two times the number of features in our dictionary (so ~32,000) forward passes per cell. For a relatively small model (on the order of 300-million parameters) with a sequence length of 1024, and a feature dictionary of size ~16k, without any optimization this takes more than 15 minutes on a 4090 GPU for a single cell. (This is largely due to memory constraints caused by the need to do a softmax with our continuous diffusion model for categorical/discrete data, which could be engineered around via approximation. Additionally one could do attribution over clusters of feature families rather than individual features, which would reduce the quadratic computation cost of the feature-by-feature matrix.)

So, realizing that virtual cell experiments are not actually free, we can then ask an interesting question: given trends in the cost of bioFLOPs (i.e. wet-lab experiments) and FLOPs (i.e. in silico accelerated computing like GPUs and TPUs), how good must our virtual cell simulator be to break even with running actual wet-lab experiments, and when can we expect this for a variety of drug-discovery related use-cases?

Napkin Math of Inferencing Real Cells vs. Virtual Cells

The cost to of bioFLOPS (or more accurately, the cost of scopes to capture information emitted by bioFLOPs) has come down precipitously over the past 10 years. The primary inputs here are the cost of DNA sequencing and the cost of single-cell library prep kits:

  • The variable costs of sequencing, reagents, are now as low as ~\$2 per Gb (Gigabase, i.e. 1 billion basepairs) when using the highest throughput, costliest instruments, but cost more like $50/Gb when using Illumina’s newest low-throughput instrument, the MiSeq i100, which was announced less than a month ago. Over the past couple years many competitors have entered the space, from Utlima Genomics to Element Biosciences, so it’s not unreasonable to expect these costs to continue declining.

  • There have been major cost declines in the single-cell library prep space over the past few months, too, also likely due to competitors entering the space (Parse Biosciences, Fluent, etc.). Whereas even just a year ago single-cell prep cost around $0.10 per cell, now the claimed costs are nearing closer to \$0.01 per cell.

    Through a series of new products and configurations expected to launch this quarter, 10x Genomics intends to deliver mega-scale single cell analysis at a cost as low as $0.01 per cell. 10x Genomics believes its upcoming launches, which enable 2.5 million cells per run and 5 million cells per kit, will be the most cost effective single cell products available for CRISPR screens, cell atlassing projects and other high-throughput applications.

  • Factoring in both of these cost declines (you can mess with a cost calculator here, though some parameters like sequencing cost don’t go low enough; but you can modify sequencing depth, how many samples you multiplex per droplet, etc.), we get a consumables cost of roughly \$0.02 per cell: ~\$0.01 per cell in library prep, and ~\$0.01 per cell in sequencing reads.

    • For instance, the NovaSeq X 25B flow cell (the 300-cycle version, for 2x150 bp reads) costs $16k and generates 8 Tb of data. At a target sequencing depth of ~10,000 reads per cell after quality control (assume 25% of reads fall below quality threshold), and using paired-end reads of length 150, you can sequence around 2 million cells.
    • The library prep costs would be roughly similar for 2 million cells at $0.01 per cell.
    • So your total overall cost would be on the order of $35-40k in consumables to profile 2 million cells.
    • If sequencing costs and library prep costs both halve over the next year to \$1/Gb and \$0.005/cell respectively, then you’re looking at a pretty clean number of $10k in consumable per 1 million cells.

So, if you wanted to run a massive perturbational CRISPR screen testing the effect of knocking out every protein-coding gene in the genome, and do it across 100 replicates each, you could do it for around $35k sometime early next year.

Conversely, suppose you wanted to do a similar massive perturbation screen on your virtual cell. How much would it cost?

We can do some extremely rough back-of-the-envelope math here on the current costs of inferencing our virtual cell:

  • To start, let’s pretend that your virtual cell is a 100 billion parameter dense diffusion transformer model (we’ll get to how to train this model shortly). Just to store the weights (fp16) in memory will take 200 GB of VRAM (2 bytes per parameter). Let’s assume we’re using an 8x H100 SXM5 node ($25/hr) with model parallelism; this means the weights alone take up 25 GB of VRAM per GPU.
  • For now, assume we’re doing denoising diffusion inference over gene tokens of the form: [$s_0$, $s_1$], where $s_0$ is your given starting gene expression state (the prefix) and $s_1$ are the predicted gene expression tokens to be denoised in parallel. For instance, consider a context length of 16384 gene tokens per cell state, for a total context length of 32768, where the first 16384 tokens are given as conditioning (and not noised) and the second 16384 are noise tokens that are to be denoised in parallel; though technically there is global bidirectional attention, only the $s_1$ tokens are being updated with every forward pass, which you can think of as (though it technically is not) a kind of prefix-bidirectional setup (you might mistakenly think you can therefore KV cache $s_0$’s activations, but this would be an error—though perhaps they will be similar enough across denoising steps that this is a good enough approximation).
  • Assume you do 100 denoising steps in total (the nice thing about diffusion models is they decouple your context length from the depth of your unrolled computational graph). Your perturbation is fed in via conditioning cross-attention during these denoising steps, conditioning the prediction of s1 given s0. The entire denoising trajectory amounts to a single virtual CRISPR knockout experiment.
  • To make a long story short (and making a ton of assumptions), your worst case scenario using nothing fancy to make the attention calculations go faster or allow for larger context lengths results in around 1 virtual cell experiment per second when using batch size of 1, or 3600 cells per hour, giving a cost of \$0.007 per experiment, which we’ll just round up to \$0.01.

So, using this napkin math and ignoring the overhead of lab space, personnel and sample prep for the wet-lab experiment, and ignoring the cost of training the virtual cell model for the virtual cell experiment, the costs per cell experiment are essentially the same in vitro and in silico.

Obviously the wet-lab experiment is going to give you more bits of real information; the virtual cell is only an approximation at this point, after all. But suppose that through obvious engineering improvements—flash attention, improved adaptive step-size samplers for our diffusion denoising, model distillation, KV-cache approximation for our s0, caching the perturbation cross attentions between timesteps, and other improvements from the image modeling literature we can arbitrage—we were able to make the virtual cell experiment 3 orders of magnitude faster and cheaper. Then we have a more interesting question: how good must our virtual cell be for 1000 virtual experiments to be equivalent to 1 real wet-lab experiment? What’s our break-even point?

Furthermore, then suppose we start factoring in the cost of time and how virtual experiments accelerate research cycles. If I can run a virtual cell experiment and get the predicted data out in less than a second, that allows me to make a quicker decision about the next experiment to run; compare this with having to wait 24-48 hours for my sequencing run to finish (which is how long the highest throughput, and therefore cheapest, sequencers currently take). Suddenly this conversation starts becoming a lot more interesting.

Perhaps it’s best to think about a combined FLOPs and bioFLOP system, where the former is used for fast, cheap iterative exploration with a lossy simulator and the latter is for doing higher fidelity, full-resolution experiments based on the navigation done with the lossy simulator. For instance, as mentioned before, running a full feature-feature attribution graph on a single cell can still be quite expensive computationally, taking on the order of minutes, precisely the sort of virtual experiments you could be running on previous experimental data while your newest wet-lab experiment is running.

Solve For the Equilibrium

How will this all play out in terms of demand for bioFLOPs vs FLOPs over the coming years?

On the one hand, virtual cell FLOPs appear to be a substitute for in vitro bioFLOPs, and perhaps cheaper virtual cell FLOPs lead to lower demand for high-quality biological tokens.

But virtual cell FLOPs are also a complement to bioFLOPs, in that running virtual experiments makes experiments on real cells all the more valuable, since these wet-lab experiments can be used more efficiently to understand the system of study and confirm the efficacy of a drug candidate to send down the pipeline.

One might think that this increasing efficiency of bioFLOP use would lead to an overall decline in use, but maybe the increased efficiency will induce a net increase in bioFLOP usage a la Jevon’s paradox. Therefore, the existence of a useful virtual cell might actually on net increase the demand for high-quality biological tokens, which is further reinforced by more capital flooding into a more efficient biopharma industry. And as we collect more high-quality biological tokens, we’ll train even better virtual cell simulators, further reinforcing this positive feedback.

Our analysis shouldn’t be restricted to the demand side. What of supply?

The prior estimates for cost of virtual vs. real cell experiments are based on current input costs. How might these costs fall over time, and how should this affect our forecast for the demand of bioFLOPs vs virtual cell FLOPs?

Consider that there are exogenous trends driving down the cost of virtual cell FLOPs: namely, the declining cost of hardware ($/FLOP halves a little over every two years) and improved algorithmic efficiency of training and inferencing deep learning models (which ~doubles a little under every year or so in domains like natural language and computer vision). So in total, we should expect our virtual cells to become about 2.8x cheaper to inference every year, if these trends continue. For instance, just consider all the inference-time algorithmic improvements being made in the diffusion for image modeling field, like better sampling regimes and distillation, which can be arbitraged to more efficiently inference our virtual cell.

Assume that virtual cell experiments continue to become much cheaper, and at a faster rate than real biological experiments. What’s the equilibrium? In the long-run, it seems reasonable that total usage of virtual cell FLOPs will completely dwarf the usage of bioFLOPs, the latter being reserved purely for confirming the results of in silico experiments, just to be extra sure.

But of course, unlike with a real cell, with a virtual cell we must grow or train the model before we can run experiments on it. How much might that cost?

Let’s be completely honest: we’re letting our imagination run away with us. Yes, our current virtual cell might be a good tool for running rough simulations that later need to be confirmed in the lab, but it’s unclear how a virtual cell trained in the language of gene counts could ever come to fully model the richness of cellular biology such that it would fully replace experiments on real cells. But what if there were an even better data source out there for training our virtual cell simulator, one which, if trained on, might produce a high-fidelity virtual cell that is almost as useful as a real one?

Readformer: Sequence Read Archive as PDB-sized Opportunity

I’ll get to the punchline: we can do better than single-cell gene expression data. The most realistic virtual cell model of the next couple years is going to be trained in the language of nucleotides, specifically on sequencing reads from different -omics modalities, but primarily RNA (both bulk and single-cell).

Nucleotides will be the universal data format for training and inferencing our virtual cell, at least to begin with, since we can represent everything from chromatin accessibility to transcriptomics to the genome in the language of reads (delimited with special tokens to indicate the different modalities).

There are almost 90 petabases of nucleotide data stored publicly on Sequence Read Archive, and around 10-20 of these petabases are human or mouse and are of a modality worth training our virtual cell model on, like ATAC-seq or RNA-seq (DNA is of lesser importance, and you can probably even reconstruct the important parts of the genome from RNA data alone).

To understand how rich this data is and why it will be an immense source of untapped dynamics modeling knowledge to train our virtual cell on, I’ll provide an analogy: imagine you’re trying to train an extremely high-resolution image generation model, and you have a petabyte of rich, detailed pictures in full 4K resolution. But because these are too hard to handle, you instead decide to break each massive 4K image down into patches, each of which gets mapped to a single token within a fixed patch vocabulary of 60,000 (these mapping are static and not learned), resulting in a much smaller 64x64 canvas of visual quanta. Though you’ve made the problem more tractable by both shrinking the canvas and compressing the visual information into this limited language, you’ve also thrown away much of the rich variation in hue, texture, and subject matter that gave these photos life. (To make the analogy more realistic, it’d be better to think about taking a 100 x 100 multichannel images, where each pixel has 150 channels instead of 3, and mapping this to a 64x64 grid where each patch could take on one of 60,000 possible “channel patterns”.)

Or, alternatively, imagine you took the 200,000 or so experimentally determined structures on PDB, most of which have resolutions of 1-4 angstroms, and you decided not only to throw away half of the atoms per structure, but you also added Gaussian noise to the position of each of the remaining atoms, such that the new resolution was more like 10-40 angstroms. This deliberate hobbling would likely get in the way of learning a good structure prediction model.

And yet this sort of hobbling is precisely what we do with many of our omics-based modalities, and understandably so. In the case of RNA-seq data, a single cell might have on the order of 10,000 reads (sequences of “A”, “T”, “C”, “G”) after quality control, each of length 150, for instance. Rather than trying to model the set of nucleotide sequences directly, we instead individually map/align them to the genome to get a single predicted gene “quanta” per read. A good portion of these reads don’t map to any genes in the genome we know of, so we simply throw them out. And then we count up how many of each quanta we have in our set of mapped reads, called “gene counts”, losing all the fine-grain information about the nucleotide sequences.

This data compression is an understandable trade-off. But there are deep patterns in the distribution of reads far beyond our comprehension that predict important biology—obviously transcriptional splicing events, which we have simplified mechanistic models of, but likely far much more, especially if you believe that our virtual cell is learning a world model of something deeper than just RNA or any other single modality.

If we already discover useful concepts inside the mind of our quantized cell model, what new biological concepts might discover when we truly plumb the depths of our cell’s mind running at full resolution?

Estimating the Value of SRA

Since the protein sequence stocks have been all but exhausted, nucleotide data is the last remaining large public stock of data to train on.

For reference, the largest public protein language model to date has 100B parameters and was trained on 7.7 x 10^11 (770 billion) unique tokens (they also train on structure tokens which complicates this), using 10^24 FLOPS (which matches almost perfectly the 6NP FLOPs estimate rule for dense transformer architectures).

Conversely, in SRA, there are probably on the order of ~1-5 high-quality petabases worth of RNA-sequencing reads alone, which equates to 1-5 x 10^15 nucleotides (1-5 quadrillion nucleotides). Depending on how you choose to tokenize this, you’re still looking at 100 trillion to 1 quadrillion tokens. To train a 100 billion parameter standard dense transformer architecture on this (which would likely be severely under-parametrized), it will take anywhere from 5 x 10^25 to 2.5 x 10^26 FLOPs, equivalent to the likely amount of compute used for the largest LLM training runs of 2023. In other words, more than 1.5-2 order of magnitude greater than the amount of compute used for the largest biological model training runs to date.

How valuable is this stock of data compared to PDB?

Well, to return to our earlier analogy to the protein structure prediction field, according to one 2017 analysis, it would cost around $12 billion to replicate the PDB archive. It’s likely the current estimate isn’t too far off. The main issue is that the costs of cryo-EM and X-ray crystallography are not dropping precipitously: startups are still buying cryo-EM microscopes for nearly $4 million a pop, and as of 2020 Thermo Fisher had only sold 130 of their high-end version (but perhaps they’ve sold more since then). Around 5000 new cryo-EM structures were added to PDB in 2023 and 2024 each, forming a substantial proprtion of the total stock of structures, but existing stocks are still quite scarce and therefore valuable.

Conversely, how much would it cost to replicate SRA currently, or the interesting parts of it?

Unlike in the protein structure world, in the cellular biology world, in response to market forces the costs of sequencing and single-cell library prep are precipitously declining, making existing data stocks less valuable with each passing year. Using the not-too-distant figures of \$0.01 in consumables per cell (half sequencing, half library prep), one could profile 250 million cells at a sequencing depth of 10k usable reads per cell (generating a whole Petabase), for around \$2.5 million in reagents. To reconstruct the entire stock of ~10 Pb of RNA-seq data (both bulk and single-cell) on SRA would therefore only take around \$25m in consumables (it would require a lot of NovaSeq X, so double it), but you could probably do it for far less since single-cell library prep and sequencing has improved so much, from read quality to transcript capture rates. (For context, a few years ago this might have cost 10x the price; and at the time the data were deposited, the cost was even higher yet. The declines in sequencing and single-cell library prep costs would be like if the cost of a high-end cryo-EM machine dropped from \$4m to \$400k, or even lower: it would immediately make PDB far less valuable.)

Training Readformer and the Three Desiderata

If you decide to try to train a Readformer-like model on SRA, the issue you run into is that it’s incredibly difficult. The machine learning engineering is difficult—from the data-loading to finding the appropriate training objectives to the model architecture (consider the architecture you’d use to efficiently model data that is thousands of 150-nucleotide-length sequences in a single context window, for instance)—not too mention the bioinformatics needed to prepare the data (you’re going to want to filter and curate the samples you train on, as well as the reads within them, many of which are uninteresting and redundant; moving around 100s of terabytes of .fastq files and demultiplexing them into millions of single cells also turns out to present its own set of challenges).

But in trying to solve these challenges, you might stumble upon more efficient architectures and training methods that conceivably allow you to train and inference such models faster and more compute-efficiently, all while respecting the richness of the underlying reads-based data.

Such architectures must obey these three desiderata to make for scalable, performant simulators. Perhaps unsurprisingly, each of these is motivated by an inductive bias found in real biological and agentic systems:

  1. Hierarchical/multi-scale: The model must operate at multiple data scales while not computing everything at full resolution, relying instead on coarse-graining. This compression is a means of both compute efficiency and learning efficiency.
  2. Adaptive compute: Computational power must be allocated dynamically, automatically adjusting to the complexity of the phenomenon being simulated, just as real agents’ internal models do. Coarse-graining should be done adaptively, not statically.
  3. Locality: Information exchange should be localized, focusing computational resources where they are most needed.

Critically, the experimental toolkit of mechanistic interpretability will need to be built in conjunction with the simulators. For instance, hierarchical SAE architecture are needed to best disentangle truly hierarchical model activations. In the limit this might involve a kind of “virtual cell line engineering”, where we engineer our simulated cells to be easier to introspect and perturb (perhaps even building specific simulation-based workflows around different virtual cell lines and drug-discovery applications). But the technical specifics of this are a conversation for another day.

So with all these moving pieces in place, let’s ask: what might it look like for all this to play out over the next 3-5 years?

From Liminal Legibility to Takeoff

A brief sketch of a possible future:

Some unnamed startup has trained a virtual cell model: not the fancy sequencing reads-based kind, but the rinky dink gene expression count kind which only took ~10^22 FLOPs or so to train. The startup gets a 10x Chromium Xo and a Miseq i100 and some lab space, and starts showing that their simulator can actually predict the outcomes of single-cell transcription factor over-expression and CRISPR gene knockout experiments. This attracts the attention of some of their first customers. With this new revenue and existing venture funding, they begin training the first version of Readformer, the reads-based virtual cell model, on a curated selection of data from SRA. Even after all the architectural innovations to save compute they still throw 10^24 FLOPs at this training run, which at the time cost in the low single-digit millions, but the results are stunning.

Not only does their reads-based model better predict gene counts when the predicted reads are mapped back to the transcriptome, dominating existing gene expression forecasting and gene regulatory network prediction tasks, but it is also useful for multi-modality prediction. This was somewhat unexpected, since existing public paired ATAC+RNA (chromatin accessibility and gene expression) data is quite scarce, but it appears that through learning deep representations the model has developed great imputation abilities—perhaps due to some kind of platonic representation hypothesis whereby these two modalities are projections of the same deep underlying cellular computational forms (which also would explain why the features their SAE finds inside the model often appear to be modality-agnostic and activate on reads from both ATAC and RNA). This allows them to predict noisy, sparse ATAC signal from cheaper, more abundant RNA data, which already begins to provide huge savings for their customers.

Through down-sampling their experimental data, they find out that their virtual cell can not only impute one modality from the other, but can also impute much of the full, rich transcriptome from a smaller number of reads, a kind of compressed sensing, allowing them to drop their sequencing depth per cell by a factor of 10, which immediately allows them to run more experiments at a lower cost.

On top of this, they even start using custom probe-based single-cell prep kits to selectively read out particularly high signal genes from which they can best impute the others. At first others were skeptical, thinking that such a lossy imputation wouldn’t suffice, but the startup realized that state prediction only mattered insofar as it was instrumental to control, and therefore you didn’t need perfect state information if your main goal was to reverse-engineer the control structure of the cell—in other words, focusing on the the knobs and levers of cellular action rather than all the precise mechanics that follow.

They run into unexpected positive feedback loops. For instance, a customer has a particular primary tissue cell type they want modeling results for, but these cells are quite scarce. The startup realized their virtual cell model actually understands the dynamics of the differentiation trajectory leading to this cell state quite well, and they can use virtual experimental roll-outs in a loop with wet lab experiments on iPSCs to quickly navigate the space of transcription factor cocktails, such that they engineer a more efficient directed differentiation recipe in a matter of weeks, not months or years. Consequently, this allows them to generate much more data on this cell type, which before had poor coverage in public single-cell atlases, yielding improved understanding of it, which in turn allows them to improve their differentiation recipe further, increasing data velocity. Thus, their initial virtual cell allows them to better build a rare type of biocomputer, from which they got more data that further improved their recipe to grow this biocomputer.

The startup wonders how they can take “control not understanding” line of thinking to its limit to save on sequencing costs and extend their virtual cell model to truly video-like modeling (up until this point relying mostly on synthetic data derived from single cell snapshots). They reason that imaging is a non-destructive readout from which they can perhaps impute enough -omics signal, at least enough to get a handle on a cell’s dynamics and find the optimal next experiment to run. So they go about creating a massive paired dataset: they get an array of microscopes to record video of cells under different perturbations, using a few fluorescent channels to track plausibly important genes (per their earlier compressed sensing approach), and then at the end of the experiment they fix the cell and do destructive RNA sequencing to readout its terminal state. With this paired video-omics data in hand, they go about finetuning their previously reads-only model to now also take in image patches and do extended temporal modeling, imputing -omics from video along the way. This results in immediate improvements in the model’s long-range coherence, which in turn makes their cycle times for finding new directed differentiation recipes faster.

They discover even more cost and time savings when they realize that by continuously collecting video of ongoing experiments and running transcriptomic imputation against them in real time, they’re able to not only stop and discard experiments that have gone off the tracks or gone down known uninteresting tracks (thereby saving on sequencing costs), but can also actually do virtual experimental roll-outs in parallel to the real experiments, forecasting how they should perturb the experiment on the fly, a kind in-the-loop continuous RL—increase the nutrient density in this well, shake this flask a bit more, add this small molecule to this other well.

But when they train SAEs on this multi-modality image-omics model, they happen upon something striking: not only does their virtual cell appear to have features related to viral infection—of course it is already well known that most sequencing runs are contaminated with large amounts of biological material from microbes, and viral and bacterial features had already been found inside their Readformer model—but through multi-modality they see that some of these features also have salient signatures in image space.

The feature that sticks out like a sore thumb is the one that fires on adeno-associated viral sequences. In read space, this fires on the reads you’d expect. But when they looked at the image patches on which this feature fired, they were flabbergasted.

They played a particular experimental video where this feature was shown to activate strongly. At first, the feature fired near a segment of the cellular membrane, where it strongly co-activated with the cell surface glycan and endocytosis family of features. It then fired with lower intensity in a more distributed fashion in the cytoplasm. Many frames later a different feature belonging to the same family of AAV-associated features also flared up, but this time near the Golgi apparatus. The feature then appeared to activate in the perinuclear space for a few frames before they saw extremely strong activation of it on the nuclear membrane, with high magnitude co-activation of the “facilitated transport via nuclear pore complex” feature. And only a frame or two later they saw yet another related AAV feature firing on patches inside the nucleus.

It seems their model, trained only a combination of low-information fluorescent videos and static snapshots of RNA-seq and ATAC-seq of single cells, had somehow learned a spatial map of the process of gene therapy vector transduction and nuclear payload delivery.

They wondered, if they looked even deeper into feature space, might they discover features controlling off-target effects when integrating the genetic payload? They analyzed data from one of their partners and indeed found a few features in AAV embedding space and genetic payload sequence space that appeared to control the rate of off-target editing effects when experimentally tested in the lab.

This raised the question: what other intracellular processes might their virtual cell have an internal spatial model of purely from training on single-cell read data? And what inter-cellular processes might their model pick up on if they trained on real spatial data?

They set about building lab automation to collect tissue slices and generate spatial data. Luckily, lab robotics had advanced tremendously in the mid 2020s and the cost of spatial transcriptomics had finally started to fall by a couple order of magnitude, due to biotech becoming the hot technology sector du jour, so this allowed them to collect quite a lot of it. But to begin, they just started by trying to model co-culture in a dish: for instance, the interactions between tumor and immune cells, or between neurons and glial cells. They then increased complexity a bit by looking at small multi-layer organoids that were meant to model a semi-permeable membrane inside the body, like the blood-brain barrier, or the multiple layers of cells that form our skin. The results were preliminary, but it appeared like their spatial model was beginning to develop a “cellular social science”, or at least had a vocabulary of concepts that described recurring social motifs between neighboring cells, like a friendly jab, or passing along a message followed by a question in response.

Or at least these were the concepts they were able to grasp, since interpreting these models was becoming more difficult.

With a single cell, they had decades of cellular biology knowledge built up that gave them a vocabulary to describe the features they found inside their model, whereas when it came to modeling systems of multiple cells they were largely in terra incognita: though the developmental biology field had given them useful concepts like morphogen gradients and leader-follower behavior in cell migration, it was conceptually atrophied compared to the cellular biology field, yielding mostly macroscopic laws about how groups of cells behave, not microscope laws about which cells in particular to intervene on or how to talk about the way their influence propagated through the cellular network.

What compounded this issue further were computational issues. Whereas in a single virtual cell it was feasible run pretty comprehensive screens to reverse-engineer its function, once you tried doing this for just a handful of virtual cells talking to each other the combinatorics exploded. Their spatial model used coarse-graining to improve computational efficiency (e.g. rather than modeling each cell at full read-level resolution, they dynamically modeled its state as a series of coarsened representation in a cell-level latent space) which helped somewhat, but this did not change the fact that the hypothesis space of possible interventions points they had to search was massive. Thus, they were increasingly without good maps when trying to navigate experimental space for multicellular systems.

Much of their experimental selection up until then had been done by humans augmented with their interpretability copilot, primarily because customers wanted to be told a good story around how the discovery happened, and because they needed to monitor the models for obvious hallucinations (which were no longer an issue). But internally they had been playing around with training simple RL policies for experimental selection, and on their internal experimental control benchmarks (e.g. how many experiments does it take the policy to figure out the control structure such that it finds the intervention a to push the cell from $s_0$ to s), their simple RL policies were starting to perform better than most humans. (Their human benchmarks came from a competition they ran for publicity, where PhD students interacted with cloud labs to run experiments via an API, and were given the suite of virtual cell and mechanistic interpretability tools with the task of analyzing the experimental data to choose the next experiment to run, after which they’d receive the experimental results of their chosen experiment from the cloud lab, etc.)

They then wondered: if our human understanding of the virtual multicellular system is currently the bottleneck to experimental selection and control of it, might our simulators now be good enough to train much better experimental selection agents inside them? Then perhaps we can make these agents interpretable and reverse-engineer the thought process by which they pick experiments.

To start, they focused just on agents for single-cell control tasks. Since the price of FLOPs now basically equaled the price of electricity, and due to beneficial deregulation you could get electricity for as cheap as 5 cents per kWh, training the agent with a massive amount of virtual experiments was feasible. This required a unique compute topology, since you needed to run inference on the frozen virtual cell simulator (which at this point required petaflops to run) and feed these to the experimental agent model for training. Of course you could get the agent off the ground with offline learning on all the experimental trajectories you’d already collected in the lab, but trying to learn from suboptimal agents can only get you so far, so eventually they decided to start from scratch and train fully online.

The agent, which at this point was just a simple actor-critic model, would first collect experimental wet-lab data by querying the lab API, which at this point amounted to choosing random perturbations to run; these data would be used to update the virtual cell simulator it trained in; and then the training inside the simulator would begin: the agent would be given a starting virtual cell state, it’d choose an action, and this action would be applied to the virtual cell, which outputted the predicted cell state and other internal information about the cell, like its sparse feature activations. The agent would receive this information and then choose another action to run, repeating the loop until a termination condition was met or it exhausted the number of rollout steps per batch, after which is would get a reward depending on how close the virtual cell’s end state was to the target end state. This reward would be used to update the agent's model, and then another training step in the virtual cell world would commence. After billions of these training steps, the agent would come back to the real-world and be evaluated in the wet-lab, choosing experiments to run and seeing if it got any closer to pushing the cell toward the desired state. With this feedback from the real world, the process of training inside the virtual cell simulator would then begin again.

Surprisingly, this very simple, greedy model was decent at choosing experiments, but still worse than humans on their benchmark. They then tried using a more sophisticated architecture for the agent, a MuZero-like model which did tree search over possible virtual experimental trajectories to select the optimal next action to perform. The key unlock here was giving this model the full interpretability toolkit to run its own virtual experiments on the cell, just as a human might, significantly expanding its action space. Now, rather than immediately choosing the next transcription factor combination or antibody design or CRISPR knockout to test on the virtual cell, the model could instead choose to ruminate and spend intermediate cycles re-analyzing these virtual experimental by running targeted virtual interpretability experiments on them, trying to come to its own understanding of the virtual cell’s internal dynamics.

This did even better, and eventually the sophisticated agent got quite good at short horizon tasks, as evidenced by only taking a few dozen experimental steps to solve their single-cell wet lab control benchmark tasks, vs. PhD students taking hundreds.

They now tried fine-tuning their experimental agent inside this new virtual environment on more difficult developmental biology control tasks. Running these virtual simulations in the loop required yet more compute, with their spatial simulators using an even more bespoke compute topology to fully exploit the locality and hierarchy of their simulator model architecture. On the first wet lab evaluations the models did quite poorly, somewhat spastically moving the probes and turning the temperature and flow valves up and down with apparently no rhyme or reason. But after a few hundred more cycles of training in simulation and collecting real lab data, the model started to get quite good and began to solve developmental biology control tasks, like growing properly vascularized tissue, that still eluded humans.

Seeing this, the startup decided to go all in on developmental biology, realizing that they could not only solve a major preclinical bottleneck by building better model systems—for which there was insatiable demand from pharma, as evidenced by the revenue the startup’s early spatial simulation platform was pulling in, which had been unlocked by the recent FDA Modernization Act 4.0—but also eventually grow better, more complex cell therapies, and perhaps one day even grow full organs in the lab.

But this required relinquishing much control to the experimental selection agents, which were simply able to pick better experiments and therefore control these multicellular systems faster than humans were.

They then gave the experimental agents longer time horizons, and the results were even better, with the agents showing scientific foresight and planning far beyond anything human researchers displayed. At least, that’s what the results suggested: despite attempts to do interpretability on the agents, the humans had a hard time understanding their decision-making. There were concerns about these agents going rogue, since AI agents hooked up to automated lab systems seemed like a rather dangerous combination, and the startup made further efforts to make the agents more interpretable, but largely to no avail.

There were calls from regulators to reign things in, but these concerns largely fell to the wayside as everyone became swept up in the biotech mania of the late 2020s.

And the startup couldn’t dispute the results: the pre-clinical model systems their agent developed were increasingly physiologically realistic and being bought in droves by pharma. The agent had even figured out how to more efficiently train multicellular simulators on these data, further accelerating things. There was no telling what it might figure out next.

Perhaps mechanistic interpretability had always been a stopgap, and it was only a matter of time before the sheer complexity of agentic systems outstripped our comprehension.

Research updates

Be notified about our latest research
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.