Unlocking Cells' Secrets: Diffusion, Deconvolution, & Discovery with Siyu He of Squidiff & CORAL

Unlocking Cells' Secrets: Diffusion, Deconvolution, & Discovery with Siyu He of Squidiff & CORAL

In this episode of the Cognitive Revolution, we hear from Siyu He, a postdoc at Stanford specializing in biomedical data science.


Watch Episode Here


Read Episode Description

In this episode of the Cognitive Revolution, we hear from Siyu He, a postdoc at Stanford specializing in biomedical data science. Siyu discusses the implications and methods behind their recent AI-driven biological research papers, Squidiff and CORAL. The conversation explores the use of AI models to analyze complex cellular systems and disease mechanisms, focusing on transcriptome and tissue sample analyses. Squidiff aims to simulate cellular transcriptomes to predict outcomes of various conditions, significantly expediting traditionally lengthy and expensive biological experiments. CORAL Project extends this by integrating different levels of biological data, enabling a more comprehensive understanding of tissue structures and cellular interactions. The discussion also delves into the challenges of using synthetic data for validating AI models and the potential acceleration of scientific discoveries through AI in biomedical research. The episode encapsulates the interplay between AI and biology, highlighting the future possibilities and current limitations of this innovative research front.

Check out the papers here:
Squidiff: https://www.biorxiv.org/conten...
CORAL: https://www.biorxiv.org/conten...

SPONSORS:
Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive

NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive


PRODUCED BY:
https://aipodcast.ing

CHAPTERS:
(00:00) About the Episode
(03:37) Introduction and Guest Welcome
(04:00) Setting the Big Picture Context
(04:37) Exploring the Squidiff and CORAL Papers
(08:31) Understanding Transcriptomes
(11:17) Single Cell RNA Sequencing Technology
(15:32) Motivation Behind Squidiff (Part 1)
(17:14) Sponsors: Oracle Cloud Infrastructure (OCI) | Shopify
(19:41) Motivation Behind Squidiff (Part 2)
(25:56) Training Data and Model Architecture (Part 1)
(31:38) Sponsors: NetSuite
(33:11) Training Data and Model Architecture (Part 2)
(37:18) Diffusion Models in Biology
(46:07) In Silico Experiments and Applications
(54:25) Clarifying the Validation Process
(55:36) Validation Strategies and Real Data
(58:26) Challenges in Modeling and Predictions
(01:02:14) Accelerating Research with AI Models
(01:07:31) Future Directions and Collaboration
(01:10:46) Introduction to CORAL Paper
(01:13:09) Spatial Transcriptomics and Proteomics
(01:17:10) Challenges in Integrating Spatial Data
(01:31:53) Synthetic Data and Model Validation
(01:36:42) The Future of AI in Healthcare
(01:43:31) Outro

SOCIAL LINKS:
Website: https://www.cognitiverevolutio...
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathan...
Youtube: https://youtube.com/@Cognitive...
Apple: https://podcasts.apple.com/de/...
Spotify: https://open.spotify.com/show/...


TRANSCRIPT

Nathan Labenz: Siyu He, postdoc at Stanford in biomedical data science, and lead author of the recent AI for Biology papers, SQUIDdiff and CORAL. Welcome to The Cognitive Revolution.

Siyu He: Thank you, Nathan. It's great to be here and very excited to share my research with a broader audience.

Nathan Labenz: Cool. Well, we've got a lot to unpack. So, let's maybe just start with a little bit of big picture context setting. I think the audience, and I can also just speak for myself, I've been obsessed with AI for the last few years, like studying it intensively. I feel like I have a pretty good general understanding of the landscape. On the biology side, the landscape is even bigger and more complicated, and I haven't spent nearly as much time in it. So, I always like to start off by trying to get a sense for where you think we are in the big picture, and where you think this work is in the big picture. And I guess, maybe, just tee you up a little bit more by saying the SQUIDdiff paper is about the transcriptome state of the cell operating at the level of a cell, which is quite interesting. And then the CORAL paper is even zooming out to a little bit bigger scope of analysis than that and looking at samples of tissue. And I have kind of had this general sense that the grand challenge in biology is maybe a couple different things. One big one is just figuring out the super intricate web of what causes what, and what upregulates what, what downregulates what. How does an intervention ultimately lead to a change in outcome? And unfortunately for all of us, that's not only super complicated, but also seems to work at multiple levels of scale. So, we have the ability at this point to do a pretty good job of saying, "Well, here's a sequence. What does that translate to in terms of the shape of a protein?" And we're even starting to get decent at how these proteins potentially fit together. But it still seems like we've got a long way to go when it comes to how does all that stuff add up to the function and evolution over time of a cell. And then even probably more work to go when it comes to understanding how that aggregates up to the tissue, the system, the whole organism level. So, I've been watching this space where you're operating because it seems like that's just a really critical frontier, and you've taken a bite out of each of those problems with these two papers. But with that kind of preface, tell me more about how you see the big picture landscape, how you understand the grand challenge, and how you have decided to orient yourself with this work toward solving them.

Siyu He: Yeah. I think the major motivation about these two projects is whether we can utilize the AI models to apply to the current technology of biologies. And the goal is trying to understand the cellular systems and trying to understand the disease mechanisms. And further, we will provide the treatment strategies for the disease, and how to really use the AI models to do different sorts of datasets in technology. For example, single cell RNA transcriptomics that I mostly address in the first paper, and then the spatial transcriptomics that I mostly address in the second paper. So, these are two different studies, but they try to get into very similar goals. Like, whether can we use AI models to accelerate those understanding of the cellular activity, and then to figure out the disease-related stuff, and then their responses to the drug treatments.

Nathan Labenz: Yeah. Perfect. Okay. Well, let's get into the first one then. The first one that caught my eye was SQUIDdiff. Am I saying that right? SQUIDdiff? Interested actually in where the name came from, just as a bit of trivia before getting into the science itself.

Siyu He: Yes. Basically because I have a paper first, we call it Starfish. And then I was thinking of probably I want more models, like they call this the animal. So, we have squid, but it's a diffusion model, so that's why we call it SQUIDdiff. But it also can, I think, it's a simulation-related, stimuli-related, quantitative inferences of the transcriptomics, using the diffusion models. So, you can try to actually guarding the whole ideas about what model is.

Nathan Labenz: Sure.

Siyu He: And CORAL is a second one. But it's also a name around the sea plants.

Nathan Labenz: Gotcha. You got a theme. You've got a whole cinematic universe in the making of various models.

Siyu He: Yeah. Hopefully I'll get more.

Nathan Labenz: Tell me about transcriptomes in terms of what we need to know that's not obvious. What I know coming into this is the transcriptome is basically a measure of what genes in a cell are being actively transcribed at any given time. So, my loose understanding is, and it's amazing we can do this now at the level of a single cell. The technical achievements of the resolution of some of these techniques are really super amazing. So, we can take literally a single cell, I assume, but correct me if I'm wrong, that this is a totally destructive process. So, this is the end of this cell. We can then pull out the RNA that has been transcribed from the DNA at any given time, and we can just look at how much of these different RNA segments are there. And that basically tells us the sort of state of the cell at that moment in time. Like, what are the active processes that are happening in that cell? But that's a rookie explanation, so tell me what I'm missing from that, or what more I need to know.

Siyu He: Yeah. I would suggest maybe a good way to start about this story is to think of the central dogma of biology. Because the cell is the smallest functional unit of the living systems. And then the key information in our cells, we call chromosomes, they carry the DNA. And something like human bodies, all the cells or almost all the cells that they share the same DNA, what makes them different is the transcript transcription processes that the DNA transcribe to RNA, and then RNA will further translate it into protein, and the protein will fold into specific structures, like what the AlphaFold is trying to predict. And then they will have various functions. So, the processes related to transcription is we call also gene expression. That is an indication of a lot of variants of the cells in the human body. And one of my postdoc advisors, Stephen Quake, has said the cell is a bag of RNA. And which actually indicated that the complexity of the cells are able to distinguish from the gene expression. Protein expression can distinguish the complexity too, but I think current technology would say it's still maybe more prevalent in the transcription levels, because protein probably are much noisy. Even we have more and more advanced technology too for protein. People are mostly starting with the transcriptomes first, which help us to understanding the cellular activity and disease mechanisms. And the second point that I also want to mention is the development of the new technology, the single cell RNA sequencing. So, I probably not go to details about what exactly the whole process it look like. But the major thing is that we dissociating the tissues into individual cells and then we will probably to measure the molecular expression, like the gene expression into the RNA, sorry, into the cells. And those mostly can profile around 30,000s or 60,000s of the genes, which is very high dimension and it's also highest record technology which making AI models. Like, a very good ways to address this kind of data because usually we will have a matrix sort of things. We have columns and rows which are like cells and genes. And they are kind of like indicating a lot of things in individual cells. So that's really a phenomenal technology came out and kind of opened areas to the quantitative biology.

Nathan Labenz: So 30 to 60,000 different genes can be measured in just a single cell. I mean, that is an amazing level of resolution. And I think it's always helpful just to consider the dimensionality of the inputs. This is true in the AI architectures that we're gonna get into. So just to be very clear, the starting point for this whole process as a transcriptome is 30 to 60,000 numbers that represent the intensity of the expression of the individual genes. And that's measured through the RNA that's pulled out of the cell when it's gonna be processed in this way. Is there any reason to think that the transcriptome is a complete picture, or are there things that are known to be missing from the transcriptome? Is there a way in which the transcriptome doesn't tell you the whole story of what's going on with a cell?

Siyu He: Of course. I think although transcriptomes will show the heterogeneity of the cells, there's more information related. For example, we're looking at the epigenomics in the cells and the chromosome things that happen in the cells that differentiate across the individual cells. And also, there is protein, as I said, which also is distinct from the gene expression, but sometimes they share similarity. And beside that, there are also mitochondria-related or repsomal-related things. Cells are very complicated. That's why people haven't really figured out what exactly is going on. And the transcriptomes probably are one of the data that we can get a whole genome level. So it's a very good way to start and to understand where a cell is going. But I would say there is more other information that can also be considered into the model design and the whole story design as well.

Nathan Labenz: Cool. Okay. So now getting into the SQUIDiff architecture, let's first maybe just describe what it is that we're trying to do here. One thing that will become clear as we go through this is both of these papers are pretty carefully designed architectures, actually more complicated than what we see even from frontier language models today, which is an interesting fact. I want to get into as well why the bitter lesson may have not fully come for this sort of work yet. But let's just start off with what is it that you hope to create a model to be able to do with SQUIDiff, and then we'll unpack the strategy and the parts of it.

Siyu He: Okay. Yeah. Sounds good. So I think the motivation of this project is because usually I'm the lab person that's trying to get single cell sequencing from the whole processes of experiments, which is very painful because you need to wait for the cells to grow. And then you prepare whole processes that is usually waiting for months. But also it's very expensive when we're trying to do the experiments. What I'm thinking is can I actually use the generated AI models to create a virtual or digital creation of the transcriptomes? So this is the first motivation. And then I realized there is a potential that we can maybe manipulate the major information. For example, because we have a lot of conditioned generative AI models, what if we just manipulate these major conditions? We probably know what they are, and there is maybe a way that we can try. And then that's why we designed SQUIDiff. So it's not just creating the transcriptomics, but also addresses some sort of interesting questions in biology, whether there is a perturbation, like chemical perturbation or a gene perturbation, making the cells different, and then what they look like. In most cases in experiments, we're not able to get them immediately and also a very large scares of the data. So that's the way it can help the whole fields.

Nathan Labenz: So you said it takes potentially months to culture cell type and condition of interest, and then you said it's expensive to do the sequencing. Can you cash that out? Is there an established dollar amount, or is it still a relatively bespoke process where it's just your time is the expense of doing those single cell sequencing processes?

Siyu He: Yeah. So I think it depends on the exact tissues or the samples that you are dealing with. Usually I was working on the organoids, which are engineered tissues derived from IPSC, for example, induced pluripotent stem cells. We were making those engineered brain tissues or vessel tissues and doing disease modeling. So the whole processes usually take months. I know some of the brain organoids need even a year or more time to create. And definitely there will be some sort of mistakes during the experiments, and you will need to start over again. And that's why it's very painful. And this is the initiative that I really want to have models that can help me to take over the task. But also there's another concern. In some cases, there is an indication that experiments cannot really perform. So that's why I think such a model will be very useful.

Nathan Labenz: Okay. So let's do just a little bit more on the actual runtime, how this thing gets used.

Nathan Labenz: You might have a particular cell type and you might have a question of interest like, "What happens if I apply this drug or I apply some radiation?" Or up-regulate or down-regulate a particular gene. Something like that. And you basically wanna know what happens, and the form of the answer to that question is ultimately a transcriptome. What the system ultimately outputs is a transcriptome that's not measured but that is generated or predicted by the AI model. What does that do for you? Correct anything that I'm missing there. But then what does that do for you? Like, when you now have a transcriptome put out from the model, what do you do with that?

Siyu He: That's a good question. Usually we have the transcriptomes from the data. We are using some sort of analysis of the single cell RNA sequencing to understand what's going on. And I think the first step usually is to annotate those cells. And usually people will deal with the dimension reduction and also using the manifold methods to see how they look, maybe the clustering technology as well, to classify the cells into different cell types. And then because we also have the gene expression in two individual cells, we know what specific genes are they expressed, so we know the related signal pathways they are, and then we are understanding what's going on. So that's usually the way people acknowledge the whole data. But there's multiple ways in the field that try to quantitatively understand the whole stuff.

Nathan Labenz: And how well do we understand that sort of thing? If you have a transcriptome regardless of its source, it could have been measured in an actual lab experiment, or it could have been generated now by a model, with what level of granularity or specificity or confidence can we say what's actually going on in that cell? We go, okay, sure, you've got a number now for 30 to 60,000 genes, but could you say, "Oh, that's definitely cancer," or, "The cancer is not cancer anymore," or, "It's healthy," or, "It's not healthy," or far more. I mean, obviously you can do a lot more specific things because you can get down to the gene by gene level analysis, but what is our ability to aggregate that up into the things that really matter using all those techniques that you just listed?

Siyu He: Usually I would say if the cells are similar, if they are in the similar state, they will have similar groups of the genes that express, we call co-expression. And in that case, we are able to classify or group these cells into different clusters. Each cluster will usually correspond to a cell type. And in that case, we are able to know what the high variable genes are in the individual groups. And other methods people usually consider are the marker genes to really distinguish what's going on. So marker genes are those genes identified by the biological community that they recognize as major hallmarks for the cell types. So then, for example, there are cancer-related markers. If they identify that, they will consider, okay, there probably is a cancerous stage, similar to what the pathology people are also trying to use, the marker, like a protein, and antibodies to figure out what's going on in the tissues.

Nathan Labenz: Okay. Cool. So, there's clustering, there's these kind of indicator genes that are reasonably well understood, and now I think we can get into how it is that you're going about training a model to come up with these things. So, I guess maybe for starters there, what does the dataset look like in terms of where does it come from, how much of this stuff is out there in the community for you to just grab that other people have published? How much, if any, do you have to actually gather yourself? And just literally in terms of megabytes, gigabytes, whatever, how big is the dataset that you were working with in this project?

Siyu He: Yeah, that's a good question. Usually I would say it really depends on the projects. As I said, we have a matrix format of the data, the cell by genes. And we have 30 to 60,000 genes, which is 1000 per level, and then we will also have 1,000s or even millions and billions of cells. But it depends on projects, I think. In my models, I'm taking small and specific datasets instead of the whole things altogether, which is also to consider the complexity of the models. But I'm also thinking of using larger scales of datasets into the models and making it more foundation-related. In terms of the data resources, the community is shareable. They will publish their own data into some space, and then we will utilize those publicly available data. At the same time, we're also collaborating with web labs, and then they help us to create novel, new datasets or maybe have validation as well for the model. So it's very flexible, and I would say the data source is one of the major challenges right now in the field because we want to have high-quality data. Because most of the data, they are good, but sometimes they're missing some information and will be making the models hard to develop, making them consistent.

Nathan Labenz: So, and by the way, is it annotated at all? I understand when you describe the grid of cells and transcriptomes, what I'm understanding there is basically one vector for every cell that indicates the strength of activity for each corresponding gene in the long list of 30 to 60,000. Is there additional metadata around, like, what kind of tissue it was or the patient's condition or anything along those lines that you also tap into here, or is it literally just pure raw information?

Siyu He: Yeah, we feel like they have metadata, and it's about the cell types which are annotated maybe by myself or by the other authors, and maybe annotated publicly already. And it has meta information, for example, what kind of tissues it belongs to and what disease stages it has. So I think there are multiple ways to label these datas already. That's one of the requirements in the model.

Nathan Labenz: Gotcha. Okay. So, did you say, sorry if I missed it, but is it a sensible question to ask how many individual cell transcriptomes constitute the training data for this project?

Siyu He: Yeah. So right now I would say the model will take any kind of large amount of the datas to be trained. I'm not providing a train model. Definitely require people to provide their own train datasets. So it really depends on how datasets you are. And then usually, I think, I just testing how long it need to, and around 5,000 cells in this kind of datasets, it just requires 15 minutes to train.

Nathan Labenz: Wow. So, you can get this working with as few as 5,000 different cells?

Siyu He: Yeah, but I think too few cells will make the models underfitting or maybe overfitting too.

Nathan Labenz: Yeah. Okay, wow. I mean, that's a small number. So, do I understand correctly, that's kind of where it starts to work, and in practice you used more, and in fullness of time or maybe future work, you could use a lot more still?

Siyu He: Of course. So I think the future direction of this project is we want to build a bigger one which can cover all kind of... For example, we are working on organoids as case studies in the paper, but we are thinking of involving all kind of organoids together so we will have foundational models for the organoid things, but we can also make a foundation model for tissue or real human tissues too. So it very depends on how the design of the exactly data sets we want to use. But the model is very powerful to taking any kind of data, I would say, and flexible to address the needs of the researchers.

Nathan Labenz: Okay, so then let's talk about the architecture a little bit. I guess one super brief summary of the structure of the architecture is there's basically two core parts to it. There's the VAE, the variational autoencoder, and then there's the diffusion model portion. The question occurred to me, before even getting into how those work, 'cause certainly folks in our audience will be even more familiar with the transformer, as probably the whole world is starting to become quite familiar with the transformer. Why not use a transformer for this? If somebody just came to me out of nowhere and said, "Let's say you want to predict the next transcriptome state for a cell, given a current state and a perturbation or whatever," that would be my first instinct, would be like, "Yeah, we could probably do it with a transformer. Transformer seems to work for everything." So why did you end up going a different way with this project?

Siyu He: Well, first thing I wanted to maybe apologize is there are mistakes in the first manuscript that it's not actually the VAE. It's a semantic encoder that connects with the conditional diffusion models. So we have two major things. One is the DDIM models, and then other is the schematic encoder, and the major generative processes is that we're taking the Z schematic x variable of schematic, and that's kind of from the encoders, and then we take a Gaussian noise, Xt, and using the denoising processes of the diffusion into a final version of transcriptomes.

Nathan Labenz: So I'll dig into that and unpack it a little more, but just for intuition's sake, why not a transformer architecture? Why this more complicated and less familiar architecture?

Siyu He: Transformer and diffusion models are both very popular as generative AI models. There are already a lot of applications in biomedicine with both transformer and diffusion models. The difference is that the diffusion model maybe takes more stochasticity of the data, because the generation is more flexible in that case, and they will model the more complex distribution. In terms of transformer, I think it would be powerful, and also there are a lot of similar works that are trying to generate related stuff. But I think the import of the model is a little bit different, because usually the transformer will take the sequence of data. The sequence of the genes usually, in our area, we're trying to rank those genes first. So it's not account values of the data but ranking the genes, and then the output is also the ranking of the genes, which is probably different with what the original data look like. And then the transformer-based models, a lot of foundational models right now, they're trying to get an embedding of the datas and then further use those embeddings to do downstream tasks like classifications and prediction tasks. So I think they are different at this time point. Another reason that I want to use diffusion is that I noticed at the moment that when I started the project, the diffusion model was not really utilized in the generation of the single cell transcriptomics. That's why I'm trying to really consider exploring the ability of that diffusion models and conditional models that can really manipulate the conditions to generate the virtual cells.

Nathan Labenz: Gotcha. Okay. So on the architecture itself, I mean, you described it briefly a minute ago, but I'll try to take a shot at describing it. I'm actually not quite sure of the distinction between the VAE and the semantic encoder. You can clear that up for me in a second. But either way, and we've seen this on the feed. I've looked at lots of different architectures over time, and actually, maybe the one that this reminds me of most might be Mind's Eye, which was a project partially out of Stability AI, where they trained a system, and it was a carefully designed system to reconstruct the images that somebody was looking at based on the fMRI scan of their brain at the time they were looking at it. And that also involved a diffusion model, and it involved basically creating a diffusion prior for the diffusion model to work from. And there's something pretty similar going on here, where the first step is to try to understand what really matters in terms of the transcriptome of a cell. In general, the VAEs, as I understand, are basically trying to pass the raw data through some narrow bottleneck such that it can then be recreated out the other end, capturing as much of the original data as possible. So typically, they're trained on a reconstruction loss, where the point is to say, how can we compress, abstract, get this data into its most semantic form, still preserving all of the information that really matters so we can get back what we put in, but then use that highly semantic form for other things? So that's part one. Anything I missed or anything you would complicate on that description?

Siyu He: There are different kinds of architectures about the models, and if we think of the combination of the VAE with the diffusion, I would say they are like the latent diffusion models, and they are also the related models that are trying to address this of duplication in transcriptome as well. The difference is that these kinds of models, they usually train separately, and then they are using the VAE to do a dimension reductions and to get a latent representation first and making the whole generation processes more efficient. The way that SQUIDiff is trying to address is to train a unifying models there so they are able to train together. And then because the whole process is including the way to learn the noises but also the way to encode the semantic information. So I would say it maybe has the powers to really make semantic informations extracted from the data and then also with the stochasticity that the model of the diffusions can do in finally making a more realistic expression data with the cells.

Nathan Labenz: Okay. That's an interesting note that this system is trained entirely end-to-end under one loss function, and the big distinction there is typically, in other contexts, the VAE is trained standalone, separately, and then can be mashed up with other things downstream as needed. Okay. So, well, that brings us then to part two, which is the diffusion model, which I think folks will be fairly familiar with from an image generation context. I always like to get procedural with these sorts of things, and so the way I do that in the image context is say, okay, let's say the goal is to train a model to, given some starting point, what would this image look like if it was a little less noisy? And then, with the conditioning, what would it look like if it was a little less noisy and a little bit more like this thing that we're trying to create? And what I think is really genius about that, of course, is it allows for mass production of training data, because you can just take all the images on the internet and gradually add noise to them and then train the model to denoise them, and then eventually you get to something that can start with pure noise and just go step, step, step, step, step all the way to finally a high resolution image. And with conditioning you can get a high resolution image of something specific that you want. So, a lot of same mechanics going on here with the big difference being that the conditioning is coming out of this semantic encoder that is saying, "This is the kind of cell that we want in the lower dimension, most semantic way possible." Now it's up to the diffusion model to map that back out into the fully detailed picture of a particular transcriptome. Again, what am I missing there? Or what more do I need to understand?

Siyu He: When I'm trying to create the models or thinking of what's the differences between these models with the other imaging-based diffusion model, I would say they are very different in their data structure first, because usually most imaging data, they are two dimension, have X,Y directions. But for the single cell transcriptomic datas we are dealing with just one dimension data. The object is the cells, the cell and then a gene expression. So with that, we're not able to use most of the way that the diffusion model dealing with the imaging, like U-Net or like imaging-based networks to learn the noises. We are using just MLP to really learn the noises, but we also consider the residue connections that are trying to incorporate the event of the times and also the thematic features. And in terms of the thematic features, I would say it can provide implicated informations. So it's not just cell types that included in the data, it will also have other sort of the informations related. For example, the disease stage and the conditions that related. So it's like unifying the conditions which we saw maybe can be manipulated in this space and not in the gene space.

Nathan Labenz: Is there anything special about the noising process? In images, it's pretty simple. I've seen that in contexts like protein folding there is a sort of highly specialized noising process that's required because the naive approach of just adding noise to a protein structure just takes you to something totally incoherent. So there's a more sophisticated way of doing the noising. Does that apply here too, or are you able to just apply a relatively simple noising process?

Siyu He: Yeah. I think the way to deal with the noise is maybe quite simple. We didn't really consider too much of the noising issues. But the differences we also consider is most of the expression data, they are usually in a single cell version. They are very sparse, so there are a lot of zeros in there, and we will need to take care of that. So I think what we're trying to address is we don't study all the genes related. We filter out some of the genes that probably aren't really used for, and we're looking on the high variable genes only. And another advantage of the diffusion model is that they are not able to reconstruct those caution or simple distributions. They are able to create very complex distributions that most of the gene expressions can have. So the model itself has the power to model these things.

Nathan Labenz: So, then in terms of how we actually run a sort of in silico experiment on this, we've got the setup of this semantic encoder that takes in a raw transcriptome and converts it into a more compact semantic, more semantic representation that abstracts away those details and tries to get at what really matters. It seems like the key thing to understand, I found here as I was reading the paper, is that you need to do an in silico experiment, to be able to ask the question, "What would happen to this cell type if I did this sort of stimulus to it?" You need to be able to encode what that stimulus is in this semantic latent space. And so what I understand is happening to create that sort of direction is you're basically creating a delta or basically doing subtraction in the semantic space between other pairs that you may have. So I think people probably have seen examples of this from language models or even image models. There's the really famous example of if you have the embedding for man and woman, for example, and then you subtract those, you can create the direction from man to woman or the direction from woman to man. Obviously those are gonna be the negative of one another in this latent space. Then you can do interesting things like if you have the embedding of king and you add the delta of man to woman, you find that at least in many experiments looks like the embedding for queen. And so you have these delta vectors that can be synthesized from contrasting pairs and then applied to other starting points to get to some other place. So I understand that that's kind of the core mechanism. If I'm understanding correctly, as always correct me, but what I think you're doing, in a lot of cases, is saying, "Okay, I have cell type X, and I want to know what would happen if I apply stimulus Y to that. But it would take me months to culture that cell type and then do all these experiments, but I do have cell type Z, and I do have the ability to apply the same stimulus to cell type Z, and then I can..." And maybe some of this stuff is just also out there as published data. You don't even necessarily have to run the experiment. You can tell me about that. But if it's easier for me to do this experiment in cell type Z, then I can get that delta vector, and then I can go apply that delta vector to the cell type X, and then I can run this diffusion process, and I can see what the transcriptome would look like if I had hypothetically applied this stimulus to that cell type X without even necessarily ever having to touch the cell type X. So again, how'd I do, and what'd I miss?

Siyu He: Yeah. Thank you so much for the introduction. I would say it's exactly what it is. It's like we're using the vector operation to address those manipulation with the schematic variable. But I would say this is an assumption actually. It's not exactly the real cases, I would say, 'cause usually the biological processes, they are complex and usually not really linear. But I would say the latent variables probably are more linear compared to the gene expression space. So we're using the linear approximations to resemble the real cases. And it can deal with some certain simple conditions. For example, in my projects, we are providing some scenarios. For example, there's cell differentiation, which, for example, day one, we have iPSC, which is the stem cells, and then at day three, we will have a more mature type of mesenchymal stem cells, and what this individual look like. And in that case, if we have the schematic variables at day zero and then day three, we can actually use the linear integrations to get day one and day two and to create the transcriptomes look like in those days. And this is a way we did. And the second implication I would say is the perturbation task. You know, in some of the case, we will have gene perturbation. Like, we're making one gene be upregulated or downregulated. And then in other cases, we're making gene be regulated, and what's going on with if we have A and B together, even I think the result will be non-linear, and the diffusion model is able to model this non-linear processes, but the latent variables are still able to be manipulated by these vector operations, because I would suppose that the space of the latent will be more smoothed and more structured. That's also an approximation, I would say. So it's a risky thing that we directly rely on the results from the model, so we need to validate as well. But so far, we apply onto the open eye cases, and we see very exciting results, like showing some cells is a very transient cells that during the development and during the differentiation to from iPSC into the endocelia and into the fibroblasts, into the blood vessel structures. And then to figure out this issue, I we will also consider variant of the Æ’ as our second projects and instead of just a simple schematic encoder, like simple schematic variables, we consider a more time series analysis. Well, that's related to how we generate the video. So, like, video generated diffusion models, we are training onto a more complicated data set that has time series informations, and we're learning precisely about the development things and then taking more non-linearity of the data sets.

Nathan Labenz: That's interesting. It's funny how some of these techniques that are developed for what were initially such whimsical, I was following these things back in 2021, when the legendary Twitter account Rivers Have Wings was training these diffusion models, and they were just generating very bizarre but sometimes quite intriguing art. And it was like, "Is this useful for anything?" And it turns out, it's actually got direct application to biology, and it seems like we're maybe just a couple years behind. Obviously, image generation and video generation has gone extremely fast to the point where now you in many cases would have a hard time distinguishing the imaged outputs from models from real photographs. And not quite there on video, but sometimes it's hard to distinguish, and sometimes it's still a little weird, but we've certainly come a long way. So, to project that that could happen in the biological domain as well over the next two to three years is nothing short of a paradigm shift, right? I mean, to be able to get that kind of accuracy out of a simulation, it seems like it would just totally upend the field. And it seems like it's almost certainly gonna happen, right? I mean, what do you think is the... I mean, there is this issue of the non-linearity, which is a fundamental one, so maybe I should just ask a little clarifying follow-up question on exactly to what degree this has been validated. I saw that there was this three-day cell transformation process, and this model, because you were able to essentially scale the delta, you could say, okay, I have a cell at day zero. I can measure the transcriptome now, and now I can apply some perturbation. Now I can come back and measure it at day three after it's differentiated into a different cell type or whatever. Now I have that direction in space. Are you calculating the two intervening days as literally just a third of that vector and two thirds of that vector? Is it as simple as that? And if it is that simple, do I understand correctly that you also have gone and actually taken cells at those one and three and one and two day time steps and compared to see how well they line up? And if that's all right, then how well in fact do they line up once you actually took the real measurement?

Siyu He: That's a good question. I think we have different types of validation strategies in the model. For example, we have some scenario cases. In the differentiation applications, we definitely have the real experimental data from day zero to day three. In the training processes, we only take day zero and day three, but we will head all those one and two as a test set. So when we are getting these models trained, we will have the results for one and three as what you said about manipulation with the semantic variable, and we got those predicted values for the gene expression, and then using the real case to compare and gather scores. In terms of other cases, like the gene perturbation and drug perturbation, we also do similar things. For example, we are having the whole experimental data about perturbation on gene one and gene two, and then also one plus two, but we will head out those one plus two and then just test whether they can accurately predict. And then the third strategy I would say is the real cases that we're showing at the end of the paper, the blood vessel organizing. In that case, because we need to culture the organoids, from day zero to day 11, if I remember correctly, we cannot stop it. We cannot collect them day by day, because the collection of this data to the single cell sequencing will actually destroy the whole sample. So, in that case, we only take day zero and then we take the final days, and we were just using the interpolation processes to generate the intermediate stages. But we have some public available data that have the time series data as well. And we will compare whether we can identify similar biological things. We discovered some novel cell states that we haven't really thought about. And in that case, we will also study what specific genes are being triggered by these sorts of differentiations, and then compare our findings to the experimental data from other publications. We will definitely see a consistent finding, so making the models strong, indicating the power to find these new things.

Nathan Labenz: Okay, cool. So, you mentioned, too, it's an important point that the manipulations that you can do in the semantic space are these vector operations which are inherently linear, and that feels like a problem 'cause we know how much non-linearity there is in biology. But you do have the diffusion process, which can capture non-linearities. So you can hope that that'll work, and at least in some number of cases, it seems like it does. I suppose it just remains to be seen in what cases might those sort of linear vector operations just fundamentally break down in the semantic space. Have you seen anything like that where you say, "Yeah, well, we tried one..." You described the validation successes, but were there any validation failures where you're like, "I guess this is just fundamentally not linear," or, "Somehow we're not capturing it in the semantic space?" Any instances like that?

Siyu He: Well, I think for now, with the cases that we tested are always efficient, I would say. We're not seeing that. But we do see that the intermediary stages prediction is not as good as we thought. That's probably because we are modeling the schematic variables to resemble the factor. For example, iPSC to the mesenchymal cells. We have different types of growth factors to be cultured cells, so we have different vectors or different directions that make the final stages of the variable. So, the point is we're just doing an approximation and then making them linear. So I would say, if we have more information, for example, we learn what the schematic variables for the individual growth factors look like, the whole models will be 3D or better. But for now, we don't have that information, so this is the way that we try to adjust this. Another point that I probably should mention, not probably related to this question, but we also realize there are limitations of the models. For example, if we have some data about drug A and drug B, and we want to know what the result looks like, we train on A and B, and then we can use those vectors A and B to other cell types, right? So we want to have A and B any type of cell types and know what the responses are. But what would the case [be if] we have drug C 'cause we don't know exactly the vector or the same manufacturer of drug C, so we cannot model drug C. To address these issues, because I was asked by the reviewers in the paper, we are thinking of considering another variant of the model. So we have an adapter that can encode the drug information that is using the mouse structure of the drug and also have this dosage information of the drug. So we can embed this drug information related to structures and dosage and then also concatenate it with a semantic variable. We will have a new version of the semantic variables. And then in that case, even we don't have the application or training onto the drug C, we can still predict the results if we have the drug components of the drug C.

Nathan Labenz: Interesting. Okay. So do you think that this is, as it stands today, something that can effectively accelerate science? I understand how it would if it's accurate enough. The way I always think about these things is, long term, this can change. But short term, we have a finite amount of wet work experiments that we can do. We've got finite lab space. We've got finite grad students and post-docs to be hands-on and actually do the cell culturing, and there's all this know-how. So for a short period of time, that throughput is relatively fixed, and what determines how much actual scientific progress we make is, do we run the right experiments that actually yield the interesting results that we can learn from? And so the great hope for these sort of in silico experiment models is obviously they run orders of magnitude faster. I assume that maybe you could tell me how long it takes to do one diffusion process, but I would assume it's less than 1/1000th, maybe less than 1/100,000th of the time that it would take to actually do the physical experiments. Is there a single answer to how long does it take to do one of these?

Siyu He: I would say this model will actually help researchers accelerate their research. The way that we are getting this single cell RNA data probably needs at least one week to go from the dissociation to getting the real files about the sequence. But if you want to get the data sets with other conditions, we can just use this graded model, and using maybe an hour to get what's going on with other conditions. And in that case, people may have some intuition about the direction of the projects. I would not say trust the model totally about having some information. In some of the projects, for example, people develop those organoid things 'cause organoid is a good way to be a drug screening platform. So they are studying different ways to create those organoids. So I guess if we have the vectors for the individual growth factors, we can predict what's going on with the development of the organoids, and also provide information about those organoids' response to the drug, like different drug components. That's one example. But in a real case, for patients with cancer, they want to know what kind of drugs are mostly useful for them. It's risky for them to just randomly take some drugs, right? So we need to test, and then I think it shows powerful applications in predicting how these cells respond and whether they will be useful for that patient. So it will have side effects as well. I guess it will provide efficient outcomes and predictions and not necessarily wait long times to see how the outcome looks.

Nathan Labenz: Yeah, the context of a single patient is really interesting, too. I hadn't even really considered that. I was just kind of thinking, like, general scientific inquiry, you can sort of run a bunch of these simulations and identify the ones that seem most promising and then prioritize your wet lab work accordingly. And that, in and of itself, seems like it could really accelerate the pace of discovery-making that you could do. But then it becomes even potentially more of a game-changing technology at the point where you're applying it to an individual who has potentially their own idiosyncratic situation going on in their own selves. That's cool. So I guess next steps for this are scale-up is obviously always one good candidate, and you mentioned separating out the interventions to have a distinct semantic encoder for them, and then you also mentioned the video to image kind of time series to single before and after analog. I mean, I guess that's enough, right? You'll have plenty to work on to tackle those. One thing I'm always struck by when I talk to people who are doing this sort of work is the pace of papers is just crazy. And you're evidence of that 'cause we're not just talking about one today. We got a whole other one that we're gonna spend a little time on very soon as well. When you look ahead to all those next steps, do you feel like it's just a matter of work? At my software company, the software engineers used to say, "It's just a simple matter of programming." And what they would mean by that is it's gonna take us some work. It's gonna take us some time. We're gonna probably find some things that are gonna challenge us a little bit along the way, but we'll definitely overcome those. And in the end, if it's simple matter of programming, we'll definitely be able to get it done, and it's just kind of a question of putting in the work and getting over the little stumbling points. Is that how you kind of think about these next steps? Are you that confident that these are gonna work? Not to say that there won't be little surprises along the way, but overall, would you say you feel very confident that these next steps will all kind of lead to better models and better predictions and all that good stuff that we want?

Siyu He: Yeah. I think it really depends on the whole community, the collaboration between scientists and not just on our own projects, because we need a lot of the data, and people have their own expertise in their domain area. And then, will be very helpful to accelerating the whole work to be into next step together. So we definitely need biologists to understanding the whole things, like whole mechanisms of disease, and we need machine learning person to create in those models, and then we also need more people like statisticians and other even artists to really kind of making our models and works to be more fashion. I would say it's really kind of like working together things and the collaboration's very important. And another case is, I remember, I think because we are dealing with application AI models into healthcare and biomedicines, we already care about how accuracy is and we will be responsible for the patients. It's serious that we need to figure out how the models exactly can do and not in the opposite ways. So I think a lot of processes are required before it can goes exactly clinical valuable and testable. I think we need to experimental test and clinical test a lot of pre testing before exactly to publish the whole models to be applied into the real cases in the patients, I would say. But in the lab scenes, I think there will be more flexibility to testing 'cause we will have ways to getting more experiments to testing the whole processes. So it's all about the teamwork things.

Nathan Labenz: Any last thoughts on SQUIDIF before we move on to talk about CORAL a little bit?

Siyu He: Yeah. I think that's all of the thing. I really appreciate that you ask so many questions. I think really capture the key points of the model. Thank you so much.

Nathan Labenz: Yeah. My pleasure. All right. Well, then let's do the CORAL paper a little bit. This one, honestly, is even a little more challenging, I would say, for a non-biology background person like myself to understand. But what intrigued me about it, I wrote down several things that we can go through that sort of intrigue me about it. But maybe the most fundamental one is this challenge of connecting different levels of resolution or different levels of scale together into an integrated understanding. That in so many ways, that's sort of the fundamental challenge of science. It's like going down to the very lowest level, I don't know exactly what we wanna call the lowest level, but maybe it's string theory, maybe it's quantum mechanics or whatever, right? We have pretty good theories certainly at the quantum mechanics level at this point of how a very small number of particles will interact, and then we can go up and look at atoms, and we can look at molecules, and we can look at proteins, and we can look at cells, and we can look at all the way up, right, to the economy. But there's often a gap between those where it's like, we don't know how to necessarily aggregate up these smaller units of analysis into the bigger thing that we care about. That's even in economics, the micro and macroeconomics divide, right? We don't have a way of aggregating all the individual economic actors into an economy. So we sort of have this other top-down approach that's measuring these aggregate statistics and trying to predict where they're going. And it seems like we have a lot of, I don't know how many layers you would identify, but it seems like we have that problem in biology at maybe the highest level of difficulty. And so I'm always intrigued by anything that I see that kind of tries to take multiple levels and integrate them and create an understanding that sort of bridges these scale gaps. So with that prompt, maybe tell me how you came to this, what the motivation is behind this work and what you're trying to do with it.

Siyu He: Yeah, sure. I would say this one is more technical, so it kind of addresses a technical issues when we're dealing with the spatial data. So maybe I will first like to start by introduce about the spatial transcriptomics that majorly addresses in this project. So we have discussed about the single-cell RNA sequencing data in the first paper, but as you know, the single-cell RNA data, they have one of the very obvious shortage is we dissociate the tissues, and which means that we don't have the location informations of individual cells. This is very important 'cause the cells are actually not individuals. It's like they are interacting with each other, and it's very important about knowing what's the cell's neighbor is. So in that case, we lost this information, and it kind of like making the study of architecture of the tissues or the communications very difficult. So with that, people are creating the new technology we call the spatial transcriptomics and also proteomics as well. Totally we can see spatial omics data recently just came out like maybe seven years ago, and now there is emergent of these current technologies about the spatial omics and then more and more data came out and requires computational ways to understanding these data. And one of the very significant issue of current ways to understanding this thing is we have spatial omics but those data are not have the same resolution, and they are also not the same sizes. You can consider as this, like, you having a imagine of yourself, like a photo of yourself, it's gray scale, high resolution images, and you have another photo of yourself at a different time point or a different space, and then it's colorful but it's a mosaic level. Can you actually unders- can you actually making it into a better way? Like, for example, you can getting a colorful, high resolution images. So that is actually the goal of these projects. Well, but I think the photo is different with what we're dealing with because the photo always have three channels, like RGB channel, but in our cases, we have very high dimension channels. For example, the spatial transcriptomics, as I said, we have whole genome level, 30 to 60 thousands of the genes, but because of that, they will loss the information of the spatial resolution, so it's like mosaic and low resolution. But then in some other cases, we will have high dimension or high resolution of the protein omics. So for the protein omics, the resolution is in the space is very high, but the limitation is that we can only get limited number of the proteins to be measured. So in that case, they are not perfect in both of the data sets, and then we will have limitations to really understanding or dissecting the spatial structures with the low resolution of the space, and very difficult to distinguish those cellular heterogeneity with the low level of the proteins. Can we actually combine them together and then we can have a more comprehensive understanding about the tissue architectures and then how they are coordinated with each other and how they respond to the disease, and how can we provide a new strategies in the treatment? That's the totally goal of this project. So we wanted to address this issue with two of the technology. I would say there's new came out, like, have the high resolution in both protein and in RNA, but still there is a high need that we need a way that addresses these resolution differences. Oh, sorry. I forgot something. So in other cases when we get when we measure these spatial proteomics and transcriptomics, the difficulty is that we cannot actually using the same slices. So usually these slices are at a different location. They are very adjacent, but still they are kind of like shifting in these cases. Like what I said, in the photos, we taking photos with the different locations and then we're making some more difficulties in the data building.

Nathan Labenz: It's just a good primer on the difficulty of doing all this sort of stuff, especially if it has to come out of actually a living person. I mean, it's a little easier, I guess, if it's coming out of an organoid or whatever that's developed in a lab, but again, in a living person, you've got a procedure to actually get the tissue sample. There's different times that those measurements might be happening at. They might be actually just distinct parts of the tissue nearby but not the same. Then you have the challenge of the different kinds of measurements, and so this basically creates a very confusing picture that is not integrated into a single understanding. This kind of reminds me also a little bit of, and I'm not by any means an expert on this, but in an episode maybe a year ago where we looked at a bunch of different applications of what was then the new Mamba architecture to biomedical imaging problems, there was a similar thing that came up where it was like, we want to be able to integrate different kinds of scans. There's all kinds of different scans. You've got your MRIs, you've got your ultrasounds, whatever. You've got all these kind of different ways of trying to get a picture of the body, and they have their different strengths and weaknesses, and if you have two of them side by side taken at different times when the person was in a little bit of a different bodily position, now you can have a real challenge to try to figure out, how do I integrate these pictures? So one of the interesting applications that we saw in that Mamba use cases episode was taking different scans of different types and figuring out how they needed to be deformed in space so as to create one coherent single view that could allow a clinician to look at one thing and kind of see these different signals in an integrated way. So it's sort of a similar problem here where you've just got these different angles of trying to see what's going on, and it's very hard for us. This is another example of just where AI is so... And there's other techniques, and not all the techniques here are learned techniques. It's quite a mix of things. But it's another indication... This has been a huge theme of this podcast experience for me, is just seeing how all the different types of problems for which humans just don't have good intuitions, and at best we can get a few experts to, with a lot of time and effort, develop the intuition, but then they can only process so much. And what we see just over and over again in so many different domains is that AIs are able to learn the sort of intuition that is needed to make sense of these things. And then obviously, once they do, they become dramatically more scalable. So, those are just big themes that I think very much are echoed in this work. Another... I guess, maybe you could... Oh, I do want to go to the synthetic data question in just a second, but before we do that, because that is another big theme, distinct from the ones I was just waxing on about, what are the big sort of contributions here? One big one that I noticed was simply taking one of these higher scale things and deconvolving it, which is a word you don't hear too often. We hear a lot about convolving and less about deconvolving, but taking something that is higher dimensional but lower resolution, and actually going down and saying, "Okay, this is what we now think each of the individual cells from this larger sample actually looked like." So, maybe give us a little bit more on what the practical upshot and value is that comes out of this whole setup.

Siyu He: I would say the deconvolving definitely is the major task that the model is trying to address. But the thing is also the model try to providing a comprehensive analysis of across integration of the different modality and transcriptomes and proteomics is a good example. We can also transfer it to other type of modality as well. So, besides this high resolution need to be deconvolved, I would say, in terms of the comprehensive analysis, as I said, it's very important to understanding how the cells are communicating with each other. So, in that case, the model, we're also modeling the interaction level of individual cells. So, we creating a graph, new network-based model, and to inferring the interaction between the cells. Another point is that once we have high resolution of the data, we can always like to exploring the single cell level modality and latent features of them, and then making the spatial regions to be more organized and then figure out the structures that happens in the tissues and then study the tissue architectures. And here we call we identify the functional domains. So I think this is a major things that in the models. But I also forgot something maybe. We can also to investigating this spatial variability kind of happening inside of the tissues.

Nathan Labenz: And maybe just for people like me who are not well-schooled in all this, I don't have a good intuition for how... Obviously the body is highly differentiated. It's a wonder of nature that we start off as even just one cell at one point, and then a blob of not super differentiated cells, and then all this differentiation happens, and then we've got in the end just many, many, many different types of cells, and some of them seem to have sort of gradual transitions, but then other times they have more abrupt transitions. And so, from my rudimentary study of biology in years past, I do have some sense of how segmentation sort of happens based on gradients of certain signaling factors. And it seems like a key assumption, if I understand this work all correctly, is that this is going to be fairly gradual. There's a couple of ways, I guess, that that happens. One is that the graph neural network is sort of modeling cells interacting directly on their neighbors. And if there is a way that they're also modeling sort of longer distance interactions, I didn't see that. So, that was interesting that it seemed to be a very sort of local causal model. And then there seemed to be also a part of the loss function that was meant to try to keep things smooth so that as you go through a tissue and you see, "Okay, well, over here, it's like, the tissue is this way, and over here it's this way," there's some sort of an assumption that this is going to happen gradually through space. So, I maybe need to be de-confused about some of that. But the thing that also I'm having a little bit of a hard time sort of reconciling in my own mind is, for certain parts of the body, it seems like the boundary, let's say, boundary is a better word than barrier, is pretty clear. When I think of a bone and then the tissue away from the bone, it seems like I know right where that bone stops and where the other tissue starts, and it doesn't seem like there's a smooth transition. Maybe there is, and I'm just not zooming in close enough to see it. But yeah, I mean, that's more of a prompt than a question, but what am I missing there? What sort of additional information would help me have a better intuition going forward for all this stuff?

Siyu He: Of course. I think the tissue actually depends on the type of tissues that we are studying. For example, bone has more boundaries, and blood vessels have boundaries as well. But in some cases, for example, tumors, they may not have clear boundaries. They may have some infiltration of the tumor cells into the normal tissues. So I guess it very much depends on the tissues that we are studying. In some cases, it's not clear to identify their structures directly from the morphologies. But if you're looking at the molecular level, they actually have some interesting patterns that exist, and we are only able to see that with the measurement of the molecular information. In terms of the thing that you mentioned regarding how the models can capture these boundaries and also address these interactions, I would say, based on the quality of the spatial data we can get from both the high resolution and high gene resolutions, the model itself is able to model these interactions between the neighbor cells. There's a little bit of detail in the models where we are not only just taking the most adjacent cells. We are also taking the nearest neighbor cells, and then we can adjust these K values and then making the network graph bigger. But I would say it's not necessary to make it too long because we are just focusing on the neighbor cells in communications in these cases. Because there are so many complicated things happening in the tissue, the way we are trying to simplify is really to understand the cell itself and also what their neighbors look like. In terms of the graph Laplace loss that I added in the model, we are making them more smooth, but I think it's a regulation term that we are trying to make in the models to be more realistic. We have some other laws related to the reconstruction of the data as well as the KLD divergence things. In that case, it will have a balance between the smoothness and the sharpness of the boundary. Because usually if we have a single cell level of the data, they will have a clear boundary because the high resolution of the single cells is able to distinguish at the cellular level, so we were exactly seeing a boundary of the cells that related to the bone, and also related to the blood vessels, which even give us more higher or detailed structures that indicating the inner structures of the organs and tissues.

Nathan Labenz: Gotcha. So the reconstruction loss incentivizes accuracy, and sometimes there is a sharp boundary. And so that incentivizes capturing that, and then the smoothness sort of tries to smooth things out in general, and that acts as a regularization. We've seen this with plenty of different setups, like a clever compound loss function can take you far, so hope to get the best of both worlds there. And then did I understand also that the k value that you mentioned, that's a hyper-parameter for how many nearest neighbors each cell will be connected to? So you could scale up or down with just that single hyper-parameter, like I just want to look at a very small number of the most adjacent cells, or I could look a little bit or even more, expand the radius with that hyper-parameter. Okay. Cool. How about on the synthetic data side? This is something that I think in the context of language modeling, the community's been on a bit of a roller coaster with this, where it's been like, "Oh, we have a data wall. Oh, don't worry, synthetic data will take us there." Then it was like, "Oh, synthetic data creates bad models that sort of have mode collapse or whatever." That idea got a lot of traction for a little while. I would say at this point it's pretty well past that notion, and it's pretty clear that having models solve problems and doing reinforcement learning on that is working very well. But in the biology context, I've just had a lot less exposure and I have a lot less intuition for what sort of synthetic data exists, how much should we trust it. Can we train models on that and be confident? Also, even just the mix here, I don't think it was all synthetic data that was used, but there was some of it. So I'd love to just understand the role that the synthetic data plays and what are the limits today to the quality of that data and then as a result of that, how much of it we can really use in a project like this.

Siyu He: Yeah. In my project, I also utilized synthetic data as well as experimental data to validate how the model can work as what we are supposed to. In terms of the simulation part, we usually have well-known technology or models in the field that people are creating. For both the single cells or spatial data as well, they are more realistic in resembling the experimental observations. Mostly, probably similar to other fields, we are interested in how their distribution looks, and then in those simulation models, they are actually to resemble the similar type of distributions and sample that from the design structured distribution. In my cases, it is using this latter, it is a well-known simulation ways to the single cell RNA. But in the spatial call model which dealing with lot of spatial data, we are actually working with our collaborator, who is a system professor now in the University of Conetics. And he's designed a model called SC-Design, which published in Nature Biotech, I think two years ago. And then this model is able to sample from design distributions that having spatial information as well as having similar observations with the spatial and gene expression. So this is about the synthetic that we are working on. We are having real data as well, and that is a little bit different because we have pre-knowledge about what this tissue looks like and then can we actually identify some interesting patterns that existed. And we are always having domain knowledge to validating the model.

Nathan Labenz: So safe to assume that the reason we need the synthetic data is that it's just hard to gather the real data. Sometimes these things sort of feel like a little bit of a hall of mirrors and you hope it's not a house of cards. There feels like there's something that can kind of be circular about it, I guess, where you're saying, "Well, we've got this ability to generate this synthetic data. Can we learn from it?" But when I see that pattern, I also sometimes think, well jeez, if we can generate that synthetic data, don't we in some sense already know what we need to know? So again, I feel like I'm just constantly asking you to help me sort of get less confused, but what is it that you are learning from the synthetic data that is not already in some sense learned by the process that generates the synthetic data in the first place?

Siyu He: That's a very interesting question. I would say we actually will use this synthetic data to validate our models, really to capture the results we want. So we will not exactly use the same type of generation process. We will have data, synthetic data, generated from a well-known or common way that people observed. The good thing about this simulation data is we have the ground truths of that. For example, we are interested in these interactions or we are interested in the domain things. We know what the region or the domain look like or the region look like. So we can validate whether the model has the ability to exactly identify those domains and also can deconvolve into high resolutions. But in real cases, we don't have those ground truths. Once we can validate the model is able to perform well in synthetic data, we can transfer the models into real cases, and then we can trust the results that come out for the real data. And then we don't have any further information about that. So this is the way that we first use the synthetic data to validate the model, and then we will transfer the model to real data. And they need to train separately. It's totally different things, and as I mentioned again, maybe emphasize that most of the synthetic way of getting data is different from what we designed. So we will make sure that there's no loop that actually happens in the whole project.

Nathan Labenz: When ground truth is hard to come by, things can certainly get really tricky. There's talk these days about a Manhattan Project for super intelligence and whatnot. And sometimes when it's very fuzzy, it's like, well, what is that super intelligence supposed to look like, what's it supposed to do, and are we gonna be able to keep control of it and so on. I sometimes feel like when I talk to folks like you who are doing such detailed work, especially in biology, I feel like maybe a worthy Manhattan Project would be just to get all the data that we might really need to scale these things up and figure them out. I suppose that could come about in multiple different ways. I have to imagine a lot of the data that we need is locked up in medical record systems, and it's hard to share because of privacy rules, and obviously those have their place. Not to suggest that they don't, but it seems like liberating some of that data in some responsible way could be a huge unlock. And then maybe even just a major investment in the form of potentially just government subsidy if we're gonna subsidize AI development. I didn't finish the thought on the superintelligence, but what's it gonna do, whatever, can we keep control of it? The number one most compelling answer to me generally when people get excited about superintelligence is, it's going to cure all the diseases, we're gonna live longer, healthier lives, it's gonna be amazing. Maybe double your lifespan, whatever. Okay, so I'm ready, I want that for sure. And I'm just wondering, maybe instead of building a trillion dollar data center and trying to create a superintelligence that can solve all our problems, maybe what we should do is get all the data so that people like you have a lot easier time making the discoveries directly, and we don't necessarily have to go through this undefined sort of amorphous superintelligence. Maybe we end up with both. But I guess the question there is, how different could your life be in a year or two if we had a strategic nation-level project to make the biological data available that would facilitate the next couple of levels of scale-up and improvement in the sort of work that you're doing?

Siyu He: This is a very interesting question. I would say there are actually many more and more projects coming out trying to make the models big and then making the superintelligent models very resemble the human bodies. For example, I think Google DeepMind has created projects like virtual cells, and they want to have a unifying model that can bring all these different types of cells and organs in different scales together to really have a digital copy of all the cells. I also know that CZI, the Chan-Zuckerberg Initiative, also launched another project they call Billion Cells, and they will have a lot of cells together to build a similar model as well. So I think this is a very exciting period that people are really utilizing those foundation models and large-scale models to address these questions. We have GPT, and people are also creating the biological version of the GPT to answer any questions related to the cells. In other cases, people are interested in the clinical questions or the health questions, so we can have virtual doctors. There are many ways, and I would say there are a lot of applications, very exciting time points that we are working on now. People are also working on digital twin projects. We can maybe have a copy of our own self, and then we know what's going on if we make different choices at some time point and what we will look like after that. That is also very interesting. I think there are a lot of very exciting and interesting questions that came out, and I am very excited to see how the model will go in one to two years, and looking forward to how the fields will develop.

Nathan Labenz: It's an exciting time for sure, and a little scary. But this sort of work is much more pure upside, it feels like, to me. I mean, obviously anything in some possible scenario could be abused, but I don't think we have to worry about SQUIDIF or Coral getting out of control. It is focused work that is answering specific problems, and it's a good reminder that such work still happens and still has value, and it's not all just about embracing the bitter lesson and maximizing the cluster size that you can throw at something. So I really appreciate that about both of these projects. I think that's about all the questions that I had for you. Is there any other next steps you want to tease or any other thoughts you want to leave us with?

Siyu He: I would say I'm very optimistic about how AI can help with healthcare and also biomedicine. I know people may be scared about how fast they can grow, and at least for now, I don't think there's a way to be scared about how AI can destroy the world for now. Hopefully, I think the researchers who are working at the intersection of AI and healthcare will make our development of healthcare quicker, and people will have a better life that is established after the studies go off our projects. That's all I think I want to share, and thank you so much for this great opportunity and very excited and delightful discussion with you. Thanks so much.

Nathan Labenz: My pleasure. Siyu He, thank you for being part of the cognitive revolution.

Siyu He: Thank you.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.