AI to help researchers see the bigger picture in cell biology
Source: MIT News - AI
Studying Gene Expression in Cancer Cells
Studying gene expression in a cancer patient’s cells can help clinical biologists understand the cancer’s origin and predict the success of different treatments. But cells are complex and contain many layers, so how the biologist conducts measurements affects which data they can obtain. For instance, measuring proteins in a cell could yield different information about the effects of cancer than measuring gene expression or cell morphology.
Where in the cell the information comes from matters. To capture a complete picture of the cell’s state, scientists often must conduct many measurements using different techniques and analyze them one at a time. Machine‑learning methods can speed up the process, but existing methods lump all the information from each measurement modality together, making it difficult to figure out which data came from which part of the cell.
A New AI‑Driven Framework
To overcome this problem, researchers at the Broad Institute of MIT and Harvard and ETH Zurich/Paul Scherrer Institute (PSI) developed an artificial‑intelligence‑driven framework that learns:
- Which information about a cell’s state is shared across different measurement modalities, and
- Which information is unique to a particular measurement type.
By pinpointing the origin of each piece of information, the approach provides a more holistic view of the cell’s state, making it easier for a biologist to see the complete picture of cellular interactions. This could help scientists understand disease mechanisms and track the progression of:
- Cancer
- Neurodegenerative disorders such as Alzheimer’s
- Metabolic diseases like diabetes
“When we study cells, one measurement is often not sufficient, so scientists develop new technologies to measure different aspects of cells. While we have many ways of looking at a cell, at the end of the day we only have one underlying cell state. By putting the information from all these measurement modalities together in a smarter way, we could have a fuller picture of the state of the cell.”
— Xinyi Zhang, SM ’22, PhD ’25 (lead author)
Zhang is joined on the paper by G.V. Shivashankar, professor in the Department of Health Sciences and Technology at ETH Zurich and head of the Laboratory of Multiscale Bioimaging at PSI; and senior author Caroline Uhler, professor in EECS and the Institute for Data, Systems, and Society (IDSS) at MIT, member of MIT’s Laboratory for Information and Decision Systems (LIDS), and director of the Eric and Wendy Schmidt Center at the Broad Institute. The research appears today in Nature Computational Science.
Manipulating Multiple Measurements
There are many tools scientists can use to capture information about a cell’s state. For example:
- RNA sequencing – reveals whether the cell is actively growing.
- Chromatin‑morphology imaging – shows how the cell responds to external physical or chemical signals.
“When scientists perform multimodal analysis, they gather information using multiple measurement modalities and integrate it to better understand the underlying state of the cell. Some information is captured by one modality only, while other information is shared across modalities. To fully understand what is happening inside the cell, it is important to know where the information came from.”
— G.V. Shivashankar
Traditionally, researchers must conduct multiple individual experiments and compare the results—a slow and cumbersome process that limits the amount of information they can gather.
The New Machine‑Learning Framework
The researchers built a framework that automatically distinguishes:
- Shared information across modalities, and
- Modality‑specific information unique to a single measurement type.
“As a user, you can simply input your cell data and it automatically tells you which data are shared and which data are modality‑specific.”
— Xinyi Zhang
How It Works
- Rethinking Autoencoders – Conventional multimodal autoencoders use one model per modality, each producing its own latent representation.
- Shared + Private Spaces – The new method introduces a shared representation space for overlapping data and separate private spaces for modality‑specific data.
- Two‑Step Training – A special training procedure helps the model decide which data belong to the shared space versus the private spaces.
The result is analogous to a Venn diagram of cellular data, where the intersection represents shared information and the non‑overlapping regions represent modality‑specific signals.
Distinguishing Data in Practice
- Synthetic datasets – The framework correctly recovered known shared and modality‑specific information.
- Real‑world single‑cell datasets – It automatically distinguished gene activity captured jointly by two modalities (e.g., transcriptomics and chromatin accessibility) while correctly identifying signals present in only one modality.
The researchers also used the method to pinpoint which measurement modality captured a protein marker indicating DNA damage in cancer patients. Knowing the origin of this marker helps clinical scientists choose the most appropriate technique for measuring it.
“There are too …” (the quote continues in the original source)
Funding
This research is funded, in part, by:
- the Eric and Wendy Schmidt Center at the Broad Institute
- the Swiss National Science Foundation
- the U.S. National Institutes of Health
- the U.S. Office of Naval Research
- AstraZeneca
- the MIT‑IBM Watson AI Lab
- the MIT J‑Clinic for Machine Learning and Health
- a Simons Investigator Award
“Many modalities in a cell and we can’t possibly measure them all, so we need a prediction tool. But then the question is: Which modalities should we measure and which modalities should we predict? Our method can answer that question,” Uhler says.
“It is not sufficient to just integrate the information from all these modalities,” Uhler says. “We can learn a lot about the state of a cell if we carefully compare the different modalities to understand how different components of cells regulate each other.”
In the future, the researchers want to enable the model to provide more interpretable information about the state of the cell, conduct additional experiments to ensure it correctly disentangles cellular information, and apply the model to a wider range of clinical questions.