Caltech discovers previously unknown cell types, gene expressions in the genome using sequencing data analysis

Yuki Oka
Yuki Oka

In 2018, a group of researchers at Yuki Oka's laboratory at Caltech made a groundbreaking discovery. They identified a specific type of neuron that is responsible for mediating thirst satiation. However, they were facing an issue with a state-of-the-art technique called single-cell RNA sequencing (scRNA-seq) which was unable to locate the thirst-related neurons in brain tissue samples, specifically from the media preoptic nucleus region, where they were expected to be present.

"We knew that the gene labeling we added to our characterized neurons was being expressed in the median preoptic nucleus of the brain, but we didn't see the gene when we profiled that region of the brain with scRNA-seq," says Oka. "We heard this from many colleagues—scRNA-seq was missing cell types and gene expression that they knew should be there. We started wondering why that is."

Identifying different cell types is crucial to comprehending the numerous functions carried out by our bodies, from healthy processes such as sensing thirst to cellular malfunction in disease states. For instance, many researchers are currently searching for cell types that could be associated with specific diseases, such as Parkinson's Disease. Determining the precise cell types involved in such processes is essential for all of these studies. Recently, the Oka laboratory at Caltech and the laboratory of Allan-Hermann Pool at the University of Texas Southwestern Medical Center joined forces to demonstrate how to optimize a crucial step in scRNA-seq analysis to recover missing cell types and gene expression data that are usually discarded.

"We've improved the analysis of existing state-of-the-art single-cell RNA sequencing data, revealing the expression of hundreds or sometimes thousands of genes for individual data sets," says Oka. "It is important to enable this type of precision because biological processes are rich and complicated. Recent research has identified over 5,000 distinct neuron types in the mouse brain, and the human brain is presumably more complex. We need our techniques to be as sensitive and comprehensive as possible."

Understanding Gene Expression

The human body is comprised of trillions of cells, each with a specific function that enables us to carry out our daily activities. These cells are distinct from each other and are responsible for various tasks such as the immune system's killer T cells that detect and destroy disease-causing pathogens, neurons that transmit electrical signals that govern brain function, and skin cells that form a barrier against the external environment. Currently, researchers have identified thousands of unique cell types, but there are still many more that remain undiscovered.

Most cells in an organism have the same genetic information in their genome. The genome contains instructions for all cellular tasks and is made up of genes written in DNA, located in the cell's nucleus. These genes are expressed by being copied into RNA, which is then transported out of the nucleus to carry out functions in the rest of the cell.

In each cell type, only a certain subset of genes are expressed or turned on at any given time. These variations in gene expression lead to the differences in cell types. 

To better understand this concept, imagine a massive library with books sorted into different sections. If you want to build a plane, you would only check out books about aviation and mechanics. Similarly, in cells, only those genes that pertain to a specialized cell's unique functions are activated, while the rest remain dormant.

Improving Techniques for Gene Expression Estimation

scRNA-seq is a powerful technique to identify cell types. With this method, a cell is broken open and the genetic information expressed inside is labeled with a molecular tag that serves as a barcode. scRNA-seq can quickly do this for thousands of cells in a single tissue sample, with each cell receiving its unique barcode. Computational analysis can then be performed to determine which sets of genes are expressed in individual cells, and supercomputer models can evaluate that data to look for patterns and identify distinct cell types.

One problem with the technique, however, was that certain RNA were commonly not included in gene-expression estimates, even though they represented expressed genes.

The reason, Oka and colleagues found, is related to an issue with the so-called reference transcriptome to which researchers map sequencing data. For example, researchers have extensively studied the mouse genome, and have labeled or annotated it in great detail, creating a digital reference, or "transcriptome," that maps out DNA sequences and their corresponding genes.

This annotation, the researchers found, must be optimized for scRNA-seq to prevent the loss of gene expression information—which can arise if the genes located at the tail ends of a DNA strand are poorly annotated, for example, or if there is extensive overlap between neighboring gene transcripts. Such complications can prevent the detection of thousands of genes. (These issues are particularly pronounced when using high-throughput forms of scRNA-seq that, to reduce cost, examine only the very tail end of genes; most of the atlases that have been created to describe the cellular complexity of our tissues rely on these methods.)

Precision and high resolution are incredibly important when identifying distinct cell types. For example, say that two cells each express genes "A", "B", "C", and "D, but only one cell expresses gene "E" while the other does not. If a sequencing technique does not capture the expression of "E", then the data would suggest that the two cells are identical when in fact they are not.

Led by Pool, a former Caltech postdoctoral scholar, and the study's first author, the team optimized the reference transcriptome for the mouse and human genomes and, over several years, built a computational framework to fix the reference transcriptomes of other organisms.

"Optimizing reference transcriptomes enables us to see cell types and states that otherwise we would be oblivious to," says Pool. "For example, with our optimized reference transcriptomes we are now able to observe the full repertoire of thirst-, satiety-, and temperature-sensing neural populations in our brain regions that we suspected would be there but were unable to detect. We expect our approach to also be highly useful in revealing new cellular and genetic diversity in existing and upcoming cell-type atlases for the brain and other organs."

The recent advances in sequencing data analysis have allowed us to uncover previously unknown cell types and gene expression patterns. This has opened up a whole new world of possibilities for researchers, allowing them to gain a better understanding of the complexity of the human body and its functions. By furthering our knowledge of the intricacies of the human body, we can better understand the mechanisms of disease and develop more effective treatments. This research has the potential to revolutionize the way we approach medical care and provide us with a better understanding of the human body and its functions. With continued research and development, we can look forward to a future of improved treatments and better health outcomes.

Funding was provided by the Eugene McDermott Scholar Funds, the Peter O'Donnell Jr. Brain Institute at UT Southwestern, Caltech, the Searle Scholars Program, the Mallinckrodt Foundation, the McKnight Foundation, the Klingenstein-Simons Foundation, the New York Stem Cell Foundation, and the National Institutes of Health.