The grammar of cell development branching time
One of the greatest achievements of science in recent years is the technology for obtaining information about thousands of individual cells extracted from an organism. This technology includes the so-called "omics" for individual cells (genomics, epigenomics, transcriptomics, proteomics), which give us the data about the genomes of thousands of single cells, the state and activity of various genes in them, as well as the presence of various proteins in these cells.
It is convenient to present the data about each cell as a point in a highly multidimensional space. As a result, using the new technology, scientists around the world obtain thousands of points (cells) in a space of enormous dimension.
The study based on data analysis methods including topological and geometric analysis, "topological grammars", "principal graphs method", "data approximation", etc., is an important element of the new (and huge in terms of investments and number of players) technology for data acquisition about living organisms. This new technology involves "omics" of single cells. Such data open up colossal and not yet fully realized opportunities for the development of biology and personalized medicine.
The idea of branching development time allows one to convert the resulting mountains of data into a more understandable, readable and interpretable form. We can imagine that each cell lies on some development trajectory. These trajectories can branch at the point where the cell in its development makes the choice of one future variant from several possible ones. Geometrically, these development trajectories with bifurcation points represent the branching development time.
A new technology for extracting this branching time from the data was developed by a large international team of researchers, including 15 scientists from 6 countries: the USA, China, France, Italy, the UK and Russia.
Complex trees are constructed using elementary transformation grammars. At each step of the basic algorithm, the elementary transformation that gives the greatest gain in the quality of data approximation is chosen.
The method of topological grammars for processing complex data of a general nature was proposed as early as 2007 by Professor Alexander Gorban (Great Britain, currently supervising the implementation of a megagrant at the Lobachevsky State University of Nizhny Novgorod) and his former student Andrei Zinoviev (France, currently collaborating with the Lobachevsky University in the implementation of the megagrant).
"The concept of branching time (or, as it is often called, pseudo-time) arises in biology in the following way: cells and events that occur to them are placed along a certain graph (or, more formally, a one-dimensional continuum, since a graph is a discrete object). This branching continuum plays the same role in the analysis of developmental and differentiation events as linear time in other areas (a scale for event placement). No mystery or modification of physical time is involved. People have introduced this concept and many of them use it. It is convenient. The topology of this scale is extracted from data analysis. Next, the data is mapped on this scale," explains Alexander Gorban.
This method was developed further as part of a broad international cooperation organized by L. Pinello from Harvard University and was used to create a specialized software product STREAM, which builds the branching time of cell development based on the "omics" data of all single cells.
"Just imagine, fairly recently, we learned with great delight and a sense of wonder about the decoding of the human genome. The proposed new technology makes it possible to determine the status and activity of genes and other important data simultaneously for tens of thousands of cells taken from the body. This information will be determined for each of them individually rather than by giving some sort of average values. Thus, extremely important information will be provided about the development of an individual organism and the origination of various diseases such as cancer. However, one must be able to read, decipher and extract useful information from such data. We provide a tool for working with these data and extracting important information from them," continues Alexander Gorban.
"Both the graph and its embedding in the data space are built simultaneously, and then this embedding is used to map all the data. That is, a highly multidimensional space (with the dimension of hundreds of thousands) is reduced to a branching one-dimensional continuum," concludes Alexander Gorban.
An article describing the method and the first results of its application has been published in a new issue of Nature Communication magazine:
H Chen, L Albergante, JY Hsu, CA Lareau, G Lo Bosco, J Guan, S Zhou, AN Gorban, DE Bauer, MJ Aryee, DM Langenau, A Zinovyev, JD Buenrostro, GC Yuan, and L Pinello, Single-cell trajectories reconstruction data with STREAM, Nature Communications, volume 10, Article number: 1903 (2019), https:/
STREAM software, its ElPiGraph compute core, and other project-related programs are freely available online: https:/