In this document, we will provide an overview of RNA velocity analysis based on single-cell RNA-seq data. We will cover the underlying mathematical model and assumptions, how the raw data can be quantified in a suitable way, and say a few words about the different velocity estimation models implemented in current software packages.
The gene expression profile of a cell often gives a good indication of the current state of the cell. However, being a single snapshot in time, it does not tell us in which direction (in gene expression space) the cell is heading, or in what state it will be at some point in the near future, and thus it may not be ideal for studying dynamical aspects of biological systems. RNA velocity (introduced by La Manno et al. (2018)) is one approach to addressing this gap, still using a single snapshot of a population of cells and thus without the need to be able to follow the transcriptome of a single cell over time, which is not possible with current standard scRNA-seq protocols. In practice, RNA velocity analyses are often summarized by plots such as those shown below (from the scVelo tutorial): on the left, a vector field overlaid on a low-dimensional embedding, visualizing the ‘flow’ encoded by the velocities, and on the right, phase plots illustrating the spliced and unspliced expression levels for individual genes.
The RNA velocity is defined as the rate of change of the mature RNA abundance in a cell, and can be estimated from scRNA-seq data by joint modeling of estimated unspliced (pre-mRNA) and spliced (mature mRNA) abundances. This exploitation of the underlying molecular dynamics of the process sets it apart from other approaches for trajectory analysis, which typically use the similarity of the estimated gene expression profiles among cells to construct a path through the observed data. A thorough review of such trajectory inference methods was recently performed by Saelens et al. (2019).
The feasibility of the RNA velocity approach is based on the observation by La Manno et al. (2018) that, with several commonly used scRNA-seq library preparation protocols, not only exonic, but also intronic and exon/intron boundary-spanning reads are observed, and the insight that considering these in combination with the exonic reads would allow for inference of developmental relationships among cells. Similar observations, coupled with a simple differential equation model of transcriptional dynamics, have previously been used for investigation of pre-mRNA dynamics and transcriptional and post-transcriptional regulation of gene expression in exon arrays (Zeisel et al. 2011) and bulk RNA-seq (Gaidatzis et al. 2015), as well as estimation of transcription, processing and degradation rates in bulk RNA-seq (Gray et al. 2014).
Let
for a given gene in a given cell at time \(t\). The system of differential equations underlying the RNA velocity calculations, as formulated by Zeisel et al. (2011) and used in scVelo (Bergen et al. 2020), is as follows: \[\frac{du(t)}{dt}=\alpha_k(t)-\beta u(t)\\\frac{ds(t)}{dt}=\beta u(t)-\gamma s(t)\]
Let’s first break this down a bit: the first equation states that the amount of unspliced RNA increases with a rate \(\alpha_k(t)\); this is the rate of transcription, which depends on the state that the gene is in in this particular cell at time \(t\) (represented by the index \(k\)). At the same time, due to splicing the unspliced RNA abundance decreases with a rate that is proportional to the current abundance, with a proportionality constant \(\beta\), the splicing rate.
The second equation similarly states that the spliced RNA abundance increases via splicing, with the same amount as the unspliced abundance decreases, and further decreases via degradation, with a proportionality constant \(\gamma\). The RNA velocity of this gene at a given time point \(t\), in a given cell, is defined as the value of \(ds(t)/dt\), that is, the instantaneous rate of change of the mature mRNA abundance at time \(t\).
We can obtain an analytical solution to this set of equations, that is, we can derive an expression for the value of \(u(t)\) and \(s(t)\) at each point in time (see e.g., Bergen et al. (2020)). To get a better understanding for the model, we will inspect this solution for a few different values of \(\alpha_k(t)\), \(\beta\) and \(\gamma\). Note that this system of equations is solved separately for each gene, so the solutions visualized below represent the abundances for a single gene over time. It is also worth keeping in mind that for the scRNA-seq data, we can not follow a single cell over time, and we also do not know which “time point” along the trajectory that each cell corresponds to. Instead, this information must be inferred from the data as well.
First, we consider the case where \(\alpha\) is constant over the considered time span. Also \(\beta\) and \(\gamma\) are considered to be constant rates. We also assume that \(u_0=u(0)=0\) and \(s_0=s(0)=0\), so that at time 0, this gene is not expressed. The figures below show three different aspects of the model:
The first thing we notice is that as the gene starts being transcribed (small values of \(t\)), both the unspliced and the spliced abundances start increasing. As expected, the unspliced abundance increases first, followed a bit later by the spliced abundance (since the latter has to go via the unspliced state).
Next, we note that as time passes, both \(u(t)\) and \(s(t)\) reach a steady state, or equilibrium. This indicates that for both the unspliced and spliced pool of molecules, the “influx” is equal to the “outflux,” and as a consequence, both \(du(t)/dt\) and \(ds(t)/dt\) are 0. In other words, in this stage, the RNA velocity (\(ds(t)/dt\)) is zero. The exact levels of \(u(t)\) and \(s(t)\) in the equilibrium phase depend on the values of the parameters \(\alpha\), \(\beta\) and \(\gamma\), but the overall pattern is the same.
Let’s see what this looks like in the phase plot on the right hand side. The equilibrium point is represented in the top right corner of this plot, and the times leading up to this equilibrium are represented by the points above the diagonal line. We can derive the slope of the diagonal line by considering the system of equations above. As we saw, in the equilibrium, \[\frac{ds(t)}{dt}=0,\]and thus \[\beta u(t)=\gamma s(t).\] Hence, \[u(t)=(\gamma/\beta) s(t),\]and the slope of the line (assuming that it passes through (0,0), which represents the initial condition) is \(\gamma/\beta\).
How can we use this observation in order to estimate RNA velocities when we don’t know the values of \(\gamma\) and \(\beta\) in advance, and need to estimate them from the data? Let’s assume that the system we are studying is such that we will eventually reach the steady state (we’ll see below an example of when this assumption may fail). Of course, in practice we don’t have a time course of abundances of this gene for a single cell, but we have multiple cells, and at our particular snapshot in time, these cells will likely be at different points along the trajectory (corresponding to different values of the underlying time \(t\)). Thus, if we assume that the model parameters are the same for all cells (shared splicing and degradation rates), we will still get a phase plot similar to the one above, but each point would correspond to a distinct cell. If we then restrict our attention to the cells in the top right corner and bottom left corner of the plot (those assumed to be in steady state), we can estimate the value of \(\gamma/\beta\) as the slope of a straight line fitted to these “extreme quantile” data points.
This is, in essence, the idea behind the approach taken by La Manno et al. (2018) and implemented in the velocyto software, and also in the ‘steady-state’ model in scVelo. If we fix one of the parameter values (e.g., setting \(\beta=1\) as in La Manno et al. (2018), corresponding to an assumption of a shared splicing rate between genes) we can estimate the other one (\(\gamma\)), and consequently obtain an estimate of the RNA velocity \(v\), since \[v=\frac{ds(t)}{dt}=\beta u(t)-\gamma s(t).\] Notably, these velocities can be derived directly from the phase plot:
Consider any point along the trajectory. By construction, the y coordinate of this point is equal to \(u(t)\). At the same time, the y coordinate of the diagonal line for the corresponding value of \(s(t)\) is \((\gamma/\beta)s(t)\), and thus the difference between the two (the vertical displacement of the point on the trajectory from the diagonal line defined by the steady state) is \[u(t)-\frac{\gamma}{\beta}s(t),\]which is proportional to the velocity as defined above (and, in fact, equal if \(\beta=1\)).
Once velocities have been estimated, approximate abundances at a future time point \(t+\Delta t\) can be estimated e.g. by assuming that the velocity stays constant for this time period: \[s(t+\Delta t) = s(t) + v\Delta t.\]
Next, let’s consider another example, where instead of a constant transcription we consider a transient pattern, where \(\alpha>0\) for some time before going back to zero.
Here, we observe the same type of increase in \(u(t)\) and \(s(t)\) in the beginning of the time course (of course, the system doesn’t know that \(\alpha\) will go down to zero in the future). When transcription is turned off, both \(u(t)\) and \(s(t)\) decrease in an exponential fashion. Again, the unspliced abundance decreases first, followed by the spliced abundance.
In the phase plot to the right, we get the same type of pattern as previously for the induction phase (in green). Recall that the diagonal line corresponds to the steady-state situation, where \(ds(t)/dt = 0\). Points above this line represent an induction of gene expression, where the unspliced abundance is higher than the spliced abundance (relative to what we expect under equilibrium). Similarly, points below the line (here, in blue), represent a decrease of gene expression, where there is relatively low amounts of unspliced RNA compared to what would be expected under steady state.
Just as in the case above, if we now assume that each point represents a single cell, we can estimate the ratio \(\gamma/\beta\) based on the cells in the top-right and bottom-left parts of the phase plot. We can also estimate velocities in the same way as described previously; points above the diagonal line will have positive velocities (induction) and points below the diagonal line will have negative velocities (repression).
Finally, let’s consider a case with the same type of transient transcription as above, but with a much shorter duration.
Now, there is not enough time to reach the steady state. This complicates the estimation of \(\gamma/\beta\) based on the phase plot - if we use the points in the upper right part of the plot we will overestimate the ratio. In velocyto, special approaches are implemented to estimate velocities for genes where all cells are far from steady state. scVelo, instead, implements also a dynamical model, where the full dynamics of the splicing kinetic are solved, thus avoiding the reliance on cells in the steady state to base the estimation of \(\gamma/\beta\) on. We will discuss the dynamical model in a bit more detail later in this document
To solve the system of differential equations, or estimate the velocities, we need to input both spliced (mature mRNA) and unspliced (pre-mRNA) abundances for the genes across all our cells. Note that this is not what is typically output from gene expression quantification pipelines, and thus we need a dedicated quantification approach for RNA velocity applications.
Several software packages are able to provide quantifications suitable for RNA velocity analyses. Just as for gene expression estimation, the software of choice depends on the scRNA-seq protocol. For protocols including UMIs, the software of course needs to be able to handle UMI counting. For full-length protocols without UMIs, on the other hand, it is common to use software developed for bulk RNA-seq data. Here, we focus on droplet-based protocols with UMIs, such as those provided by 10x Genomics. For such data, the most widely used software tools for estimating spliced and unspliced abundances are
Each of these have their own characteristic features, and the quantifications (and subsequent velocities) will depend on the chosen quantification method (see e.g. Soneson et al. (2021)).
One important characteristic that differ between the software tools just mentioned is the type of reference sequence they operate with. velocyto and STARsolo align the reads (or assume that the reads are aligned) to the reference genome, while alevin, alevin-fry and kallisto|bustools instead map the reads to an ‘expanded’ transcriptome. The ‘expansion’ here refers to the inclusion of some form of unspliced or intronic sequences, which are used to represent the pre-mRNA features.
It is worth noting that from a conceptual point of view, we would like the unspliced abundances to represent the amount of pre-mRNA in the cells. However, this comes with some challenges. First, with 3’ tag data such as those obtained with 10x Genomics protocols, we may not have enough information available to reliably assign reads falling in exons to the spliced or unspliced variant of the gene (see e.g. Soneson et al. (2021) for some observations in this direction). Second, even if we do have more even coverage along the entire transcript, we have to acknowledge that there may be partially spliced molecules present, and possibly include a large number of these as potential reference sequences for the quantification to be accurate. For these and other reasons, and since for the majority of the annotated pre-mRNA sequences the largest part of their length is in fact composed of introns, we often resort to representing the unspliced abundance by the number of reads assigned to introns, rather than to pre-mRNA molecules.
The genome alignment-based methods (velocyto and STARsolo) consider the position where reads are mapped to the genome, and compare those to the positions of annotated exons and introns. Reads aligning completely within unambiguously exonic regions are considered ‘exonic’ (or ‘spliced’), and those aligning completely within unambiguously intronic regions are considered ‘intronic’ (or ‘unspliced’). Reads overlapping multiple genes are typically discarded, but the most recent version of STARsolo implements several approaches for retaining them and distributing them appropriately among the possible targets. It is worth noting that in general, considering also intronic sequences as features of interest considerably increases the overlap among genes, and thus makes quantification more challenging.
Reads which align in regions that can be either exonic or intronic depending on which isoform one considers, are assigned or discarded as ambiguous according to a (sometimes intricate) set of rules. The velocyto software, for example, provides several such sets of rules, differing in their ‘permissiveness’ in allowing ambiguous reads.
For the transcriptome alignment-based methods (alevin, alevin-fry and kallisto|bustools), the user has to choose or construct the desired set of reference sequences before creating the quantification index to which the reads will be aligned. As mentioned, introns are often chosen as the representatives of the unspliced features, but the downstream workflow would be unaltered if instead the full pre-mRNA sequences were indexed.
Even if we decide to use introns to represent the unspliced features, there are still challenges remaining. Let’s consider the gene in the figure below.
It has two transcript isoforms, one with two exons and one with three exons. The isoforms are partly overlapping. While the transcript sequences are straightforward to determine (and also directly provided in the reference transcriptome fasta file), the introns are more ambiguous. Essentially we have two choices: either we define the introns separately for each isoform, or we first collapse the isoforms into the union of their exonic regions, and define the introns as the regions that are within the gene locus, but not included in the exon union.
As you can see in the two illustrations above, with the “separate” approach the introns of a gene can overlap with each other and with the exonic regions, while this is not allowed in the “collapse” approach. Another way to see this is that with the “separate” approach we are considering exons and introns on an equal footing, whereas with the “collapse” approach reads in ambiguous regions will be considered exonic.
Since ‘intronic’ reads can be either fully intronic or partly exonic and partly intronic (overlapping an exon/intron boundary), we add a flanking sequence to each extracted intron to allow for the latter type of reads. The length of this flanking sequence can be anywhere up to the length of the RNA read minus 1, depending on how much intronic overlap we want to require in order for a read to be considered potentially intronic.
In the lower part of the figure above, we summarize the features (transcripts + introns) that will be indexed for each of the two approaches. In each case, the ‘spliced’ abundance of the gene is defined as the sum of the abundances of the individual transcripts, and the ‘unspliced’ gene abundance is defined as the sum of the abundances of the introns.
Functionality for extracting introns according to the rules described above is available e.g. in the eisaR Bioconductor package. For example:
## Extract a GRangesList with spliced and intronic feature coordinates
<- eisaR::getFeatureRanges(
grl gtf = file.path(datadir, "GRCm38.gencode.vM21.chr18.chr19.gtf"),
featureType = c("spliced", "intron"),
intronType = "separate",
flankLength = 90L,
joinOverlappingIntrons = FALSE,
collapseIntronsByGene = TRUE,
keepIntronInFeature = TRUE,
verbose = TRUE
)#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
#> stop_codon. This information was ignored.
#> OK
#> 'select()' returned 1:1 mapping between keys and columns
#> Extracting spliced transcript features
#> Extracting introns using the separate approach
head(grl, 2)
#> GRangesList object of length 2:
#> $ENSMUST00000234132.1
#> GRanges object with 1 range and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr18 3015908-3016159 + | ENSMUSE00001453483.1 1
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENSMUST00000234132.1 ENSMUSG00000117547.1 exon
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $ENSMUST00000181558.2
#> GRanges object with 2 ranges and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr18 3336680-3337178 + | ENSMUSE00001167652.2 1
#> [2] chr18 3365927-3366863 + | ENSMUSE00001111527.1 2
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENSMUST00000181558.2 ENSMUSG00000097746.2 exon
#> [2] ENSMUST00000181558.2 ENSMUSG00000097746.2 exon
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
tail(grl, 2)
#> GRangesList object of length 2:
#> $`ENSMUSG00000118392.1-I`
#> GRanges object with 1 range and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr19 10645866-10646923 - | ENSMUSG00000118392.1-I 1
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENSMUSG00000118392.1-I ENSMUSG00000118392.1-I exon
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $`ENSMUSG00000118392.1-I1`
#> GRanges object with 1 range and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr19 10646956-10648025 - | ENSMUSG00000118392.1.. 1
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENSMUSG00000118392.1.. ENSMUSG00000118392.1-I exon
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
## Load the genome sequence
<- Biostrings::readDNAStringSet(
genome file.path(datadir, "GRCm38.gencode.vM21.chr18.chr19.genome.fa")
)names(genome) <- sapply(strsplit(names(genome), " "), .subset, 1)
genome#> DNAStringSet object of length 2:
#> width seq names
#> [1] 90702639 NNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr18
#> [2] 61431566 NNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr19
## Extract the sequences of the spliced and intronic features
<- GenomicFeatures::extractTranscriptSeqs(
seqs x = genome,
transcripts = grl
)
seqs#> DNAStringSet object of length 11425:
#> width seq names
#> [1] 252 CCTTAACCATAGGTACAGGT...CTTCTCCTGCCCACGTAGCC ENSMUST00000234132.1
#> [2] 1436 GATCGGCCCCCGTTTCACAC...TTCAGCGATTACTTCAAACT ENSMUST00000181558.2
#> [3] 6745 ACACCGGCACCAGGGCTCCA...TTTGAGATTCTATAGCCAAG ENSMUST00000234647.1
#> [4] 3727 ATTGCATCCTGCGGGTGTGT...TATTAAATTAAGTCATAACC ENSMUST0000002507...
#> [5] 2799 GGTGGCCGGCCGGGGAGGCG...ACATGATGTTGCTGCTTGTC ENSMUST0000008008...
#> ... ... ...
#> [11421] 397 AATTGGGACACACTTACAGA...AGAAAATCTTTGTTGGTGGT ENSMUSG0000011838...
#> [11422] 363 AGGCAGAGTGGGAAAAAGAG...GAGGTCGTGGAGGTGGAGCT ENSMUSG0000011838...
#> [11423] 103679 AACCTGAAGTTAACTGCCAG...AATGGAGTAGCTGAGTTCTG ENSMUSG0000011839...
#> [11424] 1058 GGTTCCAGCGGCTGCCTGAT...CTCAGAGGTTAAGGGCACAG ENSMUSG0000011839...
#> [11425] 1070 TGAAAGGATCCAAAATCTCG...CAGCCTGGGGACTTCACACC ENSMUSG0000011839...
The set of sequences obtained in the last step above can then be indexed using either Salmon or kallisto for further quantification using alevin, alevin-fry or kallisto|bustools.
There are also differences between the quantification tools listed beyond their choice of reference sequence to align or map the reads to. For example:
In order to better understand some of these differences, we show below a few examples where different methods provide different quantifications (from Soneson et al. (2021)):
These differences between counts obtained by different methods propagate also to the estimated velocities, and can affect the biological interpretation of the final results.
As mentioned already above, there are several tools available for estimating velocities from spliced and unspliced abundance estimates. The widely used scVelo python package extends the original capabilities provided by velocyto (which has both an R and a python interface), and implements three different models for estimating velocities. These are described in much more detail in Bergen et al. (2020), but recapped here.
In the practical exercises, we will see how to use the velociraptor Bioconductor package to estimate and interpret RNA velocity based on a pre-quantified data set. The velociraptor package provides a wrapper around scVelo that allows us to run the RNA velocity analysis directly in an R session.