Introduction and structure

In this document, we will provide an overview of RNA velocity analysis based on single-cell RNA-seq data. We will cover the underlying mathematical model and assumptions, how the raw data can be quantified in a suitable way, and say a few words about the different velocity estimation models implemented in current software packages.

RNA velocity

The gene expression profile of a cell often gives a good indication of the current state of the cell. However, being a single snapshot in time, it does not tell us in which direction (in gene expression space) the cell is heading, or in what state it will be at some point in the near future, and thus it may not be ideal for studying dynamical aspects of biological systems. RNA velocity (introduced by La Manno et al. (2018)) is one approach to addressing this gap, still using a single snapshot of a population of cells and thus without the need to be able to follow the transcriptome of a single cell over time, which is not possible with current standard scRNA-seq protocols. In practice, RNA velocity analyses are often summarized by plots such as those shown below (from the scVelo tutorial): on the left, a vector field overlaid on a low-dimensional embedding, visualizing the ‘flow’ encoded by the velocities, and on the right, phase plots illustrating the spliced and unspliced expression levels for individual genes.

The RNA velocity is defined as the rate of change of the mature RNA abundance in a cell, and can be estimated from scRNA-seq data by joint modeling of estimated unspliced (pre-mRNA) and spliced (mature mRNA) abundances. This exploitation of the underlying molecular dynamics of the process sets it apart from other approaches for trajectory analysis, which typically use the similarity of the estimated gene expression profiles among cells to construct a path through the observed data. A thorough review of such trajectory inference methods was recently performed by Saelens et al. (2019).

The feasibility of the RNA velocity approach is based on the observation by La Manno et al. (2018) that, with several commonly used scRNA-seq library preparation protocols, not only exonic, but also intronic and exon/intron boundary-spanning reads are observed, and the insight that considering these in combination with the exonic reads would allow for inference of developmental relationships among cells. Similar observations, coupled with a simple differential equation model of transcriptional dynamics, have previously been used for investigation of pre-mRNA dynamics and transcriptional and post-transcriptional regulation of gene expression in exon arrays (Zeisel et al. 2011) and bulk RNA-seq (Gaidatzis et al. 2015), as well as estimation of transcription, processing and degradation rates in bulk RNA-seq (Gray et al. 2014).

Mathematical model - formulation

Let

\(u(t)\) denote the abundance of unspliced pre-mRNA
\(s(t)\) denote the abundance of mature mRNA

for a given gene in a given cell at time \(t\). The system of differential equations underlying the RNA velocity calculations, as formulated by Zeisel et al. (2011) and used in scVelo (Bergen et al. 2020), is as follows: \[\frac{du(t)}{dt}=\alpha_k(t)-\beta u(t)\\\frac{ds(t)}{dt}=\beta u(t)-\gamma s(t)\]

Let’s first break this down a bit: the first equation states that the amount of unspliced RNA increases with a rate \(\alpha_k(t)\); this is the rate of transcription, which depends on the state that the gene is in in this particular cell at time \(t\) (represented by the index \(k\)). At the same time, due to splicing the unspliced RNA abundance decreases with a rate that is proportional to the current abundance, with a proportionality constant \(\beta\), the splicing rate.

The second equation similarly states that the spliced RNA abundance increases via splicing, with the same amount as the unspliced abundance decreases, and further decreases via degradation, with a proportionality constant \(\gamma\). The RNA velocity of this gene at a given time point \(t\), in a given cell, is defined as the value of \(ds(t)/dt\), that is, the instantaneous rate of change of the mature mRNA abundance at time \(t\).

Mathematical model - solution

We can obtain an analytical solution to this set of equations, that is, we can derive an expression for the value of \(u(t)\) and \(s(t)\) at each point in time (see e.g., Bergen et al. (2020)). To get a better understanding for the model, we will inspect this solution for a few different values of \(\alpha_k(t)\), \(\beta\) and \(\gamma\). Note that this system of equations is solved separately for each gene, so the solutions visualized below represent the abundances for a single gene over time. It is also worth keeping in mind that for the scRNA-seq data, we can not follow a single cell over time, and we also do not know which “time point” along the trajectory that each cell corresponds to. Instead, this information must be inferred from the data as well.

Constant transcription

First, we consider the case where \(\alpha\) is constant over the considered time span. Also \(\beta\) and \(\gamma\) are considered to be constant rates. We also assume that \(u_0=u(0)=0\) and \(s_0=s(0)=0\), so that at time 0, this gene is not expressed. The figures below show three different aspects of the model:

The lower left panel illustrates the value of \(\alpha\) as a function of time.
The upper left panel shows the value of \(u(t)\) and \(s(t)\) as functions of time.
The right panel shows a phase plot, with \(s(t)\) and \(u(t)\) plotted against each other. Each dot corresponds to one time point, and the points have been slightly jittered for better visibility.

The first thing we notice is that as the gene starts being transcribed (small values of \(t\)), both the unspliced and the spliced abundances start increasing. As expected, the unspliced abundance increases first, followed a bit later by the spliced abundance (since the latter has to go via the unspliced state).

Next, we note that as time passes, both \(u(t)\) and \(s(t)\) reach a steady state, or equilibrium. This indicates that for both the unspliced and spliced pool of molecules, the “influx” is equal to the “outflux,” and as a consequence, both \(du(t)/dt\) and \(ds(t)/dt\) are 0. In other words, in this stage, the RNA velocity (\(ds(t)/dt\)) is zero. The exact levels of \(u(t)\) and \(s(t)\) in the equilibrium phase depend on the values of the parameters \(\alpha\), \(\beta\) and \(\gamma\), but the overall pattern is the same.

Let’s see what this looks like in the phase plot on the right hand side. The equilibrium point is represented in the top right corner of this plot, and the times leading up to this equilibrium are represented by the points above the diagonal line. We can derive the slope of the diagonal line by considering the system of equations above. As we saw, in the equilibrium, \[\frac{ds(t)}{dt}=0,\]and thus \[\beta u(t)=\gamma s(t).\] Hence, \[u(t)=(\gamma/\beta) s(t),\]and the slope of the line (assuming that it passes through (0,0), which represents the initial condition) is \(\gamma/\beta\).

How can we use this observation in order to estimate RNA velocities when we don’t know the values of \(\gamma\) and \(\beta\) in advance, and need to estimate them from the data? Let’s assume that the system we are studying is such that we will eventually reach the steady state (we’ll see below an example of when this assumption may fail). Of course, in practice we don’t have a time course of abundances of this gene for a single cell, but we have multiple cells, and at our particular snapshot in time, these cells will likely be at different points along the trajectory (corresponding to different values of the underlying time \(t\)). Thus, if we assume that the model parameters are the same for all cells (shared splicing and degradation rates), we will still get a phase plot similar to the one above, but each point would correspond to a distinct cell. If we then restrict our attention to the cells in the top right corner and bottom left corner of the plot (those assumed to be in steady state), we can estimate the value of \(\gamma/\beta\) as the slope of a straight line fitted to these “extreme quantile” data points.

This is, in essence, the idea behind the approach taken by La Manno et al. (2018) and implemented in the velocyto software, and also in the ‘steady-state’ model in scVelo. If we fix one of the parameter values (e.g., setting \(\beta=1\) as in La Manno et al. (2018), corresponding to an assumption of a shared splicing rate between genes) we can estimate the other one (\(\gamma\)), and consequently obtain an estimate of the RNA velocity \(v\), since \[v=\frac{ds(t)}{dt}=\beta u(t)-\gamma s(t).\] Notably, these velocities can be derived directly from the phase plot:

Consider any point along the trajectory. By construction, the y coordinate of this point is equal to \(u(t)\). At the same time, the y coordinate of the diagonal line for the corresponding value of \(s(t)\) is \((\gamma/\beta)s(t)\), and thus the difference between the two (the vertical displacement of the point on the trajectory from the diagonal line defined by the steady state) is \[u(t)-\frac{\gamma}{\beta}s(t),\]which is proportional to the velocity as defined above (and, in fact, equal if \(\beta=1\)).

Once velocities have been estimated, approximate abundances at a future time point \(t+\Delta t\) can be estimated e.g. by assuming that the velocity stays constant for this time period: \[s(t+\Delta t) = s(t) + v\Delta t.\]

Transient transcription

Next, let’s consider another example, where instead of a constant transcription we consider a transient pattern, where \(\alpha>0\) for some time before going back to zero.

Here, we observe the same type of increase in \(u(t)\) and \(s(t)\) in the beginning of the time course (of course, the system doesn’t know that \(\alpha\) will go down to zero in the future). When transcription is turned off, both \(u(t)\) and \(s(t)\) decrease in an exponential fashion. Again, the unspliced abundance decreases first, followed by the spliced abundance.

In the phase plot to the right, we get the same type of pattern as previously for the induction phase (in green). Recall that the diagonal line corresponds to the steady-state situation, where \(ds(t)/dt = 0\). Points above this line represent an induction of gene expression, where the unspliced abundance is higher than the spliced abundance (relative to what we expect under equilibrium). Similarly, points below the line (here, in blue), represent a decrease of gene expression, where there is relatively low amounts of unspliced RNA compared to what would be expected under steady state.

Just as in the case above, if we now assume that each point represents a single cell, we can estimate the ratio \(\gamma/\beta\) based on the cells in the top-right and bottom-left parts of the phase plot. We can also estimate velocities in the same way as described previously; points above the diagonal line will have positive velocities (induction) and points below the diagonal line will have negative velocities (repression).

Short transient transcription

Finally, let’s consider a case with the same type of transient transcription as above, but with a much shorter duration.

Now, there is not enough time to reach the steady state. This complicates the estimation of \(\gamma/\beta\) based on the phase plot - if we use the points in the upper right part of the plot we will overestimate the ratio. In velocyto, special approaches are implemented to estimate velocities for genes where all cells are far from steady state. scVelo, instead, implements also a dynamical model, where the full dynamics of the splicing kinetic are solved, thus avoiding the reliance on cells in the steady state to base the estimation of \(\gamma/\beta\) on. We will discuss the dynamical model in a bit more detail later in this document

Input data

To solve the system of differential equations, or estimate the velocities, we need to input both spliced (mature mRNA) and unspliced (pre-mRNA) abundances for the genes across all our cells. Note that this is not what is typically output from gene expression quantification pipelines, and thus we need a dedicated quantification approach for RNA velocity applications.

Software tools - overview

Several software packages are able to provide quantifications suitable for RNA velocity analyses. Just as for gene expression estimation, the software of choice depends on the scRNA-seq protocol. For protocols including UMIs, the software of course needs to be able to handle UMI counting. For full-length protocols without UMIs, on the other hand, it is common to use software developed for bulk RNA-seq data. Here, we focus on droplet-based protocols with UMIs, such as those provided by 10x Genomics. For such data, the most widely used software tools for estimating spliced and unspliced abundances are

velocyto (La Manno et al. 2018), the original implementation of the RNA velocity pipeline. Accepts bam files with reads aligned to the genome, e.g., from CellRanger (Zheng et al. 2017).
STARsolo (Kaminow, Yunusov, and Dobin 2021), the ‘single-cell mode’ of the STAR aligner. Operates directly on the raw FASTQ files.
alevin (Srivastava et al. 2019), which is part of Salmon (Patro et al. 2017). Also operates directly on the FASTQ files.
alevin-fry (He et al. 2021), the successor of alevin, which operates directly on the FASTQ files, and is more modular than alevin.
kallisto|bustools, or the kb-python wrapper (Melsted, Ntranos, and Pachter 2019; Melsted et al. 2021), which leverages kallisto (Bray et al. 2016) for quantification of droplet single-cell data. Also operates directly on the FASTQ files.

Each of these have their own characteristic features, and the quantifications (and subsequent velocities) will depend on the chosen quantification method (see e.g. Soneson et al. (2021)).

Software tools - reference sequence

One important characteristic that differ between the software tools just mentioned is the type of reference sequence they operate with. velocyto and STARsolo align the reads (or assume that the reads are aligned) to the reference genome, while alevin, alevin-fry and kallisto|bustools instead map the reads to an ‘expanded’ transcriptome. The ‘expansion’ here refers to the inclusion of some form of unspliced or intronic sequences, which are used to represent the pre-mRNA features.

It is worth noting that from a conceptual point of view, we would like the unspliced abundances to represent the amount of pre-mRNA in the cells. However, this comes with some challenges. First, with 3’ tag data such as those obtained with 10x Genomics protocols, we may not have enough information available to reliably assign reads falling in exons to the spliced or unspliced variant of the gene (see e.g. Soneson et al. (2021) for some observations in this direction). Second, even if we do have more even coverage along the entire transcript, we have to acknowledge that there may be partially spliced molecules present, and possibly include a large number of these as potential reference sequences for the quantification to be accurate. For these and other reasons, and since for the majority of the annotated pre-mRNA sequences the largest part of their length is in fact composed of introns, we often resort to representing the unspliced abundance by the number of reads assigned to introns, rather than to pre-mRNA molecules.

The genome alignment-based methods (velocyto and STARsolo) consider the position where reads are mapped to the genome, and compare those to the positions of annotated exons and introns. Reads aligning completely within unambiguously exonic regions are considered ‘exonic’ (or ‘spliced’), and those aligning completely within unambiguously intronic regions are considered ‘intronic’ (or ‘unspliced’). Reads overlapping multiple genes are typically discarded, but the most recent version of STARsolo implements several approaches for retaining them and distributing them appropriately among the possible targets. It is worth noting that in general, considering also intronic sequences as features of interest considerably increases the overlap among genes, and thus makes quantification more challenging.

Reads which align in regions that can be either exonic or intronic depending on which isoform one considers, are assigned or discarded as ambiguous according to a (sometimes intricate) set of rules. The velocyto software, for example, provides several such sets of rules, differing in their ‘permissiveness’ in allowing ambiguous reads.

For the transcriptome alignment-based methods (alevin, alevin-fry and kallisto|bustools), the user has to choose or construct the desired set of reference sequences before creating the quantification index to which the reads will be aligned. As mentioned, introns are often chosen as the representatives of the unspliced features, but the downstream workflow would be unaltered if instead the full pre-mRNA sequences were indexed.

Definition of intronic regions

Even if we decide to use introns to represent the unspliced features, there are still challenges remaining. Let’s consider the gene in the figure below.

It has two transcript isoforms, one with two exons and one with three exons. The isoforms are partly overlapping. While the transcript sequences are straightforward to determine (and also directly provided in the reference transcriptome fasta file), the introns are more ambiguous. Essentially we have two choices: either we define the introns separately for each isoform, or we first collapse the isoforms into the union of their exonic regions, and define the introns as the regions that are within the gene locus, but not included in the exon union.

As you can see in the two illustrations above, with the “separate” approach the introns of a gene can overlap with each other and with the exonic regions, while this is not allowed in the “collapse” approach. Another way to see this is that with the “separate” approach we are considering exons and introns on an equal footing, whereas with the “collapse” approach reads in ambiguous regions will be considered exonic.

Since ‘intronic’ reads can be either fully intronic or partly exonic and partly intronic (overlapping an exon/intron boundary), we add a flanking sequence to each extracted intron to allow for the latter type of reads. The length of this flanking sequence can be anywhere up to the length of the RNA read minus 1, depending on how much intronic overlap we want to require in order for a read to be considered potentially intronic.

In the lower part of the figure above, we summarize the features (transcripts + introns) that will be indexed for each of the two approaches. In each case, the ‘spliced’ abundance of the gene is defined as the sum of the abundances of the individual transcripts, and the ‘unspliced’ gene abundance is defined as the sum of the abundances of the introns.

Functionality for extracting introns according to the rules described above is available e.g. in the eisaR Bioconductor package. For example:

## Extract a GRangesList with spliced and intronic feature coordinates
grl <- eisaR::getFeatureRanges(
    gtf = file.path(datadir, "GRCm38.gencode.vM21.chr18.chr19.gtf"),
    featureType = c("spliced", "intron"), 
    intronType = "separate", 
    flankLength = 90L, 
    joinOverlappingIntrons = FALSE, 
    collapseIntronsByGene = TRUE,
    keepIntronInFeature = TRUE,
    verbose = TRUE
)
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
#>   stop_codon. This information was ignored.
#> OK
#> 'select()' returned 1:1 mapping between keys and columns
#> Extracting spliced transcript features
#> Extracting introns using the separate approach
head(grl, 2)
#> GRangesList object of length 2:
#> $ENSMUST00000234132.1
#> GRanges object with 1 range and 5 metadata columns:
#>       seqnames          ranges strand |              exon_id exon_rank
#>          <Rle>       <IRanges>  <Rle> |          <character> <integer>
#>   [1]    chr18 3015908-3016159      + | ENSMUSE00001453483.1         1
#>              transcript_id              gene_id        type
#>                <character>          <character> <character>
#>   [1] ENSMUST00000234132.1 ENSMUSG00000117547.1        exon
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $ENSMUST00000181558.2
#> GRanges object with 2 ranges and 5 metadata columns:
#>       seqnames          ranges strand |              exon_id exon_rank
#>          <Rle>       <IRanges>  <Rle> |          <character> <integer>
#>   [1]    chr18 3336680-3337178      + | ENSMUSE00001167652.2         1
#>   [2]    chr18 3365927-3366863      + | ENSMUSE00001111527.1         2
#>              transcript_id              gene_id        type
#>                <character>          <character> <character>
#>   [1] ENSMUST00000181558.2 ENSMUSG00000097746.2        exon
#>   [2] ENSMUST00000181558.2 ENSMUSG00000097746.2        exon
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
tail(grl, 2)
#> GRangesList object of length 2:
#> $`ENSMUSG00000118392.1-I`
#> GRanges object with 1 range and 5 metadata columns:
#>       seqnames            ranges strand |                exon_id exon_rank
#>          <Rle>         <IRanges>  <Rle> |            <character> <integer>
#>   [1]    chr19 10645866-10646923      - | ENSMUSG00000118392.1-I         1
#>                transcript_id                gene_id        type
#>                  <character>            <character> <character>
#>   [1] ENSMUSG00000118392.1-I ENSMUSG00000118392.1-I        exon
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $`ENSMUSG00000118392.1-I1`
#> GRanges object with 1 range and 5 metadata columns:
#>       seqnames            ranges strand |                exon_id exon_rank
#>          <Rle>         <IRanges>  <Rle> |            <character> <integer>
#>   [1]    chr19 10646956-10648025      - | ENSMUSG00000118392.1..         1
#>                transcript_id                gene_id        type
#>                  <character>            <character> <character>
#>   [1] ENSMUSG00000118392.1.. ENSMUSG00000118392.1-I        exon
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

## Load the genome sequence
genome <- Biostrings::readDNAStringSet(
  file.path(datadir, "GRCm38.gencode.vM21.chr18.chr19.genome.fa")
)
names(genome) <- sapply(strsplit(names(genome), " "), .subset, 1)
genome
#> DNAStringSet object of length 2:
#>        width seq                                            names               
#> [1] 90702639 NNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr18
#> [2] 61431566 NNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNN chr19

## Extract the sequences of the spliced and intronic features
seqs <- GenomicFeatures::extractTranscriptSeqs(
  x = genome, 
  transcripts = grl
)
seqs
#> DNAStringSet object of length 11425:
#>           width seq                                         names               
#>     [1]     252 CCTTAACCATAGGTACAGGT...CTTCTCCTGCCCACGTAGCC ENSMUST00000234132.1
#>     [2]    1436 GATCGGCCCCCGTTTCACAC...TTCAGCGATTACTTCAAACT ENSMUST00000181558.2
#>     [3]    6745 ACACCGGCACCAGGGCTCCA...TTTGAGATTCTATAGCCAAG ENSMUST00000234647.1
#>     [4]    3727 ATTGCATCCTGCGGGTGTGT...TATTAAATTAAGTCATAACC ENSMUST0000002507...
#>     [5]    2799 GGTGGCCGGCCGGGGAGGCG...ACATGATGTTGCTGCTTGTC ENSMUST0000008008...
#>     ...     ... ...
#> [11421]     397 AATTGGGACACACTTACAGA...AGAAAATCTTTGTTGGTGGT ENSMUSG0000011838...
#> [11422]     363 AGGCAGAGTGGGAAAAAGAG...GAGGTCGTGGAGGTGGAGCT ENSMUSG0000011838...
#> [11423]  103679 AACCTGAAGTTAACTGCCAG...AATGGAGTAGCTGAGTTCTG ENSMUSG0000011839...
#> [11424]    1058 GGTTCCAGCGGCTGCCTGAT...CTCAGAGGTTAAGGGCACAG ENSMUSG0000011839...
#> [11425]    1070 TGAAAGGATCCAAAATCTCG...CAGCCTGGGGACTTCACACC ENSMUSG0000011839...

The set of sequences obtained in the last step above can then be indexed using either Salmon or kallisto for further quantification using alevin, alevin-fry or kallisto|bustools.

Software tools - other characteristics

There are also differences between the quantification tools listed beyond their choice of reference sequence to align or map the reads to. For example:

alevin and alevin-fry will consider reads that could stem from more than one gene in the quantification. Recent versions of STARsolo also allows inclusion of such reads, while the other tools will discard them.
kallisto|bustools does not take into account the strandedness of the reads (e.g., from 10x Genomics), and thus quantifies the abundances as if the library was unstranded. This has several consequences for the quantification. First, consider the stituation of a gene overlapping another gene on the opposite strand. Reads falling in the overlapping region will be considered ambiguous, and discarded, although the fact that the 10x Genomics libraries are stranded would have allowed them to be unambiguously assigned. On the other hand, counting reads from both strands allow the inclusion of intronic reads stemming from discordant ‘internal priming’ on polyT sites. This phenomenon was observed by La Manno et al. (2018), and suggests that in addition to the expected peaks around intronic polyA sites, there are also peaks with reads on the opposite strand, around polyT sites. If only reads aligning to the same strand as the introns are counted, all these peaks are discarded.

In order to better understand some of these differences, we show below a few examples where different methods provide different quantifications (from Soneson et al. (2021)):

Chkb - overlapping features on the same strand. In this case, among the evaluated methods only alevin assigns a non-zero UMI count (and STARsolo-diff, which defines the intronic count as the difference between a “gene body count” and the regular gene expression).

Rassf1 - overlapping features on different strands. Whether or not the tool accounts for the strandedness of the reads makes a difference.

Tspan3 - many ambiguous regions. The way that the introns are defined makes a substantial difference. The intronic count is much higher with the ‘separate’ intron definition approach.

These differences between counts obtained by different methods propagate also to the estimated velocities, and can affect the biological interpretation of the final results.

Velocity estimation

As mentioned already above, there are several tools available for estimating velocities from spliced and unspliced abundance estimates. The widely used scVelo python package extends the original capabilities provided by velocyto (which has both an R and a python interface), and implements three different models for estimating velocities. These are described in much more detail in Bergen et al. (2020), but recapped here.

Steady-state. Essentially the same model that was originally proposed by La Manno et al. (2018). Cells in steady state are inferred from the phase plot, and the value of \(\gamma\) is calculated as the slope of the straight line fitted to these cells. This model assumes that all genes share a common splicing rate, and that the induction and repression periods are long enough to reach steady state.
Stochastic. An extension of the steady-state model, treating transcription, splicing and degradation as probabilistic events. This model was shown to improve the performance of the steady-state model, without significantly increasing the computational time.
Dynamical. The most complex (and most time-consuming) model, with the best performance. In this likelihood-based framework, the reaction rate parameters (\(\alpha_k\), \(\beta\), \(\gamma\)) as well as latent variables (time and transcriptional state) are estimated for each gene using an EM algorithm. The model assumes the existence of four different transcriptional states - two steady-state (high and low), induction and repression. The EM algorithm iterates between estimating the latent time of a cell (the ‘position’ of the cell along the phase space trajectory) and assigning it a transcriptional state, and optimizing the values of the parameters (see Figure below from Bergen et al. (2020)). The likelihood is obtained by assuming that the observations follow a normal distribution:\[x_i^{obs}\sim N((\hat{u}(t), \hat{s}(t)), \sigma^2).\]

In the practical exercises, we will see how to use the velociraptor Bioconductor package to estimate and interpret RNA velocity based on a pre-quantified data set. The velociraptor package provides a wrapper around scVelo that allows us to run the RNA velocity analysis directly in an R session.

Other resources

scVelo documentation and tutorials
velocyto documentation
alevin documentation
alevin-fry documentation
alevin-fry velocity tutorial
STARsolo documentation
kallisto|bustools documentation
Overview papers: Weiler et al. (2021), Bergen et al. (2021)

References

Bergen, Volker, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. 2020. “Generalizing RNA Velocity to Transient Cell States Through Dynamical Modeling.” Nature Biotechnology 38: 1408–14.

Bergen, Volker, Ruslan A Soldatov, Peter V Kharchenko, and Fabian J Theis. 2021. “RNA Velocity—Current Challenges and Future Perspectives.” Mol. Syst. Biol. 17 (8): e10282.

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-seq Quantification.” Nat. Biotechnol. 34: 525.

Gaidatzis, Dimos, Lukas Burger, Maria Florescu, and Michael B Stadler. 2015. “Analysis of Intronic and Exonic Reads in RNA-seq Data Characterizes Transcriptional and Post-Transcriptional Regulation.” Nat. Biotechnol. 33: 722–29.

Gray, Jesse M, David A Harmin, Sarah A Boswell, Nicole Cloonan, Thomas E Mullen, Joseph J Ling, Nimrod Miller, et al. 2014. “SnapShot-Seq: A Method for Extracting Genome-Wide, in Vivo mRNA Dynamics from a Single Total RNA Sample.” PLoS One 9: e89673.

He, Dongze, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. 2021. “Alevin-Fry Unlocks Rapid, Accurate, and Memory-Frugal Quantification of Single-Cell RNA-seq Data.” bioRxiv doi:2021.06.29.450377.

Kaminow, Benjamin, Dinar Yunusov, and Alexander Dobin. 2021. “STARsolo: Accurate, Fast and Versatile Mapping/Quantification of Single-Cell and Single-Nucleus RNA-seq Data.” bioRxiv, doi: 2021.05.05.442755.

La Manno, Gioele, Ruslan Soldatov, Amit Zeisel, Emelie Braun, Hannah Hochgerner, Viktor Petukhov, Katja Lidschreiber, et al. 2018. “RNA Velocity of Single Cells.” Nature 560: 494–98.

Melsted, Páll, Vasilis Ntranos, and Lior Pachter. 2019. “The Barcode, UMI, Set Format and BUStools.” Bioinformatics 35: 4472–73.

Melsted, Páll, A Sina Booeshaghi, Lauren Liu, Fan Gao, Lambda Lu, Kyung Hoi (joseph) Min, Eduardo da Veiga Beltrame, Kristján Eldjárn Hjörleifsson, Jase Gehring, and Lior Pachter. 2021. “Modular, Efficient and Constant-Memory Single-Cell RNA-seq Preprocessing.” Nat. Biotechnol. 39: 813–18.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14: 417–19.

Saelens, Wouter, Robrecht Cannoodt, Helena Todorov, and Yvan Saeys. 2019. “A Comparison of Single-Cell Trajectory Inference Methods.” Nat. Biotechnol. 37: 547–54.

Soneson, Charlotte, Avi Srivastava, Rob Patro, and Michael B Stadler. 2021. “Preprocessing Choices Affect RNA Velocity Results for Droplet scRNA-seq Data.” PLoS Comput. Biol. 17 (1): e1008585.

Srivastava, Avi, Laraib Malik, Tom Smith, Ian Sudbery, and Rob Patro. 2019. “Alevin Efficiently Estimates Accurate Gene Abundances from dscRNA-seq Data.” Genome Biol. 20: 65.

Weiler, Philipp, Koen Van den Berge, Kelly Street, and Simone Tiberi. 2021. “A Guide to Trajectory Inference and RNA Velocity.” bioRxiv doi:2021.12.22.473434.

Zeisel, Amit, Wolfgang J Köstler, Natali Molotski, Jonathan M Tsai, Rita Krauthgamer, Jasmine Jacob-Hirsch, Gideon Rechavi, et al. 2011. “Coupled pre-mRNA and mRNA Dynamics Unveil Operational Strategies Underlying Transcriptional Responses to Stimuli.” Mol. Syst. Biol. 7: 529.

Zheng, Grace X Y, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nat. Commun. 8: 14049.

Estimating spliced and unspliced abundances for RNA velocity

Charlotte Soneson

January 14, 2022