Introduction

This vignette illustrates the basic principles of performing RNA velocity analysis in R, using the velociraptor package, which provides a wrapper around the scVelo python package.

Data preparation

We will use an example data set from Hermann et al. (2018), for which spliced and unspliced counts are available via the scRNAseq package. See here for a record of how the data was processed to get to the SingleCellExperiment object from the raw FASTQ files. Of course, in practice you will typically need to assemble the SingleCellExperiment object based on the output from a quantification tool. It is worth noting that there are many R packages that are useful for reading raw data from various quantification tools directly into a SummarizedExperiment or SingleCellExperiment object; e.g. fishpond, tximeta, BUSpaRse and DropletUtils.

We start by loading the data set and performing some initial “standard” analysis steps, such as normalizing the data, finding highly variable genes, and performing dimensionality reduction.

suppressPackageStartupMessages({
    library(scRNAseq)
})

## Load data set
hermann <- HermannSpermatogenesisData()
#> snapshotDate(): 2021-10-19
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache

## Add a 'counts' assay for easier downstream analysis 
## (here equal to the 'spliced' assay)
assay(hermann, "counts") <- assay(hermann, "spliced")
hermann
#> class: SingleCellExperiment 
#> dim: 54448 2325 
#> metadata(0):
#> assays(3): spliced unspliced counts
#> rownames(54448): ENSMUSG00000102693.1 ENSMUSG00000064842.1 ...
#>   ENSMUSG00000064369.1 ENSMUSG00000064372.1
#> rowData names(0):
#> colnames(2325): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG
#>   ATTGGTGGTTACCGAT
#> colData names(1): celltype
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

suppressPackageStartupMessages({
    library(scuttle)
    library(scater)
    library(scran)
})

## Calculate log-normalized counts
hermann <- scuttle::logNormCounts(hermann)

## Find highly variable genes
dec <- scran::modelGeneVar(hermann)
#> Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
#> collapsing to unique 'x' values
top.hvgs <- scran::getTopHVGs(dec, n = 2000)

## Perform dimensionality reduction
set.seed(100)
hermann <- scater::runPCA(hermann, subset_row = top.hvgs)
hermann <- scater::runTSNE(hermann, dimred = "PCA")

## Plot tSNE representation
scater::plotTSNE(hermann, colour_by = "celltype")

Interlude: R/python interoperability

In this tutorial, we will use the velociraptor R/Bioconductor package to run the RNA velocity analysis. This package provides a wrapper around the scVelo python package, and lets you run most of the essential analysis steps directly from your R session. However, in other situations you may want to use scVelo or other python packages directly, despite performing the initial analysis in R. A large number of tools from the python-based single-cell ecosystem are based around the so callled AnnData format, which is a type of python object that can hold largely similar types of information as SingleCellExperiment or Seurat objects. In order to use our preprocessed data in such python tools, we first need to convert our SingleCellExperiment to AnnData format. The zellkonverter package provides a convenient way to convert between these two formats. To illustrate how it works, let us convert the SingleCellExperiment object to AnnData format and save it to a file that can be directly read into python.

suppressPackageStartupMessages({
    library(zellkonverter)
})
zellkonverter::writeH5AD(hermann, file = "Hermann_AnnData.h5ad",
                         X_name = "counts")

In a python session, we can now read this object back (assuming that the anndata python module is installed), and continue with any downstream analysis. Note that the ‘main’ assay (specified by the X_name argument to writeH5AD() is stored in a slot named X in the AnnData object, and all other assays are added as additional layers. Reduced dimension representations (obsm) and cell annotations (obs) are also converted. More information about the AnnData format can be found in the documentation.

## This chunk is run in a python session
import anndata
adata = anndata.read("Hermann_AnnData.h5ad")
adata
#> AnnData object with n_obs × n_vars = 2325 × 54448
#>     obs: 'celltype', 'sizeFactor'
#>     uns: 'X_name'
#>     obsm: 'PCA', 'TSNE'
#>     layers: 'logcounts', 'spliced', 'unspliced'

zellkonverter can also be used to read an .h5ad file (e.g., that was saved from python) into an R session:

sce <- zellkonverter::readH5AD("Hermann_AnnData.h5ad")
#> ℹ Using stored X_name value 'counts'
sce
#> class: SingleCellExperiment 
#> dim: 54448 2325 
#> metadata(0):
#> assays(4): counts logcounts spliced unspliced
#> rownames(54448): ENSMUSG00000102693.1 ENSMUSG00000064842.1 ...
#>   ENSMUSG00000064369.1 ENSMUSG00000064372.1
#> rowData names(0):
#> colnames(2325): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG
#>   ATTGGTGGTTACCGAT
#> colData names(2): celltype sizeFactor
#> reducedDimNames(2): PCA TSNE
#> mainExpName: NULL
#> altExpNames(0):

RNA velocity estimation

Next, we will perform the RNA velocity estimation. In velociraptor, this is done using the scvelo() function, which provides a wrapper around the functionality of scVelo. Take a moment to study the documentation of scvelo() - you will see for example that all the modes of scVelo (steady-state, stochastic, dynamical) can be used. In addition, the argument use.theirs (default FALSE) can be set to TRUE to perform gene filtering and normalization using scVelo. If set to FALSE, these steps will be performed in R, which can be useful in order to increase consistency with the rest of an R-based pipeline.

For our example, we will use the dynamical mode. It takes a bit longer to run than the steady state option, but typically gives more reliable results.

suppressPackageStartupMessages({
    library(velociraptor)
})
velo.out <- velociraptor::scvelo(hermann, subset.row = top.hvgs,
                                 assay.X = "counts",
                                 mode = "dynamical")

The output of scvelo() is another SingleCellExperiment object.

velo.out
#> class: SingleCellExperiment 
#> dim: 2000 2325 
#> metadata(5): neighbors recover_dynamics velocity_params velocity_graph
#>   velocity_graph_neg
#> assays(10): X spliced ... velocity velocity_u
#> rownames(2000): ENSMUSG00000038015.6 ENSMUSG00000022501.6 ...
#>   ENSMUSG00000048787.13 ENSMUSG00000032232.14
#> rowData names(18): fit_r2 fit_alpha ... velocity_genes varm
#> colnames(2325): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG
#>   ATTGGTGGTTACCGAT
#> colData names(8): velocity_self_transition root_cells ...
#>   velocity_confidence velocity_confidence_transition
#> reducedDimNames(1): X_pca
#> mainExpName: NULL
#> altExpNames(0):

The scvelo() function performs the full RNA velocity estimation analysis, and in the process calls several functions from scVelo. Wd can follow the executed steps via the messages printed in the console.

Compute neighbors and moments

The first step is to estimate moments (mean and uncentered variance) of the abundances for each cell. The moments are calculated across a small set of neighboring cells (in the PCA space), for increased stability (think of it as a kind of smoothing, or imputation). Two layers, Ms and Mu, which are the first order moments (means) for spliced and unspliced abundances, are added to the data object.

## 30 nearest neighbors (note that the nearest neighbor to each cell is the 
## cell itself)
head(metadata(velo.out)$neighbors$indices)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    0  108  162   57  207   53  145  124  109   678   297    45   349     1
#> [2,]    1  101    3   49  186  119   76   58   53    86    68   742    50   297
#> [3,]    2  387   50   15    3  285   78  187  259   470    34   154   119     1
#> [4,]    3   50   34   15   84  101    9    1  259     2   285    14    53    78
#> [5,]    4   11  267  226  114   42  172  249  442   355    51    94   317    97
#> [6,]    5  173   83   95    8  326  205   55  318   223   381   176   295   388
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
#> [1,]    58    68    86     9    84    37    93     3   531   365   187   101
#> [2,]   365    91   600   124    93    45   109   207     2    15   108   162
#> [3,]   990   410   390   109   681    91   548    68   359   186   426   182
#> [4,]    93    91   387    49   123    86    45   109   496   119   154    76
#> [5,]   719   616   423   321    16   543   126   112   128   465    48   276
#> [6,]    21   253   418   304   227   375   369    88   272   762   431   773
#>      [,27] [,28] [,29] [,30]
#> [1,]    91   135   475    49
#> [2,]   154   145   393   475
#> [3,]   496     9    76    84
#> [4,]    58   186    37   990
#> [5,]    13   685   706   417
#> [6,]   342   415   453   339

## Note Ms and Mu assays
assayNames(velo.out)
#>  [1] "X"          "spliced"    "unspliced"  "Ms"         "Mu"        
#>  [6] "fit_t"      "fit_tau"    "fit_tau_"   "velocity"   "velocity_u"

Recover dynamics

This step fits the model and infers transcription rates, splicing rates and degradation rates for each gene, as well as cell-specific latent times and transcriptional states. scVelo uses an EM algorithm for the estimation, iterating between the E-step (where each observation is assigned a time and a state (induction, repression, or steady state)) and the M-step (where the rate parameters are estimated), as outlined in the figure below from Bergen et al. (2020).

This step is only required for the dynamical model. It adds several columns to rowData(hermann) (see https://scvelo.readthedocs.io/DynamicalModeling.html), including:

\(R^2\) of the linear fit to the steady-state cells in the phase plot (fit_r2). Note that this can be negative, if the obtained fit is worse than just using a straight line at the mean. This is used to determine which genes are used for the downstream analysis and projection of velocities into a low-dimensional space.
transcription rate estimates (fit_alpha)
splicing rate estimates (fit_beta)
degradation rate estimates (fit_gamma)
estimates of switching time points (fit_t_)
the likelihood value of the fit (fit_likelihood), averaged across all cells. The likelihood value for a gene and a cell indicates how well the cell is described by the learned phase trajectory.

head(rowData(velo.out)[rowData(velo.out)$velocity_genes, ])
#> DataFrame with 6 rows and 18 columns
#>                          fit_r2 fit_alpha  fit_beta fit_gamma    fit_t_
#>                       <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSMUSG00000028693.15  0.100405  60.95574 73.767529  1.142003  17.94801
#> ENSMUSG00000087279.10  0.268690   9.27810 14.083324  0.180749  14.30887
#> ENSMUSG00000118396.2   0.696183  28.32358  0.801604  0.527865   5.17467
#> ENSMUSG00000051113.9   0.279714  24.88695 14.699699  0.333210  11.93588
#> ENSMUSG00000032601.13  0.573506   5.83376 11.455458  0.210135  14.25576
#> ENSMUSG00000024532.13  0.351638  14.16604 21.884510  0.402229   6.60905
#>                       fit_scaling fit_std_u fit_std_s fit_likelihood    fit_u0
#>                         <numeric> <numeric> <numeric>      <numeric> <numeric>
#> ENSMUSG00000028693.15   0.0183809  0.189800  17.46369       0.594941         0
#> ENSMUSG00000087279.10   0.0199706  0.181182  12.86018       0.236457         0
#> ENSMUSG00000118396.2    0.4247320 10.735452  11.75326       0.295935         0
#> ENSMUSG00000051113.9    0.0317420  0.412822  22.53213       0.404600         0
#> ENSMUSG00000032601.13   0.0351809  0.193684   8.38194       0.307726         0
#> ENSMUSG00000024532.13   0.0298769  0.193273  11.20748       0.366759         0
#>                          fit_s0 fit_pval_steady fit_steady_u fit_steady_s
#>                       <numeric>       <numeric>    <numeric>    <numeric>
#> ENSMUSG00000028693.15         0        0.498334     0.520035      49.0103
#> ENSMUSG00000087279.10         0        0.499027     0.611843      27.5735
#> ENSMUSG00000118396.2          0        0.498511    27.093951      32.0842
#> ENSMUSG00000051113.9          0        0.499705     1.237777      65.3373
#> ENSMUSG00000032601.13         0        0.496976     0.482520      18.6365
#> ENSMUSG00000024532.13         0        0.498405     0.421512      27.8004
#>                       fit_variance fit_alignment_scaling velocity_genes
#>                          <numeric>             <numeric>      <logical>
#> ENSMUSG00000028693.15     0.155117               1.22703           TRUE
#> ENSMUSG00000087279.10     0.826343               4.53596           TRUE
#> ENSMUSG00000118396.2      0.330992               1.25563           TRUE
#> ENSMUSG00000051113.9      0.266072               3.18262           TRUE
#> ENSMUSG00000032601.13     0.554023               3.59790           TRUE
#> ENSMUSG00000024532.13     0.296963               2.15763           TRUE
#>                                              varm
#>                                       <DataFrame>
#> ENSMUSG00000028693.15 757.029:656.900:459.097:...
#> ENSMUSG00000087279.10 869.516:882.060:878.076:...
#> ENSMUSG00000118396.2  669.123:520.053:524.619:...
#> ENSMUSG00000051113.9  760.118:749.991:676.861:...
#> ENSMUSG00000032601.13 516.835:519.185:520.944:...
#> ENSMUSG00000024532.13 554.709:556.350:481.511:...

Compute velocities

Once the kinetic rate parameters are estimated, the actual velocities are estimated based on these. This adds a velocity layer to the data object, and the velocity_genes column in the row data. This column indicates whether the fit for a gene is considered ‘good enough’ for downstream use. Specifically, it requires fit_r2, fit_likelihood and fit_gamma to exceed certain thresholds, fit_scaling to be within a certain range, and that both the unspliced and spliced mean values are nonzero (see the scVelo code for more details).

rd <- rowData(velo.out)
perc <- quantile(rd$fit_scaling[!is.na(rd$fit_scaling)], probs = c(0.1, 0.9))
table(crit = rd$fit_r2 > 0.01 & 
          rd$fit_gamma > 0.01 & 
          rd$fit_likelihood > 0.001 & 
          rd$fit_scaling > min(perc[1], 0.03) & 
          rd$fit_scaling < max(perc[2], 3) & 
          rowSums(assay(velo.out, "Ms")) > 0 & 
          rowSums(assay(velo.out, "Mu")) > 0, 
      vel = rd$velocity_genes, useNA = "ifany")
#>        vel
#> crit    FALSE TRUE
#>   FALSE  1167    0
#>   TRUE      0  561
#>   <NA>    272    0

Compute the velocity graph

At this point, we have estimated the velocities - these are vectors in a \(K\)-dimensional space, where \(K\) is the number of retained genes. In order to use these velocities for downstream applications, such as estimating the future state of an individual cell or generating low-dimensional visualizations, we next estimate a so called velocity graph. To this end, scVelo calculates cosine similarities between the velocity vector for each cell and the displacement vector from that cell to each other (neighboring) cell: \[\pi_{ij}=cos\angle(\delta_{ij}, v_i),\]where \(\delta_{ij}=s_j-s_i\) is the displacement vector from cell \(i\) to cell \(j\) in gene expression space, and \(v_i\) is the velocity vector of cell \(i\). The cosine similarity takes values between -1 and +1, and a large value indicates that cell \(i\) has a high probability of transitioning towards cell \(j\) (since its velocity vector points towards cell \(j\)). The velocity graph is stored in the uns slot of the data object (and subsequently propagated to the metadata of the returned SingleCellExperiment), and is represented by a sparse \(N\times N\) matrix (where \(N\) is the number of cells). There is also a velocity_graph_neg, which is a matrix of the same size as velocity_graph, containing the negative cosine similarities.

## Velocity graph
dim(metadata(velo.out)$velocity_graph)
#> [1] 2325 2325
metadata(velo.out)$velocity_graph[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgTMatrix"
#>                                                   
#> [1,] .          .          .          .          .
#> [2,] 0.11193331 .          .          0.01055382 .
#> [3,] 0.05770175 0.01050411 .          .          .
#> [4,] 0.10750668 0.04704920 0.03657299 .          .
#> [5,] .          .          .          .          .

Compute latent time and pseudotime

It is often of interest to obtain an ordering of the cells along a trajectory. scVelo provides two different approaches for this: pseudotime and latent time. The velocity pseudotime is obtained via a diffusion random walk on the velocity graph, and measures how many steps (on average) it takes to reach a given cell from one of the root cells. The root cells are obtained from the transition matrix. The latent time is obtained from the transcriptional dynamics fit, by relating gene-specific times (position along the phase curve) to a “universal” latent time, shared across genes. Usually, the latent time captures the actual time better than pseudotime.

head(colData(velo.out))
#> DataFrame with 6 rows and 8 columns
#>                  velocity_self_transition  root_cells  end_points
#>                                 <numeric>   <numeric>   <numeric>
#> CCCATACTCCGAAGAG                 0.431352 9.43462e-05 1.38203e-07
#> AATCCAGTCATCTGCC                 0.356828 1.06404e-04 1.29249e-07
#> GACTGCGGTTTGACTG                 0.283698 1.45943e-04 6.18349e-08
#> ACACCAATCTTCGGTC                 0.392824 1.37776e-04 7.88960e-08
#> TGACAACAGGACAGAA                 0.213529 1.14266e-05 4.23266e-05
#> TTGGAACAGGCGTACA                 0.387442 7.31517e-05 1.43643e-04
#>                  velocity_pseudotime latent_time velocity_length
#>                            <numeric>   <numeric>       <numeric>
#> CCCATACTCCGAAGAG          0.01160549    0.507783           36.38
#> AATCCAGTCATCTGCC          0.02222421    0.513260           36.40
#> GACTGCGGTTTGACTG          0.02293368    0.522630           35.59
#> ACACCAATCTTCGGTC          0.00721252    0.504956           34.84
#> TGACAACAGGACAGAA          0.55268830    0.306849           41.24
#> TTGGAACAGGCGTACA          0.59105021    0.480577           41.25
#>                  velocity_confidence velocity_confidence_transition
#>                            <numeric>                      <numeric>
#> CCCATACTCCGAAGAG            0.946991                      0.0712617
#> AATCCAGTCATCTGCC            0.945846                      0.1497353
#> GACTGCGGTTTGACTG            0.951023                      0.2038010
#> ACACCAATCTTCGGTC            0.946307                      0.0961184
#> TGACAACAGGACAGAA            0.964340                      0.3313537
#> TTGGAACAGGCGTACA            0.972804                      0.1511034

Combine input and output

Note that the results of scvelo() are not added to the input SingleCellExperiment object, since the two objects can have different dimensions (in particular note the difference in the number of genes). Thus, in order to use information from the velocity calculations together with information from the original object, we need to copy slots from one object to the other. Here, we copy the reduced dimension representations from the original SingleCellExperiment object to the output of scvelo(), and use the latter to construct tSNE representations colored by various parameters.

stopifnot(all(colnames(hermann) == colnames(velo.out)))
reducedDims(velo.out) <- reducedDims(hermann)
velo.out$celltype <- hermann$celltype
cowplot::plot_grid(
    scater::plotTSNE(velo.out, colour_by = "celltype"),
    scater::plotTSNE(velo.out, colour_by = "velocity_pseudotime"),
    scater::plotTSNE(velo.out, colour_by = "latent_time"),
    ncol = 1, align = "v"
)

As we already saw above, the rowData and colData of velo.out contain the values of various parameters estimated by scvelo():

head(rowData(velo.out))
#> DataFrame with 6 rows and 18 columns
#>                           fit_r2 fit_alpha  fit_beta fit_gamma    fit_t_
#>                        <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSMUSG00000038015.6   0.7308033       NaN       NaN       NaN       NaN
#> ENSMUSG00000022501.6  -0.0100850       NaN       NaN       NaN       NaN
#> ENSMUSG00000043050.8   0.2894763       NaN       NaN       NaN       NaN
#> ENSMUSG00000026182.6  -0.0453711       NaN       NaN       NaN       NaN
#> ENSMUSG00000090137.8   0.0746567       NaN       NaN       NaN       NaN
#> ENSMUSG00000074435.10 -0.6170902       NaN       NaN       NaN       NaN
#>                       fit_scaling fit_std_u fit_std_s fit_likelihood    fit_u0
#>                         <numeric> <numeric> <numeric>      <numeric> <numeric>
#> ENSMUSG00000038015.6          NaN       NaN       NaN            NaN       NaN
#> ENSMUSG00000022501.6          NaN       NaN       NaN            NaN       NaN
#> ENSMUSG00000043050.8          NaN       NaN       NaN            NaN       NaN
#> ENSMUSG00000026182.6          NaN       NaN       NaN            NaN       NaN
#> ENSMUSG00000090137.8          NaN       NaN       NaN            NaN       NaN
#> ENSMUSG00000074435.10         NaN       NaN       NaN            NaN       NaN
#>                          fit_s0 fit_pval_steady fit_steady_u fit_steady_s
#>                       <numeric>       <numeric>    <numeric>    <numeric>
#> ENSMUSG00000038015.6        NaN             NaN          NaN          NaN
#> ENSMUSG00000022501.6        NaN             NaN          NaN          NaN
#> ENSMUSG00000043050.8        NaN             NaN          NaN          NaN
#> ENSMUSG00000026182.6        NaN             NaN          NaN          NaN
#> ENSMUSG00000090137.8        NaN             NaN          NaN          NaN
#> ENSMUSG00000074435.10       NaN             NaN          NaN          NaN
#>                       fit_variance fit_alignment_scaling velocity_genes
#>                          <numeric>             <numeric>      <logical>
#> ENSMUSG00000038015.6           NaN                   NaN          FALSE
#> ENSMUSG00000022501.6           NaN                   NaN          FALSE
#> ENSMUSG00000043050.8           NaN                   NaN          FALSE
#> ENSMUSG00000026182.6           NaN                   NaN          FALSE
#> ENSMUSG00000090137.8           NaN                   NaN          FALSE
#> ENSMUSG00000074435.10          NaN                   NaN          FALSE
#>                                  varm
#>                           <DataFrame>
#> ENSMUSG00000038015.6  NaN:NaN:NaN:...
#> ENSMUSG00000022501.6  NaN:NaN:NaN:...
#> ENSMUSG00000043050.8  NaN:NaN:NaN:...
#> ENSMUSG00000026182.6  NaN:NaN:NaN:...
#> ENSMUSG00000090137.8  NaN:NaN:NaN:...
#> ENSMUSG00000074435.10 NaN:NaN:NaN:...

head(colData(velo.out))
#> DataFrame with 6 rows and 9 columns
#>                  velocity_self_transition  root_cells  end_points
#>                                 <numeric>   <numeric>   <numeric>
#> CCCATACTCCGAAGAG                 0.431352 9.43462e-05 1.38203e-07
#> AATCCAGTCATCTGCC                 0.356828 1.06404e-04 1.29249e-07
#> GACTGCGGTTTGACTG                 0.283698 1.45943e-04 6.18349e-08
#> ACACCAATCTTCGGTC                 0.392824 1.37776e-04 7.88960e-08
#> TGACAACAGGACAGAA                 0.213529 1.14266e-05 4.23266e-05
#> TTGGAACAGGCGTACA                 0.387442 7.31517e-05 1.43643e-04
#>                  velocity_pseudotime latent_time velocity_length
#>                            <numeric>   <numeric>       <numeric>
#> CCCATACTCCGAAGAG          0.01160549    0.507783           36.38
#> AATCCAGTCATCTGCC          0.02222421    0.513260           36.40
#> GACTGCGGTTTGACTG          0.02293368    0.522630           35.59
#> ACACCAATCTTCGGTC          0.00721252    0.504956           34.84
#> TGACAACAGGACAGAA          0.55268830    0.306849           41.24
#> TTGGAACAGGCGTACA          0.59105021    0.480577           41.25
#>                  velocity_confidence velocity_confidence_transition
#>                            <numeric>                      <numeric>
#> CCCATACTCCGAAGAG            0.946991                      0.0712617
#> AATCCAGTCATCTGCC            0.945846                      0.1497353
#> GACTGCGGTTTGACTG            0.951023                      0.2038010
#> ACACCAATCTTCGGTC            0.946307                      0.0961184
#> TGACAACAGGACAGAA            0.964340                      0.3313537
#> TTGGAACAGGCGTACA            0.972804                      0.1511034
#>                                celltype
#>                             <character>
#> CCCATACTCCGAAGAG DIplotene/Secondary ..
#> AATCCAGTCATCTGCC DIplotene/Secondary ..
#> GACTGCGGTTTGACTG DIplotene/Secondary ..
#> ACACCAATCTTCGGTC DIplotene/Secondary ..
#> TGACAACAGGACAGAA   Mid Round spermatids
#> TTGGAACAGGCGTACA   Mid Round spermatids

More information about the interpretation of these parameters can be found in the scVelo documentation. Let us, for example, plot the distribution of transcription, splicing and degradation rates for all velocity genes with fit likelihood larger than 0.1:

suppressPackageStartupMessages({
    library(dplyr)
})
plotdf <- as.data.frame(rowData(velo.out)) %>%
    dplyr::filter(fit_likelihood > 0.1 & velocity_genes)
dim(plotdf)
#> [1] 521  38
cowplot::plot_grid(
    ggplot(plotdf, aes(x = fit_alpha)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    ggplot(plotdf, aes(x = fit_beta)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    ggplot(plotdf, aes(x = fit_gamma)) + geom_histogram(bins = 50) + 
        theme_minimal() + scale_x_log10(),
    nrow = 1
)

Velocity plots

The velociraptor package also provides plotting functions to visualize the RNA velocity results. For example, we can embed the estimated velocities into a provided low-dimensional space. In order to visualize the velocities in a lower-dimensional embedding, we convert the cosine similarities in the velocity graph to transition probabilities of cell-to-cell transitions by applying an exponential kernel: \[\tilde{\pi}_{ij}=\frac{1}{z_i}exp(\frac{\pi_{ij}}{\sigma_i^2}).\] The \(z_i\) are normalization factors and \(\sigma_i\) is an adaptive kernel width parameter. These transition probabilities are used to project the velocities into a low-dimensional embedding. This is achieved by weighting the normalized displacement vectors from a cell to all other cells in the low-dimensional space by the transition probabilities for cell, and taking the resulting weighted average as the low-dimensional velocity vector. More precisely, if \[\tilde{\delta}_{ij}=\frac{\tilde{s}_j-\tilde{s}_i}{\|\tilde{s}_j-\tilde{s}_i\|}\]are the normalized displacement vectors in the low-dimensional embedding, the embedded velocity is estimated by \[\tilde{v}_i=\sum_{j\neq i}\left(\tilde{\pi}_{ij} - \frac{1}{n}\right)\tilde{\delta}_{ij}.\] The \(1/n\) term is included to make the projected velocity zero when the transition probabilities represent a uniform distribution.

suppressPackageStartupMessages({
    library(ggplot2)
})
embedded <- velociraptor::embedVelocity(reducedDim(velo.out, "TSNE"), velo.out)
grid.df <- velociraptor::gridVectors(reducedDim(velo.out, "TSNE"), embedded)
plotTSNE(velo.out, colour_by = "velocity_pseudotime") +
    geom_segment(data = grid.df,
                 mapping = aes(x = start.1, y = start.2,
                               xend = end.1, yend = end.2),
                 arrow = arrow(length = unit(0.05, "inches")))

Next, we will look at individual genes via their phase portraits. This is often useful in order to understand how the velocities are affected/supported by specific genes. We can of course plot genes that we are already familiar with and that we know are related to the process of interest. We can also extract genes with particularly strong influence on the velocity results. ‘Driver genes,’ which display a strong dynamic behavior, can for example be detected via their high likelihoods in the dynamic model. Here, we select the six genes with the largest likelihood value for the fit.

toplh <- rownames(as.data.frame(rowData(velo.out)) %>%
    dplyr::arrange(desc(fit_likelihood)) %>%
    head())
plotVelocity(velo.out, use.dimred = "TSNE", genes = toplh[1:3],
             color_by = "celltype")

plotVelocity(velo.out, use.dimred = "TSNE", genes = toplh[4:6],
             color_by = "celltype")

Finally, we will illustrate two summary statistics returned by scVelo:

The length of the velocity vector encodes the speed or rate of differentiation.
The ‘velocity confidence’ provides a measure of the coherence of the velocity vector field, that is, how well the velocity vector for a cell correlates with those of its neighbors.

cowplot::plot_grid(
    scater::plotTSNE(velo.out, colour_by = "velocity_length"),
    scater::plotTSNE(velo.out, colour_by = "velocity_confidence"),
    ncol = 1, align = "v"
)

References

Bergen, Volker, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. 2020. “Generalizing RNA Velocity to Transient Cell States Through Dynamical Modeling.” Nature Biotechnology 38: 1408–14.

Hermann, Brian P, Keren Cheng, Anukriti Singh, Lorena Roa-De La Cruz, Kazadi N Mutoji, I-Chung Chen, Heidi Gildersleeve, et al. 2018. “The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids.” Cell Rep. 25: 1650–1667.e8.

RNA velocity analysis with velociraptor

Charlotte Soneson

January 14, 2022