Preparation of example scRNAseq data

In this workshop, we use example data from the TENxPBMCData package. This package provides an R / Bioconductor resource for representing and manipulating different single-cell RNA-seq data sets profiling peripheral blood mononuclear cells (PBMC) generated by 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets).

library(TENxPBMCData)

The man page for the TENxPBMCData() function gives an idea of the datasets that are available from this package. It can be opened with the following command.

help(TENxPBMCData)

Here, we use the pbmc3k dataset, which contains gene expression profiles for 2,700 single peripheral blood mononuclear cells. The first time this dataset is loaded, this command downloads the dataset to a local cache, which takes some time, depending on the speed of your internet connection. Subsequent times, the same command loads the dataset directly from the local cache.

sce <- TENxPBMCData(dataset = "pbmc3k")

At this point we can inspect the dataset in the console.

sce
#> class: SingleCellExperiment 
#> dim: 32738 2700 
#> metadata(0):
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#>   ENSG00000215611
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

The dataset is provided as an object of the SingleCellExperiment class. In particular, this summary view indicates that the following pieces of information are available:

  • An assay matrix named counts
  • Row (i.e., gene) names are Ensembl gene IDs
  • Row metadata include for each gene the official gene symbol, and the gene symbol used by the 10x CellRanger quantification pipeline
  • Column (i.e., cell) IDs are not initialized and left to NULL
  • Column metadata include diverse information for each cell, including the cell barcode (Barcode) and the donor identifier (Individual).

Note that a SingleCellExperiment object (or, more generally, any SummarizedExperiment object) like this one already contains sufficient information to launch an interactive application instance to visualize the available data and metadata, using the iSEE() function.

For the purpose of this workshop, we first apply some preprocessing to the SingleCellExperiment object, in order to populate it with more information that can be visualized with iSEE.

We start by adding column names to the object, and use gene symbols instead of Ensembl IDs as row names. In the case where multiple Ensembl identifiers correspond to the same gene symbol, the scuttle::uniquifyFeatureNames function concatenates the Ensembl ID and the gene symbol in order to generate unique feature names.

library(scuttle)
colnames(sce) <- paste0("Cell", seq_len(ncol(sce)))
rownames(sce) <- scuttle::uniquifyFeatureNames(
    ID = rowData(sce)$ENSEMBL_ID,
    names = rowData(sce)$Symbol_TENx
)
head(rownames(sce))
#> [1] "MIR1302-10"   "FAM138A"      "OR4F5"        "RP11-34P13.7" "RP11-34P13.8"
#> [6] "AL627309.1"

Next, we use the scuttle package to calculate gene- and cell-level quality metrics. These metrics are added as columns to the rowData and colData slots of the SingleCellExperiment object, respectively. We also add some additional metrics that are not automatically computed by the scuttle package.

MT <- rownames(sce)[grep("^MT-", rownames(sce))]
sce <- scuttle::addPerCellQC(sce, subsets = list(MT = MT))
sce <- scuttle::addPerFeatureQC(sce)
sce$log10_total <- log10(sce$total)
rowData(sce)$n_cells <- as.integer(rowData(sce)$detected / 100 * ncol(sce))
rowData(sce)$log10_total <- log10(rowSums(assay(sce, "counts")) + 1)
sce
#> class: SingleCellExperiment 
#> dim: 32738 2700 
#> metadata(0):
#> assays(1): counts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(7): ENSEMBL_ID Symbol_TENx ... n_cells log10_total
#> colnames(2700): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(18): Sample Barcode ... total log10_total
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

We filter out a few cells with a large fraction of the counts coming from mitochondrial genes, since these may be damaged cells. Notice the reduced number of columns in the dataset below.

(sce <- sce[, sce$subsets_MT_percent < 5])
#> class: SingleCellExperiment 
#> dim: 32738 2643 
#> metadata(0):
#> assays(1): counts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(7): ENSEMBL_ID Symbol_TENx ... n_cells log10_total
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(18): Sample Barcode ... total log10_total
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Next, we calculate size factors and normalized and log-transformed expression values, using the scran and scuttle packages. Note that it is typically recommended to pre-cluster the cells before computing the size factors, as follows:

# set.seed(1000)
# clusters <- scran::quickCluster(sce, BSPARAM = IrlbaParam())
# sce <- scran::computeSumFactors(sce, cluster = clusters, min.mean = 0.1)

However, for time reasons, we will skip the pre-clustering step in this workshop.

library(scran)
assay(sce, "counts") <- as(assay(sce, "counts"), "sparseMatrix")
sce <- scran::computeSumFactors(sce, min.mean = 0.1)
summary(sizeFactors(sce))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1164  0.7309  0.9004  1.0000  1.1635  9.8580
sce <- scuttle::logNormCounts(sce)

In order to extract the most informative genes, we first model the mean-variance trend and decompose the variance into biological and technical components.

dec <- scran::modelGeneVar(sce)
top.dec <- dec[order(dec$bio, decreasing = TRUE), ] 
head(top.dec)
#> DataFrame with 6 rows and 6 columns
#>              mean     total      tech       bio      p.value          FDR
#>         <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
#> LYZ       1.63950   3.92209  0.884955   3.03713 4.59557e-117 2.54472e-113
#> S100A9    1.07323   3.36899  0.778205   2.59079 2.72222e-110 9.04431e-107
#> HLA-DRA   1.56937   3.16577  0.880146   2.28562  5.73028e-68  7.93262e-65
#> FTL       3.67846   2.88499  0.708226   2.17676  2.46520e-94  5.85028e-91
#> CD74      2.23653   2.75839  0.854474   1.90392  1.34728e-50  1.01732e-47
#> CST3      1.10995   2.59172  0.790231   1.80149  7.24671e-53  6.01912e-50

Next, we apply Principal Components Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) to generate low-dimensional representations of the cells in our data set. These low-dimensional representations are added to the reducedDim slot of the SingleCellExperiment object.

library(BiocSingular)
set.seed(1000)
sce <- scran::denoisePCA(sce, technical = dec, subset.row=NULL)
ncol(reducedDim(sce, "PCA"))
#> [1] 5
set.seed(1000)
sce <- scater::runTSNE(sce, dimred = "PCA", perplexity = 30)
sce <- scater::runUMAP(sce, dimred = "PCA")
sce
#> class: SingleCellExperiment 
#> dim: 32738 2643 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(7): ENSEMBL_ID Symbol_TENx ... n_cells log10_total
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(19): Sample Barcode ... log10_total sizeFactor
#> reducedDimNames(3): PCA TSNE UMAP
#> mainExpName: NULL
#> altExpNames(0):

After this, we cluster the cells using a graph-based algorithm, and find ‘marker genes’ for each cluster as the genes that are significantly upregulated in the cluster compared to each of the other inferred clusters. The adjusted p-values from this test, for each cluster, are added to the rowData slot of the object.

snn.gr <- scran::buildSNNGraph(sce, use.dimred = "PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
sce$Cluster <- factor(clusters$membership)
table(sce$Cluster)
#> 
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
#> 143 330 212 362 224 389 338 144 142 129  79 115  25  11
markers <- scran::findMarkers(sce, groups = sce$Cluster,
                              test.type = "t",
                              direction = "up", pval.type = "all")
for (i in names(markers)) {
    rowData(sce)[, paste0("FDR_cluster", i)] <- 
        markers[[i]]$FDR[match(rownames(sce), 
                               rownames(markers[[i]]))]
}
sce
#> class: SingleCellExperiment 
#> dim: 32738 2643 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
#> rowData names(21): ENSEMBL_ID Symbol_TENx ... FDR_cluster13
#>   FDR_cluster14
#> colnames(2643): Cell1 Cell2 ... Cell2699 Cell2700
#> colData names(20): Sample Barcode ... sizeFactor Cluster
#> reducedDimNames(3): PCA TSNE UMAP
#> mainExpName: NULL
#> altExpNames(0):

Finally, we assign a label to each cell, based on their individual transcriptome profiles, using the SingleR package and the Monaco immune data as a reference. For each prediction, we store the labels in a specific colData column.

library(SingleR)
library(celldex)
ref_monaco <- MonacoImmuneData()

Here we assign the cell type according to the main classification scheme (this includes B cells, Basophils, CD4+ T cells, CD8+ T cells, Dendritic cells, Monocytes, Neutrophils, NK cells, Progenitors, and T cells).

pred_monaco_main <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.main)
table(pred_monaco_main$labels)
#> 
#>         B cells    CD4+ T cells    CD8+ T cells Dendritic cells       Monocytes 
#>             348             894             294              42             632 
#>        NK cells     Progenitors         T cells 
#>             160              15             258
sce$labels_main <- pred_monaco_main$labels

We do something similar with a more fine-grained classification, this time including the cell subtypes (e.g., for B cells, the subtypes would include Exhausted B cells, Naive B cells, Non-switched memory B cells, Plasmablasts, and Switched memory B cells).

pred_monaco_fine <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.fine)
table(pred_monaco_fine$labels)
#> 
#>    Central memory CD8 T cells           Classical monocytes 
#>                            63                           320 
#>   Effector memory CD8 T cells             Exhausted B cells 
#>                            40                            41 
#>     Follicular helper T cells        Intermediate monocytes 
#>                           130                           243 
#>                    MAIT cells       Myeloid dendritic cells 
#>                            79                            42 
#>                 Naive B cells             Naive CD4 T cells 
#>                           202                           250 
#>             Naive CD8 T cells          Natural killer cells 
#>                            73                           150 
#>       Non classical monocytes   Non-switched memory B cells 
#>                            64                            70 
#>            Non-Vd2 gd T cells                  Plasmablasts 
#>                            16                             6 
#>  Plasmacytoid dendritic cells              Progenitor cells 
#>                             4                            14 
#>       Switched memory B cells            T regulatory cells 
#>                            29                            95 
#> Terminal effector CD4 T cells Terminal effector CD8 T cells 
#>                             4                            65 
#>                     Th1 cells                Th1/Th17 cells 
#>                           144                           125 
#>                    Th17 cells                     Th2 cells 
#>                            65                           179 
#>                Vd2 gd T cells 
#>                           130
sce$labels_fine <- pred_monaco_fine$labels

Similarly, we use the information contained in the cell ontology labels.

pred_monaco_ont <- SingleR(test = sce, ref = ref_monaco, labels = ref_monaco$label.ont)
table(pred_monaco_ont$labels)
#> 
#> CL:0000236 CL:0000545 CL:0000546 CL:0000623 CL:0000782 CL:0000784 CL:0000788 
#>         43        130        185        154         42          4        201 
#> CL:0000798 CL:0000815 CL:0000860 CL:0000875 CL:0000895 CL:0000899 CL:0000900 
#>        108         99        319         64        235         58         80 
#> CL:0000907 CL:0000912 CL:0000913 CL:0000940 CL:0000970 CL:0000972 CL:0000980 
#>         70        131         62         86         69         28          6 
#> CL:0001044 CL:0001062 CL:0002038 CL:0002043 CL:0002393 
#>          3         63        143         13        247
sce$labels_ont <- pred_monaco_ont$labels

The next table shows the relationship between the coarse and fine grained assignments in the data at hand.

table(sce$labels_fine, sce$labels_main)
#>                                
#>                                 B cells CD4+ T cells CD8+ T cells
#>   Central memory CD8 T cells          0            5           49
#>   Classical monocytes                 0            0            0
#>   Effector memory CD8 T cells         0            0           29
#>   Exhausted B cells                  41            0            0
#>   Follicular helper T cells           0          125            4
#>   Intermediate monocytes              0            0            0
#>   MAIT cells                          0            1            2
#>   Myeloid dendritic cells             0            0            0
#>   Naive B cells                     200            0            0
#>   Naive CD4 T cells                   0          200           46
#>   Naive CD8 T cells                   0            9           63
#>   Natural killer cells                0            0            0
#>   Non classical monocytes             0            0            0
#>   Non-switched memory B cells        70            0            0
#>   Non-Vd2 gd T cells                  0            0            3
#>   Plasmablasts                        6            0            0
#>   Plasmacytoid dendritic cells        0            0            0
#>   Progenitor cells                    0            0            0
#>   Switched memory B cells            29            0            0
#>   T regulatory cells                  0           90            4
#>   Terminal effector CD4 T cells       0            0            2
#>   Terminal effector CD8 T cells       0            0           37
#>   Th1 cells                           0          122           15
#>   Th1/Th17 cells                      1          104            3
#>   Th17 cells                          0           62            1
#>   Th2 cells                           0          172            7
#>   Vd2 gd T cells                      1            4           29
#>                                
#>                                 Dendritic cells Monocytes NK cells Progenitors
#>   Central memory CD8 T cells                  0         0        0           0
#>   Classical monocytes                         4       315        0           1
#>   Effector memory CD8 T cells                 0         0        0           0
#>   Exhausted B cells                           0         0        0           0
#>   Follicular helper T cells                   0         0        0           0
#>   Intermediate monocytes                      1       242        0           0
#>   MAIT cells                                  0         0        0           0
#>   Myeloid dendritic cells                    33         9        0           0
#>   Naive B cells                               0         1        0           0
#>   Naive CD4 T cells                           0         0        0           0
#>   Naive CD8 T cells                           0         0        0           0
#>   Natural killer cells                        0         0      148           0
#>   Non classical monocytes                     0        64        0           0
#>   Non-switched memory B cells                 0         0        0           0
#>   Non-Vd2 gd T cells                          0         0        4           0
#>   Plasmablasts                                0         0        0           0
#>   Plasmacytoid dendritic cells                4         0        0           0
#>   Progenitor cells                            0         0        0          14
#>   Switched memory B cells                     0         0        0           0
#>   T regulatory cells                          0         0        0           0
#>   Terminal effector CD4 T cells               0         0        0           0
#>   Terminal effector CD8 T cells               0         0        5           0
#>   Th1 cells                                   0         0        0           0
#>   Th1/Th17 cells                              0         0        0           0
#>   Th17 cells                                  0         0        0           0
#>   Th2 cells                                   0         0        0           0
#>   Vd2 gd T cells                              0         1        3           0
#>                                
#>                                 T cells
#>   Central memory CD8 T cells          9
#>   Classical monocytes                 0
#>   Effector memory CD8 T cells        11
#>   Exhausted B cells                   0
#>   Follicular helper T cells           1
#>   Intermediate monocytes              0
#>   MAIT cells                         76
#>   Myeloid dendritic cells             0
#>   Naive B cells                       1
#>   Naive CD4 T cells                   4
#>   Naive CD8 T cells                   1
#>   Natural killer cells                2
#>   Non classical monocytes             0
#>   Non-switched memory B cells         0
#>   Non-Vd2 gd T cells                  9
#>   Plasmablasts                        0
#>   Plasmacytoid dendritic cells        0
#>   Progenitor cells                    0
#>   Switched memory B cells             0
#>   T regulatory cells                  1
#>   Terminal effector CD4 T cells       2
#>   Terminal effector CD8 T cells      23
#>   Th1 cells                           7
#>   Th1/Th17 cells                     17
#>   Th17 cells                          2
#>   Th2 cells                           0
#>   Vd2 gd T cells                     92

This concludes the preparation of the data. We now have a SingleCellExperiment object that contains different types of abundance values, representations in reduced dimensions, as well as a range of row (feature) and column (cell) metadata.

Overview plots

scater::plotReducedDim(sce, dimred = "TSNE", colour_by = "labels_main")

scater::plotHighestExprs(sce, n = 20, colour_cells_by = "labels_main")

Save data set

saveRDS(sce, "pbmc3k.rds")

Session info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] celldex_1.4.0               SingleR_1.8.0              
#>  [3] BiocSingular_1.10.0         scran_1.22.1               
#>  [5] scuttle_1.4.0               TENxPBMCData_1.12.0        
#>  [7] HDF5Array_1.22.1            rhdf5_2.38.0               
#>  [9] DelayedArray_0.20.0         Matrix_1.4-0               
#> [11] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
#> [13] Biobase_2.54.0              GenomicRanges_1.46.1       
#> [15] GenomeInfoDb_1.30.0         IRanges_2.28.0             
#> [17] S4Vectors_0.32.3            BiocGenerics_0.40.0        
#> [19] MatrixGenerics_1.6.0        matrixStats_0.61.0         
#> [21] BiocStyle_2.22.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rtsne_0.15                    colorspace_2.0-2             
#>   [3] ggbeeswarm_0.6.0              ellipsis_0.3.2               
#>   [5] bluster_1.4.0                 XVector_0.34.0               
#>   [7] BiocNeighbors_1.12.0          farver_2.1.0                 
#>   [9] ggrepel_0.9.1                 bit64_4.0.5                  
#>  [11] RSpectra_0.16-0               interactiveDisplayBase_1.32.0
#>  [13] AnnotationDbi_1.56.2          fansi_0.5.0                  
#>  [15] sparseMatrixStats_1.6.0       cachem_1.0.6                 
#>  [17] knitr_1.37                    scater_1.22.0                
#>  [19] jsonlite_1.7.2                cluster_2.1.2                
#>  [21] dbplyr_2.1.1                  png_0.1-7                    
#>  [23] uwot_0.1.11                   shiny_1.7.1                  
#>  [25] BiocManager_1.30.16           compiler_4.1.2               
#>  [27] httr_1.4.2                    dqrng_0.3.0                  
#>  [29] assertthat_0.2.1              fastmap_1.1.0                
#>  [31] limma_3.50.0                  later_1.3.0                  
#>  [33] htmltools_0.5.2               tools_4.1.2                  
#>  [35] rsvd_1.0.5                    igraph_1.2.9                 
#>  [37] gtable_0.3.0                  glue_1.5.1                   
#>  [39] GenomeInfoDbData_1.2.7        dplyr_1.0.7                  
#>  [41] rappdirs_0.3.3                Rcpp_1.0.7                   
#>  [43] jquerylib_0.1.4               vctrs_0.3.8                  
#>  [45] Biostrings_2.62.0             rhdf5filters_1.6.0           
#>  [47] ExperimentHub_2.2.0           DelayedMatrixStats_1.16.0    
#>  [49] xfun_0.28                     stringr_1.4.0                
#>  [51] beachmat_2.10.0               mime_0.12                    
#>  [53] lifecycle_1.0.1               irlba_2.3.5                  
#>  [55] statmod_1.4.36                AnnotationHub_3.2.0          
#>  [57] edgeR_3.36.0                  zlibbioc_1.40.0              
#>  [59] scales_1.1.1                  promises_1.2.0.1             
#>  [61] parallel_4.1.2                yaml_2.2.1                   
#>  [63] curl_4.3.2                    gridExtra_2.3                
#>  [65] memoise_2.0.1                 ggplot2_3.3.5.9000           
#>  [67] sass_0.4.0                    stringi_1.7.6                
#>  [69] RSQLite_2.2.9                 highr_0.9                    
#>  [71] BiocVersion_3.14.0            ScaledMatrix_1.2.0           
#>  [73] filelock_1.0.2                BiocParallel_1.28.3          
#>  [75] rlang_0.4.12                  pkgconfig_2.0.3              
#>  [77] bitops_1.0-7                  evaluate_0.14                
#>  [79] lattice_0.20-45               purrr_0.3.4                  
#>  [81] Rhdf5lib_1.16.0               labeling_0.4.2               
#>  [83] cowplot_1.1.1                 bit_4.0.4                    
#>  [85] tidyselect_1.1.1              magrittr_2.0.1               
#>  [87] R6_2.5.1                      generics_0.1.1               
#>  [89] metapod_1.2.0                 DBI_1.1.1                    
#>  [91] pillar_1.6.4                  withr_2.4.3                  
#>  [93] KEGGREST_1.34.0               RCurl_1.98-1.5               
#>  [95] tibble_3.1.6                  crayon_1.4.2                 
#>  [97] utf8_1.2.2                    BiocFileCache_2.2.0          
#>  [99] rmarkdown_2.11                viridis_0.6.2                
#> [101] locfit_1.5-9.4                grid_4.1.2                   
#> [103] FNN_1.1.3                     blob_1.2.2                   
#> [105] digest_0.6.29                 xtable_1.8-4                 
#> [107] httpuv_1.6.3                  munsell_0.5.0                
#> [109] viridisLite_0.4.0             beeswarm_0.4.0               
#> [111] vipor_0.4.5                   bslib_0.3.1