Gene counts for scRNA-seq data sets from Zheng et al. (2017), consisting of pre-sorted cell types combined into three artificial data sets with different cell proportions.

sce_full_Zhengmix4eq(metadata = FALSE)
sce_filteredExpr10_Zhengmix4eq(metadata = FALSE)
sce_filteredHVG10_Zhengmix4eq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix4eq(metadata = FALSE)
sce_full_Zhengmix4uneq(metadata = FALSE)
sce_filteredExpr10_Zhengmix4uneq(metadata = FALSE)
sce_filteredHVG10_Zhengmix4uneq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix4uneq(metadata = FALSE)
sce_full_Zhengmix8eq(metadata = FALSE)
sce_filteredExpr10_Zhengmix8eq(metadata = FALSE)
sce_filteredHVG10_Zhengmix8eq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix8eq(metadata = FALSE)

Arguments

metadata

Logical, whether only metadata should be returned

Details

This is a scRNA-seq data set originally from Zheng et al. (2017). The data set consists of eight pre-sorted cell types (B-cells, naive cytotoxic T-cells, CD14 monocytes, regulatory T-cells, CD56 NK cells, memory T-cells, CD4 T-helper cells and naive T-cells) from Homo sapiens combined into three artificial data sets with different cell proportions. The annotated cell type (obtained by pre-sorting of the cells) is used as the true cell label. The data sets have been used to evaluate the performance of clustering algorithms in Duò et al. (2018).

For the Zhengmix4eq data set, randomly selected B-cells, CD14 monocytes, naive cytotoxic T-cells and regulatory T-cells were combined in equal proportions (1,000 cells per subpopulation). The Zhengmix4uneq data set consists of four cell types, combined in unequal proportions (1,000 B-cells, 500 naive cytotoxic T-cells, 2,000 CD14 monocytes and 3,000 regulatory T-cells). For the Zhengmix8eq data set, all eight populations were combined in approximately equal proportions (400–600 cells per population).

For the sce_full_Zhengmix4eq, sce_full_Zhengmix4uneq and sce_full_Zhengmix8eq data set, all genes except those with zero counts across all cells are retained. The gene counts are unique molecular identifiers (UMIs) counts. The scater package was used to perform quality control of the data (McCarthy et al. (2017)). Features with zero counts across all cells, as well as all cells with total count or total number of detected features more than 3 median absolute deviations (MADs) below the median across all cells (on the log scale), were excluded.

The sce_full_Zhengmix4eq data set consists of 3,994 cells and 15,568 features, the sce_full_Zhengmix4uneq data set of 6,498 cells and 16,443 features and the sce_full_Zhengmix8eq of 3,994 cells and 16,443 features, respectively. The filteredExpr, filteredHVG and filteredM3Drop10 are further reduced data sets. For each of the filtering method, we retained 10 percent of the original number of genes (with a non-zero count in at least one cell) in the original data sets.

For the filteredExpr data sets, only the genes with the highest average expression (log-normalized count) value across all cells were retained. Using the Seurat package, the filteredHVG data sets were filtered on the variability of the features and only the most highly variable ones were retained (Satija et al. (2015)). Finally, the M3Drop package was used to model the dropout rate of the genes as a function of the mean expression level using the Michaelis-Menten equation and select variables to retain for the filteredM3Drop10 data sets (Andrews and Hemberg (2018)).

The scater package was used to normalize the count values, based on normalization factors calculated by the deconvolution method from the scran package (Lun et al. (2016)). This data set is provided as a SingleCellExperiment object (Lun and Risso (2017)). For further information on the SingleCellExperiment class, see the corresponding manual. Raw data files or the original data sets are available from https://support.10xgenomics.com/single-cell-gene-expression/datasets.

Format

SingleCellExperiment

Value

Returns a SingleCellExperiment object.

References

Andrews, T.S., and Hemberg, M. (2018). Dropout-based feature selection for scRNASeq. bioRxiv doi:https://doi.org/10.1101/065094.

Duò, A., Robinson, M.D., and Soneson, C. (2018). A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7:1141.

Lun, A.T.L., Bach, K., and Marioni, J.C. (2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17(1): 75.

Lun, A.T.L., and Risso, D. (2017). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.0.0.

McCarthy, D.J., Campbell, K.R., Lun, A.T.L., and Wills, Q.F. (2017): Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33(8): 1179-1186.

Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33(5): 495–502.

Zheng, G.X., Terry, J.M., Belgrader P., Ryvkin, Pl, Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., Gregory, M.T., Shuga, J., Montesclaros, L., Underwood, J.G., Masquelier, D.A., Nishimura, S.Y., Schnall-Levin, M., Wyatt, P.W., Hindson, C.M., Bharadwaj, R., Wong, A., Ness, K.D., Beppu, L.W., Deeg, H.J., McFarland, C., Loeb, K.R., Valente, W.J., Ericson, N.G., Stevens, E.A., Radich, J.P., Mikkelsen, T.S., Hindson, B.J., and Bielas, J.H. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8:14049.

Examples

sce_filteredExpr10_Zhengmix4eq()
#> see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
#> loading from cache
#> class: SingleCellExperiment 
#> dim: 1556 3555 
#> metadata(1): log.exprs.offset
#> assays(3): counts logcounts normcounts
#> rownames(1556): ENSG00000167526 ENSG00000140988 ... ENSG00000011009
#>   ENSG00000168522
#> rowData names(10): id symbol ... total_counts log10_total_counts
#> colnames(3555): b.cells6276 b.cells6144 ... regulatory.t1084
#>   regulatory.t9696
#> colData names(15): dataset barcode ... is_cell_control sizeFactor
#> reducedDimNames(2): PCA TSNE
#> mainExpName: NULL
#> altExpNames(0):