Gene counts for scRNA-seq data sets from Zheng et al. (2017), consisting of pre-sorted cell types combined into three artificial data sets with different cell proportions.
sce_full_Zhengmix4eq(metadata = FALSE)
sce_filteredExpr10_Zhengmix4eq(metadata = FALSE)
sce_filteredHVG10_Zhengmix4eq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix4eq(metadata = FALSE)
sce_full_Zhengmix4uneq(metadata = FALSE)
sce_filteredExpr10_Zhengmix4uneq(metadata = FALSE)
sce_filteredHVG10_Zhengmix4uneq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix4uneq(metadata = FALSE)
sce_full_Zhengmix8eq(metadata = FALSE)
sce_filteredExpr10_Zhengmix8eq(metadata = FALSE)
sce_filteredHVG10_Zhengmix8eq(metadata = FALSE)
sce_filteredM3Drop10_Zhengmix8eq(metadata = FALSE)
This is a scRNA-seq data set originally from Zheng et al. (2017). The data set consists of eight pre-sorted cell types (B-cells, naive cytotoxic T-cells, CD14 monocytes, regulatory T-cells, CD56 NK cells, memory T-cells, CD4 T-helper cells and naive T-cells) from Homo sapiens combined into three artificial data sets with different cell proportions. The annotated cell type (obtained by pre-sorting of the cells) is used as the true cell label. The data sets have been used to evaluate the performance of clustering algorithms in Duò et al. (2018).
For the Zhengmix4eq
data set, randomly selected B-cells, CD14 monocytes,
naive cytotoxic T-cells and regulatory T-cells were combined in equal
proportions (1,000 cells per subpopulation).
The Zhengmix4uneq
data set consists of four cell types, combined in
unequal proportions (1,000 B-cells, 500 naive cytotoxic T-cells, 2,000 CD14
monocytes and 3,000 regulatory T-cells). For the Zhengmix8eq
data set,
all eight populations were combined in approximately equal proportions (400–600
cells per population).
For the sce_full_Zhengmix4eq
, sce_full_Zhengmix4uneq
data set, all genes except those with zero counts
across all cells are retained. The gene counts are unique molecular
identifiers (UMIs) counts.
The scater
package was used to perform quality control of the data
(McCarthy et al. (2017)).
Features with zero counts across all cells, as well as all cells with total
count or total number of detected features more than 3 median absolute
deviations (MADs) below the median across all cells (on the log scale),
were excluded.
The sce_full_Zhengmix4eq
data set consists of 3,994 cells and 15,568
features, the sce_full_Zhengmix4uneq
data set of 6,498 cells and 16,443
features and the sce_full_Zhengmix8eq
of 3,994 cells and 16,443 features,
The filteredExpr
, filteredHVG
and filteredM3Drop10
further reduced data sets.
For each of the filtering method, we retained 10 percent of the original
number of genes
(with a non-zero count in at least one cell) in the original data sets.
For the filteredExpr
data sets, only the genes with the highest average
expression (log-normalized count) value across all cells were retained.
Using the Seurat
package, the filteredHVG
data sets were filtered
on the variability of the features and only the most highly variable ones were
retained (Satija et al. (2015)). Finally, the M3Drop
package was used
to model the dropout rate of the genes as a function of the mean expression
level using the Michaelis-Menten equation and select variables to retain for the
data sets (Andrews and Hemberg (2018)).
The scater
package was used to normalize the count values, based on
normalization factors calculated by the deconvolution method from the
package (Lun et al. (2016)).
This data set is provided as a SingleCellExperiment
(Lun and Risso (2017)). For further information on the
class, see the corresponding manual.
Raw data files or the original data sets are available from
Returns a SingleCellExperiment
Andrews, T.S., and Hemberg, M. (2018). Dropout-based feature selection for scRNASeq. bioRxiv doi:
Duò, A., Robinson, M.D., and Soneson, C. (2018). A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7:1141.
Lun, A.T.L., Bach, K., and Marioni, J.C. (2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17(1): 75.
Lun, A.T.L., and Risso, D. (2017). SingleCellExperiment: S4 Classes for Single Cell Data. R package version 1.0.0.
McCarthy, D.J., Campbell, K.R., Lun, A.T.L., and Wills, Q.F. (2017): Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33(8): 1179-1186.
Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., and Regev, A. (2015). Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33(5): 495–502.
Zheng, G.X., Terry, J.M., Belgrader P., Ryvkin, Pl, Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., Gregory, M.T., Shuga, J., Montesclaros, L., Underwood, J.G., Masquelier, D.A., Nishimura, S.Y., Schnall-Levin, M., Wyatt, P.W., Hindson, C.M., Bharadwaj, R., Wong, A., Ness, K.D., Beppu, L.W., Deeg, H.J., McFarland, C., Loeb, K.R., Valente, W.J., Ericson, N.G., Stevens, E.A., Radich, J.P., Mikkelsen, T.S., Hindson, B.J., and Bielas, J.H. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8:14049.
#> see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
#> loading from cache
#> class: SingleCellExperiment
#> dim: 1556 3555
#> metadata(1): log.exprs.offset
#> assays(3): counts logcounts normcounts
#> rownames(1556): ENSG00000167526 ENSG00000140988 ... ENSG00000011009
#> ENSG00000168522
#> rowData names(10): id symbol ... total_counts log10_total_counts
#> colnames(3555): b.cells6276 b.cells6144 ... regulatory.t1084
#> regulatory.t9696
#> colData names(15): dataset barcode ... is_cell_control sizeFactor
#> reducedDimNames(2): PCA TSNE
#> mainExpName: NULL
#> altExpNames(0):