run_clustering.Rmd
This vignette describes how each of the included clustering methods was applied to the collection of data sets in order to generate the clustering result summaries provided with the package. It also shows how to apply a new clustering method to the included data sets, to generate results that can be compared to those already included.
The code below describes how we applied each of the included
clustering methods to the data sets for our paper (Duò, Robinson, and Soneson 2018). The
apply_*()
functions, describing how the respective
clustering methods were run, are available from the GitHub
repository corresponding to the publication. In order to apply a new
clustering algorithm to one of the data sets using the same framework,
it is necessary to generate a function with the same format. The input
arguments to this function should be:
SingleCellExperiment
objectThe function should return a list with three elements:
st
- a vector with the timing information. Should have
five elements, named user.self
, sys.self
,
user.child
, sys.child
and
elapsed
.cluster
- a named vector of cluster assignments for all
cells.est_k
- the number of clusters estimated by the method
(if available, otherwise NA
).If the method does not allow specification of the desired number of
clusters, but has another parameter affecting the resolution, this can
be accommodated as well (see the solution for Seurat
in the
code below).
First, load the package and define the data set and clustering method
to use (note that in order to apply a method named
<method>
, there has to be a function named
apply_<method>()
, with the above specifications,
available in the workspace).
suppressPackageStartupMessages({
library(DuoClustering2018)
})
## Error in get(paste0(generic, ".", class), envir = get_method_env()) :
## object 'type_sum.accel' not found
scename <- "sce_filteredExpr10_Koh"
sce <- sce_filteredExpr10_Koh()
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## loading from cache
## require("SingleCellExperiment")
method <- "PCAHC"
Next, define the list of hyperparameter values. The package contains the hyperparameter values for the methods included in our paper.
## Load parameter files. General dataset and method parameters as well as
## dataset/method-specific parameters
params <- duo_clustering_all_parameter_settings_v2()[[paste0(scename, "_",
method)]]
## see ?DuoClustering2018 and browseVignettes('DuoClustering2018') for documentation
## loading from cache
params
## $nPC
## [1] 30
##
## $range_clusters
## [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Finally, define the number of times to apply the clustering method (for each value of the number of clusters), and run the clustering across a range of imposed numbers of clusters (defined in the parameter list).
## Set number of times to run clustering for each k
n_rep <- 5
## Run clustering
set.seed(1234)
L <- lapply(seq_len(n_rep), function(i) { ## For each run
cat(paste0("run = ", i, "\n"))
if (method == "Seurat") {
tmp <- lapply(params$range_resolutions, function(resolution) {
## For each resolution
cat(paste0("resolution = ", resolution, "\n"))
## Run clustering
res <- get(paste0("apply_", method))(sce = sce, params = params,
resolution = resolution)
## Put output in data frame
df <- data.frame(dataset = scename,
method = method,
cell = names(res$cluster),
run = i,
k = length(unique(res$cluster)),
resolution = resolution,
cluster = res$cluster,
stringsAsFactors = FALSE, row.names = NULL)
tm <- data.frame(dataset = scename,
method = method,
run = i,
k = length(unique(res$cluster)),
resolution = resolution,
user.self = res$st[["user.self"]],
sys.self = res$st[["sys.self"]],
user.child = res$st[["user.child"]],
sys.child = res$st[["sys.child"]],
elapsed = res$st[["elapsed"]],
stringsAsFactors = FALSE, row.names = NULL)
kest <- data.frame(dataset = scename,
method = method,
run = i,
k = length(unique(res$cluster)),
resolution = resolution,
est_k = res$est_k,
stringsAsFactors = FALSE, row.names = NULL)
list(clusters = df, timing = tm, kest = kest)
}) ## End for each resolution
} else {
tmp <- lapply(params$range_clusters, function(k) { ## For each k
cat(paste0("k = ", k, "\n"))
## Run clustering
res <- get(paste0("apply_", method))(sce = sce, params = params, k = k)
## Put output in data frame
df <- data.frame(dataset = scename,
method = method,
cell = names(res$cluster),
run = i,
k = k,
resolution = NA,
cluster = res$cluster,
stringsAsFactors = FALSE, row.names = NULL)
tm <- data.frame(dataset = scename,
method = method,
run = i,
k = k,
resolution = NA,
user.self = res$st[["user.self"]],
sys.self = res$st[["sys.self"]],
user.child = res$st[["user.child"]],
sys.child = res$st[["sys.child"]],
elapsed = res$st[["elapsed"]],
stringsAsFactors = FALSE, row.names = NULL)
kest <- data.frame(dataset = scename,
method = method,
run = i,
k = k,
resolution = NA,
est_k = res$est_k,
stringsAsFactors = FALSE, row.names = NULL)
list(clusters = df, timing = tm, kest = kest)
}) ## End for each k
}
## Summarize across different values of k
assignments <- do.call(rbind, lapply(tmp, function(w) w$clusters))
timings <- do.call(rbind, lapply(tmp, function(w) w$timing))
k_estimates <- do.call(rbind, lapply(tmp, function(w) w$kest))
list(assignments = assignments, timings = timings, k_estimates = k_estimates)
}) ## End for each run
## Summarize across different runs
assignments <- do.call(rbind, lapply(L, function(w) w$assignments))
timings <- do.call(rbind, lapply(L, function(w) w$timings))
k_estimates <- do.call(rbind, lapply(L, function(w) w$k_estimates))
## Add true group for each cell
truth <- data.frame(cell = as.character(rownames(colData(sce))),
trueclass = as.character(colData(sce)$phenoid),
stringsAsFactors = FALSE)
assignments$trueclass <- truth$trueclass[match(assignments$cell, truth$cell)]
## Combine results
res <- list(assignments = assignments, timings = timings,
k_estimates = k_estimates)
df <- dplyr::full_join(res$assignments %>%
dplyr::select(dataset, method, cell, run, k,
resolution, cluster, trueclass),
res$k_estimates %>%
dplyr::select(dataset, method, run, k,
resolution, est_k)
) %>% dplyr::full_join(res$timings %>% dplyr::select(dataset, method, run, k,
resolution, elapsed))
The resulting df
data frames can then be combined across
data sets, filterings and methods and used as input to the provided
plotting functions.
## R Under development (unstable) (2024-12-20 r87452)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: UTC
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.1
## [5] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [9] generics_0.1.3 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 DuoClustering2018_1.7.1
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
## [4] blob_1.2.4 filelock_1.0.3 viridis_0.6.5
## [7] Biostrings_2.75.3 fastmap_1.2.0 BiocFileCache_2.15.0
## [10] digest_0.6.37 mime_0.12 lifecycle_1.0.4
## [13] KEGGREST_1.47.0 RSQLite_2.3.9 magrittr_2.0.3
## [16] compiler_4.5.0 rlang_1.1.4 sass_0.4.9
## [19] tools_4.5.0 yaml_2.3.10 knitr_1.49
## [22] S4Arrays_1.7.1 htmlwidgets_1.6.4 bit_4.5.0.1
## [25] mclust_6.1.1 curl_6.0.1 DelayedArray_0.33.3
## [28] plyr_1.8.9 abind_1.4-8 withr_3.0.2
## [31] purrr_1.0.2 desc_1.4.3 grid_4.5.0
## [34] ExperimentHub_2.15.0 colorspace_2.1-1 ggplot2_3.5.1
## [37] scales_1.3.0 cli_3.6.3 rmarkdown_2.29
## [40] crayon_1.5.3 ragg_1.3.3 httr_1.4.7
## [43] reshape2_1.4.4 DBI_1.2.3 cachem_1.1.0
## [46] stringr_1.5.1 ggthemes_5.1.0 AnnotationDbi_1.69.0
## [49] BiocManager_1.30.25 XVector_0.47.1 vctrs_0.6.5
## [52] Matrix_1.7-1 jsonlite_1.8.9 bookdown_0.41
## [55] bit64_4.5.2 systemfonts_1.1.0 tidyr_1.3.1
## [58] jquerylib_0.1.4 glue_1.8.0 pkgdown_2.1.1.9000
## [61] stringi_1.8.4 gtable_0.3.6 BiocVersion_3.21.1
## [64] UCSC.utils_1.3.0 munsell_0.5.1 tibble_3.2.1
## [67] pillar_1.10.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [70] GenomeInfoDbData_1.2.13 R6_2.5.1 dbplyr_2.5.0
## [73] textshaping_0.4.1 lattice_0.22-6 evaluate_1.0.1
## [76] AnnotationHub_3.15.0 png_0.1-8 memoise_2.0.1
## [79] bslib_0.8.0 Rcpp_1.0.13-1 SparseArray_1.7.2
## [82] gridExtra_2.3 xfun_0.49 fs_1.6.5
## [85] pkgconfig_2.0.3