Tree versions of diffcyt functions — buildTree • treeclimbR

A collection of functions from the diffcyt package have been generalized to work with data provided in a tree structure. The tree represents increasingly coarse clustering of the cells, and the leaves are the clusters from an initial, high-resolution clustering generated by diffcyt. Note that diffcyt represents data using SummarizedExperiment objects with cells in rows and features in columns.

Usage

buildTree(d_se, dist_method = "euclidean", hclust_method = "average")

calcMediansByTreeMarker(d_se, tree)

calcTreeCounts(d_se, tree)

calcTreeMedians(d_se, tree, message = FALSE)

Arguments

d_se: A SummarizedExperiment object, with cells as rows and features as columns. This should be the output from generateClusters. The colData is assumed to contain a factor named marker_class.
dist_method: The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given. Please refer to method in dist for more information.
hclust_method: The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Please refer to method in hclust for more information.
tree: A phylo object from buildTree.
message: A logical scalar indicating whether progress messages should be printed.

Value

For buildTree, a phylo object representing the hierarchical clustering of the initial high-resolution clusters.
For calcTreeCounts, a TreeSummarizedExperiment object, with clusters (nodes of the tree) in rows, samples in columns and abundances (counts) in an assay.
For calcMediansByTreeMarker, a TreeSummarizedExperiment object with clusters (nodes of the tree) in rows and markers in columns. The marker expression values are in the assay.
For calcTreeMedians, a TreeSummarizedExperiment object with median marker expression for each cluster (each node of the tree) and each sample for each cluster (node of the tree). Each node is represented as a separate assay, with the number of columns equal to the number of samples. The metadata slot contains variables id_type_markers and id_state_markers.

Details

The data object is assumed to contain a factor marker_class in the column meta-data (see prepareData), which indicates the protein marker class for each column of data ("type", "state", or "none").

Variables id_type_markers and id_state_markers are saved in the metadata slot of the output object. These can be used to identify the 'cell type' and 'cell state' markers in the list of assays in the output TreeSummarizedExperiment object, which is useful in later steps of the 'diffcyt' pipeline.

buildTree applies hierarchical clustering to build a tree starting from the high-resolution clustering created by generateClusters. The function calculates the median abundance for each (ID type) marker and cluster, and uses this data to further aggregate the initial clusters using hierarchical clustering.
calcTreeCounts calculates the number of cells per cluster-sample combination (referred to as cluster cell 'counts', 'abundances', or 'frequencies'. This is a tree version of calcCounts.
calcMediansByTreeMarker calculates the median value for each cluster-marker combination. A cluster is represented by a node on the tree. This is a tree version of calcMediansByClusterMarker.
calcTreeMedians calculates the median expression for each cluster-sample-marker combination. This is a tree version of calcMedians.

Author

Ruizhu Huang

Examples

## For a complete workflow example demonstrating each step in the 'diffcyt'
## pipeline, please see the diffcyt vignette.
suppressPackageStartupMessages({
    library(diffcyt)
    library(TreeSummarizedExperiment)
})

## Helper function to create random data (one sample)
d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
    d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
    colnames(d) <- paste0("marker", sprintf("%02d", seq_len(ncol)))
    d
}

## Create random data (without differential signal)
set.seed(123)
d_input <- list(
    sample1 = d_random(), sample2 = d_random(),
    sample3 = d_random(), sample4 = d_random()
)

experiment_info <- data.frame(
    sample_id = factor(paste0("sample", seq_len(4))),
    group_id = factor(c("group1", "group1", "group2", "group2"))
)

marker_info <- data.frame(
    channel_name = paste0("channel", sprintf("%03d", seq_len(20))),
    marker_name = paste0("marker", sprintf("%02d", seq_len(20))),
    marker_class = factor(c(rep("type", 10), rep("state", 10)),
                          levels = c("type", "state", "none"))
)

# Prepare data
d_se <- diffcyt::prepareData(d_input, experiment_info, marker_info)

# Transform data
d_se <- diffcyt::transformData(d_se)

# Generate clusters
d_se <- diffcyt::generateClusters(d_se)
#> FlowSOM clustering completed in 0.3 seconds

# Build a tree
tr <- buildTree(d_se)

## Calculate abundances for nodes in each sample
d_counts_tree <- calcTreeCounts(d_se = d_se, tree = tr)

## Calculate medians (by cluster and marker)
d_medians_by_cluster_marker <-
    calcMediansByTreeMarker(d_se = d_se, tree = tr)
#> Warning: Multiple nodes are found to have the same label.
#> Warning: Multiple nodes are found to have the same label.
#> Warning: Multiple nodes are found to have the same label.

## Calculate medians (by cluster, sample and marker)
d_medians_tree <- calcTreeMedians(d_se = d_se, tree = tr)
#> Warning: Multiple nodes are found to have the same label.
#> Warning: Multiple nodes are found to have the same label.
#> Warning: Multiple nodes are found to have the same label.