Calculates a collection of metrics comparing one or more reduced dimension representations to a reference representation. The function takes a SingleCellExperiment object as input. The reference representation can be either one of the included assays or one of the reduced dimension representations. If an assay is used, reference distances can be calculated based on all or a subset of the features (rows). These distances are then compared to distances calculated from the specified reduced dimension representations, and several scores are returned. The execution time of the function depends strongly on both the number of retained variables (which affects the distance calculation in the reference space) and the number of samples that are randomly selected to use as the basis for the comparison. Since subsampling of the columns (via the nSamples argument) is random, setting the random seed is recommended to obtain reproducible results.

dreval(
  sce,
  dimReds = NULL,
  refType = "assay",
  refAssay = "logcounts",
  refDimRed = NULL,
  features = NULL,
  nSamples = NULL,
  distNorm = "none",
  refDistMethod = "euclidean",
  kTM = c(10, 100),
  labelColumn = NULL,
  verbose = FALSE
)

Arguments

sce

A SingleCellExperiment object.

dimReds

A character vector with the names of the reduced dimension representations from sce to include in the evaluation. If NULL, all reduced dimension representations are included.

refType

A character scalar, either "assay" or "dimred", specifying whether to use an assay or a reduced dimension representation of sce as the reference data source.

refAssay

A character scalar giving the name of the assay from sce to use as the basis for the distance calculations in the reference space, if refType if "assay".

refDimRed

A character scalar specifying the reduced dimension representation to use as the reference data representation if refType is "dimred".

features

A character vector giving the IDs of the features to use for distance calculations from the chosen assay. Will be matched to the row names of sce.

nSamples

A numeric scalar, giving the number of columns to subsample (randomly) from sce.

distNorm

A character scalar, indicating how the distance vectors in the reference and low-dimensional spaces should be normalized before they are compared. If set to "l2", the vectors are L2 normalized, if set to "median" they are divided by the median value times the square root of their length, and if set to any other value they are divided by the square root of their length, to avoid metrics scaling with the number of retained samples.

refDistMethod

A character scalar defining the distance measure to use in the reference space. Must be one of "euclidean", "manhattan", "maximum", "canberra" or "cosine". The distance in the low-dimensional representation will always be Euclidean.

kTM

An integer vector giving the number of neighbors to use for trustworthiness, continuity and Jaccard index calculations.

labelColumn

A character scalar defining a column of colData(sce) to use as the group assignments in the silhouette width calculations. If not provided, the silhouette widths are not calculated.

verbose

A logical scalar, indicating whether to print out progress messages.

Value

A list with two elements:

  • scores - A data.frame with values of all evaluation metrics, across the dimension reduction methods. In addition to the metrics, it contains the dimensionality of the respective reduced dimension representations, and the value of K giving the highest value of LCMC (used for the calculations of Qlocal and Qglobal, see Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009).

  • plots - A list of ggplot objects, representing diagnostic plots.

Details

The following metrics are calculated:

  • SpearmanCorrDist - The Spearman correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.

  • PearsonCorrDist - The Pearson correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.

  • KSstatDist - The Kolmogorov-Smirnov statistic comparing the distribution of distances in the reference space and in the low-dimensional representation. Ranges from 0 to 1, lower values are better.

  • EuclDistBetweenDists - The Euclidean distance between the vector of distances in the reference space and those in the low-dimensional representation. Depending on the value of distNorm, distances are scaled before they are compared. Lower values are better.

  • SammonStress - The Sammon stress (Sammon 1969). Depending on the value of distNorm, distances are scaled before they are compared. Lower values are better.

  • Trustworthiness_kNN - The trustworthiness score (Venna & Kaski 2001), using NN nearest neighbors. The trustworthiness indicates to which degree we can trust that the points placed closest to a given sample in the low-dimensional representation are really close to the sample also in the reference space. Ranges from 0 to 1, higher values are better.

  • Continuity_kNN - The continuity score (Venna & Kaski 2001), using NN nearest neighbors. The continuity indicates to which degree we can trust that the points closest to a given sample in the reference space are placed close to the sample also in the low-dimensional representation. Ranges from 0 to 1, higher values are better.

  • MeanJaccard_kNN - The mean Jaccard index (over all samples), comparing the set of NN nearest neighbors in the reference space and those in the low-dimensional representation. Ranges from 0 to 1, higher values are better.

  • MeanSilhouette_X - If a labelColumn X is supplied, the mean silhouette score (Rousseeuw 1987) across all samples, with the grouping given by this column and the distances obtained from the low-dimensional representation. Ranges from -1 to 1, higher values are better.

  • coRankingQlocal - Q_local, defined as the average LCMC over the values to the left of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of local distances, higher values are better.

  • coRankingQglobal - Q_global, defined as the average LCMC over the values to the right of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of global distances, higher values are better.

References

Venna J., Kaski S. (2001). Neighborhood preservation in nonlinear projection methods: An experimental study. In Dorffner G., Bischof H., Hornik K., editors, Proceedings of ICANN 2001, pp 485–491. Springer, Berlin.

Lee J.A., Verleysen M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72 (7-9):1431-1443.

Chen L., Buja A. (2009). Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104:209-219.

Kraemer G., Reichstein M., Mahecha M.D. (2018). dimRed and coRanking - Unifying dimensionality reduction in R. The R Journal 10 (1):342-358.

Sammon J.W. Jr (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers C18(5):401-409.

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:53-65.

Author

Charlotte Soneson

Examples

data(pbmc3ksub)
dre <- dreval(sce = pbmc3ksub, nSamples = 150)