Calculates a collection of metrics comparing one or more reduced dimension
representations to a reference representation. The function takes a
SingleCellExperiment
object as input. The reference representation can
be either one of the included assays or one of the reduced dimension
representations. If an assay is used, reference distances can be calculated
based on all or a subset of the features (rows). These distances are then
compared to distances calculated from the specified reduced dimension
representations, and several scores are returned. The execution time of the
function depends strongly on both the number of retained variables (which
affects the distance calculation in the reference space) and the number of
samples that are randomly selected to use as the basis for the comparison.
Since subsampling of the columns (via the nSamples
argument) is
random, setting the random seed is recommended to obtain reproducible
results.
dreval(
sce,
dimReds = NULL,
refType = "assay",
refAssay = "logcounts",
refDimRed = NULL,
features = NULL,
nSamples = NULL,
distNorm = "none",
refDistMethod = "euclidean",
kTM = c(10, 100),
labelColumn = NULL,
verbose = FALSE
)
A SingleCellExperiment
object.
A character vector with the names of the reduced dimension
representations from sce
to include in the evaluation. If
NULL
, all reduced dimension representations are included.
A character scalar, either "assay" or "dimred", specifying
whether to use an assay or a reduced dimension representation of sce
as the reference data source.
A character scalar giving the name of the assay from
sce
to use as the basis for the distance calculations in the
reference space, if refType
if "assay"
.
A character scalar specifying the reduced dimension
representation to use as the reference data representation if
refType
is "dimred"
.
A character vector giving the IDs of the features to use for
distance calculations from the chosen assay. Will be matched to the row
names of sce
.
A numeric scalar, giving the number of columns to subsample
(randomly) from sce
.
A character scalar, indicating how the distance vectors in the reference and low-dimensional spaces should be normalized before they are compared. If set to "l2", the vectors are L2 normalized, if set to "median" they are divided by the median value times the square root of their length, and if set to any other value they are divided by the square root of their length, to avoid metrics scaling with the number of retained samples.
A character scalar defining the distance measure to use in the reference space. Must be one of "euclidean", "manhattan", "maximum", "canberra" or "cosine". The distance in the low-dimensional representation will always be Euclidean.
An integer vector giving the number of neighbors to use for trustworthiness, continuity and Jaccard index calculations.
A character scalar defining a column of
colData(sce)
to use as the group assignments in the silhouette width
calculations. If not provided, the silhouette widths are not calculated.
A logical scalar, indicating whether to print out progress messages.
A list with two elements:
scores - A data.frame
with values of all evaluation metrics,
across the dimension reduction methods. In addition to the metrics, it
contains the dimensionality of the respective reduced dimension
representations, and the value of K giving the highest value of LCMC (used
for the calculations of Qlocal and Qglobal, see Kraemer et al 2018, Lee and
Verleysen 2009, Chen and Buja 2009).
plots - A list of ggplot objects, representing diagnostic plots.
The following metrics are calculated:
SpearmanCorrDist - The Spearman correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.
PearsonCorrDist - The Pearson correlation between the reference distances and the Euclidean distances in the low-dimensional representation. Ranges from -1 to 1, higher values are better.
KSstatDist - The Kolmogorov-Smirnov statistic comparing the distribution of distances in the reference space and in the low-dimensional representation. Ranges from 0 to 1, lower values are better.
EuclDistBetweenDists - The Euclidean distance between the vector of
distances in the reference space and those in the low-dimensional
representation. Depending on the value of distNorm
, distances are
scaled before they are compared. Lower values are better.
SammonStress - The Sammon stress (Sammon 1969). Depending on the
value of distNorm
, distances are scaled before they are compared.
Lower values are better.
Trustworthiness_kNN - The trustworthiness score (Venna & Kaski 2001), using NN nearest neighbors. The trustworthiness indicates to which degree we can trust that the points placed closest to a given sample in the low-dimensional representation are really close to the sample also in the reference space. Ranges from 0 to 1, higher values are better.
Continuity_kNN - The continuity score (Venna & Kaski 2001), using NN nearest neighbors. The continuity indicates to which degree we can trust that the points closest to a given sample in the reference space are placed close to the sample also in the low-dimensional representation. Ranges from 0 to 1, higher values are better.
MeanJaccard_kNN - The mean Jaccard index (over all samples), comparing the set of NN nearest neighbors in the reference space and those in the low-dimensional representation. Ranges from 0 to 1, higher values are better.
MeanSilhouette_X - If a labelColumn
X is supplied, the mean
silhouette score (Rousseeuw 1987) across all samples, with the grouping
given by this column and the distances obtained from the low-dimensional
representation. Ranges from -1 to 1, higher values are better.
coRankingQlocal - Q_local, defined as the average LCMC over the values to the left of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of local distances, higher values are better.
coRankingQglobal - Q_global, defined as the average LCMC over the values to the right of the maximum, following the dimRed/coRanking package implementations (Kraemer et al 2018, Lee and Verleysen 2009, Chen and Buja 2009). Measures the preservation of global distances, higher values are better.
Venna J., Kaski S. (2001). Neighborhood preservation in nonlinear projection methods: An experimental study. In Dorffner G., Bischof H., Hornik K., editors, Proceedings of ICANN 2001, pp 485–491. Springer, Berlin.
Lee J.A., Verleysen M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing 72 (7-9):1431-1443.
Chen L., Buja A. (2009). Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the American Statistical Association 104:209-219.
Kraemer G., Reichstein M., Mahecha M.D. (2018). dimRed and coRanking - Unifying dimensionality reduction in R. The R Journal 10 (1):342-358.
Sammon J.W. Jr (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers C18(5):401-409.
Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:53-65.
data(pbmc3ksub)
dre <- dreval(sce = pbmc3ksub, nSamples = 150)