A function to generate code that can be run to perform differential expression analysis of RNAseq data (comparing two conditions) using the phylolm package. The code is written to a .Rmd file. This function is generally not called by the user, the main interface for performing differential expression analysis is the runDiffExp function.

phylolm.createRmd(
  data.path,
  result.path,
  codefile,
  norm.method,
  model = "BM",
  measurement_error = TRUE,
  extra.design.covariates = NULL,
  length.normalization = "RPKM",
  data.transformation = "log2",
  ...
)

Arguments

data.path

The path to a .rds file containing the phyloCompData object that will be used for the differential expression analysis.

result.path

The path to the file where the result object will be saved.

codefile

The path to the file where the code will be written.

norm.method

The between-sample normalization method used to compensate for varying library sizes and composition in the differential expression analysis. The normalization factors are calculated using the calcNormFactors of the edgeR package. Possible values are "TMM", "RLE", "upperquartile" and "none"

model

The model for trait evolution on the tree. Default to "BM".

measurement_error

A logical value indicating whether there is measurement error. Default to TRUE.

extra.design.covariates

A vector containing the names of extra control variables to be passed to the design matrix of phyolm. All the covariates need to be a column of the sample.annotations data frame from the phyloCompData object, with a matching column name. The covariates can be a numeric vector, or a factor. Note that "condition" factor column is always included, and should not be added here. See Details.

length.normalization

one of "none" (no correction), "TPM" or "RPKM" (default). See details.

data.transformation

one of "log2", "asin(sqrt)" or "sqrt". Data transformation to apply to the normalized data.

...

Further arguments to be passed to function phylolm.

Value

The function generates a .Rmd file containing the code for performing the differential expression analysis. This file can be executed using e.g. the knitr package.

Details

For more information about the methods and the interpretation of the parameters, see the phylolm package and the corresponding publications.

The length.matrix field of the phyloCompData object is used to normalize the counts, using one of the following formulas: * length.normalization="none" : \(CPM_{gi} = \frac{N_{gi} + 0.5}{NF_i \times \sum_{g} N_{gi} + 1} \times 10^6\) * length.normalization="TPM" : \(TPM_{gi} = \frac{(N_{gi} + 0.5) / L_{gi}}{NF_i \times \sum_{g} N_{gi}/L_{gi} + 1} \times 10^6\) * length.normalization="RPKM" : \(RPKM_{gi} = \frac{(N_{gi} + 0.5) / L_{gi}}{NF_i \times \sum_{g} N_{gi} + 1} \times 10^9\)

where \(N_{gi}\) is the count for gene g and sample i, where \(L_{gi}\) is the length of gene g in sample i, and \(NF_i\) is the normalization for sample i, normalized using calcNormFactors of the edgeR package.

The function specified by the data.transformation is then applied to the normalized count matrix.

The "\(+0.5\)" and "\(+1\)" are taken from Law et al 2014, and dropped from the normalization when the transformation is something else than log2.

The "\(\times 10^6\)" and "\(\times 10^9\)" factors are omitted when the asin(sqrt) transformation is taken, as \(asin\) can only be applied to real numbers smaller than 1.

The design model used in the phylolm uses the "condition" column of the sample.annotations data frame from the phyloCompData object as well as all the covariates named in extra.design.covariates. For example, if extra.design.covariates = c("var1", "var2"), then sample.annotations must have two columns named "var1" and "var2", and the design formula in the phylolm function will be: ~ condition + var1 + var2.

References

Ho, L. S. T. and Ane, C. 2014. "A linear-time algorithm for Gaussian and non-Gaussian trait evolution models". Systematic Biology 63(3):397-408.

Law, C.W., Chen, Y., Shi, W. et al. (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15, R29.

Musser, JM, Wagner, GP. (2015): Character trees from transcriptome data: Origin and individuation of morphological characters and the so‐called “species signal”. J. Exp. Zool. (Mol. Dev. Evol.) 324B: 588– 604.

Author

Charlotte Soneson, Paul Bastide, Mélina Gallopin

Examples

try(
if (require(ape) && require(phylolm)) {
tmpdir <- normalizePath(tempdir(), winslash = "/")
set.seed(20200317)
tree <- rphylo(10, 0.1, 0)
mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000, 
                                    samples.per.cond = 5, n.diffexp = 100, 
                                    tree = tree,
                                    id.species = 1:10,
                                    lengths.relmeans = rpois(1000, 1000),
                                    lengths.dispersions = rgamma(1000, 1, 1),
                                    output.file = file.path(tmpdir, "mydata.rds"))
## Add covariates
## Model fitted is count.matrix ~ condition + test_factor + test_reg
sample.annotations(mydata.obj)$test_factor <- factor(rep(1:2, each = 5))
sample.annotations(mydata.obj)$test_reg <- rnorm(10, 0, 1)
saveRDS(mydata.obj, file.path(tmpdir, "mydata.rds"))
## Diff Exp
runDiffExp(data.file = file.path(tmpdir, "mydata.rds"), result.extent = "DESeq2", 
           Rmdfunction = "phylolm.createRmd", 
           output.directory = tmpdir,
           norm.method = "TMM",
           extra.design.covariates = c("test_factor", "test_reg"),
           length.normalization = "RPKM")
})
#> Loading required package: phylolm
#> Warning: Vector 'id.species' must be a factor. Transforming.
#> Warning: `id.species` is not named. I'm naming them, assuming they are in the same order as the tree.
#> Error in sample.annotations(mydata.obj)$test_factor <- factor(rep(1:2,  : 
#>   could not find function "sample.annotations"