Calculating JCC scores

Introduction

This vignette describes how to use the jcc package to calculate junction coverage compatibility (JCC_ scores (Soneson et al. 2018) for a given set of genes. In order to run the entire workflow, the following files are required, and will be used as described below:

A BAM file with reads aligned to a reference genome. The package has been tested with BAM files generated by the STAR aligner (Dobin et al. 2013), but files generated by other genome aligners should also work. The reads in the BAM file will be used to fit the fragment bias model with r Biocpkg("alpine") (Love, Hogenesch, and Irizarry 2016) as well as to obtain the predicted transcript coverage profiles.
A data.frame with the number of uniquely mapping and multimapping reads aligned across each annotated splice junction. Below we show how to generate a correctly formatted data frame from the SJ.out.tab text file generated by STAR.
A BSgenome object for the genome that the reads in the BAM file were aligned to.
A gtf file corresponding to the BSgenome and the BAM file.
A transcript-to-gene conversion data.frame with at least two columns, named tx and gene. Additional columns providing, e.g., gene symbols can be included.
A data.frame with estimated transcript abundances, with three columns named transcript, TPM and count.
If you want to create gene summary plots, including an illustration of the observed coverage in a genomic region, you also need to convert the BAM file into a bigWig file. This can be done e.g. as follows:

bedtools genomecov -split -ibam <file.bam> -bg > <file.bedGraph> 
bedGraphToBigWig <file.bedGraph> <chrNameLength.txt> <file.bw>

where you should replace the text in <> by the appropriate file names. The <chrNameLength.txt> is a text file with the length of each chromosome. If you created a genome index with STAR, the index directory contains such a file. The information can also be extracted from the header of the BAM file.

The workflow has mainly been tested with reference files from Ensembl, and may need to be adjusted to work with reference files in other formats.

The general procedure for estimating junction coverage compatibility (JCC) scores for a set of genes is outlined in the figure below. Please refer to the paper (Soneson et al. 2018) for a more detailed description and motivation of the different steps.

drawing

In the rest of this vignette, we will illustrate each step using a small example data set provided with the package.

Fit fragment bias model

The first step in the process is to fit a fragment bias model using the alpine package. The fitAlpineBiasModel() function provides a wrapper containing the necessary steps. Note that the parameters of this function may need to be adapted to be appropriate for a specific data set. In particular, since the example data set only contains a small number of reads and genes, we set minCount and minLength to small values to make sure that some genes will be retained and used to fit the fragment bias model. However, in a real data set these arguments can often be set to higher values. Also, the BSgenome object (here, Hsapiens from the BSgenome.Hsapiens.NCBI.GRCh38 package) should be replaced with the appropriate object for your organism. We refer to the help file and the alpine package for more detailed explanations of each parameter.

## Load necessary packages
suppressPackageStartupMessages({
  library(jcc)
  library(dplyr)
  library(BSgenome.Hsapiens.NCBI.GRCh38)
})
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found

## Load gtf file and bam file with aligned reads
gtf <- system.file("extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz", 
                   package = "jcc")
bam <- system.file("extdata/reads.chr22.bam", package = "jcc")

## Fit fragment bias model
biasMod <- fitAlpineBiasModel(gtf = gtf, bam = bam, organism = "Homo_sapiens",
                              genome = Hsapiens, genomeVersion = "GRCh38",
                              version = 90, minLength = 230, maxLength = 7000,
                              minCount = 10, maxCount = 10000, 
                              subsample = TRUE, nbrSubsample = 30, seed = 1,
                              minSize = NULL, maxSize = 220, 
                              verbose = TRUE)
#> Creating TxDb object...
#> Importing GTF file ... OK
#> Processing genes ...
#> Warning in ensDbFromGRanges(GTF, outfile = outfile, path = path, organism =
#> organism, : I'm missing column(s): 'entrezid'. The corresponding database
#> column(s) will be empty!
#>  Attribute availability:
#>   o gene_id ... OK
#>   o gene_name ... OK
#>   o entrezid ... Nope
#>   o gene_biotype ... OK
#> OK
#> Processing transcripts ... 
#>  Attribute availability:
#>   o transcript_id ... OK
#>   o gene_id ... OK
#>   o transcript_biotype ... OK
#>   o transcript_name ... OK
#> OK
#> Processing exons ... OK
#> Processing chromosomes ... Fetch seqlengths from ensembl ... OK
#> Processing metadata ... OK
#> Generating index ... OK
#>   -------------
#> Verifying validity of the information in the database:
#> Checking transcripts ... OK
#> Checking exons ... OK
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
#>   stop_codon. This information was ignored.
#> OK
#> Getting list of transcripts...
#> Generating list of exons by transcript...
#> Selecting genes with a single isoform...
#> 149 single-isoform genes found.
#> Filtering out too short and too long transcripts...
#> 125 transcripts remaining.
#> Reading bam file...
#> Subsetting to medium-to-highly expressed genes...
#> 28 genes retained.
#> Subsampling genes for fitting bias model...
#> 28 genes retained.
#> Getting fragment widths and read lengths...
#> Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :   3 alignments with ambiguous pairing were dumped.
#>     Use 'getDumpedAlignments()' to retrieve them from the dump environment.
#> Building fragment types...
#> Fitting bias model...
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored

The output of fitAlpineBiasModel is a list with three objects:

biasModel: the fitted fragment bias model from alpine.
exonsByTx: a GRangesList, grouping exons by transcript.
transcripts: a GRanges object with all reference transcripts.

names(biasMod)
#> [1] "biasModel"   "exonsByTx"   "transcripts"

biasMod$exonsByTx
#> GRangesList object of length 510:
#> $ENST00000624155
#> GRanges object with 2 ranges and 3 metadata columns:
#>       seqnames            ranges strand |   exon_id       exon_name exon_rank
#>          <Rle>         <IRanges>  <Rle> | <integer>     <character> <integer>
#>   [1]       22 11066418-11066515      + |         1 ENSE00003758096         1
#>   [2]       22 11067985-11068174      + |         2 ENSE00003758861         2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000422332
#> GRanges object with 2 ranges and 3 metadata columns:
#>       seqnames            ranges strand |   exon_id       exon_name exon_rank
#>          <Rle>         <IRanges>  <Rle> | <integer>     <character> <integer>
#>   [1]       22 11124337-11124379      + |         3 ENSE00001616995         1
#>   [2]       22 11124508-11125705      + |         4 ENSE00001776473         2
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000634617
#> GRanges object with 9 ranges and 3 metadata columns:
#>       seqnames            ranges strand |   exon_id       exon_name exon_rank
#>          <Rle>         <IRanges>  <Rle> | <integer>     <character> <integer>
#>   [1]       22 12602466-12602527      + |         5 ENSE00003784219         1
#>   [2]       22 12604670-12604739      + |         6 ENSE00003784834         2
#>   [3]       22 12615409-12615534      + |         7 ENSE00003785913         3
#>   [4]       22 12616314-12616371      + |         8 ENSE00003790031         4
#>   [5]       22 12618171-12618289      + |         9 ENSE00003786552         5
#>   [6]       22 12620838-12620938      + |        10 ENSE00003788654         6
#>   [7]       22 12624242-12624331      + |        11 ENSE00003791453         7
#>   [8]       22 12625317-12625428      + |        12 ENSE00003788301         8
#>   [9]       22 12626607-12626642      + |        13 ENSE00003788969         9
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <507 more elements>

biasMod$transcripts
#> GRanges object with 510 ranges and 2 metadata columns:
#>                   seqnames            ranges strand |         tx_name
#>                      <Rle>         <IRanges>  <Rle> |     <character>
#>   ENST00000624155       22 11066418-11068174      + | ENST00000624155
#>   ENST00000422332       22 11124337-11125705      + | ENST00000422332
#>   ENST00000634617       22 12602466-12626642      + | ENST00000634617
#>   ENST00000613107       22 15273855-15273961      + | ENST00000613107
#>   ENST00000448473       22 15290718-15297196      + | ENST00000448473
#>               ...      ...               ...    ... .             ...
#>   ENST00000473551       22 19979997-19990792      - | ENST00000473551
#>   ENST00000487793       22 19981152-19987190      - | ENST00000487793
#>   ENST00000492625       22 19981382-19987155      - | ENST00000492625
#>   ENST00000462319       22 19981545-19983753      - | ENST00000462319
#>   ENST00000467828       22 19981991-20016745      - | ENST00000467828
#>                           gene_id
#>                       <character>
#>   ENST00000624155 ENSG00000279973
#>   ENST00000422332 ENSG00000226444
#>   ENST00000634617 ENSG00000283023
#>   ENST00000613107 ENSG00000276138
#>   ENST00000448473 ENSG00000236235
#>               ...             ...
#>   ENST00000473551 ENSG00000099889
#>   ENST00000487793 ENSG00000099889
#>   ENST00000492625 ENSG00000099889
#>   ENST00000462319 ENSG00000099889
#>   ENST00000467828 ENSG00000099889
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Predict transcript coverage profiles

Next, we use the fitted fragment bias model to predict the coverage profile for each transcript in a set of specified genes. The genes argument can be set to NULL to predict the coverage profiles for all transcripts in the annotation catalog. Note, however, that this is computationally demanding if the number of transcripts is large. The nCores argument can be increased to predict coverage profiles for several transcripts in parallel.

## Load transcript-to-gene conversion table
tx2gene <- readRDS(system.file("extdata/tx2gene.sub.rds", package = "jcc"))
head(tx2gene)
#>                   tx            gene symbol   gene_biotype      tx_biotype
#> 4524 ENST00000621271 ENSG00000070371 CLTCL1 protein_coding  protein_coding
#> 4525 ENST00000427926 ENSG00000070371 CLTCL1 protein_coding  protein_coding
#> 4526 ENST00000412649 ENSG00000070371 CLTCL1 protein_coding retained_intron
#> 4527 ENST00000622493 ENSG00000070371 CLTCL1 protein_coding  protein_coding
#> 4528 ENST00000617926 ENSG00000070371 CLTCL1 protein_coding  protein_coding
#> 4529 ENST00000615606 ENSG00000070371 CLTCL1 protein_coding retained_intron

## Predict transcript coverage profiles
predCovProfiles <- predictTxCoverage(biasModel = biasMod$biasModel, 
                                     exonsByTx = biasMod$exonsByTx, 
                                     bam = bam, tx2gene = tx2gene, 
                                     genome = Hsapiens,
                                     genes = c("ENSG00000070371",
                                               "ENSG00000093010"), 
                                     nCores = 1, verbose = TRUE)
#> Predicting coverage for 23 transcripts...
#> ENST00000361682
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000403184
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000403710
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000407537
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000467943
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000412786
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000207636
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000406520
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000449653
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000493893
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000428707
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000621271
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000427926
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000412649
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000622493
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000617926
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000615606
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000617103
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000538828
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000611723
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000458188
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000540896
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...
#> ENST00000449918
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Predicting junction coverages...

length(predCovProfiles)
#> [1] 23

The output of predictTxCoverage is a list with one element per reference transcript annotated to any of the selected genes. Each element, in turn, is a list with five components:

predCov: an Rle with the predicted coverage at each position in the transcript.
strand: the strand of the transcript.
junctionCov: a GRanges object with the location and predicted coverage of each junction in the transcript.
aveFragLength: the estimated average fragment length.
note: either ‘covOK’, ‘covError’ or ‘covNA’. If ‘covOK’, alpine was able to predict a coverage profile. If ‘covError’ or ‘covNA’, the coverage profile could not be predicted, most likely because the transcript is shorter than the fragment length or because no reads in the BAM file overlapped the transcript. In both these cases, we impose a uniform coverage profile for the transcript.

predCovProfiles[2]
#> $ENST00000403184
#> $ENST00000403184$predCov
#> numeric-Rle of length 2216 with 2216 runs
#>   Lengths:           1           1           1 ...           1           1
#>   Values :   0.0000000   0.0614766   0.7640371 ... 0.002547412 0.000580614
#> 
#> $ENST00000403184$strand
#> [1] "+"
#> 
#> $ENST00000403184$junctionCov
#> GRanges object with 4 ranges and 1 metadata column:
#>       seqnames            ranges strand |  pred.cov
#>          <Rle>         <IRanges>  <Rle> | <numeric>
#>   [1]       22 19941898-19961198      + |  2.794797
#>   [2]       22 19961290-19962526      + |  0.435794
#>   [3]       22 19962816-19963565      + |  0.814605
#>   [4]       22 19963760-19964167      + |  0.426161
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000403184$aveFragLength
#> [1] 187.9594
#> 
#> $ENST00000403184$note
#> [1] "covOK"

Scale transcript coverage profiles

Based on the predicted transcript coverage profiles obtained above, and a set of estimated transcript abundances, we next determine the number of reads predicted to map across each annotated splice junction.

## Read transcript abundance estimates
txQuants <- readRDS(system.file("extdata/quant.sub.rds", package = "jcc"))

## Scale predicted coverages based on the estimated abundances
txsc <- scaleTxCoverages(txCoverageProfiles = predCovProfiles, 
                         txQuants = txQuants, tx2gene = tx2gene,
                         strandSpecific = TRUE, methodName = "Salmon", 
                         verbose = TRUE)

scaleTxCoverages returns a list with two elements:

junctionPredCovs: A data.frame with the predicted coverage for each junction, scaled by the estimated transcript abundances and summarized across all transcripts.
txQuants: A data.frame with estimated transcript abundances, including information about whether or not the coverage profile could be predicted (‘covOK’) or if a uniform coverage was imposed (‘covError’ or ‘covNA’).

names(txsc)
#> [1] "junctionPredCovs" "txQuants"

head(txsc$junctionPredCovs)
#>   seqnames    start      end width strand   predCov
#> 1       22 19941898 19961198 19301      +  75.58930
#> 2       22 19961290 19962526  1237      +  10.66403
#> 3       22 19962816 19963565   750      + 339.82310
#> 4       22 19963760 19964167   408      + 181.68201
#> 5       22 19964300 19968535  4236      + 159.39402
#> 6       22 19961290 19962232   943      +  50.41439
#>                                                                                                                                        transcript
#> 1                                                                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000467943
#> 2                                                 ENST00000361682,ENST00000403184,ENST00000467943,ENST00000412786,ENST00000207636,ENST00000406520
#> 3                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 4 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000406520,ENST00000449653,ENST00000493893,ENST00000428707
#> 5                                 ENST00000361682,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 6                                                                                                                                 ENST00000403710
#>              gene method
#> 1 ENSG00000093010 Salmon
#> 2 ENSG00000093010 Salmon
#> 3 ENSG00000093010 Salmon
#> 4 ENSG00000093010 Salmon
#> 5 ENSG00000093010 Salmon
#> 6 ENSG00000093010 Salmon

head(txsc$txQuants)
#>        transcript            gene   count      TPM method covNote
#> 1 ENST00000621271 ENSG00000070371 263.734 0.748398 Salmon   covOK
#> 2 ENST00000427926 ENSG00000070371 170.231 0.462868 Salmon   covOK
#> 3 ENST00000412649 ENSG00000070371  59.429 0.259538 Salmon   covOK
#> 4 ENST00000622493 ENSG00000070371  39.974 0.415456 Salmon   covOK
#> 5 ENST00000617926 ENSG00000070371   0.000 0.000000 Salmon   covOK
#> 6 ENST00000615606 ENSG00000070371   0.000 0.000000 Salmon   covOK

Combine observed and predicted junction coverages

In order to calculate JCC scores, we need to match the predicted junction coverages obtained above with the observed coverages obtained by aligning the reads to the genome. Assuming the reads were aligned to the genome with STAR, we have a text file (whose name ends with SJ.out.tab) containing the number of reads spanning each annotated junction. The code below shows how to read this file, add column names and combine these observed junction coverages with the predicted ones obtained above. If the reads were aligned to the genome with another aligner than STAR, the observed junction coverages may need to be determined separately and formatted in the appropriate way.

## Read the SJ.out.tab file from STAR, add column names and encode the strand
## information as +/-
jcov <- read.delim(system.file("extdata/sub.SJ.out.tab", package = "jcc"),
                   header = FALSE, as.is = TRUE) %>%
  setNames(c("seqnames", "start", "end", "strand", "motif", "annot",
             "uniqreads", "mmreads", "maxoverhang")) %>% 
  dplyr::mutate(strand = replace(strand, strand == 1, "+")) %>%
  dplyr::mutate(strand = replace(strand, strand == 2, "-")) %>%
  dplyr::select(seqnames, start, end, strand, uniqreads, mmreads) %>%
  dplyr::mutate(seqnames = as.character(seqnames))

head(jcov)
#>   seqnames    start      end strand uniqreads mmreads
#> 1       22 10848028 10870192      +         0       1
#> 2       22 10941781 10944966      -         3       1
#> 3       22 10947419 10949211      -         4       0
#> 4       22 10949270 10950048      -         1       0
#> 5       22 10950136 10959066      -         3       0
#> 6       22 10950175 10959066      -         2      11

## Combine the observed and predicted junction coverages
combCov <- combineCoverages(junctionCounts = jcov, 
                            junctionPredCovs = txsc$junctionPredCovs,
                            txQuants = txsc$txQuants)
#> Joining with `by = join_by(seqnames, start, end, gene)`

The output of combineCoverages is a list with three elements:

junctionCovs: A data.frame with the observed and predicted coverages for each junction, aggregated across all transcripts. The predCov column contains the predicted number of spanning reads, and the uniqreads column will be used to represent the observed number of junction-spanning reads.
txQuants: A data.frame with estimated transcript abundances, including information about whether or not the coverage profile could be predicted (‘covOK’) or if a uniform coverage was imposed (‘covError’ or ‘covNA’).
geneQuants: A data.frame with estimated gene abundances as well as the total number uniquely mapping and multi-mapping reads spanning any of the junctions in the gene.

names(combCov)
#> [1] "junctionCovs" "txQuants"     "geneQuants"

head(combCov$junctionCovs)
#>   junctionid seqnames    start      end width strand   predCov
#> 1         J1       22 19941898 19961198 19301      +  75.58930
#> 2         J4       22 19961290 19962526  1237      +  10.66403
#> 3         J8       22 19962816 19963565   750      + 339.82310
#> 4        J10       22 19963760 19964167   408      + 181.68201
#> 5        J12       22 19964300 19968535  4236      + 159.39402
#> 6         J5       22 19961290 19962232   943      +  50.41439
#>                                                                                                                                        transcript
#> 1                                                                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000467943
#> 2                                                 ENST00000361682,ENST00000403184,ENST00000467943,ENST00000412786,ENST00000207636,ENST00000406520
#> 3                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 4 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000406520,ENST00000449653,ENST00000493893,ENST00000428707
#> 5                                 ENST00000361682,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 6                                                                                                                                 ENST00000403710
#>              gene method uniqreads mmreads fracunique
#> 1 ENSG00000093010 Salmon        70       2  0.9722222
#> 2 ENSG00000093010 Salmon        74       0  1.0000000
#> 3 ENSG00000093010 Salmon       160       0  1.0000000
#> 4 ENSG00000093010 Salmon       365       0  1.0000000
#> 5 ENSG00000093010 Salmon       402       0  1.0000000
#> 6 ENSG00000093010 Salmon         0       0  1.0000000

head(combCov$txQuants)
#>        transcript            gene   count      TPM method covNote
#> 1 ENST00000621271 ENSG00000070371 263.734 0.748398 Salmon   covOK
#> 2 ENST00000427926 ENSG00000070371 170.231 0.462868 Salmon   covOK
#> 3 ENST00000412649 ENSG00000070371  59.429 0.259538 Salmon   covOK
#> 4 ENST00000622493 ENSG00000070371  39.974 0.415456 Salmon   covOK
#> 5 ENST00000617926 ENSG00000070371   0.000 0.000000 Salmon   covOK
#> 6 ENST00000615606 ENSG00000070371   0.000 0.000000 Salmon   covOK

head(combCov$geneQuants)
#>              gene method    count       TPM nbr_expressed_transcripts
#> 1 ENSG00000070371 Salmon  547.001  2.103382                         1
#> 2 ENSG00000093010 Salmon 1697.000 45.966095                         1
#>   nbr_expressed_transcripts_5p covOKfraction uniqjuncreads mmjuncreads
#> 1                            5             1           469           0
#> 2                            3             1          1073           4
#>   uniqjuncfraction
#> 1         1.000000
#> 2         0.996286

Calculate JCC scores

Finally, we calculate scaled predicted junction coverages and summarize the differences between the observed and predicted junction coverages for each gene by the JCC (junction coverage compatibility) scores.

## Calculate a JCC score for each gene
jcc <- calculateJCCScores(junctionCovs = combCov$junctionCovs, 
                          geneQuants = combCov$geneQuants)

head(jcc$junctionCovs)
#>   junctionid seqnames    start      end width strand   predCov
#> 1         J1       22 19941898 19961198 19301      +  75.58930
#> 2         J4       22 19961290 19962526  1237      +  10.66403
#> 3         J8       22 19962816 19963565   750      + 339.82310
#> 4        J10       22 19963760 19964167   408      + 181.68201
#> 5        J12       22 19964300 19968535  4236      + 159.39402
#> 6         J5       22 19961290 19962232   943      +  50.41439
#>                                                                                                                                        transcript
#> 1                                                                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000467943
#> 2                                                 ENST00000361682,ENST00000403184,ENST00000467943,ENST00000412786,ENST00000207636,ENST00000406520
#> 3                 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 4 ENST00000361682,ENST00000403184,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000406520,ENST00000449653,ENST00000493893,ENST00000428707
#> 5                                 ENST00000361682,ENST00000403710,ENST00000407537,ENST00000412786,ENST00000207636,ENST00000406520,ENST00000449653
#> 6                                                                                                                                 ENST00000403710
#>              gene method uniqreads mmreads fracunique jccscore scaledCov
#> 1 ENSG00000093010 Salmon        70       2  0.9722222     0.73  96.15727
#> 2 ENSG00000093010 Salmon        74       0  1.0000000     0.73  13.56573
#> 3 ENSG00000093010 Salmon       160       0  1.0000000     0.73 432.28947
#> 4 ENSG00000093010 Salmon       365       0  1.0000000     0.73 231.11796
#> 5 ENSG00000093010 Salmon       402       0  1.0000000     0.73 202.76537
#> 6 ENSG00000093010 Salmon         0       0  1.0000000     0.73  64.13222
#>     methodscore
#> 1 Salmon (0.73)
#> 2 Salmon (0.73)
#> 3 Salmon (0.73)
#> 4 Salmon (0.73)
#> 5 Salmon (0.73)
#> 6 Salmon (0.73)

head(jcc$geneScores)
#>              gene method    count       TPM nbr_expressed_transcripts
#> 1 ENSG00000070371 Salmon  547.001  2.103382                         1
#> 2 ENSG00000093010 Salmon 1697.000 45.966095                         1
#>   nbr_expressed_transcripts_5p covOKfraction uniqjuncreads mmjuncreads
#> 1                            5             1           469           0
#> 2                            3             1          1073           4
#>   uniqjuncfraction jccscore
#> 1         1.000000     0.52
#> 2         0.996286     0.73

calculateJCCScores returns a list with two elements:

junctionCovs: A data.frame with observed and predicted junction coverages, as well as scaled predicted coverages (scaledCov column) and JCC scores for the corresponding gene (jccscore).
geneScores: A data.frame with gene-level abundances, extended with the calculated JCC scores (jccscore column).

Plot observed and predicted junction coverages

To investigate the agreement between the observed and predicted junction coverages for a given gene, we can plot the uniqreads and scaledCov columns from the junction coverage data frame generated in the previous step:

selGene <- "ENSG00000070371"
ggplot2::ggplot(as.data.frame(jcc$junctionCovs) %>%
                  dplyr::filter(gene == selGene), 
                ggplot2::aes(x = uniqreads, y = scaledCov)) + 
  ggplot2::geom_abline(intercept = 0, slope = 1) + 
  ggplot2::geom_point(size = 5) + 
  ggplot2::theme_bw() + 
  ggplot2::xlab("Number of uniquely mapped reads spanning junction") + 
  ggplot2::ylab("Scaled predicted junction coverage") + 
  ggplot2::ggtitle(unique(jcc$junctionCovs %>% 
                            dplyr::filter(gene == selGene) %>%
                            dplyr::pull(methodscore)))

The package also contains a function to generate a multi-panel plot, showing (i) the observed genome coverage profile in the region of the gene, (ii) the estimated transcript abundances, and (iii) the association between the observed and predicted junction coverages.

bwFile <- system.file("extdata/reads.chr22.bw", package = "jcc")
plotGeneSummary(gtf = gtf, junctionCovs = jcc$junctionCovs, 
                useGene = "ENSG00000070371", bwFile = bwFile,
                txQuants = combCov$txQuants, txCol = "transcript_id",
                geneCol = "gene_id", exonCol = "exon_id")
#> Warning: Package `magick` is required to draw images. Image not drawn.
#> Warning in get_plot_component(plot, "guide-box"): Multiple components found;
#> returning the first one. To return all, use `return_all = TRUE`.

Session info

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.7.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000
#>  [2] BSgenome_1.68.0                       
#>  [3] rtracklayer_1.60.1                    
#>  [4] Biostrings_2.68.1                     
#>  [5] XVector_0.40.0                        
#>  [6] GenomicRanges_1.52.1                  
#>  [7] GenomeInfoDb_1.36.4                   
#>  [8] IRanges_2.34.1                        
#>  [9] S4Vectors_0.38.2                      
#> [10] BiocGenerics_0.46.0                   
#> [11] dplyr_1.1.4                           
#> [12] jcc_0.1.0                             
#> [13] BiocStyle_2.28.1                      
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
#>   [3] jsonlite_1.8.9              magrittr_2.0.3             
#>   [5] GenomicFeatures_1.52.2      farver_2.1.2               
#>   [7] rmarkdown_2.29              fs_1.6.5                   
#>   [9] BiocIO_1.10.0               zlibbioc_1.46.0            
#>  [11] ragg_1.3.3                  vctrs_0.6.5                
#>  [13] memoise_2.0.1               Rsamtools_2.16.0           
#>  [15] RCurl_1.98-1.16             base64enc_0.1-3            
#>  [17] htmltools_0.5.8.1           S4Arrays_1.0.6             
#>  [19] progress_1.2.3              curl_6.0.1                 
#>  [21] Formula_1.2-5               sass_0.4.9                 
#>  [23] bslib_0.8.0                 htmlwidgets_1.6.4          
#>  [25] desc_1.4.3                  Gviz_1.44.2                
#>  [27] cachem_1.1.0                GenomicAlignments_1.36.0   
#>  [29] lifecycle_1.0.4             pkgconfig_2.0.3            
#>  [31] Matrix_1.6-5                R6_2.5.1                   
#>  [33] fastmap_1.2.0               GenomeInfoDbData_1.2.10    
#>  [35] MatrixGenerics_1.12.3       digest_0.6.37              
#>  [37] colorspace_2.1-1            AnnotationDbi_1.62.2       
#>  [39] textshaping_0.4.1           Hmisc_5.2-1                
#>  [41] RSQLite_2.3.9               labeling_0.4.3             
#>  [43] filelock_1.0.3              polyclip_1.10-7            
#>  [45] httr_1.4.7                  abind_1.4-8                
#>  [47] compiler_4.3.3              withr_3.0.2                
#>  [49] bit64_4.5.2                 speedglm_0.3-5             
#>  [51] htmlTable_2.4.3             backports_1.5.0            
#>  [53] BiocParallel_1.34.2         DBI_1.2.3                  
#>  [55] biglm_0.9-3                 ggforce_0.4.2              
#>  [57] biomaRt_2.56.1              MASS_7.3-60.0.1            
#>  [59] rappdirs_0.3.3              DelayedArray_0.26.7        
#>  [61] rjson_0.2.23                tools_4.3.3                
#>  [63] foreign_0.8-87              scatterpie_0.2.4           
#>  [65] nnet_7.3-19                 glue_1.8.0                 
#>  [67] restfulr_0.0.15             grid_4.3.3                 
#>  [69] checkmate_2.3.2             cluster_2.1.8              
#>  [71] generics_0.1.3              gtable_0.3.6               
#>  [73] tidyr_1.3.1                 ensembldb_2.24.1           
#>  [75] data.table_1.16.4           hms_1.1.3                  
#>  [77] xml2_1.3.6                  pillar_1.10.0              
#>  [79] stringr_1.5.1               yulab.utils_0.1.8          
#>  [81] splines_4.3.3               tweenr_2.0.3               
#>  [83] BiocFileCache_2.8.0         lattice_0.22-6             
#>  [85] bit_4.5.0.1                 deldir_2.0-4               
#>  [87] biovizBase_1.48.0           tidyselect_1.2.1           
#>  [89] RBGL_1.76.0                 knitr_1.49                 
#>  [91] gridExtra_2.3               bookdown_0.41              
#>  [93] ProtGenerics_1.32.0         SummarizedExperiment_1.30.2
#>  [95] xfun_0.49                   Biobase_2.60.0             
#>  [97] matrixStats_1.4.1           stringi_1.8.4              
#>  [99] ggfun_0.1.8                 lazyeval_0.2.2             
#> [101] yaml_2.3.10                 evaluate_1.0.1             
#> [103] codetools_0.2-20            interp_1.1-6               
#> [105] tibble_3.2.1                BiocManager_1.30.25        
#> [107] graph_1.78.0                cli_3.6.3                  
#> [109] rpart_4.1.23                systemfonts_1.1.0          
#> [111] munsell_0.5.1               jquerylib_0.1.4            
#> [113] dichromat_2.0-0.1           Rcpp_1.0.13-1              
#> [115] dbplyr_2.5.0                png_0.1-8                  
#> [117] XML_3.99-0.17               parallel_4.3.3             
#> [119] alpine_1.26.0               pkgdown_2.1.1.9000         
#> [121] ggplot2_3.5.1               blob_1.2.4                 
#> [123] prettyunits_1.2.0           latticeExtra_0.6-30        
#> [125] jpeg_0.1-10                 AnnotationFilter_1.24.0    
#> [127] bitops_1.0-9                VariantAnnotation_1.46.0   
#> [129] scales_1.3.0                purrr_1.0.2                
#> [131] crayon_1.5.3                rlang_1.1.4                
#> [133] cowplot_1.1.3               KEGGREST_1.40.1

References

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-seq Aligner.” Bioinformatics 29 (1): 15–21.

Love, Michael I, John B Hogenesch, and Rafael A Irizarry. 2016. “Modeling of RNA-seq Fragment Sequence Bias Reduces Systematic Errors in Transcript Abundance Estimation.” Nat. Biotechnol. 34 (12): 1287–91.

Soneson, Charlotte, Michael I Love, Rob Patro, Shobbir Hussain, Dheeraj Malhotra, and Mark D Robinson. 2018. “A Junction Coverage Compatibility Score to Quantify the Reliability of Transcript Abundance Estimates and Annotation Catalogs.” bioRxiv 10.1101/378539.

Charlotte Soneson

Michael I Love

Rob Patro

Shobbir Hussain

Dheeraj Malhotra

Mark D Robinson

2024-12-21