library(devil)
library(scRNAseq)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'

This vignette provides a step-by-step guide to using the devil package for differential expression analysis of single-cell RNA sequencing (scRNA-seq) data. We’ll start by loading a dataset, preparing the data, fitting the model, and then testing for differential expression between conditions. Finally, we’ll visualize the results using a volcano plot.

Input data

First, we load a single-cell RNA dataset from the scRNAseq package. This dataset contains RNA counts for individual cells and metadata describing the features of each cell.

data <- scRNAseq::ReprocessedFluidigmData()

Extracting Counts and Metadata

To fit a model using devil, we need to extract the RNA counts and the associated metadata.

counts <- data@assays@data[[1]]
metadata <- data@colData

print(dim(counts))  # Dimensions of the counts matrix
#> [1] 26255   130
print(dim(metadata))  # Dimensions of the metadata
#> [1] 130  28
print(head(metadata))  # Display the first few rows of metadata
#> DataFrame with 6 rows and 28 columns
#>               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER INSERT_SZ
#>            <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638       208
#> SRR1274090    196162    182494   93.0323   14.5122 0.0366826       247
#> SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827       230
#> SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685       222
#> SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284       228
#> SRR1275269  10729700   7806230   72.7536   50.4285 0.0204349       245
#>            INSERT_SZ_STD COMPLEXITY     NDUPR PCT_RIBOSOMAL_BASES
#>                <numeric>  <numeric> <numeric>           <numeric>
#> SRR1275356            63   0.868928  0.343113               2e-06
#> SRR1274090           133   0.997655  0.935730               0e+00
#> SRR1275251            89   0.789252  0.201082               0e+00
#> SRR1275287            78   0.898100  0.538191               0e+00
#> SRR1275364            76   0.890693  0.391660               0e+00
#> SRR1275269            99   0.879414  0.431169               0e+00
#>            PCT_CODING_BASES PCT_UTR_BASES PCT_INTRONIC_BASES
#>                   <numeric>     <numeric>          <numeric>
#> SRR1275356         0.125806      0.180954           0.613229
#> SRR1274090         0.309822      0.412917           0.205185
#> SRR1275251         0.398461      0.473884           0.039886
#> SRR1275287         0.196420      0.227592           0.498944
#> SRR1275364         0.138617      0.210406           0.543941
#> SRR1275269         0.333077      0.354635           0.248331
#>            PCT_INTERGENIC_BASES PCT_MRNA_BASES MEDIAN_CV_COVERAGE
#>                       <numeric>      <numeric>          <numeric>
#> SRR1275356             0.080008       0.306760           1.495770
#> SRR1274090             0.072076       0.722739           1.007580
#> SRR1275251             0.087770       0.872345           1.242990
#> SRR1275287             0.077044       0.424013           0.775981
#> SRR1275364             0.107035       0.349024           1.441370
#> SRR1275269             0.063957       0.687712           0.617100
#>            MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS MEDIAN_5PRIME_TO_3PRIME_BIAS
#>                     <numeric>          <numeric>                    <numeric>
#> SRR1275356           0.000000           0.166122                     1.036250
#> SRR1274090           0.181742           0.698991                     0.293510
#> SRR1275251           0.000000           0.340046                     0.201518
#> SRR1275287           0.010251           0.350915                     0.292838
#> SRR1275364           0.000000           0.204074                     0.619863
#> SRR1275269           0.057960           0.345502                     0.284480
#>            sample_id.x           Lane_ID LibraryName avgLength     spots
#>            <character>       <character> <character> <integer> <integer>
#> SRR1275356   SRX534610 D24VYACXX130502:4      GW16_2       202   9818076
#> SRR1274090   SRX534823                 1       NPC_9        60     95454
#> SRR1275251   SRX534623 D24VYACXX130502:4      GW16_8       202   7935952
#> SRR1275287   SRX534641 D24VYACXX130502:1    GW21+3_2       202   6531944
#> SRR1275364   SRX534614 D24VYACXX130502:7     GW16_23       202   4919561
#> SRR1275269   SRX534632 D24VYACXX130502:4      GW21_8       202   9969377
#>            Biological_Condition Coverage_Type Cluster1 Cluster2
#>                     <character>   <character> <factor> <factor>
#> SRR1275356                 GW16          High     IIIb      III
#> SRR1274090                  NPC           Low     1a        I  
#> SRR1275251                 GW16          High     NA        III
#> SRR1275287               GW21+3          High     1c        I  
#> SRR1275364                 GW16          High     IIIb      III
#> SRR1275269                 GW21          High     NA        I
colnames(metadata)  # Display the column names of metadata
#>  [1] "NREADS"                       "NALIGNED"                    
#>  [3] "RALIGN"                       "TOTAL_DUP"                   
#>  [5] "PRIMER"                       "INSERT_SZ"                   
#>  [7] "INSERT_SZ_STD"                "COMPLEXITY"                  
#>  [9] "NDUPR"                        "PCT_RIBOSOMAL_BASES"         
#> [11] "PCT_CODING_BASES"             "PCT_UTR_BASES"               
#> [13] "PCT_INTRONIC_BASES"           "PCT_INTERGENIC_BASES"        
#> [15] "PCT_MRNA_BASES"               "MEDIAN_CV_COVERAGE"          
#> [17] "MEDIAN_5PRIME_BIAS"           "MEDIAN_3PRIME_BIAS"          
#> [19] "MEDIAN_5PRIME_TO_3PRIME_BIAS" "sample_id.x"                 
#> [21] "Lane_ID"                      "LibraryName"                 
#> [23] "avgLength"                    "spots"                       
#> [25] "Biological_Condition"         "Coverage_Type"               
#> [27] "Cluster1"                     "Cluster2"

The output shows that we have 130 cells described by 28 features, along with RNA counts for 26,255 genes.

Preprocessing: Filtering Non-Expressed Genes

Before proceeding, let’s remove genes that are not expressed in any of the cells. This helps in reducing the noise and computational load.

counts <- counts[rowSums(counts) > 0,]
dim(counts)
#> [1] 17085   130

After filtering, we retain approximately 17,000 genes.

Creating the Design Matrix

Next, we create a design_matrix, which is a matrix that includes the features of each cell. The model.matrix function is an easy way to generate this matrix, requiring a formula and the metadata as input.

design_matrix <- model.matrix(~Biological_Condition, data = metadata)
print(unique(metadata$Biological_Condition))  # Display unique biological conditions
#> [1] "GW16"   "NPC"    "GW21+3" "GW21"
head(design_matrix)  # Display the first few rows of the design matrix
#>            (Intercept) Biological_ConditionGW21 Biological_ConditionGW21+3
#> SRR1275356           1                        0                          0
#> SRR1274090           1                        0                          0
#> SRR1275251           1                        0                          0
#> SRR1275287           1                        0                          1
#> SRR1275364           1                        0                          0
#> SRR1275269           1                        1                          0
#>            Biological_ConditionNPC
#> SRR1275356                       0
#> SRR1274090                       1
#> SRR1275251                       0
#> SRR1275287                       0
#> SRR1275364                       0
#> SRR1275269                       0

Here, the design matrix models each cell based on its Biological_Condition. The intercept in this model represents the “GW16” condition.

Fitting the model

With the data prepared, we can now fit the model using the fit_devil function. This function estimates the coefficients (beta) and overdispersion for each gene across different conditions.

fit <- devil::fit_devil(
  as.matrix(counts), 
  design_matrix, 
  overdispersion = T, 
  size_factors = T, 
  verbose = T, 
  parallel.cores = 1
)
#> Compute size factors
#> Initialize beta estimate
#> Fit beta coefficients
#> Fit overdispersion

The fit object contains several important components, including:

  • beta : matrix of coefficients (n_genes x n_features)
  • overdispersion : vector of coefficients (n_genes)

Testing for Differential Expression

In order to test the data you need to specify your null hypothesis using a contrast vector cc. Considering a gene gg along with its inferred coefficient βg\beta_g, the null hypothesis H0H_0 is usually defined as

H0:iciβg,i=0 H_0 : \sum_i c_i \beta_{g,i} = 0 For example, if you are interested in the genes that are differentially expressed between the “GW21” and the “NPC” condition, you need to find the genes for which we strongly reject the null hypothesis

$$ \beta_{GW21} = \beta_{NPC} \hspace{5mm} \rightarrow \hspace{5mm} \beta_{GW21} - \beta_{NPC} = 0$$ which is equivalent to defining the contrast vector c=(0,1,0,1)c = (0,1,0,-1). Once the contrast vector is defined, you can test the null hypothesis using the test_de function.

contrast <- c(0, 1, 0, -1)
test_res <- devil::test_de(fit, contrast, max_lfc = 20)
if (!('name' %in% colnames(test_res))) {
  test_res$name <- as.character( 1:nrow(test_res) )
}
head(test_res)
#> # A tibble: 6 × 4
#>     pval adj_pval    lfc name 
#>    <dbl>    <dbl>  <dbl> <chr>
#> 1 0.774     1     -18.8  1    
#> 2 1.00      1       1.66 2    
#> 3 0.997     1       1.66 3    
#> 4 0.0115    0.137   6.70 4    
#> 5 0.817     1     -19.4  5    
#> 6 0.999     1       1.66 6

The results contains, for each gene

  • pval : p-value associated with the statistical test
  • adj_pval : p-value corrected considering multiple testing
  • lfc : log2 fold change of gene expression between the tested conditions

Visualizing the Results with a Volcano Plot

Finally, we can visualize the results using a volcano plot, which highlights genes based on their log-fold change (lfc) and adjusted p-values (adj_pval). The plot_volcano function from devil makes this easy.

devil::plot_volcano(test_res, lfc_cut = 1, pval_cut = .05, labels = TRUE, point_size = 2)

This plot provides a clear view of the most significant genes, making it easier to identify those that are likely to be biologically relevant.

Summary

In this vignette, we’ve demonstrated how to:

1.  Load and preprocess scRNA-seq data.
2.  Fit a model using the devil package.
3.  Perform differential expression testing between conditions.
4.  Visualize the results using a volcano plot.

The devil package offers a powerful and flexible framework for analyzing single-cell RNA sequencing data, helping researchers uncover meaningful biological insights.