Fits a statistical model to count data, particularly designed for RNA-seq data analysis. The function estimates regression coefficients (beta), gene-wise overdispersion parameters, and normalizes data using size factors. It supports both CPU and (optionally) GPU-based computation with parallel processing capabilities.
Usage
fit_devil(
x,
design_matrix,
assay.type = "counts",
clusters = NULL,
overdispersion = "MOM",
init_overdispersion = NULL,
init_beta_rough = FALSE,
offset = 0,
size_factors = NULL,
verbose = FALSE,
max_iter = 200,
tolerance = 0.001,
CUDA = FALSE,
batch_size = 1024L,
BPPARAM = BiocParallel::SerialParam()
)Arguments
- x
A numeric matrix of count data (genes x samples), a
SummarizedExperiment, or aSingleCellExperimentobject. Rows represent genes/features, columns represent samples/cells. Whenxis an SE/SCE object, the assay named byassay.typeis used.- design_matrix
A numeric matrix of predictor variables (samples x predictors). Each row corresponds to a sample, each column to a predictor variable. Must have
nrow(design_matrix) == ncol(x).- assay.type
Character. Name of the assay to extract when
xis aSummarizedExperimentorSingleCellExperiment. Default:"counts".- clusters
Vector of cluster or patient identifiers (one element per cell/sample). Cells belonging to the same cluster must be contiguous — use
group_data()to sort them first. When provided, enables clustered sandwich covariance estimation. Default:NULL.- overdispersion
Character or logical. Strategy for estimating overdispersion: one of
"new","I","old","MLE","MOM", orFALSEto disable overdispersion fitting (Poisson model). Default:"MOM".- init_overdispersion
Numeric scalar or
NULL. Initial value for the overdispersion parameter used as a starting point for the iterative procedures. IfNULL, an initial value is estimated from the data viaestimate_dispersion(). Recommended value if specified:100. Default:NULL.- init_beta_rough
Logical. Whether to initialize betas in a rough but extremely fast way. Default:
FALSE.- offset
Numeric scalar. Value used when computing the offset vector to avoid numerical issues with zero counts. Default:
0.- size_factors
Character string, numeric vector, or
NULL. Controls normalization for sequencing depth. Options are:NULL(default): No normalization (all size factors set to 1)"normed_sum": Geometric mean normalization"psinorm": Psi-normalization"edgeR": edgeR TMM methodA numeric vector of length
ncol(x): precomputed size factors used directly without further normalization
- verbose
Logical. Whether to print progress messages during execution. Default:
FALSE.- max_iter
Integer. Maximum number of iterations for parameter optimization (both beta and overdispersion routines). Default:
200.- tolerance
Numeric. Convergence criterion for parameter optimization. Default:
1e-3.- CUDA
Logical. Whether to use GPU acceleration (requires CUDA support and a compiled
beta_fit_gpu()implementation). Default:FALSE.- batch_size
Integer. Number of genes to process per batch in GPU mode. Only relevant if
CUDA = TRUE. Default:1024.- BPPARAM
A
BiocParallelParamobject controlling parallel evaluation. Default:BiocParallel::SerialParam().
Value
A list containing:
- beta
Matrix of fitted coefficients (genes x predictors).
- beta_sandwiches_null
List of per-gene Hessian inverse matrices (\(H^{-1}\)).
- beta_sandwiches
List of per-gene clustered sandwich matrices (\(H^{-1} M H^{-1} \times n\_samples\));
NULLentries when noclustersare provided.- overdispersion
Vector of fitted overdispersion parameters (one per gene).
- iterations
List with elements
beta_itersandtheta_itersgiving the number of iterations used for each gene.- size_factors
Vector of computed size factors (one per sample).
- offset_vector
Vector of offset values used in the model (length = number of samples).
- design_matrix
Input design matrix (as provided, possibly coerced to numeric matrix).
- input_matrix
Input count matrix (as provided, possibly coerced to numeric matrix).
- input_parameters
List of used parameter values (
max_iter,tolerance).
Details
The function implements a negative binomial regression model with the following steps:
Computes size factors for data normalization (if requested)
Initializes model parameters including beta coefficients and overdispersion
Fits the regression coefficients using either CPU (parallel) or GPU computation
Optionally fits/updates overdispersion parameters using one of several strategies
The model fitting process uses iterative optimization with configurable convergence criteria and maximum iterations. For large datasets, the GPU implementation processes genes in batches for improved memory efficiency.
Size Factor Methods
Three normalization methods are available when size_factors is a character string:
"normed_sum"(default): Geometric mean normalization based on library sizes. Fast and works well for most datasets."psinorm": Psi-normalization using Pareto distribution MLE. More robust to highly variable genes."edgeR": edgeR's TMM with singleton pairing method. Requires the edgeR package from Bioconductor.
If size_factors = NULL, no normalization is performed and all size factors are set to 1.
If size_factors is a numeric vector of precomputed size factors (one per sample), they will be used directly.
Overdispersion Strategies
The overdispersion argument controls how gene-wise overdispersion is handled:
"old"or"MLE": Overdispersion is fit via the original (legacy) NB MLE-based procedure (with Cox-Reid adjustment insidefit_dispersion())."new"or"I": Overdispersion is fit via the new iterative NB routine implemented infit_overdispersion_cppp(), typically faster and more stable for large single-cell datasets."MOM": Overdispersion is estimated using a method-of-moments approach viaestimate_mom_dispersion_cpp(), which is cheap and provides a rough dispersion estimate.FALSE: Disable overdispersion fitting and use a Poisson model (overdispersion fixed to 0).
Examples
## Example: fit a simple two-group model
set.seed(1)
# Simulate a small counts matrix (genes x cells)
counts <- matrix(
rnbinom(1000, mu = 0.2, size = 1),
nrow = 100, ncol = 10
)
rownames(counts) <- paste0("gene", seq_len(nrow(counts)))
# Two-group design (no intercept)
group <- factor(rep(c("A", "B"), each = 5))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
# Fit from a raw matrix
fit <- fit_devil(
x = counts,
design_matrix = design,
size_factors = "normed_sum"
)
## Example: fit from a SingleCellExperiment object
library(SingleCellExperiment)
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: ‘generics’
#> The following objects are masked from ‘package:base’:
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: Seqinfo
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#>
#> rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#>
#> anyMissing, rowMedians
sce <- SingleCellExperiment(assays = list(counts = counts))
fit_sce <- fit_devil(
x = sce,
design_matrix = design,
assay.type = "counts"
)
