Fit Statistical Model for Count Data

Fits a statistical model to count data, particularly designed for RNA sequencing data analysis. The function estimates multiple parameters including regression coefficients (beta), overdispersion parameters, and normalizes data using size factors. It supports both CPU and GPU-based computation with parallel processing capabilities.

fit_devil(
  input_matrix,
  design_matrix,
  overdispersion = TRUE,
  init_overdispersion = NULL,
  do_cox_reid_adjustment = TRUE,
  offset = 1e-06,
  size_factors = TRUE,
  verbose = FALSE,
  max_iter = 100,
  tolerance = 0.001,
  CUDA = FALSE,
  batch_size = 1024L,
  parallel.cores = NULL
)

Arguments

input_matrix: A numeric matrix of count data (genes × samples). Rows represent genes/features, columns represent samples/cells.
design_matrix: A numeric matrix of predictor variables (samples × predictors). Each row corresponds to a sample, each column to a predictor variable.
overdispersion: Logical. Whether to estimate the overdispersion parameter. Set to FALSE for Poisson regression. Default: TRUE
init_overdispersion: Numeric or NULL. Initial value for overdispersion parameter. If NULL, estimates initial value from data. Recommended value if specified: 100. Default: NULL
do_cox_reid_adjustment: Logical. Whether to apply Cox-Reid adjustment in overdispersion estimation. Default: TRUE
offset: Numeric. Value added to counts to avoid numerical issues with zero counts. Default: 1e-6
size_factors: Logical. Whether to compute normalization factors for different sequencing depths. Default: TRUE
verbose: Logical. Whether to print progress messages during execution. Default: FALSE
max_iter: Integer. Maximum number of iterations for parameter optimization. Default: 100
tolerance: Numeric. Convergence criterion for parameter optimization. Default: 1e-3
CUDA: Logical. Whether to use GPU acceleration (requires CUDA support). Default: FALSE
batch_size: Integer. Number of genes to process per batch in GPU mode. Only relevant if CUDA = TRUE. Default: 1024
parallel.cores: Integer or NULL. Number of CPU cores for parallel processing. If NULL, uses all available cores. Default: NULL

Value

A list containing:

beta: Matrix of fitted coefficients (genes × predictors)
overdispersion: Vector of fitted overdispersion parameters (one per gene)
iterations: Vector of iteration counts for convergence (one per gene)
size_factors: Vector of computed size factors (one per sample)
offset_vector: Vector of offset values used in the model
design_matrix: Input design matrix (as provided)
input_matrix: Input count matrix (as provided)
input_parameters: List of used parameter values (max_iter, tolerance, parallel.cores)

Details

The function implements a negative binomial regression model with the following steps:

Computes size factors for data normalization (if requested)
Initializes model parameters including beta coefficients and overdispersion
Fits the model using either CPU (parallel) or GPU computation
Optionally estimates overdispersion parameters

The model fitting process uses iterative optimization with configurable convergence criteria and maximum iterations. For large datasets, the GPU implementation processes genes in batches for improved memory efficiency.

Examples

if (FALSE) { # \dontrun{
# Basic usage with default parameters
fit <- fit_devil(counts, design)

# Using GPU acceleration with custom batch size
fit <- fit_devil(counts, design, CUDA = TRUE, batch_size = 2048)

# Disable overdispersion estimation (Poisson model)
fit <- fit_devil(counts, design, overdispersion = FALSE)
} # }