Power Analysis with rescueSim
October 08, 2025
Source:vignettes/rescueSimPowerVignette.Rmd
rescueSimPowerVignette.RmdIntroduction
One application of the rescueSim is to evaluate the
power to detect within-cell type differential gene expression in paired
and longitudinal scRNA-sequencing studies, as well as other complex
repeated measures designs. These data are inherently high-dimensional
and hierarchical, with multiple sources of variability, making standard
closed-form power or sample size calculations infeasible. Instead,
simulating data and assessing power across different design scenarios
(e.g., varying numbers of samples or cells) offers a practical and
informative approach to study planning.
This vignette focuses on using rescueSim to perform
power analysis for detecting differential gene expression in
longitudinal or repeated-measures single-cell RNA-seq studies. For an
overview of the rescueSim simulation framework and basic
usage, see the Intro to rescueSim vignette.
Running a Power Analysis
Power can be evaluated using the runRescueSimPower
function. This function simulates multiple data sets under different
user-defined scenarios, applies a user-supplied differential expression
(DE) method, and calculates power as the proportion of true DE genes
correctly identified at a specified significance threshold. The table
below outlines key function arguments.
| Argument | Description | Default |
|---|---|---|
baseParams |
A RescueSimParams object containing the base simulation
parameters. These will be updated according to each scenario. |
— |
scenarios |
A data.frame specifying the scenarios to simulate. Each
row corresponds to one scenario, with columns for parameter values to
update in baseParams. Columns must match fields in
RescueSimParams. |
— |
deFunction |
A user-supplied function that takes a simulated
SingleCellExperiment object and returns a
data.frame with columns gene and
padj. |
— |
nSim |
Number of simulations to run for each scenario. | 1 |
padjThresh |
Significance threshold (e.g., FDR cutoff) to define true positive DE calls. | 0.05 |
returnFDR |
Logical indicating whether to calculate and return the false discovery rate (FDR). | FALSE |
conditions |
A character vector of length 2 specifying the conditions to compare
(e.g., c("time1", "time0") or
c(time0_group0, time1_group1) for two group designs). If
NULL, the comparison is inferred based on available
conditions in rowData. |
NULL |
saveSimPath |
Path to directory where simulated data sets will be saved as
.rds files (if desired). If NULL, simulated
data sets are not saved. |
NULL |
saveDePath |
Path to directory where DE result files will be saved as
.rds files (if desired). If NULL, DE results
are not saved. |
NULL |
verbose |
Logical indicating whether to print progress messages during simulation. | TRUE |
... |
Additional arguments passed to simRescueSimDat,
deFunction, or other internal functions. |
— |
The function returns a data.frame containing power (and
FDR if requested) for each simulation along with the scenario settings
and comparison conditions.
Setting up the data
For this example, we will use a data set included with the
rescueSim package. The data set contains gene expression
data for recruited airspace macrophage (RAM) cells from bronchoalveolar
lavage samples collected from healthy adults. Data were collected from
five subjects at two time points per subject. the data set was subset to
include 1,940 genes and 970 cells. The genes included were assessed to
be invariant across time points.
To reduce computational cost, particularly because the differential expression analysis in this example uses MAST with random effects, which is computationally intensive, we further filter the data to include only 100 genes. Note that MAST does offer a parallelization option that can speed up the analysis. Because the data set is so small, the power estimates in this example will not be meaningful for study planning/method benchmarking. Rather, this example is used solely to illustrate the package functionality.
## Load packages
library(rescueSim)
library(SingleCellExperiment)
## Load data
data("RecAM_sce")
## Set seed and randomly select 100 cells
set.seed(24)
RecAM_sce <- RecAM_sce[sample(1:nrow(RecAM_sce), 100),]
## Estimate params from filtered data
simParams<-estRescueSimParams(RecAM_sce, sampleVariable = "sampleID",
subjectVariable = "subjectID",
timepointVariable = "time")Differential expression function for the power analysis
To run a power analysis using runRescueSimPower, you
must supply a custom function that performs differential expression (DE)
analysis on each simulated data set. this function should return a data
frame containing at least:
-
gene: a gene identifier -
padj: adjusted p-value (e.g. FDR adjustment)
This design allows users to select a DE method appropriate for their study design.
In this example, we use MAST with random effects to
model the hierarchical structure of our paired scRNA-seq data.
MAST allows the inclusion of random effects for sample and
subject, making it a reasonable choice for accounting for correlation in
our data. However, it is important to note that the performance of
MAST (and most other DE methods) on paired/longitudinal
single-cell data has not been fully evaluated. The results of this power
analysis should be interpreted with this limitation in mind.
Below is the custom DE function we will use. This function filters data to include only genes with expression in at least 10 of cells, applies log2(CPM+1) normalization, computes cellular detection rate, fits a MAST model with random effects for sample and subject, and returns a table with adjusted p-values for the time effect.
deFun <- function(sce) {
# Include only genes expressed in at least 10% of cells
sce <- sce[rowMeans(counts(sce) != 0) >= 0.1, ]
# Log2(CPM + 1) normalization
normcounts(sce) <- apply(counts(sce), 2, function(x) {
log2((1e6 * x / sum(x)) + 1)
})
# Convert to SingleCellAssay
sca <- suppressMessages(MAST::FromMatrix(as.matrix(normcounts(sce)),
colData(sce)))
# Calculate cellular detection rate
cdr <- colSums(SummarizedExperiment::assay(sca) > 0)
colData(sca)$cngeneson <- scale(cdr)
# Fit model
suppressMessages(
suppressWarnings({
zlmCond <- MAST::zlm(
form = ~ cngeneson + time + (1 | sampleID) + (1 | subjectID),
sca,
method = "glmer",
ebayes = FALSE,
strictConvergence = TRUE, silent = T
)}
)
)
# Summarize LRT for time
suppressMessages(
suppressWarnings({
raw_res <- MAST::summary(zlmCond, doLRT = "timetime1")
})
)
sum_tab <- data.frame(raw_res$datatable)
# Keep only the H (hurdle) component rows
sum_tab <- sum_tab[sum_tab$component == "H", ]
# Add padj and gene columns
sum_tab$padj <- p.adjust(sum_tab$`Pr..Chisq.`, method = "BH")
sum_tab$gene <- sum_tab$primerid
return(sum_tab)
}Running the power analysis
We will run runRescueSimPower to estimate power across a
couple of illustrative scenarios. We will simulate data sets where
20
of genes exhibit differential expression between two time points with a
log
fold-change of
.
We will evaluate two scenarios:
- Scenario 1: 3 subjects, 200 cells per sample
- Scenario 2: 6 subjects, 100 cells per sample
For simplicity, we will set the minCellsPerSamp and
maxCellsPerSamp parameters to the same value in each
scenario so that the number of cells per sample is fixed.
In practice, simulations like these could be used to explore
trade-offs between sequencing depth and subject recruitment. For
illustration purposes here, we will simulate only 2 data sets per
scenario and are using a very small number of genes, so the results
produced are not meaningful and cannot be used to reflect method
performance or guide study design. The goal is simply to demonstrate how
the runRescueSimPower function can be applied.
## Update differential expression parameters in parameter object
simParams <- updateRescueSimParams(simParams,
paramValues = list(propDE=.2, deLog2FC=.5))
## Define scenarios
scenarios <- data.frame(minCellsPerSamp = c(200, 100),
maxCellsPerSamp = c(200, 100),
nSubjsPerGroup = c(3, 6))
scenarios
#> minCellsPerSamp maxCellsPerSamp nSubjsPerGroup
#> 1 200 200 3
#> 2 100 100 6
## Set seed for reproducibility and run
set.seed(24)
power_res=runRescueSimPower(baseParams = simParams, scenarios = scenarios,
deFunction = deFun, nSim = 2, returnFDR = F)
#> Running scenario 1 of 2
#> Sim 1
#> Sim 2
#> Running scenario 2 of 2
#> Sim 1
#> Sim 2
#> Warning in runRescueSimPower(baseParams = simParams, scenarios = scenarios, :
#> Some padj values are NA and will be removed before calculating power.The output of runRescueSimPower is a data frame where
each row corresponds to one simulated data set, showing the scenario
settings, simulation number, comparison conditions, power, and (if
requested) FDR. Again, with so few genes and simulations here, the
values in this example are purely illustrative and not reliable.
power_res
#> minCellsPerSamp maxCellsPerSamp nSubjsPerGroup sim condition1 condition2
#> 1 200 200 3 1 time0 time1
#> 2 200 200 3 2 time0 time1
#> 3 100 100 6 1 time0 time1
#> 4 100 100 6 2 time0 time1
#> power
#> 1 0.5555556
#> 2 1.0000000
#> 3 0.4827586
#> 4 0.7000000When run with a full data set and more simulations, the output can be summarized with power curves or other visuals to explore design trade-offs.
Session Information
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
#> [3] Biobase_2.68.0 GenomicRanges_1.60.0
#> [5] GenomeInfoDb_1.44.3 IRanges_2.42.0
#> [7] S4Vectors_0.46.0 BiocGenerics_0.54.0
#> [9] generics_0.1.4 MatrixGenerics_1.20.0
#> [11] matrixStats_1.5.0 rescueSim_0.99.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.4 MAST_1.33.0 gridExtra_2.3
#> [4] rlang_1.1.6 magrittr_2.0.4 scater_1.36.0
#> [7] compiler_4.5.1 systemfonts_1.3.1 vctrs_0.6.5
#> [10] reshape2_1.4.4 stringr_1.5.2 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 backports_1.5.0
#> [16] XVector_0.48.0 scuttle_1.18.0 rmarkdown_2.30
#> [19] nloptr_2.2.1 UCSC.utils_1.4.0 ggbeeswarm_0.7.2
#> [22] ragg_1.5.0 xfun_0.53 bluster_1.18.0
#> [25] cachem_1.1.0 beachmat_2.24.0 jsonlite_2.0.0
#> [28] progress_1.2.3 DelayedArray_0.34.1 BiocParallel_1.42.2
#> [31] prettyunits_1.2.0 irlba_2.3.5.1 parallel_4.5.1
#> [34] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
#> [37] stringi_1.8.7 RColorBrewer_1.1-3 limma_3.64.3
#> [40] boot_1.3-31 jquerylib_0.1.4 Rcpp_1.1.0
#> [43] knitr_1.50 Matrix_1.7-3 splines_4.5.1
#> [46] igraph_2.1.4 tidyselect_1.2.1 abind_1.4-8
#> [49] yaml_2.3.10 viridis_0.6.5 codetools_0.2-20
#> [52] lattice_0.22-7 tibble_3.3.0 plyr_1.8.9
#> [55] S7_0.2.0 evaluate_1.0.5 desc_1.4.3
#> [58] pillar_1.11.1 checkmate_2.3.3 reformulas_0.4.1
#> [61] hms_1.1.3 ggplot2_4.0.0 scales_1.4.0
#> [64] minqa_1.2.8 gtools_3.9.5 glue_1.8.0
#> [67] metapod_1.16.0 tools_4.5.1 BiocNeighbors_2.2.0
#> [70] data.table_1.17.8 ScaledMatrix_1.16.0 lme4_1.1-37
#> [73] locfit_1.5-9.12 fs_1.6.6 scran_1.36.0
#> [76] grid_4.5.1 rbibutils_2.3 edgeR_4.6.3
#> [79] nlme_3.1-168 GenomeInfoDbData_1.2.14 beeswarm_0.4.0
#> [82] BiocSingular_1.24.0 vipor_0.4.7 cli_3.6.5
#> [85] rsvd_1.0.5 textshaping_1.0.3 S4Arrays_1.8.1
#> [88] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.6
#> [91] sass_0.4.10 digest_0.6.37 SparseArray_1.8.1
#> [94] ggrepel_0.9.6 dqrng_0.4.1 farver_2.1.2
#> [97] htmltools_0.5.8.1 pkgdown_2.1.3 lifecycle_1.0.4
#> [100] httr_1.4.7 statmod_1.5.0 MASS_7.3-65