Skip to contents

Introduction

rescueSim is an R package for simulating single-cell RNA-sequencing (scRNA-seq) data from repeated measures designs, such as longitudinal studies. One benefit of these types of studies is they allow researchers to investigate changes in gene expression over time for a specific cell type. Realistic simulated data data can be used to evaluate analysis methods and assess power under a variety of settings.

The rescueSim package uses a gamma-poisson framework to simulate data from individual cell-types while mimicking important attributes of empirical data such as between-sample variability (i.e. data from cells from the same sample are correlated) and between-subject variability (i.e. data from samples from the same subject are correlated).

In this vignette, we outline:

  • An overview of the simulation method
  • The basic workflow for simulating data
  • How to generate and update an object for storing simulation parameters
  • How to estimate simulation parameters from a data set
  • How to simulate data using simulation parameters
  • The effect of adjusting some of the simulation parameters on the simulated data

Installation

rescueSim is available on github at: https://github.com/ewynn610/rescueSim.

It can be installed using:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("https://github.com/ewynn610/rescueSim")

Simulation Overview

rescueSim simulates a data set for a single cell type with gene expression counts for cells from several samples, with multiple samples coming from individual subjects. If a user wants data from several cell types, this process can be repeated to generate data for each cell type of interest.

The simulation process can be broken up into three steps (Figure 1):

  1. Simulate sample-specific mean expression values for each gene, with values for samples from the same subject having an extra level of correlation.
  2. Draw cell-level “true” expression values for each gene using the sample-specific values.
  3. Simulate final count values for each cell with added variability representing technical effects.
Figure 1 Basic simulation framework
Figure 1 Basic simulation framework

As seen in the left hand side of Figure 1, the simulation framework relies on a series of parameters. These can be estimated from a data set provided by the user or supplied manually by the user. Alternatively, predefined parameter values which were estimated from a longitudinal scRNA-Seq data set of bronchoalveolar lavage samples from healthy adult subjects are included in the package. Users can use a combination of predefined or supplied parameters and parameters estimated from empirical data. This is particularly useful in situations where multi-sample/multi-time point data is not available in which case some, but not all, simulation parameters can be estimated from the empirical data.

Workflow Basics

The simulation workflow includes two broad steps:

  1. Generating simulation parameters (using RescueSimParams() and estimateRescueSimParams())
  2. Using the simulation parameters to simulate the data (using simulateRescueData())

The end result is a singleCellExperiment object with simulated data for an individual cell-type. If you want data from several cell types, you can re-run this process, preferably using simulation parameters estimated from data from the cell type(s) of interest.

Throughout the following example we will use gene expression data for recruited airspace macrophage cells from bronchoalveolar lavage samples samples collected from healthy adults. The data has been subset to contain gene expression for 1940 genes from 976 cells. Data was collected from five subjects at two time points per subject.

## Load necessary packages
library(rescueSim)
library(SingleCellExperiment)

## set seed for reproducibility
set.seed(24)

## Load data
data("RecAM_sce")

## Data is held in singleCellExperiment object
class(RecAM_sce)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"

## Data dimensions
dim(RecAM_sce)
#> [1] 1940  976

## Cell level metadata
head(colData(RecAM_sce))
#> DataFrame with 6 rows and 3 columns
#>                       sampleID   subjectID        time
#>                    <character> <character> <character>
#> ACTCCCAAGTCATTGC_2     sample1    subject1       time0
#> TATTGCTTCCTAGCCT_2     sample1    subject1       time0
#> ACGGGTCTCGTCCTTG_2     sample1    subject1       time0
#> AACGAAACATGGGCAA_2     sample1    subject1       time0
#> GAGACCCCACGGTGTC_2     sample1    subject1       time0
#> GGGAAGTTCTCAACGA_2     sample1    subject1       time0

## Five subjects
unique(RecAM_sce$subjectID)
#> [1] "subject1" "subject2" "subject3" "subject4" "subject5"

## Ten samples (two per subject)
unique(RecAM_sce$sampleID)
#>  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6" 
#>  [7] "sample7"  "sample8"  "sample9"  "sample10"

## Two different time points
unique(RecAM_sce$time)
#> [1] "time0" "time1"

We’ll begin with a smaller data set of just 40 genes to show the basic structure of the package.

## Subset data to 40 genes
RecAM_sce_small <- RecAM_sce[1:40, ]

Generating simulation parameters

Objects of the class RescueSimParams are used to hold simulation parameters and settings. A parameter object can be generated using the code below.

## Create params object
myParams <- RescueSimParams()

## Param object class
class(myParams)
#> [1] "RescueSimParams"
#> attr(,"package")
#> [1] "rescueSim"

## Examine param object contents
myParams
#> An object of class "RescueSimParams"
#> Slot "nTimepoints":
#> numeric(0)
#> 
#> Slot "twoGroupDesign":
#> logical(0)
#> 
#> Slot "nSubjsPerGroup":
#> numeric(0)
#> 
#> Slot "maxCellsPerSamp":
#> numeric(0)
#> 
#> Slot "minCellsPerSamp":
#> numeric(0)
#> 
#> Slot "logLibMean":
#> numeric(0)
#> 
#> Slot "logLibSD":
#> numeric(0)
#> 
#> Slot "logLibFacVar":
#> numeric(0)
#> 
#> Slot "customLibSizes":
#> numeric(0)
#> 
#> Slot "exprsMean":
#> numeric(0)
#> 
#> Slot "dispersion":
#> numeric(0)
#> 
#> Slot "sampleFacVarMean":
#> numeric(0)
#> 
#> Slot "sampleFacVarSD":
#> numeric(0)
#> 
#> Slot "subjectFacVarMean":
#> numeric(0)
#> 
#> Slot "subjectFacVarSD":
#> numeric(0)
#> 
#> Slot "propDE":
#> [1] 0
#> 
#> Slot "deLog2FC":
#> [1] 0

We can see that the object contains slots for a variety of parameters which are empty by default. These slots represent the parameters in the left hand side of figure 1 used to simulate the data as well as other data properties such as the number of time points or subjects.

The table below gives an details about the simulation parameters.

Parameter Symbol in Fig. 1 (if applicable) Description Acceptable Values
nTimepoints - Number of time points (i.e. samples) per subject Single numeric value >0 representing the number of time points for all subjects.
twoGroupDesign - Logical value indicating whether to simulate two groups (ex. treatment and control group) Logical value with TRUE indicating that two groups should be simulated.
nSubjsPerGroup - Number of subjects per group (if using two group design) or number of total subjects (if only single group) A single numeric value >0 indicating the number of subjects per group.
maxCellsPerSamp - Maximum parameter used when drawing number of cells per sample from a discrete uniform distribution A single numeric value >0 indicating the maximum number of cells per sample, or a vector with length equal to the number of conditions (group/time point combinations) with each value representing the maximum for a single condition, or a vector with length equal to the number of total samples with each value representing the maximum for a single sample.
minCellsPerSamp - Minimum parameter used when drawing number of cells per sample from a discrete uniform distribution A single numeric value >0 indicating the minimum number of cells per sample, or a vector with length equal to the number of conditions (group/time point combinations) with each value representing the minimum for a single condition, or a vector with length equal to the number of total samples with each value representing the minimum for a single sample.
logLibMean μL\mu_L Mean library size (log scale) parameter. Used to draw library size from a log-normal distribution. A single value >0 indicating the mean library size on a log scale.
logLibSD σL\sigma_L Library size standard deviation (log scale) parameter. Used to draw library size from a log-normal distribution. A single value >=0 indicating the standard deviation of library size on a log scale.
customLibSizes - Optional user-provided library sizes on the original (non-log) scale. If provided, library sizes will be drawn from this vector (with replacement) rather than simulated from a log-normal distribution. A numeric vector of positive values representing library sizes on the original scale. If NULL (default), library sizes are drawn from the specified log-normal distribution using logLibMean and logLibSD.
logLibFacVar αd\alpha_d Variance used for drawing sample-level multiplicative factors which give different library size distributions for each sample. Larger values result in larger variation in library size distributions by sample. A single value >=0 with 0 indicating no difference in the library size distributions by sample.
exprsMean μg\mu_g Gene-specific mean expression value representing the average expression of each gene in the data set. If a named vector is given, these names will be used as the gene names in the simulated SingleCellExperiment object. Vector of numeric values >=0 with length equal to desired number of genes in the simulated data where each value indicates the average expression for a single gene.
dispersion ϕg\phi_g Gene-specific dispersion value representing the variation in expression for each gene in the data set. Vector of numeric values >0 with length equal to desired number of genes in the simulated data where each value indicates the dispersion for a single gene.
sampleFacVarMean μb\mu_b Mean used for drawing variance (log-scale) of sample-level multiplicative factors. Larger values result in more between-sample variation. A single numeric value.
sampleFacVarSD σb\sigma_b Standard deviation used for drawing variance (log-scale) of sample-level multiplicative factors. Larger values result in more variation in amount of between-sample variation across different genes. A single numeric value >=0.
subjectVarMean μa\mu_a Mean used for drawing variance (log-scale) of subject-level multiplicative factors. Larger values result in larger between-subject variation. A single numeric value.
subjectFacVarSD σa\sigma_a Standard deviation used for drawing variance (log-scale) of subject-level multiplicative factors. Larger values result in more variation in amount of between-subject variation across different genes. A single numeric value >=0.
propDE - Proportion of genes differentially expressed between time points/groups. Numeric value between 0 and 1.
deLog2FC - Fold change values used for differentially expressed genes. A single numeric value >=0 indicating the positive log-fold change for differentially expressed genes (assigned positive or negative at random); a vector of values (positive and negative) to draw DE log2_2 fold change values from; a named list of log2_2 fold change values for all genes in each condition relative to the reference condition (e.g., time0, group0, time0_group0). The reference condition should not be included in the list.

Individual parameter values can be extracted using the getRescueSimParam function.

## Extract nTimepoints parameter
getRescueSimParam(myParams, "nTimepoints")
#> numeric(0)

Manually adding parameter values

Parameter values can be added manually when initially creating the parameter object by using the parameter name(s) as an argument in the RescueSimParams function followed by the desired value(s).

## Create parameter object with nTimepoints and nSubjsPerGroup pre-set
myParams <- RescueSimParams(nTimepoints = 2, nSubjsPerGroup = 5)
getRescueSimParam(myParams, "nTimepoints")
#> [1] 2
getRescueSimParam(myParams, "nSubjsPerGroup")
#> [1] 5

After a parameter object is created, the values can be updated using updateRescueSimParams with the parameter object and then a list of parameters and associated values.

## Update nTimepoints and nSubjsPerGroup
myParams <- updateRescueSimParams(
  paramObj = myParams,
  paramValues = list(
    nTimepoints = 3,
    nSubjsPerGroup = 10
  )
)
getRescueSimParam(myParams, "nTimepoints")
#> [1] 3
getRescueSimParam(myParams, "nSubjsPerGroup")
#> [1] 10

Estimating parameter values from empirical data

Parameters can be estimated from empirical data sets using the estRescueSimParams function. Users provide a data set in the form of a SingleCellExperiment object with a counts matrix in the counts slot and sample, subject, time point, and group meta data (where applicable) in colData. The table below outlines other function arguments.

Argument Description Default
sce SingleCellExperiment object containing empirical data with a counts matrix in the counts slot. -
paramObj RescueSimParams object with empty slots for parameters which need to be estimated. If NULL, all parameter values will be estimated. NULL
sampleVariable String denoting name of sample identifier variable in the colData of SingleCellExperiment object. If NULL, data is assumed to contain only one sample and parameters sampleFacVarMean and sampleFacVarSD parameters cannot be estimated. NULL
subjectVariable String denoting name of subject identifier variable in the colData of SingleCellExperiment object. If NULL, the data is assumed to contain only one subject and parameters nSubjsPerGroup, subjectFacVarMean, and subjectFacVarSD parameters cannot be estimated. NULL
timepointVariable String denoting name of time point identifier variable in the colData of SingleCellExperiment object. If NULL, the number of time points for the simulated data is set to 1. NULL
groupVariable String denoting name of group identifier variable in the colData of SingleCellExperiment object. If NULL, a single group design is assumed. NULL
nonDEGs Vector containing names or row indices of genes to be used to estimate sample and subject factor parameters. Using genes that are differentially expressed across time points may lead to inaccurate estimates. NULL
cellParamsByCondition Logical value indicating whether maxCellsPerSamp and minCellsPerSamp should be estimated for each condition (group/time point) or overall. FALSE

The below example shows how we can simulate data set using the subsetted recruited macrophage data. We estimate all of the parameters from the data and the empirical data has a single group, so providing a paramObj orgroupVariable is unnecessary. Using genes that are differentially expressed across time points may lead to inaccurate estimates to inaccurate estimates of subject and sample variance parameters (sampleFacVarMean, sampleFacVarSD, subjectFacVarMean, and subjectFacVarSD). The recruited macrophage data set only includes genes that appeared invariant across time points. In data sets where differential expression is believed to be present, a rough list invariant genes can be identified using differential expression methods such as edgeR or DESeq2 and the invariant genes can be passed to the nonDEGs argument.

## Estimate all parameters from the data
myParams_estimated <- estRescueSimParams(
  sce = RecAM_sce_small, paramObj = NULL,
  sampleVariable = "sampleID",
  subjectVariable = "subjectID",
  timepointVariable = "time",
  groupVariable = NULL, nonDEGs = NULL,
  cellParamsByCondition = F
)

myParams_estimated
#> An object of class "RescueSimParams"
#> Slot "nTimepoints":
#> [1] 2
#> 
#> Slot "twoGroupDesign":
#> [1] FALSE
#> 
#> Slot "nSubjsPerGroup":
#> [1] 5
#> 
#> Slot "maxCellsPerSamp":
#> [1] 136
#> 
#> Slot "minCellsPerSamp":
#> [1] 67
#> 
#> Slot "logLibMean":
#> [1] 3.178947
#> 
#> Slot "logLibSD":
#> [1] 0.522717
#> 
#> Slot "logLibFacVar":
#> [1] 0.004475137
#> 
#> Slot "customLibSizes":
#> numeric(0)
#> 
#> Slot "exprsMean":
#>       NOC2L      SCNN1D  AL391244.3  AL645728.1  AL691432.2        NADK 
#> 0.572778264 0.008741053 0.009527545 0.071148282 0.072171575 0.980568794 
#>        GNB1      LRRC47        DFFB    C1orf174      RNF207   CAMTA1-DT 
#> 4.626133728 0.388941905 0.093361548 0.494518440 0.003207091 0.001810623 
#>      SLC2A5  AL928921.2     TMEM201  PIK3CD-AS2    CTNNBIP1      UBIAD1 
#> 0.015460749 0.020178445 0.052752156 0.008627416 0.535365356 0.319103416 
#>        MFN2    SLC25A34      UQCRHL      ZBTB17    ARHGEF19      FBXO42 
#> 0.510133962 0.039608167 0.030590020 0.115291183 0.015493547 0.599248311 
#>   ARHGEF10L  AL035413.1        EMC1       CAPZB       TMCO4       PINK1 
#> 1.497391194 0.011564381 0.443020738 4.303634967 0.233431917 0.579322545 
#>      HNRNPR      ZNF436         ID3        ELOA      LYPLA2       FUCA1 
#> 3.324729001 0.049855385 1.503006201 0.722096065 0.623905753 2.399000540 
#>      SRSF10       SRRM1         RHD     LDLRAP1 
#> 2.088398063 5.044819599 0.004161722 0.134532354 
#> 
#> Slot "dispersion":
#>        NOC2L       SCNN1D   AL391244.3   AL645728.1   AL691432.2         NADK 
#> 9.765625e-05 1.024000e+02 9.765625e-05 1.276298e+01 9.765625e-05 1.835820e-01 
#>         GNB1       LRRC47         DFFB     C1orf174       RNF207    CAMTA1-DT 
#> 1.047582e-01 9.765625e-05 2.531921e+00 9.765625e-05 4.125051e+01 9.765625e-05 
#>       SLC2A5   AL928921.2      TMEM201   PIK3CD-AS2     CTNNBIP1       UBIAD1 
#> 9.765625e-05 9.765625e-05 5.431577e+00 1.024000e+02 9.765625e-05 9.765625e-05 
#>         MFN2     SLC25A34       UQCRHL       ZBTB17     ARHGEF19       FBXO42 
#> 3.701863e-01 1.611721e+01 9.765625e-05 9.765625e-05 9.765625e-05 2.641163e-01 
#>    ARHGEF10L   AL035413.1         EMC1        CAPZB        TMCO4        PINK1 
#> 9.765625e-05 9.765625e-05 9.765625e-05 9.373476e-02 5.244223e-01 2.413057e-01 
#>       HNRNPR       ZNF436          ID3         ELOA       LYPLA2        FUCA1 
#> 8.447181e-02 9.765625e-05 2.564846e+00 9.765625e-05 9.765625e-05 2.701521e-01 
#>       SRSF10        SRRM1          RHD      LDLRAP1 
#> 1.251372e-01 1.211732e-01 9.765625e-05 2.088006e-01 
#> 
#> Slot "sampleFacVarMean":
#> [1] -4.105898
#> 
#> Slot "sampleFacVarSD":
#> [1] 0.6932764
#> 
#> Slot "subjectFacVarMean":
#> [1] -3.853583
#> 
#> Slot "subjectFacVarSD":
#> [1] 0.991917
#> 
#> Slot "propDE":
#> [1] 0
#> 
#> Slot "deLog2FC":
#> [1] 0

If you want to manually set some parameters before estimating others, a RescueSimParams object with blank slots for the parameters you would like estimated can be provided. Any parameters that already have values in the RescueSimParams object will not be estimated. This is particularly useful if using single-sample or single-time point empirical data. Below is an example of how data can be simulated using single-time point empirical data after manually setting some parameters.

## Subset the data to a single time point
RecAM_sce_single_time <- RecAM_sce_small[, RecAM_sce_small$time == "time0"]

## Create a params object with necessary parameters filled
myParams <- RescueSimParams(subjectFacVarMean = -4, subjectFacVarSD = 1, nSubjsPerGroup = 5)

## Estimate remaining parameters from the data
myParams_estimated <- estRescueSimParams(
  sce = RecAM_sce_single_time,
  paramObj = myParams,
  sampleVariable = "sampleID"
)

All of the parameters are estimated from the data except for the parameters that we specified when creating the RescueSimParams object.

myParams_estimated
#> An object of class "RescueSimParams"
#> Slot "nTimepoints":
#> [1] 1
#> 
#> Slot "twoGroupDesign":
#> [1] FALSE
#> 
#> Slot "nSubjsPerGroup":
#> [1] 5
#> 
#> Slot "maxCellsPerSamp":
#> [1] 138
#> 
#> Slot "minCellsPerSamp":
#> [1] 74
#> 
#> Slot "logLibMean":
#> [1] 3.121858
#> 
#> Slot "logLibSD":
#> [1] 0.5599401
#> 
#> Slot "logLibFacVar":
#> [1] 0.004115623
#> 
#> Slot "customLibSizes":
#> numeric(0)
#> 
#> Slot "exprsMean":
#>       NOC2L      SCNN1D  AL391244.3  AL645728.1  AL691432.2        NADK 
#> 0.686218318 0.006215246 0.007633068 0.070614455 0.100593001 0.792710639 
#>        GNB1      LRRC47        DFFB    C1orf174      RNF207   CAMTA1-DT 
#> 4.288298440 0.383281997 0.108778810 0.488922467 0.001990346 0.001987183 
#>      SLC2A5  AL928921.2     TMEM201  PIK3CD-AS2    CTNNBIP1      UBIAD1 
#> 0.014821378 0.011496427 0.043983228 0.004346802 0.485034231 0.240097930 
#>        MFN2    SLC25A34      UQCRHL      ZBTB17    ARHGEF19      FBXO42 
#> 0.415416269 0.054199585 0.051226562 0.111460850 0.017990270 0.610301394 
#>   ARHGEF10L  AL035413.1        EMC1       CAPZB       TMCO4       PINK1 
#> 1.180161571 0.014261528 0.441152623 4.349938165 0.259431948 0.580329663 
#>      HNRNPR      ZNF436         ID3        ELOA      LYPLA2       FUCA1 
#> 2.831232806 0.071010670 1.289427830 0.662611225 0.736945307 2.088311882 
#>      SRSF10       SRRM1         RHD     LDLRAP1 
#> 2.023526287 5.106106212 0.006105399 0.128508501 
#> 
#> Slot "dispersion":
#>        NOC2L       SCNN1D   AL391244.3   AL645728.1   AL691432.2         NADK 
#> 1.146676e-01 1.024000e+02 9.765625e-05 1.320901e+01 9.765625e-05 2.842339e-01 
#>         GNB1       LRRC47         DFFB     C1orf174       RNF207    CAMTA1-DT 
#> 1.295745e-01 9.765625e-05 2.178611e+00 9.765625e-05 9.765625e-05 9.765625e-05 
#>       SLC2A5   AL928921.2      TMEM201   PIK3CD-AS2     CTNNBIP1       UBIAD1 
#> 9.765625e-05 9.765625e-05 6.837777e+00 1.024000e+02 9.765625e-05 9.765625e-05 
#>         MFN2     SLC25A34       UQCRHL       ZBTB17     ARHGEF19       FBXO42 
#> 3.073198e-01 1.592748e+01 9.765625e-05 9.765625e-05 9.765625e-05 3.012721e-01 
#>    ARHGEF10L   AL035413.1         EMC1        CAPZB        TMCO4        PINK1 
#> 9.765625e-05 9.765625e-05 9.765625e-05 1.343976e-01 4.026675e-01 2.689336e-01 
#>       HNRNPR       ZNF436          ID3         ELOA       LYPLA2        FUCA1 
#> 8.084418e-02 9.765625e-05 2.418655e+00 9.765625e-05 9.765625e-05 3.516693e-01 
#>       SRSF10        SRRM1          RHD      LDLRAP1 
#> 1.393887e-01 1.060177e-01 9.765625e-05 4.842573e-02 
#> 
#> Slot "sampleFacVarMean":
#> [1] -3.846468
#> 
#> Slot "sampleFacVarSD":
#> [1] 0.7435008
#> 
#> Slot "subjectFacVarMean":
#> [1] -4
#> 
#> Slot "subjectFacVarSD":
#> [1] 1
#> 
#> Slot "propDE":
#> [1] 0
#> 
#> Slot "deLog2FC":
#> [1] 0

If propDE and deLog2FC are not manually set, they are both set to 0 (no differential expression).

getRescueSimParam(myParams_estimated, "propDE")
#> [1] 0
getRescueSimParam(myParams_estimated, "deLog2FC")
#> [1] 0

Simulating Data

After setting simulation parameters, the simRescueData function can be used to simulate a data set. This function takes as an argument a RescueSimParams object. It returns a SingleCellExperiment.

## Simulate data
mySim <- simRescueData(myParams_estimated)

## Look at class of simulated data
class(mySim)
#> [1] "SingleCellExperiment"
#> attr(,"package")
#> [1] "SingleCellExperiment"

A matrix of counts is held the counts slot. The rownames (genes) take on the names of the exprsMean vector if provided. If exprsMean was estimated from empirical data, these are the gene names from the empirical data.

counts(mySim)[1:5, 1:5]
#>            cell_1 cell_2 cell_3 cell_4 cell_5
#> NOC2L           0      0      2      0      1
#> SCNN1D          0      0      0      0      0
#> AL391244.3      0      0      0      0      0
#> AL645728.1      0      0      0      0      0
#> AL691432.2      0      0      0      0      0

Cell level meta data is held in the object colData.

head(colData(mySim))
#> DataFrame with 6 rows and 4 columns
#>           sampleID   subjectID        time       group
#>        <character> <character> <character> <character>
#> cell_1     sample1    subject1       time0      group0
#> cell_2     sample1    subject1       time0      group0
#> cell_3     sample1    subject1       time0      group0
#> cell_4     sample1    subject1       time0      group0
#> cell_5     sample1    subject1       time0      group0
#> cell_6     sample1    subject1       time0      group0

Log-fold change values for each gene are held in the object rowData.

head(rowData(mySim))
#> DataFrame with 6 rows and 0 columns

Adjusting Specific Parameter Values

Now we’ll give more detail on the effect of adjusting some of the parameter values. We first estimate parameters using the unfiltered recruited macrophage data. We use the resulting object throughout the remainder of the document.

## Estimate parameters using full RecAM_sce object
myParams <- estRescueSimParams(RecAM_sce,
  sampleVariable = "sampleID",
  subjectVariable = "subjectID",
  timepointVariable = "time",
  groupVariable = NULL, nonDEGs = NULL,
  cellParamsByCondition = FALSE
)

Sample Size Adjustments

A question likely to arise in repeated measures scRNA-seq analysis and study planning is how the sample size will impact results. In these experiments, there are different aspects of sample size, all of which can be controlled in using the rescueSim simulation method:

  • Number of time points (samples) per subject (nTimepoints)

  • Number of subjects (nSubjsPerGroup)

  • Number of cells per sample (maxCellsPerSamp, maxCellsPerSamp)

A situation that sometimes arises in scRNA-seq experiments is one in which the number of cells of a particular cell type in a sample is impacted by a group variable (i.e. treatment or control) or a time variable (ex. pre or post-treatment). Thus, we allow the option to specify or estimate one maxCellsPerSampand maxCellsPerSamp parameter to be used for all samples, one maxCellsPerSampand maxCellsPerSamp parameter for each condition (group/time), or one maxCellsPerSampand maxCellsPerSamp parameter for each individual sample. Below is an example of estimating or manually specifying maxCellsPerSampand maxCellsPerSamp parameters for the two time points.

A user can estimate individual parameters for each condition by using cellParamsByCondition=TRUE as a function argument.

## Set cell parameter values to numeric 0 so we can re-estimate
myParams_cellsByCondition <- updateRescueSimParams(
  myParams,
  list(
    maxCellsPerSamp = numeric(0),
    minCellsPerSamp = numeric(0)
  )
)
## Estimate with cellParamsByCondition = T
myParams_cellsByCondition <- estRescueSimParams(RecAM_sce,
  myParams_cellsByCondition,
  sampleVariable = "sampleID",
  subjectVariable = "subjectID",
  timepointVariable = "time",
  cellParamsByCondition = T
)

## Now cell parameters are estimated separately by time point
getRescueSimParam(myParams_cellsByCondition, "maxCellsPerSamp")
#> time0 time1 
#>   138   104
getRescueSimParam(myParams_cellsByCondition, "minCellsPerSamp")
#> time0 time1 
#>    74    67

Alternatively, a user can manually set the parameters with vector values equal to the number of conditions.

## Manually setting parameters
myParams_cellsByCondition <- updateRescueSimParams(
  myParams,
  list(
    maxCellsPerSamp = c(100, 500),
    minCellsPerSamp = c(50, 250)
  )
)

When we simulate data, we can see that the number of cells for time0 samples are all between 50 and 100 while the number of cells for time1 samples are between 250 and 500.

## Simulate data
simDat <- simRescueData(myParams_cellsByCondition)

## Look at number of cells per sample
table(simDat$sampleID, simDat$time)
#>           
#>            time0 time1
#>   sample1     69     0
#>   sample10     0   368
#>   sample2     75     0
#>   sample3     72     0
#>   sample4     72     0
#>   sample5     77     0
#>   sample6      0   361
#>   sample7      0   310
#>   sample8      0   355
#>   sample9      0   257

Library Size Adjustments

The library sizes, or number of counts per cell, can be adjusted in two ways. By default, library sizes are drawn from a log-normal distribution where logLibMean the logLibSD control the overall mean and variability of library sizes across cells. Alternatively, users can supply specific library sizes through the customLibSizes parameter from which library sizes are drawn directly rather than simulated from a distribution. In either case, logLibFacVar introduces variation in library size distributions between samples, mimicking technical effects such as batch variation.

The logLibMean parameter controls the overall average library size (log scale). Below is an example of using a relatively large vs. small value for logLibMean.

## Create two parameter objects, one with large logLibMean, one with small logLibMean
myParams_bigLogLibMean <- updateRescueSimParams(myParams, list(logLibMean = log(5000)))

myParams_smallLogLibMean <- updateRescueSimParams(myParams, list(logLibMean = log(500)))

## Simulate data with both scenarios
simDat_bigLogLibMean <- simRescueData(myParams_bigLogLibMean)
simDat_smallLogLibMean <- simRescueData(myParams_smallLogLibMean)

## Calculate library sizes by summing up the columns of the count data
bigLibSizes <- data.frame(
  libSizes = colSums(counts(simDat_bigLogLibMean)),
  type = "logLibMean = log(5,000)"
)
smallLibSizes <- data.frame(
  libSizes = colSums(counts(simDat_smallLogLibMean)),
  type = "logLibMean = log(500)"
)
libSizes <- rbind(bigLibSizes, smallLibSizes)


## Output boxplot comparing library sizes
boxplot(log(libSizes) ~ type, data = libSizes, xlab = NULL, ylab = "log(Library Size)")

The logLibSD parameter controls the amount of variation in library size across cells. Below is an example of using a relatively large vs. small value for logLibSD.

## Create two parameter objects, one with large logLibSD, one with small logLibSD
myParams_bigLogLibSD <- updateRescueSimParams(myParams, list(logLibSD = log(2)))

myParams_smallLogLibSD <- updateRescueSimParams(myParams, list(logLibSD = log(1.25)))

## Simulate data with both scenarios
simDat_bigLogLibSD <- simRescueData(myParams_bigLogLibSD)
simDat_smallLogLibSD <- simRescueData(myParams_smallLogLibSD)

## Calculate library sizes by summing up the columns of the count data
bigLibSizeSD <- data.frame(
  libSizes = colSums(counts(simDat_bigLogLibSD)),
  type = "logLibSD = log(2)"
)
smallLibSizeSD <- data.frame(
  libSizes = colSums(counts(simDat_smallLogLibSD)),
  type = "logLibSD = log(1.25)"
)
libSizes <- rbind(bigLibSizeSD, smallLibSizeSD)


## Output boxplot comparing library sizes
boxplot(log(libSizes) ~ type, data = libSizes, xlab = NULL, ylab = "log(Library Size)")

Alternatively, users can provide custom library sizes via the customLibSizes parameter. When specified, these values are used directly (sampled with replacement) instead of drawing library sizes from a log-normal distribution. This can be useful in cases where the log-normal distribution does not reflect the empirical distribution of library sizes.

Below we illustrate how you could create custom library sizes based of an empirical distribution by identifying the sample in the data set with a mean library size closest to the overall mean across samples, and use that sample’s library sizes for simulation.

## Calculate library sizes for each cell
empLibSizes <- data.frame(
  libSizes = colSums(counts(RecAM_sce)),
  sample = RecAM_sce$sampleID,
  type="Empirical"
)

## Calculate mean library size for each sample
samples <- unique(empLibSizes$sample)
lib_size_means <- sapply(samples, function(s) {
  mean(empLibSizes$libSizes[empLibSizes$sample == s])
})

## Find the sample closest to the overall mean
overall_mean <- mean(lib_size_means)
idx_min <- which.min(abs(lib_size_means - overall_mean))
samp_min <- samples[idx_min]

## Get custom library sizes from that sample
customLibSizes <- empLibSizes$libSizes[empLibSizes$sample == samp_min]

## Update parameters and simulate
myParams_customLib <- updateRescueSimParams(myParams, list(customLibSizes = customLibSizes))
simDat_customLib <- simRescueData(myParams_customLib)

## Compare empirical and simulated library sizes

simLibSizes <- data.frame(
  libSizes = colSums(counts(simDat_customLib)),
  sample=simDat_customLib$sampleID,
  type = "Simulated (customLibSizes)"
)

libSizes <- rbind(empLibSizes, simLibSizes)

## Boxplot comparing library sizes
boxplot(log(libSizes) ~ type,
        data=libSizes,
        ylab = "log(Library Size)", xlab = NULL)

In multi-sample scRNA-Seq data, there are often differences in the distributions of library size across samples caused by technical batch effects since cells from individual samples are sequenced together in a batch. To account for this, we draw multiplicative factors, centered around one, which we multiply with logLibMean so that the mean used to draw the library sizes varies by sample. The parameter logLibFacVar represents the estimated variance between these multiplicative factors. A value of 0 means there is no variance in the library size means, and thus the distribution of library sizes is approximately the same for each sample. Lager values of logLibFacVar indicate larger between-sample differences in library size distribution. Below is an example of using a big and small value for logLibFacVar and how it affects the library size distribution by sample.

## Create two parameter objects, one with large logLibFacVar, one with small logLibFacVar
myParams_smalllogLibFacVar <- updateRescueSimParams(myParams, list(logLibFacVar = 0))

myParams_biglogLibFacVar <- updateRescueSimParams(myParams, list(logLibFacVar = .005))



## Simulate data with both scenarios
simDat_smalllogLibFacVar <- simRescueData(myParams_smalllogLibFacVar)
simDat_biglogLibFacVar <- simRescueData(myParams_biglogLibFacVar)

## Calculate library sizes by summing up the columns of the count data
smalllogLibFacVar <- data.frame(
  libSizes = colSums(counts(simDat_smalllogLibFacVar)),
  type = "Small logLibFacVar",
  sample = simDat_smalllogLibFacVar$sampleID
)
biglogLibFacVar <- data.frame(
  libSizes = colSums(counts(simDat_biglogLibFacVar)),
  type = "Big logLibFacVar",
  sample = simDat_biglogLibFacVar$sampleID
)



## Output boxplot comparing library sizes by samples for both scenarios
boxplot(log(libSizes) ~ sample,
  data = smalllogLibFacVar, main = "logLibFacVar = 0",
  xlab = NULL, ylab = "log(Library Size)"
)

boxplot(log(libSizes) ~ sample,
  data = biglogLibFacVar, main = "logLibFacVar = .005",
  xlab = NULL, ylab = "log(Library Size)"
)

Between Subject/Sample Variation

To simulate between-sample and between-subject variation, rescueSim uses gene specific sample- and subject- level multiplicative factors which are multiplied with the mean gene expression for each gene, with the resulting value being used later in simulation (see Figure 1, step 1). The multiplicative factors are drawn from a distribution with a mean of 1. The variance of this distribution is randomly generated. The parameters sampleFacVarMean and sampleFacVarSD represent the mean and standard deviation (log scale) of the variances for the sample-level multiplicative factors while subjectFacVarMean and subjectFacVarSD are the parameters for the sampling distribution of subject-level variances. Because larger variances result in larger between-sample or -subject variation, increasing the sampleFacVarMean or subjectFacVarMean also leads to more between-sample or -subject variation. Increasing the values of sampleFacVarSD and subjectFacVarSD lead to in more variation in amount of between-sample or -subject variation across genes.

Typically, we are more interested in adjusting the overall amount of between-sample or -subject variation, so we will concentrate on the effect of changing the sampleFacVarMean and subjectFacVarMean parameters.

Using a low value (-10) for sampleFacVarMean leads to little between-sample variation, and so we see in the TSNE plot that the samples don’t cluster together

## Set sampleFacVarMean to -10
myParams_smallsampleFacVarMean <- updateRescueSimParams(myParams, list(sampleFacVarMean = -10))

## Simulate data
simDat_smallsampleFacVarMean <- simRescueData(myParams_smallsampleFacVarMean)

## Log-normalize counts to run TSNE
simDat_smallsampleFacVarMean <- scater::logNormCounts(simDat_smallsampleFacVarMean)

## Run and plot TSNE
simDat_smallsampleFacVarMean <- scater::runTSNE(simDat_smallsampleFacVarMean)

scater::plotTSNE(simDat_smallsampleFacVarMean, colour_by = "sampleID")

If we increase the value to -3, the samples cluster very tightly by sample.

## Set sampleFacVarMean to -3
myParams_bigsampleFacVarMean <- updateRescueSimParams(myParams, list(sampleFacVarMean = -3))

## Simulate data
simDat_bigsampleFacVarMean <- simRescueData(myParams_bigsampleFacVarMean)

## Log-normalize counts to run TSNE
simDat_bigsampleFacVarMean <- scater::logNormCounts(simDat_bigsampleFacVarMean)

## Run and plot TSNE
simDat_bigsampleFacVarMean <- scater::runTSNE(simDat_bigsampleFacVarMean)

scater::plotTSNE(simDat_bigsampleFacVarMean, colour_by = "sampleID")

The same is true if we look at the effect of increasing the subjectFacVarMean on subject-level clustering.

## Set subjectFacVarMean to -3
myParams_smallsubjectFacVarMean <- updateRescueSimParams(myParams, list(subjectFacVarMean = -10))

## Simulate data
simDat_smallsubjectFacVarMean <- simRescueData(myParams_smallsubjectFacVarMean)

## Log-normalize counts to run TSNE
simDat_smallsubjectFacVarMean <- scater::logNormCounts(simDat_smallsubjectFacVarMean)

## Run and plot TSNE
simDat_smallsubjectFacVarMean <- scater::runTSNE(simDat_smallsubjectFacVarMean)

scater::plotTSNE(simDat_smallsubjectFacVarMean, colour_by = "subjectID")

## Set subjectFacVarMean to -3
myParams_bigsubjectFacVarMean <- updateRescueSimParams(myParams, list(subjectFacVarMean = -3))

## Simulate data
simDat_bigsubjectFacVarMean <- simRescueData(myParams_bigsubjectFacVarMean)

## Log-normalize counts to run TSNE
simDat_bigsubjectFacVarMean <- scater::logNormCounts(simDat_bigsubjectFacVarMean)

## Run and plot TSNE
simDat_bigsubjectFacVarMean <- scater::runTSNE(simDat_bigsubjectFacVarMean)

scater::plotTSNE(simDat_bigsubjectFacVarMean, colour_by = "subjectID")

Differential Expression

rescueSim allows the user to simulate differential expression between time points/groups using two parameters which must be manually set by the user: propDE which controls the proportion of genes differentially expressed, and deLog2FC which controls the log2_2 fold change values of differentially expressed genes. A user can specify a single positive log2_2 fold change value for deLog2FC, in which case differential expression is evenly split between up- and down-regulation in the proportion of genes specified by propDE. Alternatively, a vector of positive and/or negative values can be provided, from which log2_2 fold change values from differentially expressed genes are drawn. In both of these cases, when simulating multiple time points, the log2_2 fold change represents the change between the first (time0) and last time point, with gene expression changes simulated linearly across intermediate time points. In a two-group, multi-time point simulation, the log2_2 fold change represents the difference between time0_group0 and the final time point of group1 (timeN_group1), with no differential expression across time points within group0 and a linear change across time within group1.

Alternatively, users can provide a named list of vectors to deLogFC where each element of the list corresponds to the fold change values at each time point (or time point-group combination) relative to the baseline (time0, group0, or time0_group0). This enables simulation of non-linear or custom differential expression patterns across time and groups. When using a list, the propDE parameter is not used and each vector should have length equal to the number of genes in the data set. The baseline condition is not included in the list.

We show below using a TSNE plot from the scater package how differential expression impacts cell clustering.

If we simulate with no differential expression, cells from both time points cluster together.

## Estimate data with no differential expression (default)
simDat_noDE <- simRescueData(myParams)

## Log-normalize counts to run TSNE
simDat_noDE <- scater::logNormCounts(simDat_noDE)

## Run and plot TSNE
simDat_noDE <- scater::runTSNE(simDat_noDE)

scater::plotTSNE(simDat_noDE, colour_by = "time")

If simulate 20% of genes to be differentially expressed between time points with a ±1\pm1 log2_2 fold change, cells from different time points cluster separately.

## Manually changing DE Params
myParams_de_2timepoints <- updateRescueSimParams(myParams, list(
  propDE = .2, deLog2FC = 1
))

## Estimate data with no differential expression (default)
simDat_de_2timepoints <- simRescueData(myParams_de_2timepoints)

## Log-normalize counts to run TSNE
simDat_de_2timepoints <- scater::logNormCounts(simDat_de_2timepoints)

## Run and plot TSNE
simDat_de_2timepoints <- scater::runTSNE(simDat_de_2timepoints)

scater::plotTSNE(simDat_de_2timepoints, colour_by = "time")

If we simulate three time points, the cells cluster sequentially by time point since rescueSim simulates linear differential expression across time.

## Manually changing DE Params
myParams_de_3timepoints <- updateRescueSimParams(myParams, list(
  propDE = .2, deLog2FC = 1,
  nTimepoints = 3
))

## Estimate data with no differential expression (default)
simDat_de_3timepoints <- simRescueData(myParams_de_3timepoints)

## Log-normalize counts to run TSNE
simDat_de_3timepoints <- scater::logNormCounts(simDat_de_3timepoints)

## Run and plot TSNE
simDat_de_3timepoints <- scater::runTSNE(simDat_de_3timepoints)

scater::plotTSNE(simDat_de_3timepoints, colour_by = "time")

If we simulate a 2-group, 2-time point data set, then the cells from the final time point and last group, which is simulated to be differentially expressed relative to the other groups/times, cluster separately from the rest of the cells.

## Manually changing DE Params
myParams_de_2groups <- updateRescueSimParams(myParams, list(
  propDE = .2, deLog2FC = 1,
  twoGroupDesign = T,
  nTimepoints = 2
))

## Estimate data with no differential expression (default)
simDat_de_2groups <- simRescueData(myParams_de_2groups)

## Make a variable combining group and time
simDat_de_2groups$group_time <- paste(simDat_de_2groups$group,
  simDat_de_2groups$time,
  sep = "_"
)

## Log-normalize counts to run TSNE
simDat_de_2groups <- scater::logNormCounts(simDat_de_2groups)

## Run and plot TSNE
simDat_de_2groups <- scater::runTSNE(simDat_de_2groups)

scater::plotTSNE(simDat_de_2groups, colour_by = "group_time")

Finally, we can simulate differential expression using manually set log2_2 fold change values for each time point or condition, allowing for non-linear changes across time. This is done by providing deLog2FC as a list of named vectors, where each element of the list corresponds to a condition (excluding the reference: time0, group0 or time0_group0). Each vector specifies the log2_2 fold change values for each gene in that condition relative to the reference group.

Below, we show an example of simulating three time points, where we specify log2_2 fold changes for one intermediate time point and larger log_22 fold changes for the final time point. This results in non-linear gene expression changes across time.


## For the 2nd time point, simulate +/- 1 log2FC in 20% of genes
time1_log2FC<-sample(c(-1, 0, 1), nrow(RecAM_sce), replace=T, prob = c(.1, .8, .1))

## For final time point, multiply log2FC by 3 to get +/- 3 log2FC in 20% of genes
time2_log2FC <- time1_log2FC*3

myParams_nonLinearDE <- updateRescueSimParams(myParams, list(
  deLog2FC = list(time1=time1_log2FC, time2=time2_log2FC),
  nTimepoints = 3
))

simDat_nonLinearDE <- simRescueData(myParams_nonLinearDE)

# Log-normalize counts and run TSNE
simDat_nonLinearDE <- scater::logNormCounts(simDat_nonLinearDE)
simDat_nonLinearDE <- scater::runTSNE(simDat_nonLinearDE)

In this case, we can see that the time0 and time1 are much closer together than to time2.

# Plot TSNE colored by time point to visualize clustering
scater::plotTSNE(simDat_nonLinearDE, colour_by = "time")

Session Information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
#>  [3] Biobase_2.68.0              GenomicRanges_1.60.0       
#>  [5] GenomeInfoDb_1.44.0         IRanges_2.42.0             
#>  [7] S4Vectors_0.46.0            BiocGenerics_0.54.0        
#>  [9] generics_0.1.4              MatrixGenerics_1.20.0      
#> [11] matrixStats_1.5.0           rescueSim_0.99.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        viridisLite_0.4.2       dplyr_1.1.4            
#>  [4] vipor_0.4.7             farver_2.1.2            viridis_0.6.5          
#>  [7] fastmap_1.2.0           bluster_1.18.0          digest_0.6.37          
#> [10] rsvd_1.0.5              lifecycle_1.0.4         cluster_2.1.8.1        
#> [13] statmod_1.5.0           magrittr_2.0.3          compiler_4.5.1         
#> [16] rlang_1.1.6             sass_0.4.10             tools_4.5.1            
#> [19] igraph_2.1.4            yaml_2.3.10             data.table_1.17.6      
#> [22] knitr_1.50              labeling_0.4.3          dqrng_0.4.1            
#> [25] S4Arrays_1.8.1          DelayedArray_0.34.1     plyr_1.8.9             
#> [28] RColorBrewer_1.1-3      abind_1.4-8             BiocParallel_1.42.1    
#> [31] Rtsne_0.17              withr_3.0.2             desc_1.4.3             
#> [34] grid_4.5.1              beachmat_2.24.0         edgeR_4.6.2            
#> [37] ggplot2_3.5.2           scales_1.4.0            gtools_3.9.5           
#> [40] MAST_1.33.0             cli_3.6.5               rmarkdown_2.29         
#> [43] crayon_1.5.3            ragg_1.4.0              metapod_1.16.0         
#> [46] httr_1.4.7              reshape2_1.4.4          scuttle_1.18.0         
#> [49] ggbeeswarm_0.7.2        cachem_1.1.0            stringr_1.5.1          
#> [52] parallel_4.5.1          XVector_0.48.0          vctrs_0.6.5            
#> [55] Matrix_1.7-3            jsonlite_2.0.0          BiocSingular_1.24.0    
#> [58] BiocNeighbors_2.2.0     ggrepel_0.9.6           irlba_2.3.5.1          
#> [61] beeswarm_0.4.0          scater_1.36.0           systemfonts_1.2.3      
#> [64] locfit_1.5-9.12         limma_3.64.1            jquerylib_0.1.4        
#> [67] glue_1.8.0              pkgdown_2.1.3           codetools_0.2-20       
#> [70] stringi_1.8.7           gtable_0.3.6            UCSC.utils_1.4.0       
#> [73] ScaledMatrix_1.16.0     tibble_3.3.0            pillar_1.10.2          
#> [76] htmltools_0.5.8.1       GenomeInfoDbData_1.2.14 R6_2.6.1               
#> [79] textshaping_1.0.1       evaluate_1.0.4          lattice_0.22-7         
#> [82] backports_1.5.0         scran_1.36.0            bslib_0.9.0            
#> [85] Rcpp_1.0.14             gridExtra_2.3           SparseArray_1.8.0      
#> [88] checkmate_2.3.2         xfun_0.52               fs_1.6.6               
#> [91] pkgconfig_2.0.3