Skip to contents

Identifying Changes in Allelic Ratio Distribution (Mean or Variance) Across Groups

Introduction

ASPEN can identify genes with differential allelic patterns across groups. Groups can represent cell types, time points, or other experimental factors. If sequencing was performed across multiple time points, group identities correspond directly to the respective time points. Please note, that the batch correction might be required to remove any technical batch effects. Alternatively, cells can be ordered along a developmental trajectory (using tools like Monocle, Slingshot, palantir, etc.), the trajectory is often represented as pseudotime - a probability of a cell to be in a terminal state, bounded within the [0,1][0,1] interval. Since ASPEN detects changes across discrete groups, pseudotime should be binned (eg. based on tertiles or quintiles) before analyses. \ To test for changes in the mean allelic ratio between the groups, ASPEN evaluates whether a gene’s allelic distribution parameters are the same across the groups (H0H_0) or they are group-specific (H1H_1). ASPEN goes through the following steps:\

###Setup

#loading libraries
suppressPackageStartupMessages({
    library(ASPEN)
    library(gridExtra)
    library(openxlsx)
    library(SingleCellExperiment)
    library(scran)
    library(huxtable)
})  

Loading allele-specifc count data

As in other vignettes, we use mouse brain organoids data from CastB6 hybrids (Medina-Cano, 2025). Pseudotime for this data was estimated using palantir (Setty, 2019). For this example, we focus on three cell types representing early neurodevelopment:: radial glial cells (RGCs), intermediate progenitors cells (IPCs), deep layer neurons (cortical neurons).

data("Bl6_Cast_a1")
data("Bl6_Cast_tot")
#load_file <- system.file("extdata", "Bl6Cast_cell_annot.xlsx", package = "ASPEN")
#cell_annot <- read.xlsx(load_file, rowNames = T)
#loading pseudotime assignment
load_time <- system.file("extdata", "pseudotime_Bl6Cast.xlsx", package = "ASPEN")
pseudotime <- read.xlsx(load_time, rowNames = T)
print_md(as_huxtable(head(pseudotime)))
#> ------------------------------------
#>  cell_id                       time 
#> ------------------------ -----------
#>  clone2_TACAGGTCAGAGATGC     0.0142 
#>                                     
#>  clone1_CTCATGCAGAGATCGC     0.0162 
#>                                     
#>  clone1_TAGGTACTCAACCCGG     0.0167 
#>                                     
#>  clone2_GACAGCCCAAACGTGG     0.0167 
#>                                     
#>  clone1_TCCGATCAGTGGTCAG     0.0176 
#>                                     
#>  clone2_AATCACGTCCCTCTAG     0.0182 
#> ------------------------------------

We first select the cell barcodes for which a pseudotime value has been assigned.


Cast_B6_a1 <- Cast_B6_a1[,colnames(Cast_B6_a1) %in% pseudotime$cell_id]
Cast_B6_tot <- Cast_B6_tot[,colnames(Cast_B6_tot) %in% pseudotime$cell_id]

Since ASPEN accepts only discrete groups, we divide the pseudotime vector into five equal-sized bins.

pseudotime <- pseudotime[match(colnames(Cast_B6_tot), 
                               pseudotime$cell_id),]
pseudotime$group <- cut(pseudotime$time, 
                        breaks=c(quantile(pseudotime$time, 
                                          probs = seq(0, 1, by = 0.2))))
#by default the [`cut()`] function skips the first observation - imputing the value manually
pseudotime$group[is.na(pseudotime$group)] <- levels(pseudotime$group)[1]
#adding cell ids to pseudotime obejct row names
rownames(pseudotime) <- pseudotime$cell_id
print_md(as_huxtable(head(pseudotime)))
#> ------------------------------------------------
#>  cell_id                    time group          
#> ------------------------ ------- ---------------
#>  clone1_AAACGAACATGTGCTA   0.546 (0.541,0.566]  
#>                                                 
#>  clone1_AAAGAACCACCTGCAG   0.802 (0.594,1]      
#>                                                 
#>  clone1_AAAGGATGTACTCCGG   0.667 (0.594,1]      
#>                                                 
#>  clone1_AAAGTCCAGCAATTCC   0.608 (0.594,1]      
#>                                                 
#>  clone1_AAAGTCCAGTGCTACT   0.654 (0.594,1]      
#>                                                 
#>  clone1_AAAGTCCGTAACCCGC  0.0415 (0.0142,0.424] 
#> ------------------------------------------------

Checking the number of cells per pseudotime bin

print_md(as_huxtable(table(pseudotime$group)))
#> ---------------------------
#>                 V1         
#> --------------- -----------
#>  (0.0142,0.424] 685        
#>                            
#>   (0.424,0.541] 685        
#>                            
#>   (0.541,0.566] 684        
#>                            
#>   (0.566,0.594] 685        
#>                            
#>       (0.594,1] 685        
#> ---------------------------

Counts normalisation

We first normalize the raw single-cell counts using the [computeSumFactors()] function from scran package (Lun, et al. 2016). We then create a SingleCellExperiment object using the total and reference allele count matrices.


ase_sce <- SingleCellExperiment(assays = list(a1 = as.matrix(Cast_B6_a1),
                                              tot = as.matrix(Cast_B6_tot)))

Lowly expressed genes (expressed in less than 10 cells) are removed.

#removing lowly expressed genes
ase_sce <- ase_sce[rowSums(assays(ase_sce)[['tot']] > 1) >= 10, ]
dim(ase_sce)
#> [1] 1546 3424

#adding sample id to the metadata
colData(ase_sce)$replicate <- gsub("_.*", "", rownames(colData(ase_sce)))

#calculate size factors
ase_sce <- computeSumFactors(ase_sce, 
                             clusters=colData(ase_sce)$replicate, assay.type = "tot")

The reference allele and total counts are normalized in parallel using the same size factor estimates.

#normalizing counts
ase_sce  <- logNormCounts(ase_sce, 
                          size.factors = colData(ase_sce)$sizeFactor,
                          log = NULL, transform = "none", assay.type = "tot", name = "tot_norm")

#normalizing reference counts by the same size factors
ase_sce  <- logNormCounts(ase_sce, 
                          size.factors = colData(ase_sce)$sizeFactor,
                          log = NULL, transform = "none", assay.type = "a1", name = "a1_norm")

#checking that normalised counts assays are added to the SingleCellExperiment object
ase_sce@assays
#> An object of class "SimpleAssays"
#> Slot "data":
#> List of length 4
#> names(4): a1 tot tot_norm a1_norm

Estimating beta-binomial parameters

We start by estimating beta-binomial distribution parameters for each gene across all cells. These estimates will be used to calculate the likelihood under the null hypothesis H0H_0, which assumes no differences in allelic ratio across time points.

#extracting raw counts which will be used to estimate the model parameters
tot_mat <- as.matrix(assays(ase_sce)[['tot']])
a1_mat <- as.matrix(assays(ase_sce)[['a1']])

global_params <- estim_bbparams(a1_mat, tot_mat, min_cells = 5, cores = 6)

Defining lowly expressed genes

ASPEN applies shrinkage selectively - genes with very low dispersion are not moderated and their allelic imbalance is evaluated on the unadjusted values. Genes with stable dispersion are determined based on the residuals from the dispersion modelling step. ASPEN calculates the meadian absolute deviation-squared (MAD2{MAD}^2), which is used as a cut-off.

min_cutoff <- calc_mad(global_params)
min_cutoff
#> [1] 0.001777655

Estimate appropriate shrinkage parameters

We estimate shrinkage parameters, NN and δ\delta, on each cell type separately. As an option, genes with very low dispersion can be excluded from the estimation by setting a minimum cut-off value with thetaFilter parameter

set.seed(1001011)
shrink_pars <- estim_delta(global_params, thetaFilter = round(min_cutoff,3))
shrink_pars
#>     N delta 
#>    17    19

Performing Bayesian shrinkage

global_shrunk <- correct_theta(global_params, 
                               delta_set = shrink_pars[1], 
                               N_set = shrink_pars[2], 
                               thetaFilter = min_cutoff)

Visualizing the local model fit and the shrunk dispersion estimates

fit_plot <- plot_disp_fit_theta(global_shrunk, midpoint = 200)
shrunk_plot <- plot_disp(global_shrunk) + 
  geom_hline(yintercept = log(1e-03), linetype = "dashed", linewidth = 1)
  
grid.arrange(fit_plot, shrunk_plot, ncol = 2)
#> Warning: Removed 31 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).

Under the alternative hypothesis, we assume that the ASE distributions differ between the time points. To obtain the group-level estimates, we split the count matrices by the time point assignment and repeat the beta-binomial parameter estimation for each group.

#splitting pseudotime assignment by group
psedotime_bins <- split(pseudotime, f = pseudotime$group)


ase_sce_bybin <- list()
for (i in 1:length(psedotime_bins)){
    ase_sce_bybin[[i]] <- ase_sce[,rownames(psedotime_bins[[i]])]
}

#only using genes that are expressed in at least 10 cells 
ase_sce_bybin <- lapply(ase_sce_bybin, function(q) q[rowSums(assays(q)[['tot']] > 1) >= 10, ]) 

#extracting total counts for each pseudotime bin
tot_mat_bybin <- lapply(ase_sce_bybin, function(q) as.matrix(assays(q)[['tot']]))

#extracting reference allele counts
a1_mat_bybin <- lapply(ase_sce_bybin, function(q) as.matrix(assays(q)[['a1']]))
#selecting genes that matched filtering criteria
a1_mat_bybin <- mapply(function(p,q) p[rownames(q), ], a1_mat_bybin, tot_mat_bybin, SIMPLIFY = F)

#Estimating distribution parameters
group_params <- mapply(function(p, q) estim_bbparams(p, q, min_cells = 5, cores = 6),
                       a1_mat_bybin, tot_mat_bybin, SIMPLIFY = F)

#removing groups where optim did not converge
#group_params <- group_params[!is.na(group_params$bb_theta),]
#group_params <- as.data.frame(group_params)

We then apply Bayesian shrinkage to the group-level beta-binomial parameter estimates.


shrunk_group_params <- lapply(group_params, function(q) 
                                            correct_theta(q, 
                                                          delta_set = shrink_pars[1], 
                                                          N_set = shrink_pars[2], 
                                                          thetaFilter = min_cutoff))

Visualizing the local model fit when each group-level observations are treated as independent variables.


samples <- list(levels(pseudotime$group)[1],
                levels(pseudotime$group)[2],
                levels(pseudotime$group)[3],
                levels(pseudotime$group)[4],
                levels(pseudotime$group)[5])

p_disp <- mapply(function(p,q) plot_disp_fit_theta(p, midpoint = 300) +
                               labs(subtitle = q) +
                               geom_hline(yintercept = log(min_cutoff), linetype = "dashed"),
                 shrunk_group_params, samples, SIMPLIFY = F)
do.call(grid.arrange, c(p_disp, ncol = 3))
#> Warning: Removed 86 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 91 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).
#> Warning: Removed 71 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).
#> Warning: Removed 79 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 76 rows containing non-finite outside the scale range
#> (`stat_pointdensity()`).

ASPEN expects the group level estimates for each gene. To prepare these, we add the gene name and group identifier to each estimates object, combine the results into a single data frame and split the data frame by gene. The shrunk_params_gene object is a list where each element contains the group-level estimates for a single gene.

shrunk_group_params <- mapply(function(p,q) {p$group <- q;
                                             p$gene <- rownames(p);
                                             return(p)}, shrunk_group_params, samples, SIMPLIFY = F)

shrunk_params_comb <- do.call(rbind, shrunk_group_params)
shrunk_params_gene <- split(shrunk_params_comb, f = shrunk_params_comb$gene)

Test for changes in ASE mean across time points

ASPEN performs this test only on genes with valid beta-binomial estimates. Genes for which beta-binomial parameters could not be obtained are excluded from the output. Normalized counts are used to detect changes in the allelic ratio mean across pseudotime bins. The input matrices contain all cells.

To run the test, you must provide:

The group identifiers in the metadata must match exactly the group identifiers in the beta-binomial estimates list (shrunk_params_gene).

By default, ASPEN ensures that the number of informative cells — those meeting the minimum coverage threshold — is the same across groups.

#extracting normalised counts which will be used for testing
a1_norm <- as.matrix(round(assays(ase_sce)[['a1_norm']]))
tot_norm <- as.matrix(round(assays(ase_sce)[['tot_norm']]))

change_mean <- group_mean(a1_norm, tot_norm, 
                         metadata = pseudotime, split.var = "group",
                         min_counts = 5, min_cells = 5, 
                         estimates = global_shrunk,
                         estimates_group = shrunk_params_gene,
                         equalGroups = TRUE)

For genes that do not meet the quality cut-off threshold (here: a minimum of 5 cells with at least 5 mapped reads per cell), the inference is not performed. These genes will have NA values in the relevant output columns.

We remove such genes before calculating the false discovery rate (FDR):

change_mean <- change_mean[!is.na(change_mean$pval),]
change_mean$fdr_mean <- p.adjust(change_mean$pval, method = "fdr")
head(change_mean[order(change_mean$fdr_mean),], n  = 10)
N AR tot_gene_mean tot_gene_variance alpha beta bb_mu bb_theta id theta_smoothed thetaCorrected theta_common resid N.1 tot_gene_mean.1 AR.1 log2FC mu_group.0.0142.0.424. mu_group.0.424.0.541. mu_group.0.541.0.566. mu_group.0.566.0.594. mu_group.0.594.1. loglik0 loglik1 llr pval fdr_mean
482        0.697 2.18 7.25     0.244  0.105  0.699 2.86   98 0.013  1.56   0.013  2.85   482        2.18 0.702 0.984 0.705 0.762 0.764 0.69  0.604 -2.72e+03 -907        -1.81e+03 0         0        
1.55e+03 0.432 6.2  59.3      0.0584 0.0727 0.445 7.63   564 0.0133 4.15   0.0133 7.62   1.55e+03 6.2  0.433 -0.544 0.386 0.436 0.409 0.519 0.493 -1.54e+04 -2.3e+03  -1.31e+04 0         0        
1.01e+03 0.602 3.75 15.6      0.168  0.114  0.597 3.55   607 0.0134 1.93   0.0134 3.54   1.01e+03 3.75 0.604 -0.295 0.562 0.639 0.659 0.588 0.509 -5.01e+03 -1.81e+03 -3.2e+03  0         0        
485        0.701 2.28 4.42     0.361  0.151  0.705 1.95   64 0.0131 1.07   0.0131 1.94   485        2.28 0.706 1.31  0.681 0.723 0.746 0.71  0.665 -1.36e+03 -640        -720        1.44e-310 1.36e-308
568        0.971 2.48 24        0.99   0.0298 0.971 0.981  11 0.0132 0.539  0.0132 0.967  568        2.48 0.972 4.55  0.968 0.961 0.974 0.987 0.963 -884        -339        -544        1.97e-234 1.49e-232
228        0.193 1.32 4.29     0.237  0.963  0.197 0.833  31 0.014  0.46   0.014  0.819  228        1.32 0.2   -2.17  0.125 0.234 0.32  0.263 0.108 -883        -433        -450        1.85e-193 1.16e-191
2.24e+03 0.603 26.5  1.57e+03 20.8    16.9    0.552 0.0266 51 0.0145 0.0219 0.0145 0.0121 2.24e+03 26.5  0.605 0.238 0.514 0.54  0.616 0.657 0.639 -6.94e+03 -6.55e+03 -387        2.42e-166 1.31e-164
519        0.629 2.27 13.5      2.43   1.58   0.606 0.249  162 0.013  0.142  0.013  0.236  519        2.27 0.634 0.558 0.632 0.561 0.55  0.622 0.657 -1.87e+03 -1.58e+03 -286        1.59e-122 7.52e-121
124        0.689 1.32 1.96     0.227  0.101  0.692 3.05   97 0.014  1.67   0.014  3.04   124        1.32 0.693 1.35  0.643 0.719 0.743 0.694 0.646 -366        -158        -208        1.11e-88  4.65e-87 
871        0.61  3.18 16.2      5.84   3.96   0.596 0.102  211 0.0133 0.0622 0.0133 0.0887 871        3.18 0.624 0.583 0.606 0.623 0.611 0.567 0.594 -2.4e+03  -2.19e+03 -205        1.74e-87  6.57e-86 
gene <- "Dbi"
#generating data frame for plotting

simul_data <- make_plotdf_simul(Cast_B6_a1, Cast_B6_tot, gene = gene, estimates_group = shrunk_params_gene,
                                metadata = pseudotime, order.by = "time", split.var = "group")

plot_distr(simul_data, gene = gene, add.density = FALSE, min_counts = 0) + 
  geom_hline(yintercept = c(simul_data$Index[match(unique(simul_data$group), simul_data$group)][-1])) +
  labs(y = "Ordered pseudotime")

Test for changes in ASE variance over time

The changes in allelic variation between the groups can be identifying with [group_var()]. This function requires specifying the global mean allelic ratio (passed through mean_null parameter), which is used when evaluating both the null and alternative hypotheses.

change_var <- group_var(a1_mat, tot_mat, 
                         metadata = pseudotime, split.var = "group",
                         min_counts = 5, min_cells = 5, 
                         mean_null = 0.5,
                         estimates = global_shrunk,
                         estimates_group = shrunk_params_gene,
                         equalGroups = TRUE)
#variance changes overtime, whilst keeping the mean AR constant

change_var <- change_var[!is.na(change_var$pval),]
change_var$fdr_var <- p.adjust(change_var$pval_var, method = "fdr")
change_var[order(change_var$fdr_var)[11:20],]
N AR tot_gene_mean tot_gene_variance alpha beta bb_mu bb_theta id theta_smoothed thetaCorrected theta_common resid N.1 tot_gene_mean.1 AR.1 log2FC theta_orig_.0.0142.0.424. theta_orig_.0.424.0.541. theta_orig_.0.541.0.566. theta_orig_.0.566.0.594. theta_orig_.0.594.1. theta_shrunk_.0.0142.0.424. theta_shrunk_.0.424.0.541. theta_shrunk_.0.541.0.566. theta_shrunk_.0.566.0.594. theta_shrunk_.0.594.1. loglik0_var loglik1_var llr_var pval_var fdr_var
57        0.746  0.378  2.96 4.74     1.56     0.753  0.159  1184 0.0824 0.128  0.0824 0.0763  57        0.378  0.746  1.64  0.119  0.143                    0.0912 0.12                       -134        -117        -16.7 1.02e-06 3.5e-05 
2.11e+03 0.522  7.41   37.9  488        448        0.521  0.0011 465       0.0011 0.0133 -0.0122  2.11e+03 7.41   0.522  0.124 0.0158 0.0014 0.0221 0.0129 0.051  0.0136 0.0014 0.0179  0.0125  0.0388 -4.1e+03  -4.08e+03 -16.2 1.52e-06 4.78e-05
191        0.0238 1.48   2.57 4.51     167        0.0263 0.0058 12 0.0134 0.01   0.0134 -0.00761 191        1.48   0.0238 -5.21  0.0201 0.0162 0.0197 0      0.0002 0.035  0.0214 0.0266  0       0.0002 -680        -664        -15.6 2.81e-06 8.18e-05
1.18e+03 0.59   4.06   13.9  346        267        0.564  0.0016 226       0.0016 0.0133 -0.0117  1.18e+03 4.06   0.59   0.371 0.028  0.0194 0.0537 0.0026 0.0685 0.021  0.0163 0.0374  0.00959 0.0505 -2.19e+03 -2.17e+03 -14.5 7.79e-06 0.00021 
16        0.365  0.0622 0.38 3.44     7.67     0.31   0.09   992 0.199  0.151  0.199  -0.109   16        0.0622 0.365  -1.28        0.0534                         0.134                      -36.8      -22.9      -13.9 1.42e-05 0.000358
2.83e+03 0.568  13.1    91.2  3.08e+03 2.42e+03 0.56   0.0002 293       0.0002 0.0135 -0.0133  2.83e+03 13.1    0.568  0.347 0.0001 0.0001 0.0039 0.0066 0.0002 0.0001 0.0001 0.00776 0.0078  0.0002 -5.97e+03 -5.96e+03 -12   7.88e-05 0.00186 
1.23e+03 0.644  4.37   23.6  404        254        0.614  0.0015 100       0.0015 0.0133 -0.0118  1.23e+03 4.37   0.644  0.667 0.0952 0.0001 0.0001 0.0066 0.0037 0.0777 0.0001 0.0001  0.00836 0.0127 -2.42e+03 -2.4e+03  -11.8 9.9e-05  0.0022  
162        0.954  1.2    2.57 10.4      0.482    0.956  0.0917 3 0.0145 0.0572 0.0145 0.0772  162        1.2    0.954  4.47  0.0776 0.0665 0.106  0.126  0.24   0.0624 0.0508 0.0866  0.0941  0.159  -543        -532        -11.5 0.000126 0.00264 
100        0.552  0.466  3.66 17.6      16.2      0.521  0.0296 1065 0.061  0.0474 0.061  -0.0314  100        0.466  0.552  0.114 0.0002 0.0303 0.0038             0.0002 0.0357 0.136                -194        -183        -11.3 0.000158 0.00314 
407        0.266  2.12   4.19 10.6      30.4      0.259  0.0244 36 0.013  0.0199 0.013  0.0114  407        2.12   0.266  -1.52  0.0513 0.0241 0.0001 0      0.0491 0.0369 0.0218 0.0001  0       0.0443 -928        -917        -10.8 0.000236 0.00445 
gene <- "Tubb3"
#generating data frame for plotting

simul_data <- make_plotdf_simul(Cast_B6_a1, Cast_B6_tot, gene = gene, estimates_group = shrunk_params_gene,
                                metadata = pseudotime, order.by = "time", split.var = "group")

plot_distr(simul_data, gene = gene, add.density = FALSE, min_counts = 0) + 
  geom_hline(yintercept = c(simul_data$Index[match(unique(simul_data$group), simul_data$group)][-1])) +
  labs(y = "Ordered pseudotime")