The Methodological Pitfall of Using Expression Values in Standard GSEA

- 9 mins

What the packages expect

Tool Typical input What it measures
fgsea::fgsea() A named numeric vector of gene-level statistics (one value per gene): e.g. moderated t, log2 fold change, signed correlation with phenotype, or similar. Whether genes in a set tend to lie at the extremes of that statistic (enrichment at the “up” or “down” end of the ranking).
GSVA::gsva(..., method = "ssgsea") A matrix of expression (often log-CPM / log-TPM / log-RPKM), genes × samples. Within each sample, pathway activity from the relative ordering of expression (ssGSEA ranks internally; with continuous data, kcdf = "Gaussian" is usual).

The mistake: passing logTPM (or any “how highly expressed is this gene?” scale) into fgsea(..., stats = ) as if it were a differential statistic. Then the implicit ranking is dominated by absolute expression, not by evidence of regulation / contrast. Pathway “scores” can track highly expressed genes rather than pathway-specific biology.

Better patterns:


Toy example: same numbers, different stories

We use a tiny TPM-like matrix: each column is scaled so the four genes sum to 100 (same constraint as TPM summing to 1e6 per sample; here we use “percent of total” for readability). First block: fgsea on one column as stats. Second block: GSVA with ssgsea on the full matrix.

library(fgsea)
library(GSVA)

Data — TPM-like matrix (each sample sums to 100)

Unscaled abundances live in raw; vstat is TPM-like: each column multiplied so Σ genes = 100 (same idea as TPM summing to 10^6, shown here as percentages).

genes <- c("Gene_A", "Gene_B", "Gene_C", "Gene_D", "Gene_E")
pathways <- list(test = c("Gene_A", "Gene_C"))
raw <- data.frame(
  Condition_A = c(1.2, 1.1, 0.1, 0.2, 0.3),
  Condition_B = c(8.5, 7.2, 4.1, 2.0, 1.0),
  Condition_C = c(100, 99, 10, 1, 0.5),
  row.names = genes
)
# TPM-like: proportions × 100 so each column sums to 100
vstat <- as.data.frame(lapply(raw, function(x) 100 * x / sum(x)))
rownames(vstat) <- genes
stopifnot(all.equal(as.numeric(colSums(vstat)), rep(100, ncol(vstat)), check.attributes = FALSE))
vstat
#>        Condition_A Condition_B Condition_C
#> Gene_A   41.379310   37.280702  47.5059382
#> Gene_B   37.931034   31.578947  47.0308789
#> Gene_C    3.448276   17.982456   4.7505938
#> Gene_D    6.896552    8.771930   0.4750594
#> Gene_E   10.344828    4.385965   0.2375297

Part A — fgsea with column-wise “stats” (= expression-like)

Here test_stats is just one sample’s TPM column (same relative abundances as before, each column summing to 100). fgsea ranks genes by these values and tests enrichment of the set c('Gene_A','Gene_C').


for (sample in colnames(vstat)) {
  set.seed(37)
  test_stats <- vstat[, sample]
  names(test_stats) <- row.names(vstat)
  result <- fgsea::fgsea(
    pathways = pathways,
    stats = test_stats,
    minSize = 0,
    scoreType = "pos"
  )
  print(sample)
  print(result[, c("pathway", "NES", "ES", "pval")])
}
#> [1] "Condition_A"
#>    pathway      NES        ES      pval
#> 1:    test 1.562328 0.9230769 0.2067932
#> [1] "Condition_B"
#>    pathway      NES        ES      pval
#> 1:    test 1.195829 0.6746032 0.4285714
#> [1] "Condition_C"
#>    pathway      NES        ES      pval
#> 1:    test 1.361405 0.9090909 0.4285714

Interpretation: Multiplying each column by a positive constant (here, scaling so the column sums to 100) does not change gene ranks within that sample, so fgsea results match the same relative abundances as before. NES / ES still differ across Condition_* because each sample has a different abundance ranking. The problem is not the TPM scaling; it is that the statistic is still abundance, not a contrast. You are not testing “pathway activation” in the GSEA sense; you are testing whether pathway genes sit at the high end of that sample’s expression. That is often not what people intend when they say “single-sample pathway score from logTPM.”

Part B — GSVA + ssgsea on the expression matrix

Designed for continuous expression; ranks are formed within each sample in the ssGSEA procedure (with kcdf = "Gaussian" for continuous values). Integer counts would use kcdf = "Poisson" instead.

gsva(
  expr = as.matrix(vstat),
  gset.idx.list = pathways,
  method = "ssgsea",
  kcdf = "Gaussian",
  min.sz = 1,
  max.sz = 4,
  verbose = FALSE
)
#>      Condition_A Condition_B Condition_C
#> test   0.2977449    1.297745    1.297745

Here the method is aligned with expression-as-input for per-sample enrichment.


Takeaway

  1. fgsea(stats = TPM_one_sample) (or log-transformed TPM used the same way as a plain ranking scale) ≈ enrichment driven by who is highest on that abundance scale, not necessarily pathway biology you care about.
  2. GSVA / ssgsea(expr = logTPM_matrix)rank-based single-sample pathway summaries intended for expression.
  3. If you truly need fgsea, supply contrast-based gene statistics (DE, correlation, etc.), not raw per-sample TPM-like columns alone.

Session info

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.utf-8/en_US.utf-8/en_US.utf-8/C/en_US.utf-8/en_US.utf-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] GSVA_1.38.2  fgsea_1.16.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3                lattice_0.20-45            
#>  [3] digest_0.6.29               utf8_1.2.2                 
#>  [5] GenomeInfoDb_1.26.7         R6_2.5.1                   
#>  [7] stats4_4.0.4                RSQLite_2.4.6              
#>  [9] evaluate_0.15               httr_1.4.2                 
#> [11] ggplot2_3.3.5               pillar_1.7.0               
#> [13] zlibbioc_1.36.0             rlang_1.0.2                
#> [15] data.table_1.14.2           annotate_1.68.0            
#> [17] blob_1.3.0                  S4Vectors_0.28.1           
#> [19] Matrix_1.4-1                rmarkdown_2.13             
#> [21] BiocParallel_1.24.1         stringr_1.4.0              
#> [23] RCurl_1.98-1.6              bit_4.0.4                  
#> [25] munsell_0.5.0               DelayedArray_0.16.3        
#> [27] compiler_4.0.4              xfun_0.30                  
#> [29] pkgconfig_2.0.3             BiocGenerics_0.36.1        
#> [31] htmltools_0.5.2             tidyselect_1.1.2           
#> [33] SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4     
#> [35] tibble_3.1.6                gridExtra_2.3              
#> [37] matrixStats_0.61.0          IRanges_2.24.1             
#> [39] XML_3.99-0.23               fansi_1.0.3                
#> [41] crayon_1.5.1                dplyr_1.0.8                
#> [43] bitops_1.0-7                grid_4.0.4                 
#> [45] xtable_1.8-4                GSEABase_1.52.1            
#> [47] gtable_0.3.0                lifecycle_1.0.1            
#> [49] DBI_1.3.0                   magrittr_2.0.3             
#> [51] scales_1.1.1                graph_1.68.0               
#> [53] cli_3.2.0                   stringi_1.7.6              
#> [55] cachem_1.0.6                XVector_0.30.0             
#> [57] ellipsis_0.3.2              generics_0.1.2             
#> [59] vctrs_0.4.0                 fastmatch_1.1-3            
#> [61] tools_4.0.4                 bit64_4.0.5                
#> [63] Biobase_2.50.0              glue_1.6.2                 
#> [65] purrr_0.3.4                 MatrixGenerics_1.2.1       
#> [67] parallel_4.0.4              fastmap_1.1.0              
#> [69] yaml_2.3.5                  AnnotationDbi_1.52.0       
#> [71] colorspace_2.0-3            GenomicRanges_1.42.0       
#> [73] memoise_2.0.1               knitr_1.38
Jingxin Fu, Ph.D.

Jingxin Fu, Ph.D.

Research Fellow interested in data mining on cancer genomics

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora