The Methodological Pitfall of Using Expression Values in Standard GSEA
- 9 minsWhat the packages expect
| Tool | Typical input | What it measures |
|---|---|---|
fgsea::fgsea() | A named numeric vector of gene-level statistics (one value per gene): e.g. moderated t, log2 fold change, signed correlation with phenotype, or similar. | Whether genes in a set tend to lie at the extremes of that statistic (enrichment at the “up” or “down” end of the ranking). |
GSVA::gsva(..., method = "ssgsea") | A matrix of expression (often log-CPM / log-TPM / log-RPKM), genes × samples. | Within each sample, pathway activity from the relative ordering of expression (ssGSEA ranks internally; with continuous data, kcdf = "Gaussian" is usual). |
The mistake: passing logTPM (or any “how highly expressed is this gene?” scale) into fgsea(..., stats = ) as if it were a differential statistic. Then the implicit ranking is dominated by absolute expression, not by evidence of regulation / contrast. Pathway “scores” can track highly expressed genes rather than pathway-specific biology.
Better patterns:
- For single-sample pathway activity from expression, use GSVA / ssGSEA (or similar rank-based methods), not raw
fgseaon TPM. - For fgsea, build
statsfrom a contrast (e.g. tumor vs normal, or correlation with a score) so that large positive/negative values mean “up” / “down” in that comparison.
Toy example: same numbers, different stories
We use a tiny TPM-like matrix: each column is scaled so the four genes sum to 100 (same constraint as TPM summing to 1e6 per sample; here we use “percent of total” for readability). First block: fgsea on one column as stats. Second block: GSVA with ssgsea on the full matrix.
library(fgsea)
library(GSVA)
Data — TPM-like matrix (each sample sums to 100)
Unscaled abundances live in raw; vstat is TPM-like: each column multiplied so Σ genes = 100 (same idea as TPM summing to 10^6, shown here as percentages).
genes <- c("Gene_A", "Gene_B", "Gene_C", "Gene_D", "Gene_E")
pathways <- list(test = c("Gene_A", "Gene_C"))
raw <- data.frame(
Condition_A = c(1.2, 1.1, 0.1, 0.2, 0.3),
Condition_B = c(8.5, 7.2, 4.1, 2.0, 1.0),
Condition_C = c(100, 99, 10, 1, 0.5),
row.names = genes
)
# TPM-like: proportions × 100 so each column sums to 100
vstat <- as.data.frame(lapply(raw, function(x) 100 * x / sum(x)))
rownames(vstat) <- genes
stopifnot(all.equal(as.numeric(colSums(vstat)), rep(100, ncol(vstat)), check.attributes = FALSE))
vstat
#> Condition_A Condition_B Condition_C
#> Gene_A 41.379310 37.280702 47.5059382
#> Gene_B 37.931034 31.578947 47.0308789
#> Gene_C 3.448276 17.982456 4.7505938
#> Gene_D 6.896552 8.771930 0.4750594
#> Gene_E 10.344828 4.385965 0.2375297
Part A — fgsea with column-wise “stats” (= expression-like)
Here test_stats is just one sample’s TPM column (same relative abundances as before, each column summing to 100). fgsea ranks genes by these values and tests enrichment of the set c('Gene_A','Gene_C').
for (sample in colnames(vstat)) {
set.seed(37)
test_stats <- vstat[, sample]
names(test_stats) <- row.names(vstat)
result <- fgsea::fgsea(
pathways = pathways,
stats = test_stats,
minSize = 0,
scoreType = "pos"
)
print(sample)
print(result[, c("pathway", "NES", "ES", "pval")])
}
#> [1] "Condition_A"
#> pathway NES ES pval
#> 1: test 1.562328 0.9230769 0.2067932
#> [1] "Condition_B"
#> pathway NES ES pval
#> 1: test 1.195829 0.6746032 0.4285714
#> [1] "Condition_C"
#> pathway NES ES pval
#> 1: test 1.361405 0.9090909 0.4285714
Interpretation: Multiplying each column by a positive constant (here, scaling so the column sums to 100) does not change gene ranks within that sample, so fgsea results match the same relative abundances as before. NES / ES still differ across Condition_* because each sample has a different abundance ranking. The problem is not the TPM scaling; it is that the statistic is still abundance, not a contrast. You are not testing “pathway activation” in the GSEA sense; you are testing whether pathway genes sit at the high end of that sample’s expression. That is often not what people intend when they say “single-sample pathway score from logTPM.”
Part B — GSVA + ssgsea on the expression matrix
Designed for continuous expression; ranks are formed within each sample in the ssGSEA procedure (with kcdf = "Gaussian" for continuous values). Integer counts would use kcdf = "Poisson" instead.
gsva(
expr = as.matrix(vstat),
gset.idx.list = pathways,
method = "ssgsea",
kcdf = "Gaussian",
min.sz = 1,
max.sz = 4,
verbose = FALSE
)
#> Condition_A Condition_B Condition_C
#> test 0.2977449 1.297745 1.297745
Here the method is aligned with expression-as-input for per-sample enrichment.
Takeaway
fgsea(stats = TPM_one_sample)(or log-transformed TPM used the same way as a plain ranking scale) ≈ enrichment driven by who is highest on that abundance scale, not necessarily pathway biology you care about.GSVA / ssgsea(expr = logTPM_matrix)≈ rank-based single-sample pathway summaries intended for expression.- If you truly need fgsea, supply contrast-based gene statistics (DE, correlation, etc.), not raw per-sample TPM-like columns alone.
Session info
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.utf-8/en_US.utf-8/en_US.utf-8/C/en_US.utf-8/en_US.utf-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GSVA_1.38.2 fgsea_1.16.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.8.3 lattice_0.20-45
#> [3] digest_0.6.29 utf8_1.2.2
#> [5] GenomeInfoDb_1.26.7 R6_2.5.1
#> [7] stats4_4.0.4 RSQLite_2.4.6
#> [9] evaluate_0.15 httr_1.4.2
#> [11] ggplot2_3.3.5 pillar_1.7.0
#> [13] zlibbioc_1.36.0 rlang_1.0.2
#> [15] data.table_1.14.2 annotate_1.68.0
#> [17] blob_1.3.0 S4Vectors_0.28.1
#> [19] Matrix_1.4-1 rmarkdown_2.13
#> [21] BiocParallel_1.24.1 stringr_1.4.0
#> [23] RCurl_1.98-1.6 bit_4.0.4
#> [25] munsell_0.5.0 DelayedArray_0.16.3
#> [27] compiler_4.0.4 xfun_0.30
#> [29] pkgconfig_2.0.3 BiocGenerics_0.36.1
#> [31] htmltools_0.5.2 tidyselect_1.1.2
#> [33] SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4
#> [35] tibble_3.1.6 gridExtra_2.3
#> [37] matrixStats_0.61.0 IRanges_2.24.1
#> [39] XML_3.99-0.23 fansi_1.0.3
#> [41] crayon_1.5.1 dplyr_1.0.8
#> [43] bitops_1.0-7 grid_4.0.4
#> [45] xtable_1.8-4 GSEABase_1.52.1
#> [47] gtable_0.3.0 lifecycle_1.0.1
#> [49] DBI_1.3.0 magrittr_2.0.3
#> [51] scales_1.1.1 graph_1.68.0
#> [53] cli_3.2.0 stringi_1.7.6
#> [55] cachem_1.0.6 XVector_0.30.0
#> [57] ellipsis_0.3.2 generics_0.1.2
#> [59] vctrs_0.4.0 fastmatch_1.1-3
#> [61] tools_4.0.4 bit64_4.0.5
#> [63] Biobase_2.50.0 glue_1.6.2
#> [65] purrr_0.3.4 MatrixGenerics_1.2.1
#> [67] parallel_4.0.4 fastmap_1.1.0
#> [69] yaml_2.3.5 AnnotationDbi_1.52.0
#> [71] colorspace_2.0-3 GenomicRanges_1.42.0
#> [73] memoise_2.0.1 knitr_1.38