Dimension reduction of the transcript abundance data — reduce

reduce_dimensions() takes as input a `SummarizedExperiment` and calculates the reduced dimensional space of the transcript abundance.

reduce_dimensions(
  .data,
  assay,
  method,
  .dims = 2,
  top = 500,
  of_samples = TRUE,
  transform = log1p,
  scale = TRUE,
  ...,
  log_transform = NULL
)

# S4 method for class 'SummarizedExperiment'
reduce_dimensions(
  .data,
  assay,
  method,
  .dims = 2,
  top = 500,
  of_samples = TRUE,
  transform = log1p,
  scale = TRUE,
  ...,
  log_transform = NULL
)

# S4 method for class 'RangedSummarizedExperiment'
reduce_dimensions(
  .data,
  assay,
  method,
  .dims = 2,
  top = 500,
  of_samples = TRUE,
  transform = log1p,
  scale = TRUE,
  ...,
  log_transform = NULL
)

Arguments

.data: A `SummarizedExperiment`
assay: Character string: the name of the assay to use for dimension reduction (must match `assayNames(.data)`). This argument must be explicitly specified so that the choice of abundance matrix is deliberate.
method: A character string. The dimension reduction algorithm to use (PCA, MDS, tSNE).
.dims: An integer. The number of dimensions your are interested in (e.g., 4 for returning the first four principal components).
top: An integer. How many top genes to select for dimensionality reduction
of_samples: A boolean. In case the input is a tidybulk object, it indicates Whether the element column will be sample or transcript column
transform: A function that will tranform the counts, by default it is log1p for RNA sequencing data, but for avoinding tranformation you can use identity
scale: A boolean for method="PCA", this will be passed to the `prcomp` function. It is not included in the ... argument because although the default for `prcomp` if FALSE, it is advisable to set it as TRUE.
...: Further parameters passed to the function prcomp if you choose method="PCA" or Rtsne if you choose method="tSNE", or uwot::tumap if you choose method="umap"
log_transform: DEPRECATED - A boolean, whether the value should be log-transformed (e.g., TRUE for RNA sequencing data)

Value

A tbl object with additional columns for the reduced dimensions

A `SummarizedExperiment` object

Details

`r lifecycle::badge("maturing")`

This function reduces the dimensions of the transcript abundances. It can use multi-dimensional scaling (MDS; DOI.org/10.1186/gb-2010-11-3-r25), principal component analysis (PCA), or tSNE (Jesse Krijthe et al. 2018)

Underlying method for PCA: prcomp(scale = scale, ...)

Underlying method for MDS: limma::plotMDS(ndim = .dims, plot = FALSE, top = top)

Underlying method for tSNE: Rtsne::Rtsne(data, ...)

Underlying method for UMAP: variable features from the chosen assay are optionally PCA-reduced, then uwot::tumap() is applied to the sample coordinates matrix.

References

Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7

Krijthe, J. H. (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation. R package version 0.15.

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex





counts.MDS =
 airway |>
 identify_abundant() |>
 reduce_dimensions(assay = "counts", method="MDS", .dims = 3)
#> Warning: All samples appear to belong to the same group.
#> Getting the 500 most variable genes
#> [1] "MDS result_df colnames: sample, 1, 2, 3"
#> tidybulk says: to access the raw results do `metadata(.)$tidybulk$MDS`


counts.PCA =
 airway |>
 identify_abundant() |>
 reduce_dimensions(assay = "counts", method="PCA", .dims = 3)
#> Warning: All samples appear to belong to the same group.
#> Getting the 500 most variable genes
#> Fraction of variance explained by the selected principal components
#> # A tibble: 3 × 2
#>   `Fraction of variance`    PC
#>                    <dbl> <int>
#> 1                  0.409     1
#> 2                  0.285     2
#> 3                  0.147     3
#> tidybulk says: to access the raw results do `metadata(.)$tidybulk$PCA`