Identifies transcripts/genes that are consistently expressed above a threshold across samples. This function adds a logical column `.abundant` to indicate which features pass the filtering criteria.
identify_abundant(
.data,
abundance = assayNames(.data)[1],
design = NULL,
formula_design = NULL,
minimum_counts = 10,
minimum_proportion = 0.7,
minimum_count_per_million = NULL,
factor_of_interest = NULL,
...,
.abundance = NULL
)
# S4 method for class 'SummarizedExperiment'
identify_abundant(
.data,
abundance = assayNames(.data)[1],
design = NULL,
formula_design = NULL,
minimum_counts = 10,
minimum_proportion = 0.7,
minimum_count_per_million = NULL,
factor_of_interest = NULL,
...,
.abundance = NULL
)
# S4 method for class 'RangedSummarizedExperiment'
identify_abundant(
.data,
abundance = assayNames(.data)[1],
design = NULL,
formula_design = NULL,
minimum_counts = 10,
minimum_proportion = 0.7,
minimum_count_per_million = NULL,
factor_of_interest = NULL,
...,
.abundance = NULL
)A `tbl` or `SummarizedExperiment` object containing transcript/gene abundance data
The name of the transcript/gene abundance column (character, preferred)
A design matrix for more complex experimental designs. If provided, this is passed to filterByExpr instead of factor_of_interest.
...
...
...
Minimum CPM cutoff to use for filtering (passed to CPM.Cutoff in filterByExpr). If provided, this will override the minimum_counts parameter. Default is NULL (uses edgeR default).
The name of the column containing groups/conditions for filtering. Used by edgeR's filterByExpr to define sample groups. DEPRECATED: Use 'design' or 'formula_design' instead. This argument will be removed in a future release.
Further arguments.
DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)
Returns the input object with an additional logical column `.abundant` indicating which features passed the abundance threshold criteria.
A `SummarizedExperiment` object
A `SummarizedExperiment` object
`r lifecycle::badge("maturing")`
This function uses edgeR's filterByExpr() function to identify consistently expressed features. A feature is considered abundant if it has CPM > minimum_counts in at least minimum_proportion of samples in at least one experimental group (defined by factor_of_interest or design).
Mangiola, S., Molania, R., Dong, R., Doyle, M. A., & Papenfuss, A. T. (2021). tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biology, 22(1), 42. doi:10.1186/s13059-020-02233-7
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. DOI: 10.1093/bioinformatics/btp616
## Load airway dataset for examples
data('airway', package = 'airway')
# Ensure a 'condition' column exists for examples expecting it
SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex
# Basic usage
airway |> identify_abundant()
#> Warning: All samples appear to belong to the same group.
#> class: RangedSummarizedExperiment
#> dim: 63677 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
#> ENSG00000273493
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition
# With custom thresholds
airway |> identify_abundant(
minimum_counts = 5,
minimum_proportion = 0.5
)
#> Warning: All samples appear to belong to the same group.
#> class: RangedSummarizedExperiment
#> dim: 63677 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
#> ENSG00000273493
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition
# Using a factor of interest
airway |> identify_abundant(factor_of_interest = condition)
#> Warning: The `factor_of_interest` argument of `identify_abundant()` is deprecated as of
#> tidybulk 2.0.0.
#> ℹ Please use the `formula_design` argument instead.
#> ℹ The argument 'factor_of_interest' is deprecated and will be removed in a
#> future release. Please use the 'design' or 'formula_design' argument instead.
#> class: RangedSummarizedExperiment
#> dim: 63677 8
#> metadata(1): ''
#> assays(1): counts
#> rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
#> ENSG00000273493
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition