Filters the data to keep only transcripts/genes that are consistently expressed above a threshold across samples. This is a filtering version of identify_abundant() that removes low-abundance features instead of just marking them.

This function is similar to identify_abundant() but instead of adding an .abundant column, it filters out the low-abundance features directly.

keep_abundant(
  .data,
  abundance = assayNames(.data)[1],
  design = NULL,
  formula_design = NULL,
  minimum_counts = 10,
  minimum_proportion = 0.7,
  minimum_count_per_million = NULL,
  factor_of_interest = NULL,
  ...,
  .abundance = NULL
)

Arguments

.data

A `tbl` or `SummarizedExperiment` object containing transcript/gene abundance data

abundance

The name of the transcript/gene abundance column (character, preferred)

design

A design matrix for more complex experimental designs. If provided, this is passed to filterByExpr instead of factor_of_interest.

formula_design

A formula for creating the design matrix

minimum_counts

The minimum count threshold for a feature to be considered abundant

minimum_proportion

The minimum proportion of samples in which a feature must be abundant

minimum_count_per_million

The minimum count per million threshold

factor_of_interest

The name of the column containing groups/conditions for filtering. DEPRECATED: Use 'design' or 'formula_design' instead.

...

Further arguments.

.abundance

DEPRECATED. The name of the transcript/gene abundance column (symbolic, for backward compatibility)

Value

Returns a filtered version of the input object containing only the features that passed the abundance threshold criteria.

Returns a filtered version of the input object containing only the features that passed the abundance threshold criteria.

Details

Filter to keep only abundant transcripts/genes

[Questioning]

This function uses edgeR's filterByExpr() function to identify and keep consistently expressed features. A feature is kept if it has CPM > minimum_counts in at least minimum_proportion of samples in at least one experimental group (defined by factor_of_interest or design).

This function is similar to identify_abundant() but instead of adding an .abundant column, it filters out the low-abundance features directly.

References

McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. DOI: 10.1093/bioinformatics/btp616

Examples

## Load airway dataset for examples

  data('airway', package = 'airway')
  # Ensure a 'condition' column exists for examples expecting it

    SummarizedExperiment::colData(airway)$condition <- SummarizedExperiment::colData(airway)$dex


# Basic usage
airway |> keep_abundant()
#> Warning: All samples appear to belong to the same group.
#> class: RangedSummarizedExperiment 
#> dim: 14224 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(14224): ENSG00000000003 ENSG00000000419 ... ENSG00000273382
#>   ENSG00000273486
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition

# With custom thresholds
airway |> keep_abundant(
  minimum_counts = 5,
  minimum_proportion = 0.5
)
#> Warning: All samples appear to belong to the same group.
#> class: RangedSummarizedExperiment 
#> dim: 15436 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(15436): ENSG00000000003 ENSG00000000419 ... ENSG00000273478
#>   ENSG00000273486
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition

# Using a factor of interest
airway |> keep_abundant(factor_of_interest = condition)
#> Warning: The `factor_of_interest` argument of `keep_abundant()` is deprecated as of
#> tidybulk 2.0.0.
#>  Please use the `formula_design` argument instead.
#>  The argument 'factor_of_interest' is deprecated and will be removed in a
#>   future release. Please use the 'design' or 'formula_design' argument instead.
#> class: RangedSummarizedExperiment 
#> dim: 15926 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(15926): ENSG00000000003 ENSG00000000419 ... ENSG00000273472
#>   ENSG00000273486
#> rowData names(11): gene_id gene_name ... symbol .abundant
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(10): SampleName cell ... BioSample condition