Brings SummarizedExperiment to the tidyverse!

website: stemangiola.github.io/tidySummarizedExperiment/

Another nice introduction by carpentries-incubator.

Please also have a look at

  • tidySingleCellExperiment for tidy manipulation of SingleCellExperiment objects
  • tidyseurat for tidy manipulation of Seurat objects
  • tidybulk for tidy analysis of RNA sequencing data
  • nanny for tidy high-level data analysis and manipulation
  • tidygate for adding custom gate information to your tibble
  • tidyHeatmap for heatmaps produced with tidy principles

Introduction

tidySummarizedExperiment provides a bridge between Bioconductor SummarizedExperiment [@morgan2020summarized] and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the Bioconductor SummarizedExperiment object as a tidyverse tibble, and provides SummarizedExperiment-compatible dplyr, tidyr, ggplot and plotly functions. This allows users to get the best of both Bioconductor and tidyverse worlds.

Functions/utilities available

SummarizedExperiment-compatible Functions Description
all After all tidySummarizedExperiment is a SummarizedExperiment object, just better
tidyverse Packages Description
dplyr Almost all dplyr APIs like for any tibble
tidyr Almost all tidyr APIs like for any tibble
ggplot2 ggplot like for any tibble
plotly plot_ly like for any tibble
Utilities Description
as_tibble Convert cell-wise information to a tbl_df

Installation

if (!requireNamespace("BiocManager", quietly=TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("tidySummarizedExperiment")

From Github (development)

devtools::install_github("stemangiola/tidySummarizedExperiment")

Load libraries used in the examples.

Create tidySummarizedExperiment, the best of both worlds!

This is a SummarizedExperiment object but it is evaluated as a tibble. So it is fully compatible both with SummarizedExperiment and tidyverse APIs.

pasilla_tidy <- tidySummarizedExperiment::pasilla 

It looks like a tibble

pasilla_tidy
## class: SummarizedExperiment 
## dim: 14599 7 
## metadata(0):
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): untrt1 untrt2 ... trt2 trt3
## colData names(2): condition type

But it is a SummarizedExperiment object after all

assays(pasilla_tidy)
## List of length 1
## names(1): counts

Tidyverse commands

We can use tidyverse commands to explore the tidy SummarizedExperiment object.

We can use slice to choose rows by position, for example to choose the first row.

pasilla_tidy %>%
    slice(1)
## class: SummarizedExperiment 
## dim: 1 1 
## metadata(1): latest_join_scope_report
## assays(1): counts
## rownames(1): FBgn0000003
## rowData names(0):
## colnames(1): untrt1
## colData names(2): condition type

We can use filter to choose rows by criteria.

pasilla_tidy %>%
    filter(condition == "untreated")
## class: SummarizedExperiment 
## dim: 14599 4 
## metadata(1): latest_filter_scope_report
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(4): untrt1 untrt2 untrt3 untrt4
## colData names(2): condition type

We can use select to choose columns.

pasilla_tidy %>%
    select(.sample)
## 
[90m# A tibble: 102,193 × 1
[39m
##    .sample
##    
[3m
[90m<chr>
[39m
[23m  
## 
[90m 1
[39m untrt1 
## 
[90m 2
[39m untrt1 
## 
[90m 3
[39m untrt1 
## 
[90m 4
[39m untrt1 
## 
[90m 5
[39m untrt1 
## 
[90m 6
[39m untrt1 
## 
[90m 7
[39m untrt1 
## 
[90m 8
[39m untrt1 
## 
[90m 9
[39m untrt1 
## 
[90m10
[39m untrt1 
## 
[90m# ℹ 102,183 more rows
[39m

We can use count to count how many rows we have for each sample.

pasilla_tidy %>%
    count(.sample)
## 
[90m# A tibble: 7 × 2
[39m
##   .sample     n
##   
[3m
[90m<chr>
[39m
[23m   
[3m
[90m<int>
[39m
[23m
## 
[90m1
[39m trt1    
[4m1
[24m
[4m4
[24m599
## 
[90m2
[39m trt2    
[4m1
[24m
[4m4
[24m599
## 
[90m3
[39m trt3    
[4m1
[24m
[4m4
[24m599
## 
[90m4
[39m untrt1  
[4m1
[24m
[4m4
[24m599
## 
[90m5
[39m untrt2  
[4m1
[24m
[4m4
[24m599
## 
[90m6
[39m untrt3  
[4m1
[24m
[4m4
[24m599
## 
[90m7
[39m untrt4  
[4m1
[24m
[4m4
[24m599

We can use distinct to see what distinct sample information we have.

pasilla_tidy %>%
    distinct(.sample, condition, type)
## 
[90m# A tibble: 7 × 3
[39m
##   .sample condition type      
##   
[3m
[90m<chr>
[39m
[23m   
[3m
[90m<chr>
[39m
[23m     
[3m
[90m<chr>
[39m
[23m     
## 
[90m1
[39m untrt1  untreated single_end
## 
[90m2
[39m untrt2  untreated single_end
## 
[90m3
[39m untrt3  untreated paired_end
## 
[90m4
[39m untrt4  untreated paired_end
## 
[90m5
[39m trt1    treated   single_end
## 
[90m6
[39m trt2    treated   paired_end
## 
[90m7
[39m trt3    treated   paired_end

We could use rename to rename a column. For example, to modify the type column name.

pasilla_tidy %>%
    rename(sequencing=type)
## class: SummarizedExperiment 
## dim: 14599 7 
## metadata(0):
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): untrt1 untrt2 ... trt2 trt3
## colData names(2): condition sequencing

We could use mutate to create a column. For example, we could create a new type column that contains single and paired instead of single_end and paired_end.

pasilla_tidy %>%
    mutate(type=gsub("_end", "", type))
## class: SummarizedExperiment 
## dim: 14599 7 
## metadata(1): latest_mutate_scope_report
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): untrt1 untrt2 ... trt2 trt3
## colData names(2): condition type

We could use unite to combine multiple columns into a single column.

pasilla_tidy %>%
    unite("group", c(condition, type))
## class: SummarizedExperiment 
## dim: 14599 7 
## metadata(0):
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
## rowData names(0):
## colnames(7): untrt1 untrt2 ... trt2 trt3
## colData names(1): group

We can also combine commands with the tidyverse pipe %>%.

For example, we could combine group_by and summarise to get the total counts for each sample.

pasilla_tidy %>%
    group_by(.sample) %>%
    summarise(total_counts=sum(counts))
## 
[90m# A tibble: 7 × 2
[39m
##   .sample total_counts
##   
[3m
[90m<chr>
[39m
[23m          
[3m
[90m<int>
[39m
[23m
## 
[90m1
[39m trt1        18
[4m6
[24m
[4m7
[24m
[4m0
[24m279
## 
[90m2
[39m trt2         9
[4m5
[24m
[4m7
[24m
[4m1
[24m826
## 
[90m3
[39m trt3        10
[4m3
[24m
[4m4
[24m
[4m3
[24m856
## 
[90m4
[39m untrt1      13
[4m9
[24m
[4m7
[24m
[4m2
[24m512
## 
[90m5
[39m untrt2      21
[4m9
[24m
[4m1
[24m
[4m1
[24m438
## 
[90m6
[39m untrt3       8
[4m3
[24m
[4m5
[24m
[4m8
[24m426
## 
[90m7
[39m untrt4       9
[4m8
[24m
[4m4
[24m
[4m1
[24m335

We could combine group_by, mutate and filter to get the transcripts with mean count > 0.

pasilla_tidy %>%
    group_by(.feature) %>%
    mutate(mean_count=mean(counts)) %>%
    filter(mean_count > 0)
## 
[90m# A tibble: 86,513 × 6
[39m
## 
[90m# Groups:   .feature [12,359]
[39m
##    .feature    .sample counts condition type       mean_count
##    
[3m
[90m<chr>
[39m
[23m       
[3m
[90m<chr>
[39m
[23m    
[3m
[90m<int>
[39m
[23m 
[3m
[90m<chr>
[39m
[23m     
[3m
[90m<chr>
[39m
[23m           
[3m
[90m<dbl>
[39m
[23m
## 
[90m 1
[39m FBgn0000003 untrt1       0 untreated single_end      0.143
## 
[90m 2
[39m FBgn0000008 untrt1      92 untreated single_end     99.6  
## 
[90m 3
[39m FBgn0000014 untrt1       5 untreated single_end      1.43 
## 
[90m 4
[39m FBgn0000015 untrt1       0 untreated single_end      0.857
## 
[90m 5
[39m FBgn0000017 untrt1    
[4m4
[24m664 untreated single_end   
[4m4
[24m672.   
## 
[90m 6
[39m FBgn0000018 untrt1     583 untreated single_end    461.   
## 
[90m 7
[39m FBgn0000022 untrt1       0 untreated single_end      0.143
## 
[90m 8
[39m FBgn0000024 untrt1      10 untreated single_end      7    
## 
[90m 9
[39m FBgn0000028 untrt1       0 untreated single_end      0.429
## 
[90m10
[39m FBgn0000032 untrt1    
[4m1
[24m446 untreated single_end   
[4m1
[24m085.   
## 
[90m# ℹ 86,503 more rows
[39m

Plotting

my_theme <-
    list(
        scale_fill_brewer(palette="Set1"),
        scale_color_brewer(palette="Set1"),
        theme_bw() +
            theme(
                panel.border=element_blank(),
                axis.line=element_line(),
                panel.grid.major=element_line(size=0.2),
                panel.grid.minor=element_line(size=0.1),
                text=element_text(size=12),
                legend.position="bottom",
                aspect.ratio=1,
                strip.background=element_blank(),
                axis.title.x=element_text(margin=margin(t=10, r=10, b=10, l=10)),
                axis.title.y=element_text(margin=margin(t=10, r=10, b=10, l=10))
            )
    )

We can treat pasilla_tidy as a normal tibble for plotting.

Here we plot the distribution of counts per sample.

pasilla_tidy %>%
    ggplot(aes(counts + 1, group=.sample, color=`type`)) +
    geom_density() +
    scale_x_log10() +
    my_theme