Tidyomics: Enabling Tidy Data Analysis for Complex Biological Data

Michael Love

Biostatistics & Genetics, UNC-Chapel Hill

Tidyomics project

An open, open-source project spanning multiple R packages, and developers from around the world. Operates within the Bioconductor Project, with separate funding and organization. Coordination of development via GitHub Projects.

What is “tidy data”

  • One row per observation, one column per variable.

  • Genomic regions (BED) already in this format.

  • Matrices and annotated matrices are not.

A grammar of data manipulation

In the R package dplyr:

  • mutate() adds new variables that are functions of existing variables.
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • slice() picks cases based on their position.
  • summarize() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
  • group_by() perform any operation by group.

https://dplyr.tidyverse.org/

The pipe

command | command | command > output.txt

“Pipes rank alongside the hierarchical file system and regular expressions as one of the most powerful yet elegant features of Unix-like operating systems.”

http://www.linfo.org/pipe.html

In R we use %>% or |> instead of | to chain operations.

Tidyomics = dplyr verbs applied to omics

  • Genomic regions (called “ranges”)

  • Matrix data, e.g. gene expression over samples or cells

  • Genomic interactions, e.g. DNA loops

  • And more…

Diagram of tidyomics workflows

Diagram of how packages share a similar grammar=

International development team

Diagram of tidyomics community=

Keep data organized in objects

  • We typically have more information than just a matrix

  • Row and column information (on features and samples)

  • Annotated matrix data, i.e. python’s xarray, AnnData

  • Metadata: organism, genome build, annotation release

  • 2002: ExpressionSet; 2011: SummarizedExperiment

  • Endomorphic functions: x |> f -> x

SummarizedExperiment, “SE”

A SummarizedExperiment is annotated matrix data.

Imagine a matrix of counts:

counts <- matrix(
  rpois(16, lambda=100), ncol=4,
  dimnames=list(c("g1","g2","g3","g4"),
                c("s1","s2","s3","s4"))
)
counts
    s1  s2  s3  s4
g1  94 117 112  94
g2 111 104 103 112
g3  83  87 104 108
g4 101  82 101  99

Row data and column data

metadata about genes (rows)

DataFrame with 4 rows and 2 columns
           id      symbol
  <character> <character>
1          g1         ABC
2          g2         DEF
3          g3         GHI
4          g4         JKL

metadata about samples (columns)

DataFrame with 4 rows and 3 columns
       sample   condition         sex
  <character> <character> <character>
1          s1           x           m
2          s2           y           m
3          s3           x           f
4          s4           y           f

Assembled object

se <- SummarizedExperiment(
  assays = list(counts = counts),
  rowData = genes, 
  colData = samples,
  metadata = list(organism="Hsapiens")
)
se
class: SummarizedExperiment 
dim: 4 4 
metadata(1): organism
assays(1): counts
rownames(4): g1 g2 g3 g4
rowData names(2): id symbol
colnames(4): s1 s2 s3 s4
colData names(3): sample condition sex

Deals with bookkeeping issues

Reordering columns propagates to assay and colData:

se2 <- se[,c(1,3,2,4)]
assay(se2, "counts")
    s1  s3  s2  s4
g1  94 112 117  94
g2 111 103 104 112
g3  83 104  87 108
g4 101 101  82  99
colData(se2)
DataFrame with 4 rows and 3 columns
        sample   condition         sex
   <character> <character> <character>
s1          s1           x           m
s3          s3           x           f
s2          s2           y           m
s4          s4           y           f

Deals with bookkeeping issues

Assignment with conflicting metadata gives an error:

assay(se2) <- counts
# Error:
# please use 'assay(x, withDimnames=FALSE)) <- value' or 
# 'assays(x, withDimnames=FALSE)) <- value'
# when the rownames or colnames of the supplied assay(s) are not 
# identical to those of the receiving  SummarizedExperiment object 'x'
  • Other such validity checks include comparison across different genome builds.

  • Errors triggered by metadata better than silent errors!

Can be hard for new users

slotNames(se)
[1] "colData"         "assays"          "NAMES"           "elementMetadata" "metadata"       
methods(class = class(se)) |> head()
[1] "!=,ANY,Vector-method"                         "!=,Vector,ANY-method"                        
[3] "!=,Vector,Vector-method"                      "[,SummarizedExperiment,ANY,ANY,ANY-method"   
[5] "[[,SummarizedExperiment,ANY,missing-method"   "[[<-,SummarizedExperiment,ANY,missing-method"
  • An advanced R/Bioconductor user should eventually learn these methods, class/method inheritance.

  • Not needed to explore or visualize data, or make basic data summaries.

Beginners know dplyr & ggplot2

library(dplyr)
# filter to samples in condition 'x'
samples |> 
  as_tibble() |> 
  filter(condition == "x")
# A tibble: 2 × 3
  sample condition sex  
  <chr>  <chr>     <chr>
1 s1     x         m    
2 s3     x         f    

Enabling dplyr verbs for omics

  • tidySummarizedExperiment package from Mangiola et al.

  • Printing says: “SummarizedExperiment-tibble abstraction

library(tidySummarizedExperiment)
se
# A SummarizedExperiment-tibble abstraction: 16 × 8
# Features=4 | Samples=4 | Assays=counts
   .feature .sample counts sample condition sex   id    symbol
   <chr>    <chr>    <int> <chr>  <chr>     <chr> <chr> <chr> 
 1 g1       s1          94 s1     x         m     g1    ABC   
 2 g2       s1         111 s1     x         m     g2    DEF   
 3 g3       s1          83 s1     x         m     g3    GHI   
 4 g4       s1         101 s1     x         m     g4    JKL   
 5 g1       s2         117 s2     y         m     g1    ABC   
 6 g2       s2         104 s2     y         m     g2    DEF   
 7 g3       s2          87 s2     y         m     g3    GHI   
 8 g4       s2          82 s2     y         m     g4    JKL   
 9 g1       s3         112 s3     x         f     g1    ABC   
10 g2       s3         103 s3     x         f     g2    DEF   
11 g3       s3         104 s3     x         f     g3    GHI   
12 g4       s3         101 s3     x         f     g4    JKL   
13 g1       s4          94 s4     y         f     g1    ABC   
14 g2       s4         112 s4     y         f     g2    DEF   
15 g3       s4         108 s4     y         f     g3    GHI   
16 g4       s4          99 s4     y         f     g4    JKL   

Tidyomics is an API

Interact with native objects using standard R/Bioc methods.

Counter with a menu and a bell

Still a standard Bioc object

class(se)
[1] "SummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
dim(se)
[1] 4 4

We can use familiar dplyr verbs

se |> 
  filter(condition == "x")
# A SummarizedExperiment-tibble abstraction: 8 × 8
# Features=4 | Samples=2 | Assays=counts
  .feature .sample counts sample condition sex   id    symbol
  <chr>    <chr>    <int> <chr>  <chr>     <chr> <chr> <chr> 
1 g1       s1          94 s1     x         m     g1    ABC   
2 g2       s1         111 s1     x         m     g2    DEF   
3 g3       s1          83 s1     x         m     g3    GHI   
4 g4       s1         101 s1     x         m     g4    JKL   
5 g1       s3         112 s3     x         f     g1    ABC   
6 g2       s3         103 s3     x         f     g2    DEF   
7 g3       s3         104 s3     x         f     g3    GHI   
8 g4       s3         101 s3     x         f     g4    JKL   

We can use familiar dplyr verbs

se |> 
  filter(condition == "x") |>
  colData()
DataFrame with 2 rows and 3 columns
        sample   condition         sex
   <character> <character> <character>
s1          s1           x           m
s3          s3           x           f

Facilitates quick exploration

library(ggplot2)
se |> 
  filter(.feature %in% c("g1","g2")) |> # `.feature` a special name
  ggplot(aes(condition, counts, color=sex, group=sex)) + 
  geom_point(size=2) + geom_line() + facet_wrap(~.feature)

Seurat and SingleCellExperiment

SingleCellExperiment has slots for e.g. reduced dimensions.

sce |>
  filter(Phase == "G1") |>
  ggplot(aes(UMAP_1, UMAP_2, color=nCount_RNA)) +
  geom_point() + scale_color_gradient2(trans="log10")

Efficient operation on SE with plyxp

  • Justin Landis (UNC BCB) noticed some opportunities for more efficient operations.

  • Also, reduce ambiguity, allow multiple ways to access data across context. Leveraging data masks from rlang.

  • Developed in Summer 2024: plyxp, stand-alone but also as a tidySummarizedExperiment engine.

Efficient operation on SE with plyxp

Suppose the following SE object

assay(simple)
  a b c  d
A 1 4 7 10
B 2 5 8 11
C 3 6 9 12
colData(simple) |> as.data.frame()
  type
a    1
b    1
c    2
d    2
rowData(simple) |> as.data.frame()
  length
A     10
B     20
C     30

Trigger methods with new_plyxp()

library(plyxp)
xp <- simple |> 
  new_plyxp()
xp
# A SummarizedExperiment-tibble Abstraction: 3 × 4
    .features .samples | counts | length | type 
    <chr>     <chr>    |  <int> |  <dbl> | <fct>
  1 A         a        |      1 |     10 | 1    
  2 B         a        |      2 |     20 | 1    
  3 C         a        |      3 |     30 | 1    
  4 A         b        |      4 |     10 | 1    
  5 B         b        |      5 |     20 | 1    
  … …         …               …        …   …    
n-4 B         c        |      8 |     20 | 2    
n-3 C         c        |      9 |     30 | 2    
n-2 A         d        |     10 |     10 | 2    
n-1 B         d        |     11 |     20 | 2    
n   C         d        |     12 |     30 | 2    
# ℹ n = 12

Mutate with plyxp

xp |> # assay is default context for mutate()
  mutate(norm_counts = counts / .rows$length)
# A SummarizedExperiment-tibble Abstraction: 3 × 4
    .features .samples | counts norm_counts | length | type 
    <chr>     <chr>    |  <int>       <dbl> |  <dbl> | <fct>
  1 A         a        |      1        0.1  |     10 | 1    
  2 B         a        |      2        0.1  |     20 | 1    
  3 C         a        |      3        0.1  |     30 | 1    
  4 A         b        |      4        0.4  |     10 | 1    
  5 B         b        |      5        0.25 |     20 | 1    
  … …         …               …          …         …   …    
n-4 B         c        |      8        0.4  |     20 | 2    
n-3 C         c        |      9        0.3  |     30 | 2    
n-2 A         d        |     10        1    |     10 | 2    
n-1 B         d        |     11        0.55 |     20 | 2    
n   C         d        |     12        0.4  |     30 | 2    
# ℹ n = 12

What was that .rows$ doing?

Grammar for group and summarize

summed <- xp |>
  group_by(cols(type)) |>
  summarize(sum = rowSums(counts))
summed
# A SummarizedExperiment-tibble Abstraction: 3 × 2
# Groups: cols(type)
  .features .samples |   sum | length |  type
  <chr>        <int> | <dbl> |  <dbl> | <int>
1 A                1 |     5 |     10 |     1
2 B                1 |     7 |     20 |     1
3 C                1 |     9 |     30 |     1
4 A                2 |    17 |     10 |     2
5 B                2 |    19 |     20 |     2
6 C                2 |    21 |     30 |     2
assay(summed)
  [,1] [,2]
A    5   17
B    7   19
C    9   21

Multiple implementations

xp |> # .assays_asis gives direct access to the matrix
  mutate(rows(ave_counts = rowMeans(.assays_asis$counts)))
# A SummarizedExperiment-tibble Abstraction: 3 × 4
    .features .samples | counts | length ave_counts | type 
    <chr>     <chr>    |  <int> |  <dbl>      <dbl> | <fct>
  1 A         a        |      1 |     10        5.5 | 1    
  2 B         a        |      2 |     20        6.5 | 1    
  3 C         a        |      3 |     30        7.5 | 1    
  4 A         b        |      4 |     10        5.5 | 1    
  5 B         b        |      5 |     20        6.5 | 1    
  … …         …               …        …         …    …    
n-4 B         c        |      8 |     20        6.5 | 2    
n-3 C         c        |      9 |     30        7.5 | 2    
n-2 A         d        |     10 |     10        5.5 | 2    
n-1 B         d        |     11 |     20        6.5 | 2    
n   C         d        |     12 |     30        7.5 | 2    
# ℹ n = 12

The plyxp class

A simple wrapper of SE:

class(xp)
[1] "PlySummarizedExperiment"
attr(,"package")
[1] "plyxp"
colData(xp)
DataFrame with 4 rows and 1 column
      type
  <factor>
a        1
b        1
c        2
d        2

Unwrapping plyxp

xp |> se()
class: SummarizedExperiment 
dim: 3 4 
metadata(0):
assays(1): counts
rownames(3): A B C
rowData names(1): length
colnames(4): a b c d
colData names(1): type

Computing on genomic ranges

  • Translate biological questions about the genome into code.

  • Leverage familiar concepts of filters, joins, grouping, mutation, or summarization.

T2T Consoritium, Science.

Genomic overlap as a join

Interwoven stacks of wood

Genomic overlap as a join

library(plyranges)
x
GRanges object with 40 ranges and 1 metadata column:
       seqnames    ranges strand |     score
          <Rle> <IRanges>  <Rle> | <numeric>
   [1]        1   114-115      * |   4.88780
   [2]        1   129-130      * |   4.40817
   [3]        1   154-155      * |   5.18773
   [4]        1   195-196      * |   5.81901
   [5]        1   200-201      * |   4.14720
   ...      ...       ...    ... .       ...
  [36]        1   898-899      * |   6.55006
  [37]        1   922-923      * |   6.46796
  [38]        1   956-957      * |   4.93079
  [39]        1   957-958      * |   4.34292
  [40]        1   966-967      * |   4.53976
  -------
  seqinfo: 1 sequence from an unspecified genome

Genomic overlap as a join

y
GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |       id
         <Rle> <IRanges>  <Rle> | <factor>
  [1]        1   101-300      * |        a
  [2]        1   451-650      * |        b
  [3]        1  801-1000      * |        c
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Computing overlap (here inner join)

x |> join_overlap_inner(y)
GRanges object with 30 ranges and 2 metadata columns:
       seqnames    ranges strand |     score       id
          <Rle> <IRanges>  <Rle> | <numeric> <factor>
   [1]        1   114-115      * |   4.88780        a
   [2]        1   129-130      * |   4.40817        a
   [3]        1   154-155      * |   5.18773        a
   [4]        1   195-196      * |   5.81901        a
   [5]        1   200-201      * |   4.14720        a
   ...      ...       ...    ... .       ...      ...
  [26]        1   898-899      * |   6.55006        c
  [27]        1   922-923      * |   6.46796        c
  [28]        1   956-957      * |   4.93079        c
  [29]        1   957-958      * |   4.34292        c
  [30]        1   966-967      * |   4.53976        c
  -------
  seqinfo: 1 sequence from an unspecified genome

Chaining operations

x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  group_by(id) |>
  summarize(ave_score = mean(score), n = n())
DataFrame with 3 rows and 3 columns
        id ave_score         n
  <factor> <numeric> <integer>
1        a   5.00465        10
2        b   5.43353        11
3        c   5.45538         7

Options: directed, within, maxgap, minoverlap, etc.

Pipe to plot

x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  as_tibble() |>
  ggplot(aes(x = id, y = score)) + 
  geom_violin() + geom_jitter(width=.1)

Convenience functions

y |> 
  anchor_5p() |> # 5', 3', start, end, center
  mutate(width=1) |>
  join_nearest(x, distance=TRUE)
GRanges object with 3 ranges and 3 metadata columns:
      seqnames    ranges strand |       id     score  distance
         <Rle> <IRanges>  <Rle> | <factor> <numeric> <integer>
  [1]        1       101      * |        a   4.88780        12
  [2]        1       451      * |        b   3.99047         1
  [3]        1       801      * |        c   6.49877        22
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Assess significance using a tidy grammar

  • Asking about the enrichment of variants near genes, or peaks in TADs, often requires a lot of custom R code, lots of loops and control code.

  • Hard to read, hard to maintain, hard at submission/publication.

  • Instead: use familiar verbs, stacked genomic range sets.

Defining null genomic range sets

  • We developed a package, nullranges, a modular tool to assist with making genomic comparisons.

  • Only provides null genomic range sets for investigating various hypotheses. That’s it!

  • Doesn’t do enrichment analysis per se, but can be combined with plyranges, ggplot2, etc.

Bootstrapping ranges

Statistical papers from the ENCODE project noted that block bootstrapping genomic data preserves important spatial patterns (Bickel et al. 2010).

Diagram of block bootstrapping

bootRanges

library(nullranges)
boot <- bootRanges(x, blockLength=10, R=20)
# keep track of bootstrap iteration, gives boot dist'n
boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |> # bootstrap iter, range id
  summarize(n_overlaps = n())
DataFrame with 60 rows and 3 columns
     iter       id n_overlaps
    <Rle> <factor>  <integer>
1       1        a          4
2       1        b         10
3       1        c          4
4       2        a         11
5       2        b         10
...   ...      ...        ...
56     19        b         18
57     19        c         11
58     20        a         10
59     20        b         15
60     20        c          8

bootRanges

boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |> # bootstrap iter, range id
  summarize(n_overlaps = n()) |>
  as_tibble() |>
  ggplot(aes(x = id, y = n_overlaps)) + 
  geom_boxplot()

Matching ranges

Matching on covariates from a large pool allows for more focused hypothesis testing.

Diagram of matching genomic ranges

matchRanges

xprime <- x |>
  filter(score > 5) |> # xprime are hi-score
  mutate(score = rnorm(n(), mean = score, sd = .5)) # jitter data

matchRanges

m <- matchRanges(focal = xprime, pool = x, covar = ~score)
plotCovariate(m)

Bind ranges

combined <- bind_ranges(
  focal = xprime,
  matched = matched(m),
  pool = x,
  .id = "origin"
)

This is now “tidy data” with the two group concatenated and a new metadata column, origin.

Group and summarize

combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score))
DataFrame with 9 rows and 3 columns
        id  origin ave_score
  <factor>   <Rle> <numeric>
1        a   focal   6.17495
2        a matched   5.54412
3        a    pool   5.00465
4        b   focal   5.79665
5        b matched   6.07625
6        b    pool   5.25055
7        c   focal   6.32536
8        c matched   5.82463
9        c    pool   5.14840

Pipe to plot

combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score), sd = sd(score)) |>
  as_tibble() |>
  ggplot(aes(origin, ave_score, 
             ymin=ave_score-2*sd, ymax=ave_score+2*sd)) + 
  geom_pointrange() +
  facet_wrap(~id, labeller = label_both)

bootRanges and matchRanges methods

Published as Application Notes in 2023:

Current state and future directions

  • Writing up plyxp as an Application Note.

  • Have shown it locally and to an industry team.

  • Working with Stefano Mangiola’s team on consistent printing, messaging.

  • Working on tutorials, workshops, documentation.

What has been implemented

  • Matrix-shaped objects (SE, SCE) - Mangiola et al.

  • Ranges - Lee et al.

  • Interactions - Serizay et al.

  • Cytometry - Keyes et al.

  • Spatial - Hutchison et al.

  • more to come…

Ongoing work

  • Package code and non-standard evaluation (IMO I don’t recommend)
  • Optimized code
  • Conflicts
  • Documentation!

How to contribute

Omics data analysis requires looking at:

  • main contributions to variance (e.g. PCA, see plotPCA for bulk and OSCA for sc), also variancePartition
  • column and row densities (geom_density of select rows/columns, or geom_violin)
  • known positive features, feature-level plots (filter to feature, pipe to geom_point etc.)

Acknowledgments

  • Stefano Mangiola (co-lead, tidySE, tidySCE, spatial)

  • Justin Landis (plyxp)

  • Eric Davis, Wancen Mu, Doug Phanstiel (nullranges)

  • Stuart Lee, M. Lawrence, D. Cook (plyranges developers)

tidyomics developers: William Hutchison, Timothy Keyes, Helena Crowell, Jacques Serizay, Charlotte Soneson, Eric Davis, Noriaki Sato, Lambda Moses, Boyd Tarlinton, Abdullah Nahid, Miha Kosmac, Quentin Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang

tidyomics project funded by an Essential Open Source Software award from CZI and Wellcome Trust