plyranges provides a consistent interface for importing and wrangling genomics data from a variety of sources. The package defines a grammar of genomic data transformation based on dplyr
and the Bioconductor packages IRanges
, GenomicRanges
, and rtracklayer
. It does this by providing a set of verbs for developing analysis pipelines based on Ranges objects that represent genomic regions:
mutate()
and stretch()
functions.anchor_
family of functions.arrange()
.mutate()
, filter()
, and summarise()
functions.group_by()
.join_nearest_
family of functions.join_overlaps_
family of functions.reduce_ranges()
.disjoin_ranges()
.read_/write_
family of functions.For more details on the features of plyranges, read the vignette. For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq results read the fluentGenomics workflow.
plyranges is part of the tidyomics project, providing a dplyr
-based interface for many types of genomics datasets represented in Bioconductor.
plyranges can be installed from the latest Bioconductor release:
# install.packages("BiocManager")
BiocManager::install("plyranges")
To install the development version from GitHub:
BiocManager::install("tidyomics/plyranges")
Ranges
Ranges
objects can either represent sets of integers as IRanges
(which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as GRanges
. In addition, both types of Ranges
can store information about their intervals as metadata columns (for example GC content over a genomic interval).
Ranges
objects follow the tidy data principle: each row of a Ranges
object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations).
We can construct a IRanges
object from a data.frame
with a start
or width
using the as_iranges()
method.
library(plyranges)
df <- data.frame(start = 1:5, width = 5)
as_iranges(df)
#> IRanges object with 5 ranges and 0 metadata columns:
#> start end width
#> <integer> <integer> <integer>
#> [1] 1 5 5
#> [2] 2 6 5
#> [3] 3 7 5
#> [4] 4 8 5
#> [5] 5 9 5
# alternatively with end
df <- data.frame(start = 1:5, end = 5:9)
as_iranges(df)
#> IRanges object with 5 ranges and 0 metadata columns:
#> start end width
#> <integer> <integer> <integer>
#> [1] 1 5 5
#> [2] 2 6 5
#> [3] 3 7 5
#> [4] 4 8 5
#> [5] 5 9 5
We can also construct a GRanges
object in a similar manner. Note that a GRanges
object requires at least a seqnames column to be present in the data.frame (but not necessarily a strand column).
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
start = 1:5,
width = 5)
as_granges(df)
#> GRanges object with 5 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-5 *
#> [2] chr2 2-6 *
#> [3] chr2 3-7 *
#> [4] chr1 4-8 *
#> [5] chr2 5-9 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
# strand can be specified with `+`, `*` (mising) and `-`
df$strand <- c("+", "+", "-", "-", "*")
as_granges(df)
#> GRanges object with 5 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-5 +
#> [2] chr2 2-6 +
#> [3] chr2 3-7 -
#> [4] chr1 4-8 -
#> [5] chr2 5-9 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Let’s look at a more a realistic example (taken from HelloRanges vignette).
Suppose we have two GRanges objects: one containing coordinates of known exons and another containing SNPs from a GWAS.
The first and last 5 exons are printed below, there are two additional columns corresponding to the exon name, and a score.
We could check the number of exons per chromosome using group_by
and summarise
.
exons
#> GRanges object with 459752 ranges and 2 metadata columns:
#> seqnames ranges strand | name score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] chr1 11874-12227 + | NR_046018_exon_0_0_c.. 0
#> [2] chr1 12613-12721 + | NR_046018_exon_1_0_c.. 0
#> [3] chr1 13221-14409 + | NR_046018_exon_2_0_c.. 0
#> [4] chr1 14362-14829 - | NR_024540_exon_0_0_c.. 0
#> [5] chr1 14970-15038 - | NR_024540_exon_1_0_c.. 0
#> ... ... ... ... . ... ...
#> [459748] chrY 59338754-59338859 + | NM_002186_exon_6_0_c.. 0
#> [459749] chrY 59338754-59338859 + | NM_176786_exon_7_0_c.. 0
#> [459750] chrY 59340194-59340278 + | NM_002186_exon_7_0_c.. 0
#> [459751] chrY 59342487-59343488 + | NM_002186_exon_8_0_c.. 0
#> [459752] chrY 59342487-59343488 + | NM_176786_exon_8_0_c.. 0
#> -------
#> seqinfo: 93 sequences from an unspecified genome; no seqlengths
exons %>%
group_by(seqnames) %>%
summarise(n = n())
#> DataFrame with 49 rows and 2 columns
#> seqnames n
#> <Rle> <integer>
#> 1 chr1 43366
#> 2 chr10 19420
#> 3 chr11 24476
#> 4 chr12 24949
#> 5 chr13 7974
#> ... ... ...
#> 45 chrUn_gl000222 20
#> 46 chrUn_gl000223 22
#> 47 chrUn_gl000228 85
#> 48 chrX 18173
#> 49 chrY 4128
Next we create a column representing the transcript_id with mutate
:
To find all GWAS SNPs that overlap exons, we use join_overlap_inner
. This will create a new GRanges with the coordinates of SNPs that overlap exons, as well as metadata from both objects.
olap <- join_overlap_inner(gwas, exons)
olap
#> GRanges object with 3439 ranges and 4 metadata columns:
#> seqnames ranges strand | name.x name.y score
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric>
#> [1] chr1 1079198 * | rs11260603 NR_038869_exon_2_0_c.. 0
#> [2] chr1 1247494 * | rs12103 NM_001256456_exon_1_.. 0
#> [3] chr1 1247494 * | rs12103 NM_001256460_exon_1_.. 0
#> [4] chr1 1247494 * | rs12103 NM_001256462_exon_1_.. 0
#> [5] chr1 1247494 * | rs12103 NM_001256463_exon_1_.. 0
#> ... ... ... ... . ... ... ...
#> [3435] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
#> [3436] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0
#> [3437] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
#> [3438] chrX 153764217 * | rs1050828 NM_000402_exon_9_0_c.. 0
#> [3439] chrX 153764217 * | rs1050828 NM_001042351_exon_9_.. 0
#> tx_id
#> <character>
#> [1] NR_038869
#> [2] NM_001256456
#> [3] NM_001256460
#> [4] NM_001256462
#> [5] NM_001256463
#> ... ...
#> [3435] NM_001042351
#> [3436] NM_000402
#> [3437] NM_001042351
#> [3438] NM_000402
#> [3439] NM_001042351
#> -------
#> seqinfo: 93 sequences from an unspecified genome; no seqlengths
For each SNP we can count the number of times it overlaps a transcript.
olap %>%
group_by(name.x, tx_id) %>%
summarise(n = n())
#> DataFrame with 1619 rows and 3 columns
#> name.x tx_id n
#> <character> <character> <integer>
#> 1 rs10043775 NM_001271723 1
#> 2 rs10043775 NM_030793 1
#> 3 rs10078 NM_001242412 1
#> 4 rs10078 NM_020731 1
#> 5 rs10089 NM_001046 1
#> ... ... ... ...
#> 1615 rs9906595 NM_001008777 1
#> 1616 rs9948 NM_017623 1
#> 1617 rs9948 NM_199078 1
#> 1618 rs995030 NM_000899 4
#> 1619 rs995030 NM_003994 4
We can also generate 2bp splice sites on either side of the exon using flank_left
and flank_right
. We add a column indicating the side of flanking for illustrative purposes. The interweave
function pairs the left and right ranges objects.
left_ss <- flank_left(exons, 2L)
right_ss <- flank_right(exons, 2L)
all_ss <- interweave(left_ss, right_ss, .id = "side")
all_ss
#> GRanges object with 919504 ranges and 4 metadata columns:
#> seqnames ranges strand | name score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] chr1 11872-11873 + | NR_046018_exon_0_0_c.. 0
#> [2] chr1 12228-12229 + | NR_046018_exon_0_0_c.. 0
#> [3] chr1 12611-12612 + | NR_046018_exon_1_0_c.. 0
#> [4] chr1 12722-12723 + | NR_046018_exon_1_0_c.. 0
#> [5] chr1 13219-13220 + | NR_046018_exon_2_0_c.. 0
#> ... ... ... ... . ... ...
#> [919500] chrY 59340279-59340280 + | NM_002186_exon_7_0_c.. 0
#> [919501] chrY 59342485-59342486 + | NM_002186_exon_8_0_c.. 0
#> [919502] chrY 59343489-59343490 + | NM_002186_exon_8_0_c.. 0
#> [919503] chrY 59342485-59342486 + | NM_176786_exon_8_0_c.. 0
#> [919504] chrY 59343489-59343490 + | NM_176786_exon_8_0_c.. 0
#> tx_id side
#> <character> <character>
#> [1] NR_046018 left
#> [2] NR_046018 right
#> [3] NR_046018 left
#> [4] NR_046018 right
#> [5] NR_046018 left
#> ... ... ...
#> [919500] NM_002186 right
#> [919501] NM_002186 left
#> [919502] NM_002186 right
#> [919503] NM_176786 left
#> [919504] NM_176786 right
#> -------
#> seqinfo: 93 sequences from an unspecified genome; no seqlengths
The fluentGenomics workflow package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the case study by Michael Love for using plyranges with tximeta.
The extended vignette in the plyrangesWorkshops package has a detailed walk through of using plyranges for coverage analysis.
The Bioc 2018 Workshop book has worked examples of using plyranges
to analyse publicly available genomics data.
If you found plyranges
useful for your work please cite our paper:
@ARTICLE{Lee2019,
title = "plyranges: a grammar of genomic data transformation",
author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",
journal = "Genome Biol.",
volume = 20,
number = 1,
pages = "4",
month = jan,
year = 2019,
url = "http://dx.doi.org/10.1186/s13059-018-1597-8",
doi = "10.1186/s13059-018-1597-8",
pmc = "PMC6320618"
}
We welcome contributions from the R/Bioconductor community. We ask that contributors follow the code of conduct and the guide outlined here.