Instructors

Dr. Stefano Mangiola is currently a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall Institute in Melbourne, Australia. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical models for the analysis of RNA sequencing data, and data analysis and visualisation interfaces.

Workshop goals and objectives

What you will learn

  • Basic tidy operations possible with tidyseurat and tidySingleCellExperiment
  • The differences between Seurat and SingleCellExperiment representation, and tidy representation
  • How to interface Seurat and SingleCellExperiment with tidy manipulation and visualisation
  • A real-world case study that will showcase the power of tidy single-cell methods compared with base/ad-hoc methods

What you will not learn

  • The molecular technology of single-cell sequencing
  • The fundamentals of single-cell data analysis
  • The fundamentals of tidy data analysis

Getting started

Local

We will use the Cloud during the workshop and this method is available if you want to run the material after the workshop. If you want to install on your own computer, see instructions here.

Alternatively, you can view the material at the workshop webpage here.

Introduction to tidytranscriptomics

Here

Part 1 Introduction to tidyseurat

Seurat is a very popular analysis toolkit for single cell RNA sequencing data (Butler et al. 2018; Stuart et al. 2019).

Here we load single-cell data in Seurat object format. This data is peripheral blood mononuclear cells (PBMCs) from metastatic breast cancer patients.

# load single cell RNA sequencing data
seurat_obj <- tidyomicsWorkshop::seurat_obj

# take a look
seurat_obj

tidyseurat provides a bridge between the Seurat single-cell package and the tidyverse (Wickham et al. 2019). It creates an invisible layer that enables viewing the Seurat object as a tidyverse tibble, and provides Seurat-compatible dplyr, tidyr, ggplot and plotly functions.

If we load the tidyseurat package and then view the single cell data, it now displays as a tibble.

library(tidyseurat)

seurat_obj 

If we want to revert to the standard SingleCellExperiment view we can do that.

options("restore_Seurat_show" = TRUE)
seurat_obj

If we want to revert back to tidy SingleCellExperiment view we can.

options("restore_Seurat_show" = FALSE)
seurat_obj

It can be interacted with using Seurat commands such as Assays.

Assays(seurat_obj)

We can also interact with our object as we do with any tidyverse tibble.

Tidyverse commands

We can use tidyverse commands, such as filter, select and mutate to explore the tidyseurat object. Some examples are shown below and more can be seen at the tidyseurat website here.

We can use filter to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage. Check if have groups or ident present.

seurat_obj |> filter(Phase == "G1")

We can use select to choose columns, for example, to see the sample, cell, total cellular RNA

seurat_obj |> select(.cell, nCount_RNA, Phase)

We also see the UMAP columns as they are not part of the cell metadata, they are read-only. If we want to save the edited metadata, the Seurat object is modified accordingly.

# Save edited metadata
seurat_modified <- seurat_obj |> select(.cell, nCount_RNA, Phase)
# View Seurat metadata
seurat_modified[[]] |> head()

We can use mutate to create a column. For example, we could create a new Phase_l column that contains a lower-case version of Phase.

seurat_obj |>
    mutate(Phase_l=tolower(Phase)) |>
    
    # Select columns to view    
    select(.cell, Phase, Phase_l)

We can use tidyverse commands to polish an annotation column. We will extract the sample, and group information from the file name column into separate columns.

# First take a look at the file column
seurat_obj |> select(.cell, file)
# Create columns for sample and group
seurat_obj <- seurat_obj |>
    
    # Extract sample and group
    extract(file, "sample", "../data/.*/([a-zA-Z0-9_-]+)/outs.+", remove = FALSE)

# Take a look
seurat_obj |> select(.cell, sample)

We could use tidyverse unite to combine columns, for example to create a new column for sample id that combines the sample and patient identifier (BCB) columns.

seurat_obj <- seurat_obj |> unite("sample_id", sample, BCB, remove = FALSE)

# Take a look
seurat_obj |> select(.cell, sample_id, sample, BCB)

Part 2 Signature visualisation

We will now demonstrate a real-world example of the power of using tidy transcriptomics packages in single cell analysis. For more information on single-cell analysis steps performed in a tidy way please see the ISMB2021 workshop.

Data pre-processing

The object seurat_obj we’ve been using was created as part of a study on breast cancer systemic immune response. Peripheral blood mononuclear cells have been sequenced for RNA at the single-cell level. The steps used to generate the object are summarised below.

  • scran, scater, and DropletsUtils packages have been used to eliminate empty droplets and dead cells. Samples were individually quality checked and cells were filtered for good gene coverage.

  • Variable features were identified using Seurat.

  • Read counts were scaled and normalised using SCTtransform from Seurat.

  • Data integration was performed using Seurat with default parameters.

  • PCA performed to reduce feature dimensionality.

  • Nearest-neighbor cell networks were calculated using 30 principal components.

  • 2 UMAP dimensions were calculated using 30 principal components.

  • Cells with similar transcriptome profiles were grouped into clusters using Louvain clustering from Seurat.

Analyse custom signature

The researcher analysing this dataset wanted to to identify gamma delta T cells using a gene signature from a published paper (Pizzolato et al. 2019).

With tidyseurat’s join_features the counts for the genes could be viewed as columns.

seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
    ) 
Signature calculation

They were able to use tidyseurat’s join_features to select the counts for the genes in the signature, followed by tidyverse mutate to easily create a column containing the signature score.

seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
        
    ) |>
    
    mutate(signature_score =
                    scales::rescale(CD3D + TRDC + TRGC1 + TRGC2, to=c(0,1)) -
                    scales::rescale(CD8A + CD8B, to=c(0,1))
    ) |>
    
    select(signature_score, everything())

The gamma delta T cells could then be visualised by the signature score using Seurat’s visualisation functions.

seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
        
    ) |>
    
    mutate(signature_score =
                    scales::rescale(CD3D + TRDC + TRGC1+ TRGC2, to=c(0,1)) -
                    scales::rescale(CD8A + CD8B, to=c(0,1))
    ) |>
    
    Seurat::FeaturePlot(features =  "signature_score", min.cutoff = 0) 

The cells could also be visualised using the popular and powerful ggplot package, enabling the researcher to use ggplot functions they were familiar with, and to customise the plot with great flexibility.

seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
        
    ) |>
    
    mutate(signature_score =
                    scales::rescale(CD3D + TRDC + TRGC1+ TRGC2, to=c(0,1)) -
                    scales::rescale(CD8A + CD8B, to=c(0,1))
    ) |>
    
    # plot cells with high score last so they're not obscured by other cells
    arrange(signature_score) |>
    
    ggplot(aes(UMAP_1, UMAP_2, color = signature_score)) +
    geom_point(size=0.5) +
    scale_color_distiller(palette = "Spectral") +
    tidyomicsWorkshop::theme_multipanel
Filtering based on gate

The gamma delta T cells (the blue cluster on the left with high signature score) could be interactively selected from the plot using the tidygate package.

This code can only be executed interactively from an R file and not from a markdown file. So, in this casre, we need to copy and paste this into the console.

seurat_obj_gamma_delta = 
  seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
        
    ) |>
    
    mutate(signature_score =
                    scales::rescale(CD3D + TRDC + TRGC1+ TRGC2, to=c(0,1)) -
                    scales::rescale(CD8A + CD8B, to=c(0,1))
    ) |>
    
    mutate(gate = tidygate::gate_int(
        UMAP_1, UMAP_2, 
        .size = 0.1, 
        .color =signature_score
    )) |> 
    
    filter(gate==1)
Filtering based on threshold

For exploratory analyses, we can select the gamma delta T cells, the red cluster on the left with high signature score. We’ll filter for cells with a signature score > 0.7.

seurat_obj_gamma_delta <-
    
    seurat_obj |>
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide"
    ) |>
    
    mutate(
        signature_score =
            scales::rescale(CD3D + TRDC + TRGC1 + TRGC2, to = c(0, 1)) -
            scales::rescale(CD8A + CD8B, to = c(0, 1))
    ) |>
    
    # Proper cluster selection should be used instead (see supplementary material)
    filter(signature_score > 0.5)
Reanalysis after filtering

It was then possible to perform analyses on these gamma delta T cells by simply chaining further commands, such as below.

source("https://raw.githubusercontent.com/satijalab/seurat-wrappers/master/R/fast_mnn.R")

seurat_obj |>
    
    
    join_features(
        features = c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"),
        shape = "wide",
        assay = "SCT"
        
    ) |>
    
    mutate(signature_score =
                    scales::rescale(CD3D + TRDC + TRGC1+ TRGC2, to=c(0,1)) -
                    scales::rescale(CD8A + CD8B, to=c(0,1))
    ) |>
    
    # Proper cluster selection should be used instead (see supplementary material)
    filter(signature_score > 0.5) |> 
    
    # Reanalysis
    NormalizeData(assay="RNA") |> 
    FindVariableFeatures(nfeatures = 100, assay="RNA") |>
    SplitObject(split.by = "file") |> 
    SeuratWrappers::RunFastMNN(assay="RNA", features = 100) |> 
    RunUMAP(reduction = "mnn", dims = 1:20) |> 
    FindNeighbors(dims = 1:20, reduction = "mnn") |> 
    FindClusters(resolution = 0.3)

For comparison, we show the alternative using base R and SingleCellExperiment. Note that the code contains more redundancy and intermediate objects.

counts_positive <-
    assay(seurat_obj, "logcounts")[c("CD3D", "TRDC", "TRGC1", "TRGC2"), ] |>
    colSums() |>
    scales::rescale(to = c(0, 1))

counts_negative <-
    assay(seurat_obj, "logcounts")[c("CD8A", "CD8B"), ] |>
    colSums() |>
    scales::rescale(to = c(0, 1))

seurat_obj$signature_score <- counts_positive - counts_negative

seurat_obj_gamma_delta <- seurat_obj[, seurat_obj$signature_score > 0.7]

# Reanalysis
# ...

Interactive plotting

As a final note, it’s also possible to do complex and powerful things in a simple way, due to the integration of the tidy transcriptomics packages with the tidy universe. As one example, we can visualise the cells as a 3D plot using plotly.

The example data we’ve been using only contains a few genes, for the sake of time and size in this demonstration, but below is how you could generate the 3 dimensions needed for 3D plot with a full dataset.

single_cell_object |> 
    RunUMAP(dims = 1:30, n.components = 3L, spread = 0.5,min.dist  = 0.01, n.neighbors = 10L)

We’ll demonstrate creating a 3D plot using some data that has 3 UMAP dimensions.

seurat_obj_UMAP3 <- tidyomicsWorkshop::seurat_obj_UMAP3

# Look at the object
seurat_obj_UMAP3 |> select(contains("UMAP"), everything())

seurat_obj_UMAP3 |>
    
    tidyseurat::plot_ly(
        x = ~`UMAP_1`,
        y = ~`UMAP_2`,
        z = ~`UMAP_3`,
        color = ~cell_type,
        colors = dittoSeq::dittoColors()
    ) |> 
    
    plotly::add_markers(size = I(1))

Exercises

  1. What proportion of all cells are gamma-delta T cells?

  2. There is a cluster of cells characterised by a low RNA output (nCount_RNA). Use tidygate to identify the cell composition (cell_type) of that cluster.

Part 3 Nested analyses

When analysing single cell data is sometimes necessary to perform calculations on data subsets. For example, we might want to estimate difference in mRNA abundance between two condition for each cell type.

tidyr and purrr offer a great tool to perform iterativre analyses in a functional way.

We use tidyverse nest to group the data. The command below will create a tibble containing a column with a SummarizedExperiment object for each cell type. nest is similar to tidyverse group_by, except with nest each group is stored in a single row, and can be a complex object such as Seurat.

First let’s have a look to the cell types that constitute this dataset

seurat_obj |> 
  count(cell_type)

Let’s group the cells based on cell identity using nest

# Set idents for the differential analysis
Idents(seurat_obj) = seurat_obj[[]]$"treatment"

seurat_obj_nested = 
  seurat_obj |> 
  nest(seurat = -cell_type) 

seurat_obj_nested

Let’s see what the first element of the Surat column looks like

seurat_obj_nested |> 
  slice(1) |> 
  pull(seurat)

Now, let’s perform a differential gene-transcript abundance analysis between the two conditions for each cell type.

seurat_obj_nested = 
  seurat_obj_nested |> 
  
  # Select significant genes
  mutate(significant_genes = map(
    seurat,
    ~ .x |> 
      
      # Test
      NormalizeData(assay="RNA") |> 
      FindAllMarkers(assay="RNA") |> 
      
      # Select top genes
      filter(p_val_adj<0.05) |> 
      head(10) |> 
      rownames()
  ))

seurat_obj_nested

We can the lies the top genes with the heat map iteratively across the cell types

seurat_obj_nested = 
  seurat_obj_nested |> 
  
  # Build heatmaps
  mutate(heatmap = map2(
    seurat, significant_genes,
    ~ .x |> 
       ScaleData(assay="RNA") |> 
        DoHeatmap(.y, assay="RNA")
  )) 

seurat_obj_nested

Let’s have a look to the first heatmap

seurat_obj_nested |> 
  slice(1) |> 
  pull(heatmap)

You can do this whole analysis without saving any temporary variable using the piping functionality of tidy R programming

seurat_obj |> 
  
  # Nest
  nest(seurat = -cell_type) |> 
  
  # Select significant genes
  mutate(significant_genes = map(
    seurat,
    ~ .x |> 
      
      # Test
      NormalizeData(assay="RNA") |> 
      FindAllMarkers(assay="RNA") |> 
      
      # Select top genes
      filter(p_val_adj<0.05) |> 
      head(10) |> 
      rownames()
  )) |> 
  
  # Build heatmaps
  mutate(heatmap = map2(
    seurat, significant_genes,
    ~ .x |> 
       ScaleData(assay="RNA") |> 
        DoHeatmap(.y, assay="RNA")
  )) |> 
  
  # Extract heatmaps
  pull(heatmap)

Exercises

  1. Let’s suppose that you want to perform the analyses only for cell types that have a total number of cells bigger than 1000. For example, if a cell type has less than a sum of 1000 cells across all samples, that cell type will be dropped from the dataset.
  • Answer this question avoiding to save temporary variables, and using the function add_count to count the cells (before nesting), and then filter
  • Answer this question avoiding to save temporary variables, and using the function map_int to count the cells (after nesting), and the filter

Session Information

References

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20.
Pizzolato, Gabriele, Hannah Kaminski, Marie Tosolini, Don-Marc Franchini, Fréderic Pont, Fréderic Martins, Carine Valle, et al. 2019. “Single-Cell RNA Sequencing Unveils the Shared and the Distinct Cytotoxic Hallmarks of Human TCRV\(\updelta\)1 and TCRV\(\updelta\)2 \(\upgamma\)\(\updelta\) t Lymphocytes.” Proceedings of the National Academy of Sciences, May, 201818488. https://doi.org/10.1073/pnas.1818488116.
Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

  1. <maria.doyle at petermac.org>↩︎

  2. <mangiola.s at wehi.edu.au>↩︎