While SeqGeq was designed with single-cell RNA sequencing data in mind, it can also be applied to bulk data analysis to great effect, and in many of the same ways. Though certain considerations must be made in the treatment of this type of data in SeqGeq.

 

Overview

In bulk sequencing, the data points observed (observations) are not single cells, but rather represent bulk samples (many cells). This tends to reduce the sparsity of values within the expression matrix, making the data’s parameters more rich and less susceptible to dropouts. It also results in fewer data points in almost all cases.

Bulk RNA-sequencing data analyzed here comes from the Gene Expression Omnibus (GEO), GSE73313 specifically. (1)

 

Quality Control

The Quality Control platform in SeqGeq can be quite useful in identifying outlier observations in bulk sequenced datasets (sometimes requiring removal), and to explore characteristics unique to this data, such as normalization:

 

 

Cell Quality

Often times a line will be visible within the cell expression metrics for bulk data. This is due to the normalization techniques used in bulk sequencing. If all sample are normalized to give equal library sizes, the cell QC window will show a vertical line.

In this example dataset, no such normalization has been applied. However stark population separations are visible. Initially we can test whether the differences are biased by samples combined:

 

Note, above, no such bias by sample appears here.

 

Gene QC

As in single-cell RNA-sequencing, it may be desirable to remove outlier genes and target highly dispersed genes for clustering. These parameters will tend to give better separation of clusters, due to their being highly variable, non-dim, non-bright.

 

Dimensionality Reduction

Principal Component Analysis (PCA) <link> can be a powerful tool for mapping global differences in observations from bulk sequencing data:

 

Here the islands of observations are clearly clustering by Sample to a large degree, though some samples may still cluster into broader buckets.

Typically tSNE winds up being less useful for data in which few data points are available, and thus won’t be applicable to the majority of bulk sequencing data files:

 

The islands in tSNE do tend to comprise similar samples, but again, some global clustering is also evident.

 

Clustering

K-Means clustering is available in the Clustering Platform <link> in SeqGeq. This type of clustering is fairly unbiased, except that it relies on a researcher choosing their expected number of clusters, ‘k’. Here are clustering results relative to PCA and tSNE mappings for k=4 and k=7 respectively:

 

 

Differential Expression Analysis

Statistics used for differential expression in single cell sequencing won’t show the same results with regard to bulk data. This is due to the low number of data points available on which to set confidence levels. This means that the adjusted p-Values from the Volcano Plots <link> in SeqGeq probably may not be useful for bulk sequencing datasets.

In the example data there are enough observations available, to make some differentially expressed genesets:

 

 

Even when the p-value isn’t useful, the Fold Change parameter can be used to draw some conclusions in population comparisons:

 

In addition, or alternatively, the Pivot Window can be used to provide some hypotheses regarding over / under expressed features in bulk data:

 

References:

  1. West EE, Spolski R, Kazemian M, Yu ZX et al. A TSLP-complement axis mediates neutrophil killing of methicillin-resistant Staphylococcus aureus. Sci Immunol 2016 Nov 18;1(5). PMID: 28783679

 

If you have more questions regarding bulk sequencing, or any other data in SeqGeq please get in touch: seqgeq@flowjo.com