Gene Sets are a vital part of any sequencing analysis. These are parameter collections that can represent a wide variety of biological states. Interpreting the meaning of a given gene set within the context of a data-set or experiment can be the most challenging aspect of an analysis.

Gene set enrichment is a process for checking the match between a gene set derived from your data and a library of well-annotated gene sets (known as a gene set library).

Curation

Gene sets should be specific to the context of the investigation being performed. Poorly curated gene set libraries cannot accurately support the researcher’s hypothesis or null-hypothesis. Similarly, test gene sets derived from noisy data or improperly defined clustering can yield unreliable results.

It is the responsibility of the researcher to perform proper due diligence when ascribing biological meaning to comparisons made. Please make every effort to ensure proper care is taken when investigating gene sets from literature to employ in a given analysis. Below are some general tips for attaining quality in gene sets:

Gene set libraries must contain similar functions or categories aimed at answering similar biological questions in order to make sensible predictions. For example, researchers would likely want to include a T-cell gene set(s) within PBMC phenotype collection, but not a G2 phase cell cycle gene set.
Deep hierarchies of gene sets are known to introduce bias under typical kinds of enrichment analysis.
Similarly, gene sets within a given library should contain roughly comparable numbers of parameters.
Gene set libraries should contain parameters derived from the same gene model, and match the gene model being tested as closely as possible.
Test gene sets should derive from sensible analysis – for example, differentially expressed gene sets are generated by population comparisons, and gene set enrichment based on those comparisons will only ever be as relevant as the comparison on which the enrichment was based.
Enrichment tests can only report on the annotations contained in the gene set library – gaps in the gene set library will manifest as gaps in your enrichment test results, and the next-best result might be misleading.

Note: SeqGeq does not perform synonym matching for gene names, thus the annotations between model, library, and data analyzed should all match.

Any hypothesis supported by Enrichment Analysis should be carefully validated in further studies.

Models

Gene models are, essentially, a set of all genes for a given species. There are important differences between groups curating/computing these models, such that some are more conservative and others more inclusive. Note: Currently models must live in the sample space within SeqGeq, in order to be utilized by the enrichment platform.

Common Gene Models:

RefSeq – Conservative gene model from the NCBI (US gov’t agency), including many species
- Genes: NC are reference genes, AC are alternate assemblies [see RefSeq Accessions]
- Transcripts: NM are messenger RNAs, NR are other RNAs
- Proteins: NP are well-characterized proteins
Entrez – Conservative gene model from the NCBI, including many species
- Genes: integer id
- Transcripts: none
- Proteins: none
Ensembl – Inclusive gene model from the EMBL (EU gov’t agency), also for many species
- Genes: ENSG are human genes, ENSMUSG are mouse genes
- Transcripts: ENST are human transcripts
- Proteins: ENSP are human proteins.
UCSC – Moderate gene model from academia, which is closely associated with early reference genome project, and available for human-only. This model is largely derived from RefSeq.
- Genes: Identified solely by gene symbols.
- Transcripts: uc are transcripts.
- Proteins: Modeled by link-out to UniProt and RefSeq.

When possible, researchers should avoid mixing gene identifiers and transcript identifiers in their data upstream. Mixing gene sets from different models requires translating between them, or standardizing on a single model universally.

If users import data from an upstream analysis they like with malformed, ambiguous, and unrecognizable gene symbols, common models won’t map to these annotations. Note: There are also certain gene symbols that have a tendency to get mangled (esp. by Excel), such as “SEP15” which is almost invariably converted to a date.Close attention by the user will be required to catch such errors prior to analysis in SeqGeq.

Classifiers

Many databases are publicly available which contain a myriad of gene set libraries. These are often provided in Gene Matrix Transpose (GMT) format. SeqGeq is compatible with classifiers in GMT, as well as CSV, and TSV formats. The following is a list of popular gene set library sources available to researchers, provided by third party distributors (not maintained or specifically endorsed by FlowJo):

MSigDB

EMGenesets

Note: If you’ve identified gene sets you’d like to utilize for classification purposes, SeqGeq can create GMT files from a collection within the Gene Sets pane of the Workspace. To do so, first create a collection of gene sets to combine, then right click on the collection and choose “Export as GMT”.

Fischer’s Exact Test

SeqGeq’s Gene Set Enrichment platform utilizes Fisher’s Exact Test(1) to determine whether a set of genes discovered (the “test”) overrepresents a function defined by a library of known gene sets (the “controls”). In other words, it checks how closely the test matches controls. SeqGeq uses a one-sided test to check for enrichment only, not depletion.

Running Gene Set Enrichment

To run enrichment tests in SeqGeq, right click on the test gene set of interest and select “Enrichment Test”.

Within the resulting dialog, select the gene model used. It is sufficient to use the sample itself as its own gene model if the sample contains all of the genes assayed/quantitated. Otherwise, select another gene model (or another sample) that contains the correct set of all genes. Using a gene model that is too large or too small will affect the correctness of the enrichment analysis. If you have multiple samples, using a common gene model (if appropriate) will improve performance.

Within the same dialog select the gene set library to compare with, and a sensible p-value threshold to filter comparisons for display:

This will generate a table of comparator gene sets falling below the p-value threshold set, their adjusted p-values, # of genes in common between the test and comparator, and the number of genes in a given comparator gene set.

Note: The adjusted p-values reported by Fisher’s exact test are generally very low stringency. Any conclusions drawn from enrichment tests should be validated by other means (e.g., experimentally).

References:

Routledge R. Fisher’s exact test. Encyclopedia of biostatistics. 2005

For any question regarding the Geneset Enrichment platform in SeqGeq please contact tech support: seqgeq@flowjo.com