Thank you for your interest in SeqGeq™, one of the latest software suites from the team at FlowJo, LLC! SeqGeq is designed to make analysis of single cell RNA-Sequencing (scRNA-Seq) data as intuitive and enjoyable as it is for flow data in FlowJo®. The first step in analyzing your scRNA-Seq experiment in this platform is simply to load your sample(s).
In most cases this is possible simply by dragging and dropping your expression matrix of interest into an open SeqGeq workspace (currently .csv, .txt, .tsv, .mtx, .tab, and .h5 formats are supported). You can also browse for data to load directly from the application by clicking the “Add Samples” button within the Analyze tab at the top of the workspace:
Provided the information is correctly organize within the file, loading the sample(s) is as simple as that! SeqGeq has been developed with robust cross compatibility in mind. In other words, SeqGeq reads and analyzes data from most scRNA-Seq data collection pipelines and allows you to focus on the important work: your analysis.
Supported File Types:
- .csv – Comma separated values , the most common type of data matrix we encounter.
- .txt – Unicode 8 text, comma separated, preferably without quotes.
- .tsv – Tab separated values.
- .mtx – Matrix standard file format, usually accompanied by genes.tsv and barcodes.tsv files. Note: Moving these files currently will cause SeqGeq to loose the reference to them within saved GeqZip analyses.
- .tab – This is another type of tab separated file format – MapInfo formatted data file, must contain ASCII characters only.
- .h5 – HDF5, commonly output from 10x Genomics platform.
Files with non-standard characters in their file name cannot be loaded into SeqGeq. In particular the ‘&’ character must be replaced.
File Import Summary
Data files should have one devoted column and one devoted row for genes and for cell IDs each (though the orientation there is flexible). SeqGeq requires data matrices be rectangular meaning no ragged matrices can be loaded; i.e. data files with columns or rows of uneven length.
Advanced Data Import
In some cases special steps may be required in order to import data into SeqGeq. The following information is intend to make these steps and their underlying logic clear.
SeqGeq uses some basic logic to infer the size (# of rows and columns) of data within an expression matrix, the end of header information, and the orientation of the matrix (parameters in rows or columns). The software is even smart enough to discover the row or column in which gene names are present! Two parameters are critical: unique event identifier (i.e., cellID) and reads (i.e., genes).
“CellId” – This information tells SeqGeq, for a given expression matrix, the unique identifier of each event (usually one cell). SeqGeq will recognize both Well ID strings or DNA barcodes embedded in an initial column, or in the column headers to make this inference.
Genes – SeqGeq can infer genes based on some commonly known sets, such as UniProt and GenBank. The following data-bases of standards are also utilized to identify genes.
- Human: HGNC/HUGO, RefSeq, Ensembl, UCSC.
- Mouse: MGI
- Rat: RGD
- Saccharomyces cerevisiae: SGD
- Drosophila melanogaster: Flybase
- Danio rerio: ZFIN
- C elegans: Wormbase
Simple Parameters in Rows (with annotations for clarity):
Real World Data Matrices:
Tips and Tricks
In some cases difficult or non-standard data may require some extra massaging. In these cases we offer some general troubleshooting options below.
Often it is useful to separate events in an expression matrix based on a known piece of information such as subject, treatment, or time-point – Often referred to as a “Categorical Parameter”.
In this case it is useful to include a parameter within the expression matrix alongside other parameters, for use in downstream analyses. Researchers can manually add this information via R or directly in a spreadsheet software.
SeqGeq will also conveniently add parameterized keywords (one kind of categorical parameter) to concatenated sample files. By default a “SampleID” parameter is created, which separates samples numerically in the order they were displayed within the SeqGeq workspace when concatenating.
Considerations for Excel:
There are some quirks about loading typical expression matrices into Excel. One of which in particular user’s should be cautious to avoid – Certain genes will be interpreted automatically as a date, and converted in format as such (e.g. the MAR6 gene can be changed to March-6).
HDF5 (.h5) Support:
Generated by the 10x Genomics pipelines SeqGeq supports the direct import of this commonly used file type.
For users interested in converting a “filtered_gene_bc_matrices.h5” data file, exported from the Cell Ranger software, to CSV can be accomplished in R with the following simple script:
library(cellrangerRkit) genome <- "YOUR_REF_GENOME_HERE" pipestance_path <- "YOUR_OUTS_DIRECTORY_HERE" gene_bc_matrix <- load_cellranger_matrix_h5(pipestance_path, genome=genome) dataFile = as.data.frame(as.matrix(exprs(gene_bc_matrix))) fdatas <- fData(gene_bc_matrix) rownames(dataFile) = make.names(fdatas$symbol, unique=TRUE) write.csv(dataFile, file="h5_Dense_10x_Symbol.csv", quote = FALSE)
This will export a CSV of your expression matrix to your home folder. Note: You will need to input your particular reference genome and the directory which contains your “outs” folder into this script.
SeqGeq provides advanced researchers struggling to import difficult data, a powerful set of meta-information tags (similar to keywords in FACS data) to force their files into submission. These pieces of information and their possible values are described below.
These metadata pairs should be placed in a header indicated by the “[Metadata]” annotation, below which the tags can be inserted, above another indicator which precedes the expression information directly, “[Data]”.
$DataType should be “Expression”.
$Organization may be: “ParamsInRows”, “ParamsInCols”, or “Triplet”.
$ParamTypes specifies how to read the data in each row (for ParamsInRows) or in each column (for ParamsInCols). There is one letter for each row or column, and the last letter applies to all the rest of the columns (so the string usually ends in c or f, indicating that all following rows/columns contain count data or floating point data, respectively).
|I||Identifier||row/col contains Event Ids|
|A||other Annotation||ignore (e.g., descriptive text)|
|s||string||Parameter contains string values|
|c||count||Parameter contains count values|
|f||float/double||Parameter contains float values|
$EventTypes specifies how to read the data in each column (for ParamsInRows) or in each row (for ParamsInCols). There is one letter for each column or row, and the last letter applies to all the rest of the columns (so the string usually ends in v, indicating that all following columns/rows contain values).
|I||Identifier||row/col contains Names/Ids of Parameters|
|N||Name||row/col contains additional Names of Parameters|
|A||other Annotation||row/col contains Annotations on Parameters|
|v||Values||row/col contains values|
$FirstRow specifies which line is the first line of the data section (that is, the first header row of the matrix or triplet file). The very first line is line 0 (NOT 1), so you can think of this as “how many lines to skip when reading the data”.
|Gene||Cell 1||Cell 2|
For the best, most up to date troubleshooting assistance, you can always reach out to our technical support team. We will be more than happy to lend a hand in everything from importing your data, to applying the most cutting edge analysis methods within SeqGeq: email@example.com