A more detailed explanation of the algorithms in the Population Comparison platform

FlowJo’s comparison platforms support four different comparison algorithms. Two algorithms (Overton and SED) are used to calculate the percentage of positive cells found in the sample (not in the control). Two algorithms (Kolmogorov-Smirnov and Probability Binning) are used to determine the statistical difference between samples.

All of these methods begin by ‘binning’ the data – counting the number of events that fall into discreet ranges. This can be visualized in one dimension as a histogram. A bin is one such range.

*% Positive Algorithms*

**Overton cumulative histogram subtraction**^{1}: The algorithm created by Roy Overton subtracts histograms on a channel-by-channel basis to provide a percent of positive cells. It is popular because it is easy to understand and works reasonably well. The process for normalization in the Overton method is to find the mode (the bin that has the most cells in it) of each tube and divide the data from that tube by that value. This puts the data on roughly the same scale (0 – 100%) while preserving features. The Overton method then subtracts the control data from the comparison tube and counts the number of events that remain per bin, labeling these “positive”.

**Super-Enhanced Dmax Subtraction (SED):** A sophisticated algorithm developed by Bruce Bagwell to compute percent positives when comparing histograms with improved normalization and population estimation compared to Overton. A detailed overview of the algorithm can be found here, though the SED method has never been published. What we’ve put in FlowJo is essentially the Enhanced Normalization Subtraction Method (ENS) which is very similar to the SED (it lacks a correction factor). The difference between ENS and Overton normalization is that the control and test algorithms are normalized so that they have the same area. This protects the user from bad normalization due to an outlier bin with a extraordinarily large number of events in it, usually due to a data artifact on the edges of the scale. One bin with a large number of cells would cause the algorithm to normalize one of the tube to this large number and make the scales *very* different. The ENS also estimates the probability distribution function of the positive population, and aligns it to the data using the point of maximum difference between samples. By estimating the shape of the positive population, the algorithm is less likely to create a poor fit because of noisy data. Overall what is called the SED method in FlowJo is superior to the Overton method because it factors in some safety precautions for data artifacts. For really nice clean data, there should be almost no difference between the two methods.

*Confidence Interval Algorithms*

**Kolmogorov-Smirnov (K-S): **An algorithm commonly used to determine the confidence interval with which one can make the assertion that two univariate histograms are different. In other words, it states a confidence interval for the assertion that the two populations are NOT drawn from a common distribution. The KS test creates a cumulative distribution of the two populations being compared, and looks for the maximum difference between them. For a detailed look at the algorithm, go here.

NOTE: This statistic is ideal for comparing differences in small populations (e.g. n=100). In flow cytometry, we compare values of much higher magnitude (e.g. tens of thousands to millions of events). With so many events involved, the probability that there is a difference in at least one bin large enough to indicate a 99% confidence interval that the two samples are different is extremely high. In fact the KS test will erroneously report that two halves of the same population (every other cell makes up one of the halves while the cells in between make up the other half) are distinct. This algorithm is still in FlowJo largely as a legacy.

**Chi Squared (T(x))**^{3-5} : The FlowJo Chi Squared comparison is related to the Cox Chi Square^{6} approach, but with modified binning such that it minimizes the maximal expected variance, referred to as probability binning. This algorithm has been shown to detect small differences between two populations and it does so in a quantitative way. The algorithm divides the control sample into bins of variable size to that each bin has the same number of events, so that the bin to bin variability in the control sample is negligible. The test sample is then divided along the same boundaries so that the variance measured by the calculated Chi Square will be low for data distributed like the control, and high for samples distributed differently. The Chi Square value is converted into a metric T(X) that can be used to estimate the probability that a test population is different from a control population. The Probability Binning algorithm was designed for use with flow cytometry data.

NOTE: K-S and T(X) will very readily label two samples as MATHEMATICALLY different. It’s up to the user to establish a threshold for BIOLOGICAL meaning. This can be done by calculating the T(x) for multiple control samples to establish what normal biological variance for that particular data is.

References:

1) Overton WR. Modified histogram subtraction technique for analysis of flow cytometry data. Cytometry. 1988 Nov;9(6):619-26.

3) Roederer M, Treister A, Moore W, Herzenberg LA. Probability binning comparison: A metric for quantitating univariate distribution differences. Cytometry. 2001 Sep 1;45(1):37-46.

4) Roederer M, Moore W, Treister A, Hardy RR, Herzenberg LA. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry. 2001 Sep 1;45(1):47-55.

5) Roederer M, Hardy RR. Frequency difference gating: A multivariate method for identifying subsets that differ between samples. Cytometry. 2001 Sep 1;45(1):56-64.

6) Cox C, Reeder JE, Robinson RD, Suppes SB, Wheeless LL. Comparison of frequency distributions in flow cytometry. Cytometry. 1988 Jul;9(4):291-8.