Whether you’re aiming to apply KMeans or KNearest Neighbors to your data – We can cluster that!

SeqGeq contains a Classification platform available from the Discovery band within the workspace.


KMeans Clustering


To run KMeans, simply select your population or sample of interest from within SeqGeq’s workspace, and click on the Classification button within the Discovery band.

Within the resulting dialog select the “KMeans Cluster” Algorithm, name the clustering run, select the parameters on which you’d like to perform unbiased clustering, and set your “K” value. This K value will need to be an integer, and will determine the number of clusters generated directly by the algorithm:


Having run the clustering algorithm, your data will now contain a categorical parameter which identifies each of the newly populations:


Note: The AutoCatGate plugin will return populations using this new parameter, corresponding to your clustering.

KMeans stops iterating when the clustering quality improves by 0.1% or less in an iteration. It will also stop if it reaches 500 iterations. The clustering works well with genes, but runs faster with near identical results on appropriate PCA parameters.


K Nearest Neighbors Classification


The KNN Classify algorithm relies on a training population (i.e. gold standard classification)  and categorical label (i.e. categorical parameter) within that training data. A training set can be any sample or population of similar data type to the training set. The categorical label will be used to classify data within the target population or sample.

Once you know your training population and categorical select the target population (the population to be classified) within the workspace, click on the Classification icon, and in the resulting dialog select “KNN Classify”. Then enter your new KNN classification name, choose the parameters to use for assigning classification, the Training Population, and the Class Labels parameter:


Once run, the classification categorical parameter should identify populations similar to the training population’s categorical parameter, within the target population:


Distance Metric – If labeled sample is comparable to your selected sample (that is, from same technology and general experimental setup), you should choose Euclidean as the distance metric and uncheck “standardize”. If they’re not directly comparable, you should choose Angular distance and check “standardize”.

Note: If your labeled sample contains representatives (e.g., centroids) of clusters (instead of actual cells), you can embed a new sample into the same clusters using KNN with K=1.

For questions on SeqGeq or the clustering platform specifically, please reach out to: seqgeq@flowjo.com