NSforest: A Machine Learning Method to Identify Marker Genes from Single Cell/Single Nuclei RNA Sequencing Data

Cells are fundamental functional units of multicellular organisms, with different cell types playing distinct physiological roles in the body. The recent advent of single cell transcriptional profiling using RNA sequencing is producing "big data," enabling the identification of novel human cell types at an unprecedented rate.

NSforest is a method based on random forest machine learning for identifying sets of necessary and sufficient marker genes, which can be used for quantitative PCR and multiplex FISH, and to assemble consistent and reproducible cell type definitions for incorporation into the Cell Ontology (CL). The representation of defined cell type classes and their relationships in the CL using this strategy will make the cell type classes findable, accessible, interoperable, and reusable (FAIR), allowing the CL to serve as a reference knowledgebase of information about the role that distinct cellular phenotypes play in human health and disease.


BMC bioinformatics. 2017-12-21; 18.Suppl 17: 559.
Cell type discovery and representation in the era of high-content single cell phenotyping
Bakken T, Cowell L, Aevermann BD, Novotny M, Hodge R, Miller JA, Lee A, Chang I, McCorrison J, Pulendran B, Qian Y, Schork NJ, Lasken RS, Lein ES, Scheuermann RH
PMID: 29322913


This work is funded by the Chan Zuckerberg Initiative DAF under grant no. 2018-182730.

Principal Investigator

Key Staff

  • Brian Aevermann, MS
  • Mark Novotny


Trygve Bakken, Jeremy A. Miller, and Ed Lein
Allen Institute for Brain Science

Alexander D. Diehl
University of Buffalo

David Osumi-Sutherland
European Bioinformatics Institute

Related Research