

We and others have been fascinated by the underlying complexity of the “dark matter” in shotgun proteomics 3 – including the vast diversity of post-translational modifications (PTMs) as well as novel sequences (e.g. However, even given significant improvements in the quality of MS/MS data acquired on modern mass spectrometers, a very significant fraction of spectra remains unexplained. The most commonly used computational strategy is based on searching acquired tandem mass (MS/MS) spectra against a protein sequence database using database search algorithms 2. Peptide identification algorithms have served as a cornerstone of shotgun proteomics for several decades 1. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics. We further illustrate its utility using protein-RNA crosslinked peptide data, and using affinity purification experiments where we observe on average a 300% increase in the number of identified spectra for enriched proteins. Using some of the largest proteomic datasets to date, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in the modification rates across experimental samples and conditions.

We present a novel fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables an over 100-fold improvement in speed over most existing tools. There is a need to better understand and handle the “dark matter” of proteomics – the vast diversity of post-translational and chemical modifications that are unaccounted in a typical analysis and thus remain unidentified.
