The Importance of False Discovery Rate Control in Metabolomics

We establish a novel approach to expand the utilization of FDR estimation by providing an automated false-discovery-rate-controlled analysis for data-independent acquisition in metabolomics.

Metabolomics (an approach to study small molecules) has the capability of providing direct insight into the by-products of hundreds of thousands of chemical reactions. The resulting high-throughput data involves a thorough analysis workflow composed of several steps for the identification, quantification and elucidation of small molecules and their interactions in biological systems.

Advancements in modern mass spectrometry (MS) instrumentation such as developments in data-independent acquisition (DIA)-MS, explore the systematic, unbiased sampling of MS/MS fragmentation spectra for small molecules (Gillet et al., 2012). In DIA, the entire range of ionized precursors are fragmented simultaneously for a highly multiplexed, comprehensive overview of the sample. However, the resulting increased data complexity reveals one of the main challenges in targeted metabolomics - the detection and filtering of false-positive metabolic features in low signal-to-noise ranges of DIA results (Guo & Huan, 2020). 

One measure that provides an estimation of the number of false positives in an analysis is the false discovery rate (FDR). The FDR was first introduced as a control measure for high-throughput data in biology (Benjamini & Hochberg, 1995) and is defined as the expected ratio of false positive (FP) classifications (false discoveries) to the total number of positive classifications [TP; FDR = FP/(FP+TP)]. Within a biological setting, different applications may tolerate varying levels of false positives. For example, for clinical biomarker validation, a low number of false positives is highly important (1% - 5% FDR). On the other hand, if the aim is to postulate novel hypotheses for further testing, an FDR of 10% might still be valid, as these annotations would require further experimental and manual validation.

FDR estimation is not yet widely adopted in the field of metabolomics (Scheubert et al., 2017; Wang et al., 2018). Additionally, the lack of FDR approaches in metabolomics limits the confidence in reported identifications and quantifications, and manual assessment is still a common practice. One approach, the target-decoy method, introduces a set of “decoys” (or known false features) that can be used to estimate the number of false positives in the overall result. In the field of proteomics, this approach has been used for many years and has become a standard analysis tool (Elias & Gygi, 2007). Unfortunately, due to the structural diversity of small molecules (with the prominence of structural isomers), creating plausible decoys is not as straightforward in metabolomics. Simply shuffling structural elements in a chemical may yield a molecule that is actually present and thus cannot be considered “known false”, or it may yield a completely unrealistic chemical. Scheubert et al. addressed this concern in 2017, as they introduced a fragmentation-tree-based method that ensures the consistency of decoy spectra using fragmentation tree re-rooting (Scheubert et al., 2017; Ludwig et al., 2020). Briefly, a fragmentation tree is constructed for the spectrum of interest (with fragments assigned to corresponding metabolite substructures), which is then re-rooted to shift the fragmentation reaction order, creating a decoy. 

Using this approach (Ludwig et al., 2020) to further broaden the portfolio of FDR estimation and its application, we provide an automated false-discovery-rate-controlled analysis for data-independent acquisition in metabolomics in our recent publication (Alka et al., 2022). With rigorous FDR control and improved comparability between metabolomics results, we observe fewer missing values and improvements in biologically relevant findings compared to other methods. We follow an established approach in targeted proteomics by using an experiment-specific target and decoy prior knowledge database for targeted extraction. For available targets and decoys, peak groups are extracted and scored, where a peak group consists of fragment ions derived from a target metabolite, integrated over chromatographic time. This metabolomics data analysis pipeline provides assay library generation with targeted analysis and statistical validation. 

Our work also allows for high-level automation of data analysis in metabolomics. The targeted DIA data analysis strategy relies on spectral assay libraries generally derived from a priori measurements. While both the creation of assay libraries for DIA analysis and the processing of extracted ion chromatograms has been automated in other fields (Röst et al., 2014), this is not the case in metabolomics. Fully automated construction of the assay library permits the discovery and quantification of new metabolites, while still achieving the quantification accuracy of a manually curated, targeted approach. 

With the accelerated development of novel methods in the field of metabolomics, data complexity continues to increase. Previously with DIA data, downstream data analysis with currently available tools led to vast differences in results due to the lack of reliable precision estimates. This demonstrated the need for a robust and standardized workflow that utilizes a statistically well-calibrated FDR and allows practitioners to analyze and compare DIA data systematically. These advancements enable the accurate identification of compounds and markers from DIA data in low concentrations, facilitating biomarker quantification. We expect to see the wider adoption of FDR methods in metabolomics to improve comparability and the statistical significance of resulting identifications.