MAIT Metabolite Automatic Identification Toolkit

Author

Francesc Fernández-Albert, Rafael Llorach, Cristina Andrés-Lacueva, Alexandre Perera

MAIT

MAIT is an R package to perform Metabolomic LC/MS analysis: MAIT (Metabolite Automatic Identification Toolkit), developed by Francesc Fernandez, Rafael Llorach, Cristina Andres-LaCueva and Alexandre Perera and it was published in Oxford Journals Bioinformatics. MAIT package is currently maintained by the B2SLab at UPC, so please send us a note if you have any suggestion or comment.

Citation

Fernández-Albert F., Llorach R., Andrés-Lacueva C., Perera-Lluna A. An R package to analyse LC/MS metabolomic data: MAIT (Metabolite Automatic Identification Toolkit. Bioinformatics (2014). doi: 10.1093/bioinformatics/btu136.

MAIT Description

MAIT is capable of peak Peak detection for metabolomic LC/MS data sets. It uses a matched filter (Danielsson, Bylund, and Markides 2002) and the centWave algorithm (Tautenhahn et al. 2008) through the XCMS package, developed by the SCRIPPS Center for Metabolomics.

MAIT also uses further complementary steps in the peak annotation stage, also usingCAMERA package (Kuhl et al. 2011). The peaks within each peak group are annotated following a reference adduct/fragment table and a mass tolerance window. MAIT uses a predefined biotransformation table where the biotransformations to find are saved. A user-defined biotransformation table can be set as an input.

The package also holds a metabolite identification stage, in which a predefined metabolite database is mined to search for the significant masses, also using a tolerance window. This database is the Human Metabolome Database (HMDB) (Wishart, Knox, Guo, Eisner, Young, Gautam, Hau, Psychogios, Dong, Bouatra, and et al. 2009), 2009/07 version.

In terms of statistical tests, the package automatically tests on every feature and selects a significant set of features. Depending on the number of classes defined in the data, MAIT can use Student’s T-test, Welch’s T-tests and Mann-Whitney tests for two classes or ANOVA and Kruskal-Wallis tests for more than two classes. The use is able to define custom tests in the automatic procedure as well.

A nice addition of the MAIT R package, is the support for peak aggregation techniques that might lead to better feature selection (Fernández-Albert, Llorach, Andrés-Lacueva, and Perera-Lluna), as defined in: Fernández-Albert F, Llorach R, Andrés-Lacueva C, Perera-Lluna A. Improvement in the predicting power of LC-MS-based metabolomics profiles by peak aggregation techniques. Analytical Chemistry 2014 86 (5), 2320-2325 DOI: 10.1021/ac403702p

Downloading MAIT

The latest version of the MAIT package is available through the Bioconductor repository: - Package: MAIT - Tutorial: MAITSupplementary

Data of the faahKO package in zip format. This is the data used in the MAIT tutorial.

Using MAIT

The data files for this example are a subset of the data used in reference (Saghatelian, Trauger, Want, Hawkins, Siuzdak, and Cravatt 2004), which are freely distributed through the faahKO package Smith (2012). In these data there are 2 classes of mice: a group where the fatty acid amide hydrolase gene has been suppressed (class knockout or KO) and a group of wild type mice (class wild type or WT). There are 6 spinal cord samples in each class. In the following, the MAIT package will be used to read and analyse these samples using the main functions discussed in Section 5. The significant features related to each class will be found using statistical tests and analysed through the different plots that MAIT produces.

Abstract

Processing metabolomic liquid chromatography and mass spectrometry (LC/MS) data files is time-consuming. Currently available R tools allow for only a limited number of processing steps, and online tools are hard to use in a programmable fashion. This paper introduces the metabolite automatic identification toolkit MAIT package, which allows users to perform end-to-end LC/MS metabolomic data analysis. The package is especially focused on improving the peak annotation stage and provides tools to validate the statistical results of the analysis. This validation stage consists of a repeated random sub-sampling cross-validation procedure evaluated through the classification ratio of the sample files. MAIT also includes functions that create a set of tables and plots, such as principal component analysis (PCA) score plots, cluster heat maps, or boxplots. To identify which metabolites are related to statistically significant features, MAIT includes a metabolite database for a metabolite identification stage.

Introduction

Liquid Chromatography and Mass Spectrometry (LC/MS) is an analytical instrument widely used in metabolomics to detect molecules in biological samples. It breaks the molecules down into pieces, some of which are detected as peaks in the mass spectrometer. Metabolic profiling of LC/MS samples consists of a peak detection and signal normalization step, followed by multivariate statistical analysis such as principal component analysis (PCA) and univariate statistical tests such as ANOVA.

Analyzing metabolomic data is time-consuming, and a wide array of software tools are available. Commercial tools such as Analyst® software exist, while R packages such as XCMS, CAMERA, and AStream cover only peak annotation. Online tools like XCMS Online and MetaboAnalyst are extensively used but difficult to use in a programmable fashion. These tools involve only part of the entire metabolomic analysis process.

We introduce a new R package called MAIT (Metabolite Automatic Identification Toolkit) for automatic LC/MS analysis. The goal of MAIT is to provide an array of tools for programmable metabolomic end-to-end analysis. It has special functions to improve peak annotation through biotransformations and is designed to identify statistically significant metabolites that separate data classes.

Methodology

The main processing steps for metabolomic LC/MS data include the following stages:

Peak Detection: Detecting peaks in LC/MS sample files.
Peak Annotation: Improving the identification of metabolites in the samples.
Statistical Analysis: Identifying significant sample features.

All three steps are covered in the MAIT workflow.

Peak Detection

Peak detection in metabolomic LC/MS datasets is complex, and several approaches have been developed. Two of the most well-established techniques are matched filter and centWave algorithm. MAIT can use both algorithms through the XCMS package.

Peak Annotation

The MAIT package uses three complementary steps for peak annotation:

Step 1: Uses a peak correlation distance approach and a retention time window to determine peaks from the same metabolite.
Step 2: Uses a mass tolerance window to detect specific mass losses (biotransformations) based on a predefined table.
Step 3: Performs a metabolite identification stage using the Human Metabolome Database (HMDB).

Statistical Analysis

The goal of metabolomic profiling is to obtain statistically significant features that contain class-related information. MAIT applies standard univariate statistical tests (ANOVA or t-test) and selects significant features based on a user-defined P-value threshold. A validation test quantifies how well data classes are separated through repeated random sub-sampling cross-validation using:

Partial Least Squares - Discriminant Analysis (PLS-DA)
Support Vector Machine (SVM) with a radial kernel
K-Nearest Neighbors (KNN)

Using MAIT

Data Import

Data files should be placed in directories named after their respective classes. The MAIT package requires class folders within a single directory.

Peak Detection

Once the data is organized, peak detection can be initiated using the sampleProcessing() function:

library(MAIT)
library(faahKO)
cdfFiles <- system.file("cdf", package="faahKO", mustWork=TRUE)
MAIT <- sampleProcessing(dataDir = cdfFiles, project = "MAIT_Demo", snThres=2, rtStep=0.03)

Peak Annotation

The next step in the data processing is the first peak annotation step, which is per- formed through the peakAnnotation(). If the input parameter adductTable is not set, then the default MAIT table for positive polarisation will be selected. However, if the adductTable parameter is set to ”negAdducts”, the default MAIT table for negative fragments will be chosen instead. peakAnnotation function also creates an output table (see Table 3) containing the peak mass (in charge/mass units), the retention time (in minutes) and the spectral ID number for all the peaks detected. A call of the function peakAnnotation may be:

MAIT <- peakAnnotation(MAIT.object = MAIT, corrWithSamp = 0.7, corrBetSamp = 0.75, perfwhm = 0.6)

Because the parameter adductTable was not set in the peakAnnotation call, a warning was shown informing that the default MAIT table for positive polarisation mode was selected. The xsAnnotated object that contains all the information related to peaks, spectra and their annotation is stored in the MAIT object.

Fowing the first peak annotation stage, we want to know which features are different be- tween classes. Consequently, we run the function spectralSigFeatures().

MAIT<-spectralSigFeatures(MAIT.object = MAIT, pvalue = 0.05, p.adj = "none",
scale = FALSE)

It is worth mentioning that by setting the scale parameter to TRUE, the data will be scaled to have unit variance. The parameter p.adj allows for using the multiple testing correction methods included in the function p.adjust of the package stats. A summary of the statis- tically significant features is created and saved in a table called significantFeatures.csv (see Table 3). It is placed inside the Tables subfolder located in the project folder. This table shows characteristics of the statistically significant features, such as their P-value, the peak annotation or the expression of the peaks across samples. This table can be retrieved at any time from the MAIT-class objects by typing the instruction:

signTable <- sigPeaksTable(MAIT.object = MAIT, printCSVfile = FALSE)

The number of significant features can be retrieved from the MAIT-class object as follows:

MAIT

By default, when using two classes, the statistical test applied by MAIT is the Welch’s test. Nevertheless, when having two classes,MAIT also supports applying the Student’s t-test and the non-parametric test Mann-Whitney test

Statistical Analysis

To identify statistically significant features:

MAIT <- spectralSigFeatures(MAIT.object = MAIT, pvalue=0.05, p.adj="none", scale=FALSE)
summary(MAIT)

Statistical Plots

Out of 2,402 features, 106 were found to be statistically significant. At this point, several MAIT functions can be used to extract and visualise the results of the analysis. Functions plotBoxplot, plotHeatmap, plotPCA and plotPLS automatically generate boxplots, heat maps PCA score plot and PLS score plot files in the project folder when they are applied to a MAIT object.

plotBoxplot(MAIT)
plotHeatmap(MAIT)
MAIT <- plotPCA(MAIT, plot3d=FALSE)
MAIT <- plotPLS(MAIT, plot3d=FALSE)

The plotPCA and plotPLS functions produce MAIT objects with the corresponding PCA and PLS models saved inside. The models, loadings and scores can be retrieved from the MAIT objects by using the functions model, loadings and scores.

All the output figures are saved in their corresponding subfolders contained in the project folder. The names of the folders for the boxplots, heat maps and score plots are Boxplots, Heatmaps, PCA Scoreplots and PLS Scoreplots respectively. Figures 3 and 4 depict a heat map, a PCA score plot and a PLS score plot created when functions plotHeatmap, plotPCA and plotPLS were launched.

Metabolite Identification and Validation

Biotransformations

Before identifying the metabolites, peak annotation can be improved using the function Biotransformations to make interpreting the results easier. The MAIT package uses a default biotransformations table, but another table can be defined by the user and introduced by using the bioTable function input variable. The biotransformations table that MAIT uses is saved inside the file MAITtables.RData, under the name biotransformationsTable.

MAIT <- Biotransformations(MAIT.object = MAIT, peakPrecision = 0.005)

Metabolite Identification

Once the biotransformations annotation step is finished, the significant features have been enriched with a more specific annotation. The annotation procedure performed by the Biotransformations() function never replaces the peak annotations already done by other functions. MAIT considers the peak annotations to be complementary; therefore, when new annotations are detected, they are added to the current peak annotation and the identifica- tion function may be launched to identify the metabolites corresponding to the statistically significant features in the data.

MAIT <- identifyMetabolites(MAIT.object = MAIT, peakTolerance = 0.005)

By default, the function identifyMetabolites() looks for the peaks of the significant fea- tures in the MAIT default metabolite database. The input parameter peakTolerance defines the tolerance between the peak and a database compound to be considered a possible match. It is set to 0.005 mass/charge units by default. The argument polarity, refers to to the polar- ity in which the samples were taken (positive or negative). It is set to “positive” by default but it should be adjusted changed to ”negative” if the samples were recorded in negative po- larisation mode. To check the results easily, function identifyMetabolites creates a table containing the significant feature characteristics and the possible metabolite identifications.

Validation

Finally, we will use the function Validation() to check the predictive value of the significant features. All the information related to the output of the Validation() function is saved in the project directory in a folder called “Validation”.

MAIT <- Validation(Iterations = 20, trainSamples = 3, MAIT.object = MAIT)
summary(MAIT)

Conclusion

MAIT is an R package designed for end-to-end metabolomic LC/MS analysis. It integrates peak detection, annotation, and statistical validation, providing researchers with an automated and programmable tool for metabolite identification.