A strategy to improve the annotation coverage of small molecule compounds in fruits, vegetables and their products by untargeted metabolomics

Junyan Yu; Lei Xu; Nan Zhang; Lingxiao Tang; Xiangyang Zhu; Lu Mi; Qiong Xu; Kewen Wang; Charles Viau; Xue Wang; Zhenzhen Xu; Junyan Yu; Lei Xu; Nan Zhang; Lingxiao Tang; Xiangyang Zhu; Lu Mi; Qiong Xu; Kewen Wang; Charles Viau; Xue Wang; Zhenzhen Xu

doi:10.48130/fia-0026-0009

Metabolomics is essential for analyzing small molecules in food. Effective extraction and separation technology, along with reliable and efficient analytical tools, are essential for enhancing both the quantity and accuracy of compound analysis. Traditional methods relying on single-solvent extraction and single-column separation often result in target omission and reduced annotation coverage. This study presents a 'Divide, Conquer, and Integrate Strategy' for comprehensive untargeted metabolomics in fruits, vegetables, and their products. The method uses three extraction techniques to capture metabolites across a broad polarity range. Each extract is separated using specific chromatographic columns and mobile phases to ensure high annotation coverage. Data are collected via high-resolution mass spectrometry in both positive and negative ion modes, and analyzed using MS-DIAL and MetaboAnalystR. This integrated approach enhances metabolite discovery and annotation accuracy, with low overlap of metabolites annotated by different extraction methods.

HTML

Experimental materials and equipment

Chemical and reagents

● Ultrapure water, generated by a Milli-Q system (Millipore, Billerica, MA, USA) or similar

● LC/MS-grade methanol (Thermo Fisher Scientific, Waltham, MA, USA)

● LC/MS-grade acetonitrile (Thermo Fisher Scientific, Waltham, MA, USA)

● LC/MS-grade isopropanol (Thermo Fisher Scientific, Waltham, MA, USA)

● HPLC-grade formic acid (DIKMA, Beijing, China)

● AR-grade chloroform (Aladdin, Shanghai, China)

● AR-grade ammonium acetate (Aladdin, Shanghai, China)

Chromatographic materials and other consumables
● Acquity UPLC BEH C18 column (2.1 × 100 mm, 1.7 μm; Waters, Milford, MA, USA)

● InfinityLab Poroshell 120 HILIC-Z column (3.0 × 100 mm, 2.7 μm; Agilent, Santa Clara, CA, USA)

● InfinityLab Poroshell 120 SB-C18 column (2.1 × 100 mm, 2.7 μm; Agilent, Santa Clara, CA, USA)

● Pipettes (Eppendorf, Hamburg, Germany) and tips (Axygen, Silicon Valley, CA, USA)

● Centrifuge tubes (Axygen, Silicon Valley, CA, USA)

● Brown autosampler vials (Titan, Shanghai, China)

● 0.22 μm PES membrane filters (Jin Teng, Tianjin, China)

● 0.22 μm nylon membrane filters (Jin Teng, Tianjin, China)

Laboratory equipment
● −80 °C refrigerator (DW-86L 726G, Haier Biomedical, Qingdao, Shandong, China)

● Vacuum freeze dryer (LG-03, Song Yuan, Beijing, China)

● Freeze grinder (Xinyi-24N, Xinyi, Ningbo, China)

● Ultrasonic cleaner (KQ-500DE, Supmile, Kunshan, Jiangsu, China)

● Vortex mixer (SCI-VS, SCILOGEX, Rocky Hill, CT, USA)

● Electronic balances (0.1 mg, MA204E, Mettler toledo, Zurich, Switzerland)

● High speed microcentrifuge (L-CM-1524R, LABGIC, Beijing, China)

● LC (ExionLC AD) coupled with a Triple TOF 6600 mass spectrometer (AB SCIEX, Redwood City, CA, USA) or similar

Software
● MSConvert (https://proteowizard.sourceforge.io/)

● MS-DIAL (version 4.9 or higher) (https://systemsomicslab.github.io/compms/msdial/main.html)

● R statistical scripting language (version 4.1.2) (https://www.r-project.org/)

● MetaboAnalystR (https://github.com/xia-lab/MetaboAnalystR)

● OptiLCMS (https://github.com/xia-lab/OptiLCMS)

● Python 3 (https://www.python.org/)

Protocols

Integral design

Extraction and separation by the polarity of compound

Double extraction is typically employed to obtain both non-polar and polar compounds. The polar compounds were retained in methanol/water, while lipids were assessed after the chloroform phase (organic reagent commonly used for lipid extraction^[18]), evaporation, and methanol reconstitution, which expands the scope of metabolomics analysis. However, these systems containing organic solvents exhibit low extraction efficiency for hydrophilic, strongly polar compounds, resulting in their low abundance in the extract and insufficient response in MS detection.

Therefore, in order to obtain comprehensive metabolites of the sample, solvents with different polarities were used to extract the compounds. Water and methanol-water mixtures are used to extract polar and semi-polar compounds, respectively. Chloroform-methanol-water mixtures are used to achieve the separation of lipids and polar compounds by referring to Bligh & Dyer's methods with slight adjustment^[19,20]. Particularly, for clear juice samples, extraction is not necessary. These can be directly centrifuged and filtered through a 0.22 µm PES membrane before immediate analysis by LC-MS. If the intensity of the mass spectrum is low, solid phase extraction can also be carried out.

Quality control (QC) samples contain the most comprehensive set of metabolites, and can be used to monitor the stability and repeatability of the instrument and method. QC samples are prepared by mixing all extracts in equal volume. Since extracts of different polarities are extracted with different solvents in this protocol, QC samples need to be prepared separately.

The selection of the column is closely related to the polarity of the solvent. Strongly polar compounds exhibit poor retention in reversed-phase liquid chromatography (RPLC), which results in target omission and reduces annotation coverage. Therefore, normal phase liquid chromatography (NPLC), such as an HILIC column, is often used for their separation^[21]. Semi-polar compounds are typically separated using RPLC, for example, with a C18 column^[22]. For weakly polar or non-polar systems, where lipids are the dominant components, lipidomics is commonly employed to fill the gap left by metabolomics in this area^[23].

Untargeted metabolic profile data collected by LC-QTOF-MS
The acquisition modes of QTOF-MS are divided into data-dependent acquisition (DDA) and data-independent acquisition (DIA). DDA selects several parent ions with the highest intensity for fragmentation; higher quality MS2 spectra can be obtained^[24], which is conducive to the annotation. DIA fragments all precursor ions within a selected m/z range or within a selected sequential mass window (such as sequential window acquisition of all theoretical spectra, SWATH)^[6]. This technique can obtain more comprehensive MS2 information and cover the DDA limitation on the MS2 acquisition of low-abundance ions. Different metabolites have different suitable collision energy (CE) due to their structure. In order to obtain better product ions for annotation, this protocol sets CE = ± 35 V, CES = 15 V in product ion scan, that is, three different CE voltages of ± 20, ± 35, and ± 50 V. The other main parameters of the QTOF-MS are set according to the instructions of the manufacturer and optimized according to the actual needs.

Metabolite annotation by multiple tools
To obtain as much credible metabolite annotation information as possible, the collaborative use of multiple analytical tools can mutually validate the annotation results and compensate for potential deficiencies that may exist in a single analytical tool during deconvolution or reference database searching. MS-DIAL can be utilized to rapidly acquire high-quality alignment tables and annotation results. By importing the m/z and retention time (RT) information obtained from MS-DIAL into MetaboAnalystR, the MS2 spectra undergo deconvolution, replicated spectrum consensus screening, and reference database searching, and the annotation results were obtained again^[17]. Comprehensive analysis of the annotation results from both analytical tools allows for the verification of MS2 spectrum matching with the database under the same set of m/z and RT values, as well as the supplementation of metabolites that could not be annotated by a single analytical tool alone. In short, the annotation results of two analysis tools will be merged, and different annotations under the same feature will select the results with higher scores. This protocol applies this method to increase the quantity of annotation results while reducing the occurrence of false positives.

Safety rules
All experiments should be performed in accordance with relevant national and international guidelines and regulations. For safety, always wear appropriate personal protective equipment, including gloves, lab coats, masks, and safety goggles, when handling chemical reagents. Additionally, work with volatile reagents in a fume hood to ensure safe ventilation.

Sample collection and pretreatment
This protocol is applicable to fresh fruits and vegetables and their processed products. Samples were collected in accordance with sampling norms to ensure the representativeness of the samples. Sample pretreatment should be performed as soon as possible according to experimental requirements. The samples selected in this protocol include four sprout types (V1–V4), European plums (F1), passion fruit (F2), and mango pulp in two sterilization processes of UHT and HPP (P1, P2).

Clean the fruit and vegetable samples, and keep the edible part. After that, cut the solid samples into slices or small pieces. And the semi-solid samples are mixed and placed on the sterile plates. For fruits and vegetables with high water content or cloudy juice/pulp samples, priority can be given to lyophilization before compound extraction to improve the feature intensity of subsequent detection. The samples should be frozen with liquid nitrogen to make sure the moisture forms ice crystals, which can sublimate during the lyophilization process. Lyophilize the frozen samples to a crisp state with a vacuum freeze dryer. For the vacuum freeze dryer with heating function, the plate temperature should not be set too high to avoid thermal degradation of metabolites; generally, less than 10 °C is appropriate. Grind the lyophilized sample into a fine powder with a freeze grinder, then collect and store the powder at −80 °C. Specifically, juice samples are not allowed to be lyophilized.

Preparation of analytical reagents

Extraction solvents
All extraction solvents should be prepared fresh before use.

For extracting polar compounds, the preferred solvent is ultrapure water or ultrapure water containing a small amount of organic acid. Ultrapure water is generated by a Milli-Q system or similar. Organic acids are preferred, but not limited to, HPLC-grade formic acid, acetic acid, and trifluoroacetic acid are also applicable. The concentration of organic acid is recommended in 0.1%–0.5%.

For extracting semi-polar compounds, the preferred solvent is 50% (vol/vol) water/methanol solution or one containing a small amount of organic acid. The 50% (vol/vol) water/methanol solution is prepared by thoroughly mixing equal volumes of ultrapure water and LC/MS-grade methanol. Similarly, organic acids are preferred but not limited to HPLC-grade formic acid, acetic acid, and trifluoroacetic acid are also applicable. The concentration of organic acid is recommended in 0.1%–0.5%.

For extracting lipids, the preferred solvent is 2:1 (vol/vol) chloroform/methanol. To prepare the 2:1 (vol/vol) chloroform/methanol solution, LC/MS-grade methanol is thoroughly mixed with AR-grade chloroform at a 2:1 volume ratio.

Mobile phase and other reagents
The required mobile phase volume for the anticipated analysis time is determined by considering the number of samples, the run time per injection, and the mobile phase flow rate. It is assumed that 1,000 mL each of the aqueous and organic phases will be sufficient.

For the BEH C18 column, mobile phase A is prepared by adding 1.0 mL of HPLC-grade formic acid to 1,000 mL of ultrapure water and mixing thoroughly. Similarly, mobile phase B is prepared by adding 1.0 mL of HPLC-grade formic acid to 1,000 mL of LC/MS- grade acetonitrile and mixing thoroughly.

For the HILIC-Z column, prepare the 2 mol/L CH₃COONH₄ solution in advance and store it in the refrigerator at 4 °C. To create mobile phase A, add 5 mL of this 2 mol/L CH₃COONH₄ solution to 1,000 mL of a 5% (vol/vol) acetonitrile/water solution and mix thoroughly, achieving a final CH₃COONH₄ concentration of 10 mmol/L. For mobile phase B, prepare a 95% (vol/vol) acetonitrile/water solution containing 10 mmol/L CH₃COONH₄, ensuring mixing thoroughly, too.

For an optimal peak profile with the SB-C18 column, CH₃COONH₄ is also added to the mobile phase. To prepare mobile phase A, dissolve 2.5 mL of 2 mol/L CH₃COONH₄ in 1,000 mL of a 6:4 (vol/vol) acetonitrile/water mixture, ensuring mixing thoroughly to achieve a final CH₃COONH₄ concentration of 5 mmol/L. Similarly, mobile phase B is prepared by mixing CH₃COONH₄ in a 9:1 (vol/vol) isopropanol/acetonitrile mixture, also to a final concentration of 5 mmol/L CH₃COONH₄, ensuring complete mixing.

Other solutions needed for LC-QTOF-MS (like needle wash solution, calibration solution) are prepared according to the manufacturer's instructions.

Compound extraction

Polar compound extraction
For each sample, weigh 0.5 g of freeze-dried powder and add 5 mL of ultrapure water (or ultrapure water containing 0.1%–0.5% organic acid) into a centrifuge tube. Mix thoroughly using a vortex mixer for 10 s. The mixture was ultrasonicated at 40 kHz and 4 °C for 20 min, then centrifuged at 10,000 × g and 4 °C for 15 min. Transfer the supernatant to a new centrifuge tube, filter it through a 0.22 µm PES membrane, and transfer 1 mL of the filtrate into brown autosampler vials. Prepare a QC sample by combining equal volumes from each filtrate, and prepare a blank sample by adding 1 mL of the extraction solvent to a separate brown autosampler vial.

Notably, for the ultrapure water extracts, the filtrate should be analyzed as soon as possible to avoid long-term storage, as it may lead to precipitation or bacterial growth in the sample.

Semi-polar compound extraction
For each sample, weigh 0.5 g of freeze-dried powder and add 5 mL of a 50% (vol/vol) methanol/water solution (optionally containing 0.1%–0.5% organic acid) into a centrifuge tube. Mix thoroughly using a vortex mixer for 10 s. The mixture was ultrasonicated at 40 kHz and 4 °C for 20 min, then centrifuged at 10,000 × g and 4 °C for 15 min. Transfer the supernatant to a new centrifuge tube, filter it through a 0.22 µm nylon membrane, and transfer 1 mL of the filtrate into brown autosampler vials. Prepare a QC sample by combining equal volumes from each filtrate, and create a blank sample by adding 1 mL of extraction solvent into a separate brown autosampler vial. Store all samples at −20 °C or −80 °C until analysis.

Lipid compound extraction
For lipid extraction, use 0.5 g of freeze-dried powder as a sample. First, add 1 mL ultrapure water and thoroughly mix on a vortex mixer for 10 s. Then, add 4 mL of 2:1 (vol/vol) chloroform/methanol and mix thoroughly again. The mixture was ultrasonicated at 40 kHz and 4 °C for 20 min, then centrifuged at 10,000 × g and 4 °C for 15 min, then the lower organic layer was carefully collected. For secondary extraction, add 4 mL of the extraction solution to the supernatant, mixing thoroughly. Next, merge the two extracts. In the following step, evaporate the solution to nearly dry by nitrogen gas and reconstitute the sample in 2 mL of 2:1 (vol/vol) chloroform/methanol. Filter the solution through a 0.22 µm nylon membrane and transfer 1 mL of the filtrate into brown autosampler vials. Combine equal volumes of each filtrate to create a QC sample, and prepare a blank sample by adding 1 mL of extraction solvent into a separate brown autosampler vial. Store all samples at −20 °C or −80 °C until analysis.

Metabolomics data acquisition

Preparation of the analytical system
Metabolomics data were acquired using LC-QTOF-MS (ExionLC-TripleTOF6600, AB SCIEX). Other HRMS instruments with DDA/IDA are also suitable.

First, replace the mobile phases and the chromatographic column. For AB SCIEX instruments, open the purge valve to prevent the expelled liquid from entering the mass spectrometer before purging. For other manufacturers' instruments, operate according to the instrument manufacturer's specifications if the instrument does not have a purge valve.

Second, calibrate the QTOF-MS and Product Ion. Set appropriate scanning ranges and ion source parameters according to the experiment needs. Injecting the calibration solution, then the acquisition begins. After the acquisition signal is stable, stop the acquisition, tune, and calibrate according to the instrument requirements to ensure that the average error is < 2 ppm. Then, the calibration of Product Ion is similar; the following is an example: DP: 80 V/−80 V; CE: 45/−23 V, Product Of: 609/403 Da, respectively. The MS2 scan range was m/z 50–1,500 Da (can be adjusted according to the experiment needs).

Construction of analysis method and sample sequence
For the polar compound extract solution, chromatographic separation was performed using an InfinityLab Poroshell 120 HILIC-Z column (3.0 × 100 mm, 2.7 µm, Agilent, USA) (Water-HILIC). The flow rate was set at 0.4 mL/min, with the following gradient program: 5% A (0–2.00 min), 5%–20% A (2.00–7.00 min), 20%–32% A (7.00–13.00 min), 32%–35% A (13.00–16.00 min), 35%–5% A (16.00–16.01 min), and 5% A (16.01–18.00 min). The autosampler and column temperatures were maintained at 4 and 40 °C, respectively, with an injection volume of 2 µL.

For the semi-polar compound extract solution, chromatographic separation was carried out using an Acquity UPLC BEH C18 column (2.1 × 100 mm, 1.7 µm, Waters) (Methanol-C18). The flow rate was set at 0.3 mL/min, with the gradient program: 5%–40% B (0–10.00 min), 40%–100% B (10.00–13.00 min), 100% B (13.00–15.00 min), 100%–5% B (15.00–15.01 min), and 5% B (15.01–18.00 min). The autosampler and column temperatures were also maintained at 4 and 40 °C, respectively, with an injection volume of 2 µL.

For the lipid compound extract solution, chromatographic separation was performed using Poroshell 120, SB-C18 (3.0 × 100 mm, 2.7 μm, Agilent, USA) (Lipid-C18). The flow rate was set at 0.4 mL/min, with the following gradient program: 60%–0% A (0.00–12.00 min), 0% A (12.00–14.00 min), 0%–60% A (14.00–14.10 min), and 60% A (14.10–18.00 min). The autosampler and column temperatures were maintained at 4 °C and 40 °C, respectively, with an injection volume of 2 µL.

For all samples, the experimental parameters for TOF-MS scanning in both positive and negative ion modes were set as follows: Curtain Gas, 25 psi; Ion Source Gas 1, 50 psi; Ion Source Gas 2, 50 psi; Source Temperature, 500 °C; IonSpray Voltage Floating, 5,500/−4,500 V; Declustering Potential (DP), 60/−60 V; Collision Energy (CE), 10/−10 V, respectively. Metabolomics data were acquired using Information Dependent Acquisition (IDA) mass spectrometry mode (DDA mode in AB SCIEX). This mode provided data on TOF-MS primary parent ions (MS1) and high-sensitivity secondary product ions (MS2) for each sample. The IDA (cycle time 545 ms) was composed of a TOF-MS scan (accumulation time, 50 ms; CE, 10/−10 V) and 15 dependent product ion scans (accumulation time, 30 ms each; CE, 35/−35 V; CES, 15/–15 V) in the high-sensitivity mode with dynamic background subtraction. The MS1 scan range was m/z 100–1,500 Da, and the MS2 scan range was m/z 50–1,500 Da.

Establish the sequence of samples to be analyzed. Inserting five QC injections at the beginning to equilibrate the instrument. Then, blank and QC samples were inserted every five injections. Ensure that all sample names are unique and set it to calibrate every five injections.

Data acquisition
Equilibrate the instrument until the pressure stabilizes and begin the queue. Keep attention on system pressure and operating status during data acquisition. Strictly control the stability of Total Ion Chromatogram (TIC) for global QC samples to avoid analytical drift. If significant deviations in retention times or peak responses are observed, the data need to be re-acquired to ensure quality criteria. Clean the LC-QTOF-MS following the manufacturer's instructions after acquisition. Archive the data, making sure to copy the wiff file and wiff.scan file and place them in the same path.

Annotation analysis

File format conversion
ABFConverter and MSConvert are stand-alone, open-source software tools that support raw MS data conversion from various vendors, including Agilent, Bruker, SCIEX, and Thermo, thereby facilitating analysis in free software such as MS-DIAL. For AB SCIEX instruments, convert wiff files to ABF format using ABFConverter, or convert to mzML format using MSConvert. In MSConvert, the Peak Picking based on the Vendor algorithm is used for filtering, and other parameters remain default.

Metabolite annotation based on MS-DIAL (version 4.9 or higher)
Step 1: Start up a project. Set Ionization type, Separation type, MS method (Collision) type, Ion mode, and Target omics according to the specific experimental scheme. For ABF format Data, select MS1 and MS/MS Data type as profile data, while mzML format data, select as centroid data.

Step 2: Import the analysis files. The sample type is divided into sample, QC, and blank, and the sample class is marked according to the experimental design.

Step 3: Analysis parameter setting in MS-DIAL. Especially for the Identification module, one parameter group recommendation is to select MS1 tolerance of 0.01 Da, MS2 tolerance of 0.05 Da, and a cut-off score of 70% (in MS-DIAL version 4.9). For MS-DIAL version 5 or higher, the more detailed identification parameter can be set. Relative spectrum amplitude cutoff is suggested to be 1% to reduce the effect of noise peaks on matching scores. Dot product, weight dot product, and reverse dot product scores are recommended to be set to 700, matched spectrum percentage is set to 15%, and the minimum number of match spectra is 3, so as to obtain better annotated metabolites.

Other parameter groups that meet the actual experimental requirements are also perfectly appropriate. The remaining parameters can be adjusted by referring to the MS-DIAL website (https://systemsomicslab.github.io/compms/msdial/main.html) or maintaining the default. In addition to lipidomics, which was conducted using the lipid database that was integrated within MS-DIAL, reference databases can use a variety of open-source resources such as MS-DIAL metabolomics MSP spectral kit, HMDB, MassBank, MoNA, RIKEN, or the spliced database. Note that the database needs to be converted to MSP format, and distinguish between positive and negative ion modes before importing.

Step 4: Remove features based on blank information (if such samples are present), and the alignment results are derived from the peak area.

Metabolite supplementary annotation based on MetaboAnalystR
Notably, the following methods can be used only if MS-DIAL is used for mzML format analysis.

Step 1: Import raw data files in mzML format. Sample and QC need to be placed in different folder paths.

Step 2: Convert the alignment results obtained from MS-DIAL to make it compatible with MetaboAnalystR (OptiLCMS). Specifically, change the column names 'Average Rt (min)' and 'Average Mz' to 'rt' and 'mz', and swap the two column orders in Excel. Then, convert the unit of retention time from minutes to seconds. Finally, ensure the column names are in the first row position.

Step 3: MetaboAnalystR (OptiLCMS) analysis. Skip the peak picking and alignment steps, and directly import the alignment results derived from MS-DIAL to perform MS/MS deconvolution and annotation analysis on the raw data. The file format of the reference database needs to be converted from MSP to SQLite. Parameter setting and adjustment refer to the MetaboAnalyst website (www.metaboanalyst.ca). Annotation results are automatically saved in CSV format. A supplementary script can be added to the original script to generate dot similarity in the result, which is used for the credible filtering of annotated metabolites.

Post-processing of annotated metabolites

Step 1: The annotation results were filtered according to the following recommended conditions. For annotation results in MS-DIAL version 4.9, total score ≥ 70, dot product score ≥ 70, and reverse dot product score ≥ 70 were selected. However, since the detailed identification parameters have been defined when setting parameters in version 5 or higher, they do not need to be filtered again. For MetaboAnalystR annotation results, total score ≥ 60 and dot similarity ≥ 0.7 were selected (MetaboAnalystR did not use reverse dot product to calculate spectral similarity). In particular, the annotation results of lipidomics in MS-DIAL selected dot product score ≥ 55, which is different from metabolomics because the MS2 of lipids is more complex. Notably, the score range is not mandatory and can be adjusted according to actual research needs. Retain results where the relative standard deviation (RSD) value of the QC samples is less than 30%. In this way, compounds with high qualitative confidence and stable detection in samples can be obtained.

Step 2: Unify compound names to facilitate the annotation of natural metabolites. Using Python to access the PubChem (https://pubchem.ncbi.nlm.nih.gov/) and LOTUS^[25] (https://lotus.naturalproducts.net/) APIs to implement batch processing. The script is shown in Algorithm 1. When there is a compound that cannot be found in LOTUS but can be found in PubChem, it is necessary to carefully check whether the compound is a natural product.

Table 1. The script for the unification of compound names.

import pandas as pd

import requests

import time

lotus_cache = {}

pubchem_cache = {}

def process_inchikey(inchikey):

if inchikey[-4] == 'N':

return inchikey[:-4] + 'S' + inchikey[-3:]

return inchikey

def get_lotus_name(inchikey):

inchikey = process_inchikey(inchikey)

if inchikey in lotus_cache:

return lotus_cache[inchikey]

url = f"https://lotus.naturalproducts.net/api/search/simple?query={inchikey}"

max_retries = 3

for attempt in range(max_retries):

try:

response = requests.get(url)

if response.status_code == 200:

data = response.json()

if 'naturalProducts' in data and len(data['naturalProducts']) > 0:

product = data['naturalProducts'][0]

if 'traditional_name' in product:

lotus_cache[inchikey] = product['traditional_name']

return product['traditional_name']

except requests.exceptions.RequestException as e:

print(f"Request error for {inchikey}: {e}")

time.sleep(3)

lotus_cache[inchikey] = "Not Found"

return "Not Found"

def get_pubchem_name(inchikey):

inchikey = process_inchikey(inchikey)

if inchikey in pubchem_cache:

return pubchem_cache[inchikey]

url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/{inchikey}/synonyms/JSON"

max_retries = 3

for attempt in range(max_retries):

try:

response = requests.get(url)

if response.status_code == 200:

data = response.json()

if 'InformationList' in data and 'Information' in data['InformationList'] and len(data['InformationList']['Information']) > 0:

info = data['InformationList']['Information'][0]

if 'Synonym' in info and len(info['Synonym']) > 0:

pubchem_cache[inchikey] = info['Synonym'][0]

return info['Synonym'][0]

except requests.exceptions.RequestException as e:

print(f"Request error for {inchikey}: {e}")

time.sleep(3)

pubchem_cache[inchikey] = "Not Found"

return "Not Found"

xls = pd.ExcelFile("your input path")

with pd.ExcelWriter("your output path") as writer:

for sheet_name in xls.sheet_names:

df = pd.read_excel(xls, sheet_name=sheet_name)

if 'InchiKey' in df.columns:

df['Lotus Name'] = df['InChiKey'].apply(get_lotus_name)

df['Pubchem Name'] = df['InChiKey'].apply(get_pubchem_name)

time.sleep(1)

df.to_excel(writer, sheet_name=sheet_name, index=False)

Step 3 (Optional): Using Classyfire API (http://classyfire.wishartlab.com) to classify metabolites^[26]. The script is shown in Algorithm 2. Classification is helpful to further summarize the functional properties of the compounds. In particular, the classification dimensions (subclass, class, and superclass) can be freely selected to make the expression clearer.

Table 2. The script for the classification of metabolites.

import pandas as pd

import requests

import time

def process_inchikey(inchikey):

if inchikey[-4] == 'N':

return inchikey[:-4] + 'S' + inchikey[-3:]

return inchikey

def get_classification_info(inchikey):

inchikey = process_inchikey(inchikey)

url = f"http://classyfire.wishartlab.com/entities/{inchikey}.json"

max_retries = 5

for attempt in range(max_retries):

try:

response = requests.get(url)

if response.status_code == 200:

classification = response.json()

subclass = (classification.get('subclass') or {}).get('name', 'Not available')

class_ = (classification.get('class') or {}).get('name', 'Not available')

superclass = (classification.get('superclass') or {}).get('name', 'Not available')

return subclass, class_, superclass

except requests.exceptions.RequestException as e:

print(f"Request error for {inchikey}: {e}")

time.sleep(3)

return "Error", "Error", "Error"

xls_named = pd.ExcelFile("your input path")

with pd.ExcelWriter("your output path") as writer:

for sheet_name in xls_named.sheet_names:

df = pd.read_excel(xls_named, sheet_name=sheet_name)

if ' InChiKey ' in df.columns:

df[['Subclass', 'Class', 'Superclass']] = df['InChiKey'].apply(lambda x: pd.Series(get_classification_info(x)))

time.sleep(1)

df.to_excel(writer, sheet_name=sheet_name, index=False)

Step 4: Filter out potential synthetic compounds and environmental contaminants, as these may result from mismatches in the annotation process. Their presence can impact the analysis of small-molecule compounds in fruits, vegetables, and related products, where the focus is primarily on natural food ingredients.

Troubleshooting

In actual operation, various potential problems may occur, especially for compound annotation. Table 1 lists some possible problems and gives the possible causes and solutions of the problems, aiming at more smoothly implementing the operation of this protocol.

Table 1. Troubleshooting table.

Problem	Possible reason	Solution
The compounds have the wrong adduction.	Database may contain non-standard adduct ion forms, leading to adduct misassignment in the annotation information of MS-DIAL.	The exact mass is calculated from the molecular formula and compared to the average m/z detected, and the correct adduct ion is modified to match the difference between the exact mass and the average m/z. The script for batch calculation of exact mass is given in Algorithm 3.
Compounds are annotated repeatedly.	Deconvolution algorithms may not be able to accurately separate the signals of complex mixtures, resulting in the signal of one compound being separated into multiple features and annotated by the database.	Select the perfect one with the highest matching score (refer to total score, dot product score, etc.) and response intensity for comprehensive judgment.
The searched compound names show 'Not Found', or don't correspond to the name given by the analysis tools (not the same substance).	The InChiKey or compound name of the compounds recorded in the database is incorrect.	For compounds that could not be matched or matched incorrectly, manually search on PubChem and modify the InChiKey or metabolite names.
Common error in Python script.	The column name does not contain a character matching 'InChiKey' (for Algorithm 1 and 2) or 'Formula' (for Algorithm 3). The file path is incorrect. The corresponding package is not installed.	Check the correctness of column names and file paths, and install the corresponding packages.

Table 3. The script for the calculation of exact mass.

import pandas as pd

import re

from pyteomics import mass

def calculate_monoisotopic_mass(chemical_formula):

if not isinstance(chemical_formula, str):

return None

pattern = r'([A-Z][a-z]*)(\d*)'

monoisotopic_mass = 0

for symbol, count in re.findall(pattern, chemical_formula):

count = int(count) if count else 1

monoisotopic_mass += count * mass.nist_mass[symbol][0][0]

return monoisotopic_mass

file_path = ' your path'

sheets = pd.ExcelFile(file_path)

sheet_names = sheets.sheet_names

writer = pd.ExcelWriter(file_path, mode='a', engine='openpyxl', if_sheet_exists='replace')

for sheet_name in sheet_names:

sheet_df = pd.read_excel(file_path, sheet_name=sheet_name)

formula_columns = [col for col in sheet_df.columns if 'Formula' in col]

for formula_col in formula_columns:

exact_mass_col = f'Exact Mass ({formula_col})'

sheet_df[exact_mass_col] = sheet_df[formula_col].apply(calculate_monoisotopic_mass)

sheet_df.to_excel(writer, sheet_name=sheet_name, index=False)

writer.close()

[1]	Wang K, Liao X, Xia J, Xiao C, Deng J, et al. 2023. Metabolomics: a promising technique for uncovering quality-attribute of fresh and processed fruits and vegetables. Trends in Food Science & Technology 142:104213 doi: 10.1016/j.jpgs.2023.104213 CrossRef Google Scholar
[2]	Shao X, Liu F, Shen Q, He W, Jia B, et al. 2024. Transcriptomics and metabolomics reveal major quality regulations during melon fruit development and ripening. Food Innovation and Advances 3:144−154 doi: 10.48130/fia-0024-0013 CrossRef Google Scholar
[3]	Rampler E, El Abiead Y, Schoeny H, Rusz M, Hildebrand F, et al. 2021. Recurrent topics in mass spectrometry-based metabolomics and lipidomics − standardization, coverage, and throughput. Analytical Chemistry 93:519−545 doi: 10.1021/acs.analchem.0c04698 CrossRef Google Scholar
[4]	Li Y, Shen R, Wang S, Zhang J, Deng J, et al. 2025. A comprehensive review on bioactive compounds from Lycium seeds: extraction, characterization, bioactivities, and applications. Food Innovation and Advances 4:212−227 doi: 10.48130/fia-0025-0020 CrossRef Google Scholar
[5]	Yu J, Xu L, Mi L, Zhang N, Liu F, et al. 2025. Integrated, high-throughput metabolomics approach for metabolite analysis of four sprout types. Food Chemistry 463:141182 doi: 10.1016/j.foodchem.2024.141182 CrossRef Google Scholar
[6]	Lacalle-Bergeron L, Izquierdo-Sandoval D, Sancho JV, López FJ, Hernández F, et al. 2021. Chromatography hyphenated to high resolution mass spectrometry in untargeted metabolomics for investigation of food (bio)markers. TrAC Trends in Analytical Chemistry 135:116161 doi: 10.1016/j.trac.2020.116161 CrossRef Google Scholar
[7]	Kohler I, Verhoeven M, Haselberg R, Gargano AFG. 2022. Hydrophilic interaction chromatography − mass spectrometry for metabolomics and proteomics: state-of-the-art and current trends. Microchemical Journal 175:106986 doi: 10.1016/j.microc.2021.106986 CrossRef Google Scholar
[8]	Tang DQ, Zou L, Yin XX, Ong CN. 2016. HILIC-MS for metabolomics: an attractive and complementary approach to RPLC-MS. Mass Spectrometry Reviews 35:574−600 doi: 10.1002/mas.21445 CrossRef Google Scholar
[9]	Lv J, Zhang L, Yan F, Wang X. 2018. Clinical lipidomics: a new way to diagnose human diseases. Clinical and Translational Medicine 7:e12 doi: 10.1186/s40169-018-0190-9 CrossRef Google Scholar
[10]	Kostidis S, Sánchez-López E, Giera M. 2023. Lipidomics analysis in drug discovery and development. Current Opinion in Chemical Biology 72:102256 doi: 10.1016/j.cbpa.2022.102256 CrossRef Google Scholar
[11]	Tietel Z, Hammann S, Meckelmann SW, Ziv C, Pauling JK, et al. 2023. An overview of food lipids toward food lipidomics. Comprehensive Reviews in Food Science and Food Safety 22:4302−4354 doi: 10.1111/1541-4337.13225 CrossRef Google Scholar
[12]	Zhang X, Su M, Long Z, Du J, Zhou H, et al. 2024. Quantitative lipidomics reveals lipid differences among peach (Prunus persica L. Batsch) fruits with varying textures. LWT 201:116226 doi: 10.1016/j.lwt.2024.116226 CrossRef Google Scholar
[13]	Wang K, Xu L, Wang X, Chen A, Xu Z. 2021. Discrimination of beef from different origins based on lipidomics: a comparison study of DART-QTOF and LC-ESI-QTOF. LWT 149:111838 doi: 10.1016/j.lwt.2021.111838 CrossRef Google Scholar
[14]	Tsugawa H, Cajka T, Kind T, Ma Y, Higgins B, et al. 2015. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nature Methods 12:523−526 doi: 10.1038/nmeth.3393 CrossRef Google Scholar
[15]	Pang Z, Lu Y, Zhou G, Hui F, Xu L, et al. 2024. MetaboAnalyst 6.0: towards a unified platform for metabolomics data processing, analysis and interpretation. Nucleic Acids Research 52:W398−W406 doi: 10.1093/nar/gkae253 CrossRef Google Scholar
[16]	Pang Z, Zhou G, Ewald J, Chang L, Hacariz O, et al. 2022. Using MetaboAnalyst 5.0 for LC – HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data. Nature Protocols 17:1735−1761 doi: 10.1038/s41596-022-00710-w CrossRef Google Scholar
[17]	Pang Z, Xu L, Viau C, Lu Y, Salavati R, et al. 2024. MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics. Nature Communications 15:3675 doi: 10.1038/s41467-024-48009-6 CrossRef Google Scholar
[18]	Cubero-Leon E, De Rudder O, Maquet A. 2018. Metabolomics for organic food authentication: results from a long-term field study in carrots. Food Chemistry 239:760−770 doi: 10.1016/j.foodchem.2017.06.161 CrossRef Google Scholar
[19]	Bligh EG, Dyer WJ. 1959. A rapid method of total lipid extraction and purification. Canadian Journal of Biochemistry and Physiology 37:911−917 doi: 10.1139/o59-099 CrossRef Google Scholar
[20]	Folch J, Lees M, Sloane Stanley GH. 1957. A simple method for the isolation and purification of total lipides from animal tissues. Journal of Biological Chemistry 226:497−509 doi: 10.1016/S0021-9258(18)64849-5 CrossRef Google Scholar
[21]	Periat A, Krull IS, Guillarme D. 2015. Applications of hydrophilic interaction chromatography to amino acids, peptides, and proteins. Journal of Separation Science 38:357−367 doi: 10.1002/jssc.201400969 CrossRef Google Scholar
[22]	Nielsen NJ, Tomasi G, Christensen JH. 2016. Evaluation of chromatographic conditions in reversed phase liquid chromatography-mass spectrometry systems for fingerprinting of polar and amphiphilic plant metabolites. Analytical and Bioanalytical Chemistry 408:5855−5865 doi: 10.1007/s00216-016-9700-z CrossRef Google Scholar
[23]	Tsiantas K, Konteles SJ, Kritsi E, Sinanoglou VJ, Tsiaka T, et al. 2022. Effects of non-polar dietary and endogenous lipids on gut microbiota alterations: the role of lipidomics. International Journal of Molecular Sciences 23:4070 doi: 10.3390/ijms23084070 CrossRef Google Scholar
[24]	Xu L, Xu Z, Strashnov I, Liao X. 2020. Use of information dependent acquisition mass spectra and sequential window acquisition of all theoretical fragment-ion mass spectra for fruit juices metabolomics and authentication. Metabolomics 16:81 doi: 10.1007/s11306-020-01701-2 CrossRef Google Scholar
[25]	Rutz A, Sorokina M, Galgonek J, Mietchen D, Willighagen E, et al. 2022. The LOTUS initiative for open knowledge management in natural products research. eLife 11:e70780 doi: 10.7554/eLife.70780 CrossRef Google Scholar
[26]	Djoumbou Feunang Y, Eisner R, Knox C, Chepelev L, Hastings J, et al. 2016. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. Journal of Cheminformatics 8(1):61 doi: 10.1186/s13321-016-0174-y CrossRef Google Scholar
[27]	Kirwan JA, Gika H, Beger RD, Bearden D, Dunn WB, et al. 2022. Quality assurance and quality control reporting in untargeted metabolic phenotyping: mQACC recommendations for analytical quality management. Metabolomics 18:70 doi: 10.1007/s11306-022-01926-3 CrossRef Google Scholar
[28]	Blaženović I, Kind T, Ji J, Fiehn O. 2018. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8:31 doi: 10.3390/metabo8020031 CrossRef Google Scholar
[29]	Khan WA, Hu H, Ann Cuin T, Hao Y, Ji X, et al. 2022. Untargeted metabolomics and comparative flavonoid analysis reveal the nutritional aspects of pak choi. Food Chemistry 383:132375 doi: 10.1016/j.foodchem.2022.132375 CrossRef Google Scholar
[30]	Louis A, Chich JF, Chepca H, Schmitz I, Hugueney P, et al. 2025. Green extraction method: microwave-assisted water extraction followed by HILIC-HRMS analysis to quantify hydrophilic compounds in plants. Metabolites 15:223 doi: 10.3390/metabo15040223 CrossRef Google Scholar

{{lists.name}}

A strategy to improve the annotation coverage of small molecule compounds in fruits, vegetables and their products by untargeted metabolomics

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors