Analytica Chimica Acta, volume 1147, pages 64-71

DeepReI: Deep learning-based gas chromatographic retention index predictor

Tomáš Vrzal 1
Michaela Malečková 2, 3
Jana Olsovska 1
1
 
Research Institute of Brewing and Malting, Plc., Lípová 511/15, 120 44, Prague 2, Czech Republic.
2
 
Research Institute of Brewing and Malting, Plc., Lípová 511/15, 120 44, Prague 2, Czech Republic
Publication typeJournal Article
Publication date2021-02-01
scimago Q1
wos Q1
SJR0.998
CiteScore10.4
Impact factor5.7
ISSN00032670, 18734324
Biochemistry
Spectroscopy
Analytical Chemistry
Environmental Chemistry
Abstract
Retention index in gas chromatographic analyses is an essential tool for appropriate analyte identification. Currently, many libraries providing retention indices for a huge number of compounds on distinct stationary phase chemistries are available. However, situation could be complicated in the case of unknown unknowns not present in such libraries. The importance of identification of these compounds have risen together with a rapidly expanding interest in non-targeted analyses in the last decade. Therefore, precise in silico computation/prediction of retention indices based on a suggested molecular structure will be highly appreciated in such situations. On this basis, a predictive model based on deep learning was developed and presented in this paper. It is designed for user-friendly and accurate prediction of retention indices of compounds in gas chromatography with the semi-standard non-polar stationary phase. Simplified Molecular Input Entry System (SMILES) is used as the model’s input. Architecture of the model consists of 2D-convolutional layers, together with batch normalization, max pooling, dropout, and three residual connections. The model reaches median absolute error of prediction of the retention index for validation and test set at 16.4 and 16.0 units, respectively. Median percentage error is lower than or equal to 0.81% in the case of all mentioned data sets. Finally, the DeepReI model is presented in R package, and is available on https://github.com/TomasVrzal/DeepReI together with a user-friendly graphical user interface. • Advanced model for retention indices prediction of compounds in GC was developed. • The model is based on a convolutional neural network and advanced approaches. • Median percentage error of prediction is ≤ 0.81%. • The model is publicly available in the R package - DeepReI.
Matyushin D.D., Sholokhova A.Y., Buryak A.K.
Analytical Chemistry scimago Q1 wos Q1
2020-08-12 citations by CoLab: 49 Abstract  
Preliminary compound identification and peak annotation in gas chromatography-mass spectrometry is usually made using mass spectral databases. There are a few algorithms that enable performing a search of a spectrum in a large mass spectral library. In many cases, a library search procedure returns a wrong answer even if a correct compound is contained in a library. In this work, we present a deep learning driven approach to a library search in order to reduce the probability of such cases. Machine learning ranking (learning to rank) is a class of machine learning and deep learning algorithms that perform a comparison (ranking) of objects. This work introduces the usage of deep learning ranking for small molecules identification using low-resolution electron ionization mass spectrometry. Instead of simple similarity measures for two spectra, such as the dot product or the Euclidean distance between vectors that represent spectra, a deep convolutional neural network is used. The deep learning ranking model outperforms other approaches and enables reducing a fraction of wrong answers (at rank-1) by 9-23% depending on the used data set. Spectra from the Golm Metabolome Database, Human Metabolome Database, and FiehnLib were used for testing the model.
Ji H., Deng H., Lu H., Zhang Z.
Analytical Chemistry scimago Q1 wos Q1
2020-06-17 citations by CoLab: 72 Abstract  
Electron ionization-mass spectrometry (EI-MS) hyphenated to gas chromatography (GC) is the workhorse for analyzing volatile compounds in complex samples. The spectral matching method can only identify compounds within the spectral database. In response, we present a deep-learning-based approach (DeepEI) for structure elucidation of an unknown compound with its EI-MS spectrum. DeepEI employs deep neural networks to predict molecular fingerprints from an EI-MS spectrum and searches the molecular structure database with the predicted fingerprints. We evaluated DeepEI with MassBank spectra, and the results indicate DeepEI is an effective identification method. In addition, DeepEI can work cooperatively with database spectral matching and NEIMS (fingerprint to spectrum method) to improve identification accuracy.
Miccio L.A., Schwartz G.A.
Polymer scimago Q1 wos Q2
2020-04-01 citations by CoLab: 62 Abstract  
In this work convolutional-fully connected neural networks were designed and trained to predict the glass transition temperature of polymers based only on their chemical structure. This approach has shown to successfully predict the Tg of unknown polymers with average relative errors as low as 6%. Several networks with different architecture or hiperparameters were successfully trained using a previously studied glass transition temperatures dataset for validation, and then the same method was employed for an extended dataset, with larger Tg dispersion and polymer's structure variability. This approach has shown to be accurate and reliable, and does not require any time consuming or expensive measurements and calculations as inputs. Furthermore, it is expected that this method can be easily extended to predict other properties. The possibility of predicting the properties of polymers not even synthesized will save time and resources for industrial development as well as accelerate the scientific understanding of structure-properties relationships in polymer science.
Matyushin D.D., Sholokhova A.Y., Buryak A.K.
Journal of Chromatography A scimago Q2 wos Q1
2019-12-01 citations by CoLab: 46 Abstract  
A deep convolutional neural network was used for the estimation of gas chromatographic retention indices on non-polar (polydimethylsiloxane and polydimethyl(5%-phenyl) siloxane) stationary phases. The neural network can be used for candidate ranking while searching a mass spectral database. A linear representation (SMILES notation) of the molecule structure was used as an input for the model. The input line was converted to a one-hot matrix and then directly processed by the neural network. The calculation of any common molecular descriptors is avoided, following the modern tendency in machine learning: to allow the neural network to find the most preferable features by itself instead of using hard-coded features. The model has two 1D-convolutional layers with 120 neurons each followed by a pooling layer and a fully-connected layer with 200 hidden neurons. The model was compared with state-of-the-art models for prediction of gas chromatographic indices based on molecular descriptors and on functional groups contributions. On different data sets better accuracy is shown together with greater versatility. The applicability to diverse sets of flavors and fragrances, essential oils, metabolites is shown. The possibility of using the model for improvement of mass spectral identification (without reference retention index) is demonstrated. The median absolute error and the median percentage error are in the range of 17.3 (0.93%) to 38.1 (2.15%) depending on used test data set. Ready-to-use neural network parameters are provided.
Li M., Wang X.R.
Journal of Chromatography A scimago Q2 wos Q1
2019-10-01 citations by CoLab: 35 Abstract  
We present ChromAlignNet, a deep learning model for alignment of peaks in Gas Chromatography-Mass Spectrometry (GC-MS) data. In GC-MS data, a compound's retention time (RT) may not stay fixed across multiple chromatograms. To use GC-MS data for biomarker discovery requires alignment of identical analyte's RT from different samples. Current methods of alignment are all based on a set of formal, mathematical rules. We present a solution to GC-MS alignment using deep learning neural networks, which are more adept at complex, fuzzy data sets. We tested our model on several GC-MS data sets of various complexities and analysed the alignment results quantitatively. We show the model has very good performance (AUC ∼ 1 for simple data sets and AUC ∼ 0.85 for very complex data sets). Further, our model easily outperforms existing algorithms on complex data sets. Compared with existing methods, ChromAlignNet is very easy to use as it requires no user input of reference chromatograms and parameters. This method can easily be adapted to other similar data such as those from liquid chromatography. The source code is written in Python and available online.
Vrzal T., Olšovská J.
Analytica Chimica Acta scimago Q1 wos Q1
2019-06-01 citations by CoLab: 8 Abstract  
The problems of contamination of many products by nitroso compounds have been discussed since 1970's and have been partially solved, namely, the contamination by carcinogenic volatile N-nitrosamines. However, there is still a gap in knowing non-volatile nitroso compounds in terms of both the determination of these compounds and the description of their toxicity. Therefore, a procedure for their detailed non-targeted study is necessary to be developed. Based on these facts, a new method permitting the detection and the classification of nitroso compound groups, such as N-nitroso, C-nitroso, and interfering substances in the nitrosamine specific chemiluminescence detection after previous gas chromatographic separation, was developed. The method is based on signal profiling of chromatographic peaks recorded by a chemiluminescence detector at different pyrolytic temperatures and subsequent multivariate chemometric classification. The resulting classification function by linear discriminant analysis shows good performance with total accuracy of 96.12% after the method validation. The method was successfully applied and demonstrated on a non-targeted beer sample analysis. Nitroso compounds detected by the method were selected for detailed structural analysis by GC-MS/MS. The combination of the presented method with the MS/MS instrumentation provides a really powerful analytical tool for the identification of unknown nitroso compounds in complex samples. This study represents a valuable contribution to the protocols of identification of organic compounds with the nitrogen functional groups - toxicologically and analytically important nitroso compounds.
Fan X., Ming W., Zeng H., Zhang Z., Lu H.
The Analyst scimago Q2 wos Q2
2019-01-24 citations by CoLab: 153 Abstract  
DeepCID can achieve high accuracy, excellent sensitivity and few false positives for component identification in mixtures based on Raman spectroscopy and deep learning.
Hirohara M., Saito Y., Koda Y., Sato K., Sakakibara Y.
BMC Bioinformatics scimago Q1 wos Q1 Open Access
2018-12-31 citations by CoLab: 144 PDF Abstract  
Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.
Dossin E., Martin E., Diana P., Castellon A., Monge A., Pospisil P., Bentley M., Guy P.A.
Analytical Chemistry scimago Q1 wos Q1
2016-07-22 citations by CoLab: 30 Abstract  
Monitoring of volatile and semivolatile compounds was performed using gas chromatography (GC) coupled to high-resolution electron ionization mass spectrometry, using both headspace and liquid injection modes. A total of 560 reference compounds, including 8 odd n-alkanes, were analyzed and experimental linear retention indices (LRI) were determined. These reference compounds were randomly split into training (n = 401) and test (n = 151) sets. LRI for all 552 reference compounds were also calculated based upon computational Quantitative Structure-Property Relationship (QSPR) models, using two independent approaches RapidMiner (coupled to Dragon) and ACD/ChromGenius software. Correlation coefficients for experimental versus predicted LRI values calculated for both training and test set compounds were calculated at 0.966 and 0.949 for RapidMiner and at 0.977 and 0.976 for ACD/ChromGenius, respectively. In addition, the cross-validation correlation was calculated at 0.96 from RapidMiner and the residual standard error value obtained from ACD/ChromGenius was 53.635. These models were then used to predict LRI values for several thousand compounds reported present in tobacco and tobacco-related fractions, plus a range of specific flavor compounds. It was demonstrated that using the mean of the LRI values predicted by RapidMiner and ACD/ChromGenius, in combination with accurate mass data, could enhance the confidence level for compound identification from the analysis of complex matrixes, particularly when the two predicted LRI values for a compound were in close agreement. Application of this LRI modeling approach to matrixes with unknown composition has already enabled the confirmation of 23 postulated compounds, demonstrating its ability to facilitate compound identification in an analytical workflow. The goal is to reduce the list of putative candidates to a reasonable relevant number that can be obtained and measured for confirmation.
Wilson M.B., Barnes B.B., Boswell P.G.
Journal of Chromatography A scimago Q2 wos Q1
2014-12-01 citations by CoLab: 7 Abstract  
Programmed-temperature gas chromatographic (GC) retention information is difficult to share because it depends on so many experimental factors that vary among laboratories. Though linear retention indexing cannot properly account for experimental differences, retention times can be accurately calculated, or "projected", from shared isothermal retention vs. temperature (T) relationships, but only if the temperature program and hold-up time vs. T profile produced by a GC is known with great precision. The effort required to measure these profiles were previously impractical, but we recently showed that they can be easily back-calculated from the programmed-temperature retention times of a set of 25 n-alkanes using open-source software at www.retentionprediction.org/gc. In a multi-lab study, the approach was shown to account for both intentional and unintentional differences in the temperature programs, flow rates, and inlet pressures produced by the GCs. Here, we tested 16 other experimental factors and found that only 5 could reduce accuracy in retention projections: injection history, exposure to very high levels of oxygen at high temperature, a very low transfer line temperature, an overloaded column, and a very short column (≤15m). We find that the retention projection methodology acts as a hybrid of conventional retention projection and retention indexing, drawing on the advantages of both; it properly accounts for a wide range of experimental conditions while accommodating the effects of experimental factors not properly taken into account in the calculations. Finally, we developed a four-step protocol to efficiently troubleshoot a GC system after it is found to be yielding inaccurate retention projections.
Zhang J., Koo I., Wang B., Gao Q., Zheng C., Zhang X.
Journal of Chromatography A scimago Q2 wos Q1
2012-08-01 citations by CoLab: 20 Abstract  
▸ A large scale test dataset are created from NIST repetitive MS and RI library. ▸ RI integrated with three MS similarity measures to verify the compound identification. ▸ RI threshold and MS similarity measure can all influence the final identification result. Retention index (RI) is useful for metabolite identification. However, when RI is integrated with mass spectral similarity for metabolite identification, many controversial RI threshold setup are reported in literatures. In this study, a large scale test dataset of 5844 compounds with both mass spectra and RI information were created from National Institute of Standards and Technology (NIST) repetitive mass spectra (MS) and RI library. Three MS similarity measures: NIST composite measure, the real part of Discrete Fourier Transform (DFT.R) and the detail of Discrete Wavelet Transform (DWT.D) were used to investigate the accuracy of compound identification using the test dataset. To imitate real identification experiments, NIST MS main library was employed as reference library and the test dataset was used as search data. Our study shows that the optimal RI thresholds are 22, 15, and 15 i.u. for the NIST composite, DFT.R and DWT.D measures, respectively, when the RI and mass spectral similarity are integrated for compound identification. Compared to the mass spectrum matching, using both RI and mass spectral matching can improve the identification accuracy by 1.7%, 3.5%, and 3.5% for the three mass spectral similarity measures, respectively. It is concluded that the improvement of RI matching for compound identification heavily depends on the method of MS spectral similarity measure and the accuracy of RI data.
Babushok V.I., Linstrom P.J., Zenkevich I.G.
2011-11-29 citations by CoLab: 652 Abstract  
Gas chromatographic retention indices were evaluated for 505 frequently reported plant essential oil components using a large retention index database. Retention data are presented for three types of commonly used stationary phases: dimethyl silicone (nonpolar), dimethyl silicone with 5% phenyl groups (slightly polar), and polyethylene glycol (polar) stationary phases. The evaluations are based on the treatment of multiple measurements with the number of data records ranging from about 5 to 800 per compound. Data analysis was limited to temperature programmed conditions. The data reported include the average and median values of retention index with standard deviations and confidence intervals.
Araujo P., Nguyen T., Frøyland L., Wang J., Kang J.X.
Journal of Chromatography A scimago Q2 wos Q1
2008-11-01 citations by CoLab: 101 Abstract  
A simplified method for quantitative analysis of fatty acids in various matrices by gas chromatography is proposed as an alternative to the conventional method and the variables of the protocol examined to optimize the processing conditions. The modified method involves direct methylation of fatty acids in homogenized samples with boron trihalide (BF(3) or BCl(3) in methanol) followed by extraction with hexane. The addition of hexane to the reaction mixture after the methylation process can enhance the efficiency of fatty acid methylation and is critical for those samples that contain high levels of triglycerides. A mechanism underlying this effect is proposed.
Cao Y., Charisi A., Cheng L.-., Jiang T., Girke T.
Bioinformatics scimago Q1 wos Q1 Open Access
2008-07-02 citations by CoLab: 291 PDF Abstract  
Software applications for structural similarity searching and clustering of small molecules play an important role in drug discovery and chemical genomics. Here, we present the first open-source compound mining framework for the popular statistical programming environment R. The integration with a powerful statistical environment maximizes the flexibility, expandability and programmability of the provided analysis functions.We discuss the algorithms and compound mining utilities provided by the R package ChemmineR. It contains functions for structural similarity searching, clustering of compound libraries with a wide spectrum of classification algorithms and various utilities for managing complex compound data. It also offers a wide range of visualization functions for compound clusters and chemical structures. The package is well integrated with the online ChemMine environment and allows bidirectional communications between the two services.ChemmineR is freely available as an R package from the ChemMine project site: http://bioweb.ucr.edu/ChemMineV2/chemminer
Goodner K.L.
2008-07-01 citations by CoLab: 175 Abstract  
High-quality regression models of gas chromatographic retention indices were generated for OV-101 (R=0.997), DB-1 (R=0.998), DB-5 (R=0.997), and DB-Wax (R=0.982) using 91, 57, 94, and 102 compounds, respectively. The models were generated using a second-order equation including the cross product utilizing two easily obtained variables, boiling point and the log octanol-water coefficient. Additionally, a method for determining outlier data (the GOodner Outlier Determination (GOOD) method) is presented, which is a combination of several outlier tests and is less prone to discarding legitimate data.
Lin H., Zhong C., Wen R., Ma T.H., He D., Martin J.W., Goss G.G., Alessi D.S., He Y.
Water Research scimago Q1 wos Q1
2025-01-01 citations by CoLab: 1
Yang Q., Zhang H., Wang Y., Tan L., Xie T., Wang Y., Long J., Guo Z., Zhang Z., Lu H.
Analytical Chemistry scimago Q1 wos Q1
2024-12-19 citations by CoLab: 0
Matyushin D.D., Sholokhova A.Y., Khrisanfov M.D., Borovikova S.A.
2024-12-01 citations by CoLab: 0 Abstract  
When predicting retention indices using deep learning, there is typically no way to assess the reliability of predictions for specific molecules. The present study demonstrates, using stationary phases based on polyethylene glycol and NIST 17 database, that predictions are generally more accurate when the training dataset includes molecules structurally similar to the compound for which prediction is made. The Tanimoto similarity of “molecular fingerprints” ECFP is the most suitable algorithm for this task among the four algorithms considered. For several transformation products of unsymmetrical dimethylhydrazine whose structures were established using such predictions, the predictions were shown to be unreliable.
Sholokhova A.Y., Matyushin D.D.
Journal of Separation Science scimago Q2 wos Q2
2024-11-04 citations by CoLab: 0 Abstract  
ABSTRACTRetention index prediction based on the molecule structure is not often used in practice due to low accuracy, the need to use paid software to calculate molecular descriptors (MD), and the narrow applicability domain of many models. In recent years, relatively accurate and versatile deep learning (DL)‐based models have emerged. These models are now used in practice as an additional criterion in gas chromatography‐mass spectrometry identification. The DB‐225ms stationary phase (usually described as 50%‐cyanopropylphenyl‐50%‐dimethylpolysiloxane in available sources) is widely used, but ready‐to‐use retention index estimation models are not available for it. This study presents such models. The models are linear and use simple constitutional MD and retention indices predicted by DL for the DB‐WAX and DB‐624 stationary phases as MD (we show that it is their use that allows us to achieve satisfactory accuracy). The accuracy obtained for a completely unseen hold‐out test set: root mean square error 73.2; mean absolute error 45.7; median absolute error 22.0. The models were trained using a retention data set of 266 volatile compounds. All calculations can be performed using the convenient open‐source software CHERESHNYA. The final equations are implemented as a spreadsheet and a code snippet and are available online: https://doi.org/10.6084/m9.figshare.26800789.
Ciccarelli D., Samanipour S., Rapp-Wright H., Bieber S., Letzel T., O’Brien J.W., Marczylo T., Gant T.W., Vineis P., Barron L.P.
2024-09-02 citations by CoLab: 3
Karnaeva A.E., Sholokhova A.Y.
Chemosphere scimago Q1 wos Q1
2024-08-01 citations by CoLab: 3 Abstract  
Thirty two commercially available standards were used to determine chromatographic retention indices for three different stationary phases (non-polar, polar and mid-polar) commonly used in gas chromatography. The selected compounds were nitrogen-containing heterocycles and amides, which are referred to in the literature as unsymmetrical dimethylhydrazine (UDMH) transformation products or its assumed transformation products. UDMH is a highly toxic compound widely used in the space industry. It is a reactive substance that forms a large number of different compounds in the environment. Well-known transformation products may exceed UDMH itself in their toxicity, but most of the products are poorly investigated, while posing a huge environmental threat. Experimental retention indices for the three stationary phases, retention indices from the NIST database, and predicted retention indices are presented in this paper. It is shown that there are virtually no retention indices for UDMH transformation products in the NIST database. In addition, even among those compounds for which retention indices were known, inconsistencies were identified. Adding retention indices to the database and eliminating erroneous data would allow for more reliable identification when standards are not available. The discrepancies identified between experimental retention index values and predicted values will allow for adjustments to the machine learning models that are used for prediction. Previously proposed compounds as possible transformation products without the use of standards and NMR method were confirmed.
Bera D., Kumar A., Roy J., Roy K.
Chromatographia scimago Q3 wos Q4
2024-07-18 citations by CoLab: 0 Abstract  
The demand for novel flavors and fragrance (F&F) compounds has increased, highlighting the need for a systematic design approach. Currently, the F&F industry relies heavily on experimental approaches without considering the potential consequences of altering the features that contribute to the fragrance of the compound. In silico approaches have great potential to identify the necessary features for creating novel F&F compounds. In the present study, Quantitative Structure–Property Relationship (QSPR) models were developed using 1208 compounds and simple 2D descriptors, focusing on the RI (retention index) as the endpoint to predict the olfactory properties of molecules. Feature selection was initially carried out by multi-layered stepwise regression followed by feature thinning using the Genetic Algorithm (GA) and optimal feature combination selection using the BSS (best subset selection) method. Final models were developed using the Partial Least Squares (PLS) method. Additionally, internal and external validation of the models was performed using different validation metrics suggesting that the developed models are reliable, predictive, reproducible, and robust. To enhance the external prediction of the developed models, an Intelligent Consensus Prediction (ICP) method was employed and CM3 (consensus model 3) (best selection of predictions (compound-wise) from individual models) was found to provide the best predictivity. The modeling descriptors suggested that the hydrophobicity, high molecular weight, aromaticity, and presence of large-size fragments (high percentage of carbon) enhance the RI values. Conversely, polarity and hydrophilicity decrease the RI values. This study can be used to optimize the stationary phase according to the flavor and fragrance compounds to obtain the desired retention index (RI values).
Song Z., Nian L., Shi M., Ren X., Tang M., Shi A., Han Y., Liu M., Wang L., Zhang Y., Xu Y., Feng X.
2024-07-12 citations by CoLab: 0 Abstract  
Non-targeted analysis (NTA) was conducted to identify semi-volatile organic compounds (SVOCs) in a museum in China using the gas chromatography (GC)-Orbitrap-mass spectrometer (MS). Approximately 160 SVOCs were detected, of which 93 had not been reported in previous studies of museum environments. Many of the detected SVOCs were found to be associated with the chemical agents applied in conservation treatment and the materials used in furnishings. The results of hierarchical cluster analysis (HCA) indicated a spatial variation of SVOCs in the indoor air in the museum, but there were no obvious temporal differences of SVOCs observed in indoor dust. Spearman’s correlation analysis showed that several classes of SVOCs were well correlated, suggesting their common sources. Fragrances and plasticizers were found to be the primary sources of SVOC pollution detected in the museum. Compared with compounds in outdoor air, indoor SVOCs had a lower level of unsaturation and more portions of chemically reduced compounds. This study is the first of its kind to comprehensively characterize SVOCs in a museum using an automated NTA approach with GC-Orbitrap-MS. The SVOCs identified in the current study are likely to be present in other similar museums; therefore, further examination of their potential impacts on cultural heritage artifacts, museum personnel, and visitors may be warranted.
Yoon N., Jung W., Kim H.
Chemosensors scimago Q2 wos Q1 Open Access
2024-07-07 citations by CoLab: 1 PDF Abstract  
The gas chromatography analysis method for chemical substances enables accurate analysis to precisely distinguish the components of a mixture. This paper presents a technique for augmenting time-series data of chemicals measured by gas chromatography instruments with artificial intelligence techniques such as generative adversarial networks (GAN). We propose a novel GAN algorithm called GCGAN for gas chromatography data, a unified model of autoencoder (AE) and GAN for effective time-series data learning with an attention mechanism. The proposed GCGAN utilizes AE to learn a limited number of data more effectively. We also build a layer of high-performance generative adversarial neural networks based on the analysis of the features of data measured by gas chromatography instruments. Then, based on the proposed learning, we synthesize the features embedded in the gas chromatography data into a feature distribution that extracts the temporal variability. GCGAN synthesizes the features embedded in the gas chromatography data into a feature distribution that extracts the temporal variability of the data over time. We have fully implemented the proposed GCGAN and experimentally verified that the data augmented by the GCGAN have the characteristic properties of the original gas chromatography data. The augmented data demonstrate high quality with the Pearson correlation coefficient, Spearman correlation coefficient, and cosine similarity all exceeding 0.9, significantly enhancing the performance of AI classification models by 40%. This research can be effectively applied to various small dataset domains other than gas chromatography data, where data samples are limited and difficult to obtain.
He M., Li S.
2024-04-26 citations by CoLab: 1 Abstract  
The modernization and globalization of traditional Chinese medicines (TCMs) require the implementation of a robust quality control system, and the application of modern theories and technologies in analytical chemistry can greatly facilitate the establishment of such a system. However, inherent “uncertainties” are often present in the chemical measurement data obtained from TCMs using modern analytical instruments. To address this issue, the utilization and further development of Chemometrics are urgently needed. It plays a crucial role in reducing or eliminating the “uncertainties” associated with the chemical composition, structure, and other relevant information of the TCMs, primarily focusing on qualitative identification and quantitative determination. Given that TCMs are complex multi-component systems, future quality evaluation may encompass in silico prediction of physical/chemical properties and activities. Furthermore, it is essential to establish a link between the measured data and biological activity. Achieving these objectives necessitates continuous advancements in Chemometrics and close collaboration with artificial intelligence. In essence, the quality control of TCMs requires extensive knowledge of Chemometrics, both existing and yet to be explored. This chapter provides a comprehensive discussion on various topics, including “sampling for analytical purposes,” “experimental design and optimization,” “evaluation of experimental measurements,” “fingerprint pre-processing,” “multivariate calibration and multivariate resolution,” “pattern recognition of fingerprint data,” “fingerprint-efficacy modeling,” “structure–property/activity relationship,” and “expert system of TCMs fingerprint.”
Guo Z., Fan Y., Yu C., Lu H., Zhang Z.
Analytical Chemistry scimago Q1 wos Q1
2024-04-01 citations by CoLab: 2
Khrisanfov M.D., Matyushin D.D., Samokhin A.S.
Analytica Chimica Acta scimago Q1 wos Q1
2024-04-01 citations by CoLab: 6 Abstract  
The NIST retention index database is one the most widely used sources of retention indices. In both untargeted analysis and machine learning studies filtering for potential errors is rather lacking or nonexistent. According to our estimates about 80% of the compounds from both NIST 17 and NIST 20 retention index databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection is also impractical because the database contains more than 300 000 entries. We suggest a two-step procedure to find potentially erroneous retention indices based on machine learning. The first step is to use five predictive models to obtain predicted retention index values for the whole database. The second one is to compare these predicted values against the experimental ones. We consider a retention index erroneous if its accuracy (the difference between predicted and experimental value) is in the bottom 5% for each of the five models simultaneously. Using this method, we were able to detect 2093 outlier entries for standard and semi-standard non-polar stationary phases in the NIST 17 retention index database, 566 of those were corrected or removed by the developers in the NIST 20. This is a novel approach to find potentially erroneous entries in a large-scale database with mostly unique entries, which can be applied not only to retention indices. The procedure can help filter and report mishandled data to improve the quality of the dataset for machine learning applications and experimental use.
Geer L.Y., Stein S.E., Mallard W.G., Slotta D.J.
2024-01-17 citations by CoLab: 7
Li T., Su W., Zhong L., Liang W., Feng X., Zhu B., Ruan T., Jiang G.
2023-11-27 citations by CoLab: 14
Song Z., Shi M., Ren X., Wang L., Wu Y., Fan Y., Zhang Y., Xu Y.
Journal of Hazardous Materials scimago Q1 wos Q1
2023-10-01 citations by CoLab: 8 Abstract  
Household dust contains a wide variety of semi-volatile organic compounds (SVOCs) that may pose health risks. We developed a method integrating non-targeted analysis (NTA) and targeted analysis (TA) to identify SVOCs in indoor dust. Based on a combined use of gas and liquid chromatography with high-resolution mass spectrometry, an automated, time-efficient NTA workflow was developed, and high accuracy was observed. A total of 128 compounds were identified at confidence level 1 or 2 in NIST standard reference material dust (SRM 2585). Among them, 113 compounds had not been reported previously, and this suggested the value of NTA in characterizing contaminants in dust. Additionally, TA was done to avoid the loss of trace compounds. By integrating data obtained from the NTA and TA approaches, SVOCs in SRM 2585 were prioritized based on exposure and chemical toxicity. Six of the top 20 compounds have never been reported in SRM 2585, including melamine, dinonyl phthalate, oxybenzone, diheptyl phthalate, drometrizole, and 2-phenylphenol. Additionally, significant influences of analytical instruments and sample preparation on NTA results were observed. Overall, the developed method provided a powerful tool for identifying SVOCs in indoor dust, which is necessary to obtain a more complete understanding of chemical exposures and risks in indoor environments.

Top-30

Journals

1
2
3
1
2
3

Publishers

2
4
6
8
10
12
2
4
6
8
10
12
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex
Found error?