Open Access
Open access
Chemical Science, volume 12, issue 31, pages 10622-10633

ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning

Publication typeJournal Article
Publication date2021-07-03
Journal: Chemical Science
scimago Q1
wos Q1
SJR2.333
CiteScore14.4
Impact factor7.6
ISSN20416520, 20416539
PubMed ID:  34447555
General Chemistry
Abstract
Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition tool designed to remove these barriers. A neural image captioning approach consisting of a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder learned a mapping from photographs of hand-drawn hydrocarbon structures to machine-readable SMILES representations. We generated a large auxiliary training dataset, based on RDKit molecular images, by combining image augmentation, image degradation and background addition. Additionally, a small dataset of ∼600 hand-drawn hydrocarbon chemical structures was crowd-sourced using a phone web application. These datasets were used to train the image-to-SMILES neural network with the goal of maximizing the hand-drawn hydrocarbon recognition accuracy. By forming a committee of the trained neural networks where each network casts one vote for the predicted molecule, we achieved a nearly 10 percentage point improvement of the molecule recognition accuracy and were able to assign a confidence value for the prediction based on the number of agreeing votes. The ensemble model achieved an accuracy of 76% on hand-drawn hydrocarbons, increasing to 86% if the top 3 predictions were considered.
Raucci U., Valentini A., Pieri E., Weir H., Seritan S., Martínez T.J.
Nature Computational Science scimago Q1 wos Q1
2021-01-14 citations by CoLab: 13 Abstract  
Over the past decade, artificial intelligence has been propelled forward by advances in machine learning algorithms and computational hardware, opening up myriads of new avenues for scientific research. Nevertheless, virtual assistants and voice control have yet to be widely used in the natural sciences. Here, we present ChemVox, an interactive Amazon Alexa skill that uses speech recognition to perform quantum chemistry calculations. This new application interfaces Alexa with cloud computing and returns the results through a capable device. ChemVox paves the way to making computational chemistry routinely accessible to the wider community. Using voice-based technologies, ChemVox is able to answer quantum chemistry questions in seconds, thus making such complex questions more accessible to the community.
Noé F., Tkatchenko A., Müller K., Clementi C.
2020-04-20 citations by CoLab: 593 Abstract  
Machine learning (ML) is transforming all areas of science. The complex and time-consuming calculations in molecular simulations are particularly suitable for an ML revolution and have already been profoundly affected by the application of existing ML methods. Here we review recent ML methods for molecular simulation, with particular focus on (deep) neural networks for the prediction of quantum-mechanical energies and forces, on coarse-grained molecular dynamics, on the extraction of free energy surfaces and kinetics, and on generative network approaches to sample molecular equilibrium structures and compute thermodynamics. To explain these methods and illustrate open methodological problems, we review some important principles of molecular physics and describe how they can be incorporated into ML structures. Finally, we identify and describe a list of open challenges for the interface between ML and molecular simulation.
Seritan S., Thompson K., Martínez T.J.
2020-04-08 citations by CoLab: 25 Abstract  
The encapsulation and commoditization of electronic structure arise naturally as interoperability, and the use of nontraditional compute resources (e.g., new hardware accelerators, cloud computing)...
Beard E.J., Cole J.M.
2020-03-26 citations by CoLab: 32 Abstract  
The number of journal articles in the scientific domain has grown to the point where it has become impossible for researchers to capitalize on all findings in their relevant discipline. Information...
Withnall M., Lindelöf E., Engkvist O., Chen H.
Journal of Cheminformatics scimago Q1 wos Q1 Open Access
2020-01-08 citations by CoLab: 147 PDF Abstract  
Neural Message Passing for graphs is a promising and relatively recent approach for applying Machine Learning to networked data. As molecules can be described intrinsically as a molecular graph, it makes sense to apply these techniques to improve molecular property prediction in the field of cheminformatics. We introduce Attention and Edge Memory schemes to the existing message passing neural network framework, and benchmark our approaches against eight different physical–chemical and bioactivity datasets from the literature. We remove the need to introduce a priori knowledge of the task and chemical descriptor calculation by using only fundamental graph-derived properties. Our results consistently perform on-par with other state-of-the-art machine learning approaches, and set a new standard on sparse multi-task virtual screening targets. We also investigate model performance as a function of dataset preprocessing, and make some suggestions regarding hyperparameter selection.
Staker J., Marshall K., Abel R., McQuaw C.M.
2019-02-13 citations by CoLab: 67 Abstract  
Chemical structure extraction from documents remains a hard problem because of both false positive identification of structures during segmentation and errors in the predicted structures. Current approaches rely on handcrafted rules and subroutines that perform reasonably well generally but still routinely encounter situations where recognition rates are not yet satisfactory and systematic improvement is challenging. Complications impacting the performance of current approaches include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations, and other challenges related to image quality, including resolution and noise. We present end-to-end deep learning solutions for both segmenting molecular structures from documents and predicting chemical structures from the segmented images. This deep-learning-based approach does not require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein, we show that it is possible to perform well on both segmentation and prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.
Krizhevsky A., Sutskever I., Hinton G.E.
Communications of the ACM scimago Q1 wos Q1
2017-05-24 citations by CoLab: 35322 Abstract  
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Behler J.
Journal of Chemical Physics scimago Q1 wos Q1
2016-11-01 citations by CoLab: 1009 PDF Abstract  
Nowadays, computer simulations have become a standard tool in essentially all fields of chemistry, condensed matter physics, and materials science. In order to keep up with state-of-the-art experiments and the ever growing complexity of the investigated problems, there is a constantly increasing need for simulations of more realistic, i.e., larger, model systems with improved accuracy. In many cases, the availability of sufficiently efficient interatomic potentials providing reliable energies and forces has become a serious bottleneck for performing these simulations. To address this problem, currently a paradigm change is taking place in the development of interatomic potentials. Since the early days of computer simulations simplified potentials have been derived using physical approximations whenever the direct application of electronic structure methods has been too demanding. Recent advances in machine learning (ML) now offer an alternative approach for the representation of potential-energy surfaces by fitting large data sets from electronic structure calculations. In this perspective, the central ideas underlying these ML potentials, solved problems and remaining challenges are reviewed along with a discussion of their current applicability and limitations.
Tajbakhsh N., Shin J.Y., Gurudu S.R., Hurst R.T., Kendall C.B., Gotway M.B., Liang J.
2016-05-01 citations by CoLab: 2244 Abstract  
Training a deep convolutional neural network (CNN) from scratch is difficult because it requires a large amount of labeled training data and a great deal of expertise to ensure proper convergence. A promising alternative is to fine-tune a CNN that has been pre-trained using, for instance, a large set of labeled natural images. However, the substantial differences between natural and medical images may advise against such knowledge transfer. In this paper, we seek to answer the following central question in the context of medical image analysis: Can the use of pre-trained deep CNNs with sufficient fine-tuning eliminate the need for training a deep CNN from scratch? To address this question, we considered four distinct medical imaging applications in three specialties (radiology, cardiology, and gastroenterology) involving classification, detection, and segmentation from three different imaging modalities, and investigated how the performance of deep CNNs trained from scratch compared with the pre-trained CNNs fine-tuned in a layer-wise manner. Our experiments consistently demonstrated that 1) the use of a pre-trained CNN with adequate fine-tuning outperformed or, in the worst case, performed as well as a CNN trained from scratch; 2) fine-tuned CNNs were more robust to the size of training sets than CNNs trained from scratch; 3) neither shallow tuning nor deep tuning was the optimal choice for a particular application; and 4) our layer-wise fine-tuning scheme could offer a practical way to reach the best performance for the application at hand based on the amount of available data.
Hirschberg J., Manning C.D.
Science scimago Q1 wos Q1 Open Access
2015-07-17 citations by CoLab: 963 PDF Abstract  
Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today’s researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area.
LeCun Y., Bengio Y., Hinton G.
Nature scimago Q1 wos Q1
2015-05-27 citations by CoLab: 57034 Abstract  
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Frasconi P., Gabbrielli F., Lippi M., Marinai S.
2014-08-06 citations by CoLab: 22 Abstract  
Optical chemical structure recognition is the problem of converting a bitmap image containing a chemical structure formula into a standard structured representation of the molecule. We introduce a novel approach to this problem based on the pipelined integration of pattern recognition techniques with probabilistic knowledge representation and reasoning. Basic entities and relations (such as textual elements, points, lines, etc.) are first extracted by a low-level processing module. A probabilistic reasoning engine based on Markov logic, embodying chemical and graphical knowledge, is subsequently used to refine these pieces of information. An annotated connection table of atoms and bonds is finally assembled and converted into a standard chemical exchange format. We report a successful evaluation on two large image data sets, showing that the method compares favorably with the current state-of-the-art, especially on degraded low-resolution images. The system is available as a web server at http://mlocsr.dinfo.unifi.it.
Rupp M., Tkatchenko A., Müller K., von Lilienfeld O.A.
Physical Review Letters scimago Q1 wos Q1 Open Access
2012-01-31 citations by CoLab: 1711 Abstract  
We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schrödinger equation is mapped onto a nonlinear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross validation over more than seven thousand organic molecules yields a mean absolute error of ∼10  kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves.
Blum L.C., Reymond J.
2009-06-08 citations by CoLab: 592 Abstract  
GDB-13 enumerates small organic molecules containing up to 13 atoms of C, N, O, S, and Cl following simple chemical stability and synthetic feasibility rules. With 977,468,314 structures, GDB-13 is the largest publicly available small organic molecule database to date.
Valko A.T., Johnson A.P.
2009-03-19 citations by CoLab: 65 Abstract  
We present CLiDE Pro, the latest version of the output of the long-term CLiDE project for the development of tools for automatic extraction of chemical information from the literature. CLiDE Pro is concerned with the extraction of chemical structure and generic structure information from electronic images of chemical molecules available online as well as pages of scanned chemical documents. The information is extracted in three phases, first the image is segmented into text and graphical regions, then graphical regions are analyzed and where possible the connection tables are reconstructed, and finally any generic structures are interpreted by matching R-groups found in structure diagrams with the ones located in the text. The program has been tested on a large set of images of chemical structures originating from various sources. The results demonstrate good performance in the reconstruction of connection tables with few errors in the interpretation of the individual drawing features found in the structure diagrams. This full test set is presented for use in the validation of other similar systems.
Tao J., Liu W., Peng X., He X., Luo Y.
2024-11-12 citations by CoLab: 0 Abstract  
The recognition of hand-drawn chemical molecular formulas is crucial for applications such as electronic note-taking and automated grading. Despite the challenges posed by stylistic variations in hand-drawn chemical structure diagrams, we introduce a novel recognition algorithm for hand-drawn hydrocarbon molecular formulas using anchor-free object detection methods. First, we employ an anchor-free detector based on irregular quadrilaterals to identify all potential chemical bonds in input images. By analyzing the collision relationships between these bonds, we then reconstruct all unspecified carbon atoms and assemble them into an adjacency matrix. Finally, we use the RDKit to convert the adjacency matrix into a SMILES string. Notably, our method does not rely on the SMILES string used during training, thereby enabling it to recognize previously unseen hydrocarbons. To verify the effectiveness of the algorithm, we collected a dataset containing 4,217 hand-drawn hydrocarbon molecular structures. Using RepVGG-A0 at a $$512\,\times \,512$$ resolution, our algorithm achieved a recognition accuracy of 85.86%.
Ouyang H., Liu W., Tao J., Luo Y., Zhang W., Zhou J., Geng S., Zhang C.
Scientific Reports scimago Q1 wos Q1 Open Access
2024-07-25 citations by CoLab: 0 PDF Abstract  
AbstractChemical molecular structures are a direct and convenient means of expressing chemical knowledge, playing a vital role in academic communication. In chemistry, hand drawing is a common task for students and researchers. If we can convert hand-drawn chemical molecular structures into machine-readable formats, like SMILES encoding, computers can efficiently process and analyze these structures, significantly enhancing the efficiency of chemical research. Furthermore, with the progress of educational technology, automated grading is gaining popularity. When machines automatically recognize chemical molecular structures and assess the correctness of the drawings, it offers great convenience to teachers. We created ChemReco, a tool designed to identify chemical molecular structures involving three atoms: C, H, and O, providing convenience for chemical researchers. Currently, there are limited studies on hand-drawn chemical molecular structures. Therefore, the primary focus of this paper is constructing datasets. We propose a synthetic image method to rapidly generate images resembling hand-drawn chemical molecular structures, enhancing dataset acquisition efficiency. Regarding model selection, the hand-drawn chemical molecule structural recognition model developed in this article achieves a final recognition accuracy of 96.90%. This model employs the encoder-decoder architecture of EfficientNet + Transformer, demonstrating superior performance compared to other encoder-decoder combinations.
Anjaneyulu B., Goswami S., Banik P., Chauhan V., Raghav N., Chinmay
Chemistry Africa scimago Q3 wos Q3
2024-05-31 citations by CoLab: 1 Abstract  
The field of computational chemistry is one of many sectors that artificial intelligence (AI) has revolutionized in recent years. Chemists are now more equipped to analyze enormous volumes of data, optimize chemical processes, and design new molecules and materials with high speed and accuracy because of advancements in machine-learning (ML) approaches, hardware platforms, and algorithms. This article explores the newest advancements and patterns in artificial intelligence related to chemistry, emphasizing how this technology can potentially transform the subject entirely and the integration of AI in the 14 different software/databases widely used in chemistry.
Aioanei A.C., Hunziker-Rodewald R.R., Klein K.M., Michels D.L.
PLoS ONE scimago Q1 wos Q1 Open Access
2024-04-19 citations by CoLab: 1 PDF Abstract  
Epigraphy is witnessing a growing integration of artificial intelligence, notably through its subfield of machine learning (ML), especially in tasks like extracting insights from ancient inscriptions. However, scarce labeled data for training ML algorithms severely limits current techniques, especially for ancient scripts like Old Aramaic. Our research pioneers an innovative methodology for generating synthetic training data tailored to Old Aramaic letters. Our pipeline synthesizes photo-realistic Aramaic letter datasets, incorporating textural features, lighting, damage, and augmentations to mimic real-world inscription diversity. Despite minimal real examples, we engineer a dataset of 250 000 training and 25 000 validation images covering the 22 letter classes in the Aramaic alphabet. This comprehensive corpus provides a robust volume of data for training a residual neural network (ResNet) to classify highly degraded Aramaic letters. The ResNet model demonstrates 95% accuracy in classifying real images from the 8th century BCE Hadad statue inscription. Additional experiments validate performance on varying materials and styles, proving effective generalization. Our results validate the model’s capabilities in handling diverse real-world scenarios, proving the viability of our synthetic data approach and avoiding the dependence on scarce training data that has constrained epigraphic analysis. Our innovative framework elevates interpretation accuracy on damaged inscriptions, thus enhancing knowledge extraction from these historical resources.
Adhikary T., Basak P.
2023-08-28 citations by CoLab: 0 Abstract  
“Omic” technologies (such as genomics, transcriptomics, proteomics, and metabolomics) generate huge databases that demand computational approaches to state novel conclusions. With the advent of machine learning and artificial intelligence algorithms, the analysis of biological data and protein engineering has taken a step forward. Different virtual screening servers and standalone software paved their importance in the initial phase of drug discovery, aiding in drug repurposing and high-throughput screening. Besides, interaction networks, often encountered in polypharmacology and network pharmacology, guide a researcher in target fishing and developing drug combinations. Visualization and prediction of molecular structures, modeling antibodies, and peptides including homology modeling are crucial to bioinformaticians and clinical biologists. Biological network analysis, pharmacophore modeling, molecular docking, and dynamics simulation are broadly exploited in the domain of computational biology and elucidate the mechanisms underlying biomolecular interactions, consequently revealing the orchestra of biological pathways. Considering the intended purposes, advantages, and limitations of the existing software, this chapter highlights only a fraction of popular platforms and encourages the readers to explore other alternatives in various domains of drug discovery and protein engineering.
Rajan K., Brinkhaus H.O., Agea M.I., Zielesny A., Steinbeck C.
Nature Communications scimago Q1 wos Q1 Open Access
2023-08-19 citations by CoLab: 24 PDF Abstract  
AbstractThe number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai.
Ouyang H., Liu W., Tao J., Luo Y., Zhang W., Zhou J., Geng S., Zhang C.
2023-08-17 citations by CoLab: 0 Abstract  
Abstract Chemical molecule structures are important in academic communication because they allow for a more direct and convenient representation of chemical knowledge. Hand-drawn chemical molecular structures are a common task for chemistry students and researchers. If hand-drawn chemical molecular structures, such as SMILES codes, could be converted into machine-readable data forms. Computers would be able to process and analyze these chemical molecular structures, greatly increasing the efficiency of chemical research. Furthermore, with the advancement of information technology in education, automatic marking is becoming increasingly popular. Teachers will benefit greatly from having a machine recognize the chemical molecular structure and then determine whether they are drawn correctly. In this study, we will investigate the chemical molecular formulas consisting of three atoms C, H, O. Because there has been little research on hand-drawn chemical molecular structures, the first major task of this paper is to create a dataset. This paper proposes a synthetic image method for quickly generating synthetic images resembling hand-drawn chemical molecular structures and improving dataset acquisition efficiency. The final recognition accuracy of the hand-drawn chemical structure recognition model designed in this paper is 96.90% in terms of model selection. The model employs the EfficientNet + Transformer encoder-decoder architecture, which outperforms other encoder-decoder combinations.
Stamatakis M., Gritz W., Oldag J., Hoppe A., Schanze S., Ewerth R.
2023-06-25 citations by CoLab: 0 Abstract  
Automatic analyses of student drawings in chemistry education have the potential to support classroom teaching. To date, related work on handwritten chemical structures or formulas is limited to well-defined presentation formats, e.g., Lewis structures. However, the large variety of possible illustrations in student drawings in chemical education has not been addressed yet. In this paper, we present a novel approach to identify visual primitives in student drawings from chemistry classes. Since the field lacks suitable datasets for the given task, we introduce a method to synthetically create a dataset for visual primitives. We demonstrate how detected visual primitives can be used to automatically classify drawings according to a taxonomy of drawing characteristics in chemistry and physics. Our experiments show that (1) the detection of visual primitives in student drawings, and (2) the subsequent classification of chemistry- and physics-specific drawing characteristics is possible.
Raucci U., Weir H., Sakshuwong S., Seritan S., Hicks C.B., Vannucci F., Rea F., Martínez T.J.
2023-04-24 citations by CoLab: 9 Abstract  
Modern quantum chemistry algorithms are increasingly able to accurately predict molecular properties that are useful for chemists in research and education. Despite this progress, performing such calculations is currently unattainable to the wider chemistry community, as they often require domain expertise, computer programming skills, and powerful computer hardware. In this review, we outline methods to eliminate these barriers using cutting-edge technologies. We discuss the ingredients needed to create accessible platforms that can compute quantum chemistry properties in real time, including graphical processing units–accelerated quantum chemistry in the cloud, artificial intelligence–driven natural molecule input methods, and extended reality visualization. We end by highlighting a series of exciting applications that assemble these components to create uniquely interactive platforms for computing and visualizing spectra, 3D structures, molecular orbitals, and many other chemical properties. Expected final online publication date for the Annual Review of Physical Chemistry, Volume 74 is April 2023. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Brinkhaus H.O., Rajan K., Schaub J., Zielesny A., Steinbeck C.
2023-04-01 citations by CoLab: 12 Abstract  
Recent years have seen a sharp increase in the development of deep learning and artificial intelligence-based molecular informatics. There has been a growing interest in applying deep learning to several subfields, including the digital transformation of synthetic chemistry, extraction of chemical information from the scientific literature, and AI in natural product-based drug discovery. The application of AI to molecular informatics is still constrained by the fact that most of the data used for training and testing deep learning models are not available as FAIR and open data. As open science practices continue to grow in popularity, initiatives which support FAIR and open data as well as open-source software have emerged. It is becoming increasingly important for researchers in the field of molecular informatics to embrace open science and to submit data and software in open repositories. With the advent of open-source deep learning frameworks and cloud computing platforms, academic researchers are now able to deploy and test their own deep learning models with ease. With the development of new and faster hardware for deep learning and the increasing number of initiatives towards digital research data management infrastructures, as well as a culture promoting open data, open source, and open science, AI-driven molecular informatics will continue to grow. This review examines the current state of open data and open algorithms in molecular informatics, as well as ways in which they could be improved in future.

Top-30

Journals

1
2
1
2

Publishers

1
2
3
4
5
6
7
1
2
3
4
5
6
7
  • We do not take into account publications without a DOI.
  • Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?
Profiles