Head of Laboratory

Derkach, Denis

PhD in Physics and Mathematics, Associate Professor
Publications
831
Citations
36 200
h-index
90
Authorization required.
Lab team

The activity of the Laboratory of Big data Analysis methods is to develop and apply machine learning and data analysis methods to solve problems of fundamental sciences such as particle physics and astrophysics. The search for answers to the mysteries of the universe with leading scientists from these fields is the main direction of the laboratory's development. In particular, we cooperate with the European Center for Nuclear Research (CERN), and our joint work consists both in research on the physics of the events of the Large Hadron Collider and in solving problems of improving the efficiency of data processing. In addition, the laboratory's educational activities include organizing and conducting academic seminars and summer/winter schools on big data analysis and providing scientific guidance to graduate and dissertation papers. The Laboratory of Big Data Analysis Methods was founded in 2015.

  1. New materials
  2. Computer search for materials
  3. Artificial intelligence
Denis Derkach
Head of Laboratory
Ratnikov, Fedor D
Fedor Ratnikov
Leading researcher
Ustyuzhanin, Andrey E
Andrey Ustyuzhanin
Leading researcher
Hushchyn, Mikhail I
Mikhail Hushchyn
Senior Researcher
Sergei Mokhnenko 🤝
Researcher
Mikhail Lazarev 🥼 🤝
Researcher
Trofimova, Ekaterina A
Ekaterina Trofimova
Junior researcher
Ryzhikov, Artem S
Artem Ryzhikov
Junior researcher
Kurbatov, Evgenii O
Evgenii Kurbatov
Junior researcher
Bocharnikov, Vladimir O
Vladimir Bocharnikov
Junior researcher
Arzymatov, Kenenbek
Kenenbek Arzymatov
Junior researcher
Karpov, Maxim E
Maxim Karpov
Junior researcher
Rogachev, Alexander Igorevich
Alexander Rogachev
Research intern
Shevelev, Andrey
Andrey Shevelev
Research intern
Shipilov, Foma A
Foma Shipilov
Research intern
Gremyachikh, Leonid
Leonid Gremyachikh
Research intern
Rashid, Abdalaziz Rashid
Abdalaziz Rashid
Research intern
Ramazyan, Tigran
Tigran Ramazyan
Research intern
Kagramanyan, David G
David Kagramanyan
Research intern
Popov, Sergey A
Sergey Popov
Research intern
Temirkhanov, Aziz
Aziz Temirkhanov
Research intern

Research directions

Natural language for machine learning

+
Routine tasks of designing data analysis pipelines using various machine learning models usually involve building a combination of repetitive common patterns. Nevertheless, the construction of such pipelines is extremely important for specialists in various subject areas that are not directly related to data analysis. Thus, among non-specialists in the field of data analysis, for example, among biologists, chemists, physicists or humanities, there is a great demand for advanced ML pipeline developments. This project aims to develop an auxiliary bot/auxiliary agent capable of generating ML-related task pipelines from a natural language task description. Such an auxiliary bot must rely heavily on natural language processing and programming language synthesis techniques.

Interpreted machine learning models and the search for the laws of nature

+
Interpreted machine learning models and the search for the laws of nature
There are many problems in physics, biology and other natural sciences in which symbolic regression can provide valuable information and discover new laws of nature. The widespread deep neural network does not offer interpretable solutions. Meanwhile, symbolic expressions indicate a clear connection between the observations and the target variable. However, at the moment there is no dominant solution for the symbolic regression problem, and we are striving to reduce this gap with our project. Our laboratory has started research in this direction, and our approach to finding a representation of the symbolic law involves the use of generative models along with optimization methods with constraints. It can be applied to equations in closed form or to a system of differentiable equations. The objective of the study is to improve the model by using active/zero learning methods.

Platforms for evaluating ML models

+
The transfer of predictive deep learning models from a research environment to an industrial environment involves significant costs associated with the versatile verification of such models: work under load, work under conditions of RAM limitations, streaming data access. This project is aimed at implementing algorithms for continuous monitoring of various deep learning models in an industrial environment and early diagnosis of the need for pre-training of these models on the minimum required data set. The goal is to introduce this platform into the CERN LHCb experiment.

High-precision digital twin of data storage systems (DSS)

+
High-precision digital twin of data storage systems (DSS)
High-precision modeling of installations and systems is one of the main directions of industrial data analysis today. Models of systems, their digital counterparts, are used to predict their behavior under various conditions. We have developed a digital twin of a data storage system (DSS) using generative machine learning models. The system consists of several types of components: HDD and SSD disks, disk pools with different RAID arrays, cache and storage controllers. Each storage component is represented by a probabilistic model that describes the probability distribution of the values of the component performance parameters depending on their configuration and the parameters of the external data load. Using machine learning allows you to get a high-precision digital twin of a specific system, spending less time and resources than other analogues. It allows you to quickly predict the performance of the system and its components under different configurations and external data loads, which significantly speeds up the development of new storage systems. Also, comparing the forecasts of the double with the indicators of the real storage system allows you to diagnose failures and anomalies in the system, increasing its reliability.

Detecting temporary changes for predictive analytics systems

+
Detecting changes in the behavior of complex systems is one of the important industrial tasks in signal processing, statistics and machine learning. Solutions to this problem have found applications in many applications: quality control of production processes, monitoring of the condition of engineering structures, detection of failures and breakdowns of equipment according to sensor readings, monitoring of distributed computer systems and detection of security violations, segmentation of video stream, recognition of sound effects, control of chemical processes, monitoring of seismological data, analysis of financial and economic data and many others. We have developed a number of new methods for detecting mode changes in complex systems using classification and regression models, generative-adversarial networks and normalization flows, as well as neural stochastic differential equations. Theoretical and practical advantages over other analogues have been demonstrated. We have successfully applied new methods to detect data storage failures, analyze human activity, and segment videos and texts.

Updating the weather forecast

+
Updating the weather forecast
Forecasting and checking the state of the weather is the task of extrapolating a number of indicators. Modern weather research and forecasting models work well under well-known conditions and short time intervals. On the other hand, it is known that AI methods, available data, and weather simulators do not perfectly match each other. Thus, this project is aimed at developing and training new algorithms to adjust the parameters of the simulator and more effectively obtain reliable forecasts. This synergy, in turn, will improve the accuracy of forecasts of normal and abnormal weather conditions for a longer period.

Investigation of two-dimensional materials: prediction of properties and generation according to specified parameters

+
Investigation of two-dimensional materials: prediction of properties and generation according to specified parameters
The development of new materials with the properties of electric energy storage is the most important task of the modern energy industry. Two-dimensional crystals based on the principles of graphene lattices can be used to produce such materials. The search for crystal lattice configurations is complicated by the multitude of possible options and the length of the test cycle for a single configuration. Many resource-intensive in silico and in vitro tests are required. These algorithms are aimed at realizing the possibility of predicting the energy properties of crystals of a given configuration and solving the problem of inference - determining the optimal crystal configuration according to a given energy characteristic. Combining such algorithms will significantly reduce the time for searching and synthesizing practically useful energy carriers.

Publications and patents

Bocharnikov V., Derkach D., Golubeva M., Guber F., Morozov S., Parfenov P., Ratnikov F.
2024-08-18 citations by CoLab: 0 Abstract
At present, new compact highly granular neutron detector is being developed for the BM@N experiment. This detector will be used to identify neutrons, to measure their energies using time-of-flight method, neutron yields and azimuthal flow of neutrons in heavy-ion collisions at beam energies up to 4 A GeV. Application of machine learning techniques and preliminary results of neutron identification and energy reconstruction are discussed. First predictions of the anisotropic flow of neutrons using the DCM-QGSM-SMM model of heavy-ion collisions are shown.
Ryzhikov A., Hushchyn M., Derkach D.
IEEE Access Q1 Q2 Open Access
2023-09-22 citations by CoLab: 1
Mistryukova L., Plotnikov A., Khizhik A., Knyazeva I., Hushchyn M., Derkach D.
Solar Physics Q2 Q2
2023-08-28 citations by CoLab: 3 Abstract
Magnetic fields are responsible for a multitude of solar phenomena, including potentially destructive events such as solar flares and coronal mass ejections, with the number of such events rising as we approach the peak of the 11-year solar cycle in approximately 2025. High-precision spectropolarimetric observations are necessary to understand the variability of the Sun. The field of quantitative inference of magnetic field vectors and related solar atmospheric parameters from such observations has been investigated for a long time. In recent years, very sophisticated codes for spectropolarimetric observations have been developed. Over the past two decades, neural networks have been shown to be a fast and accurate alternative to classic inversion methods. However, most of these codes can be used to obtain point estimates of the parameters, so ambiguities, degeneracies, and uncertainties of each parameter remain uncovered. In this paper, we provide end-to-end inversion codes based on the simple Milne-Eddington model of the stellar atmosphere and deep neural networks to both parameter estimation and their uncertainty intervals. The proposed framework is designed in such a way that it can be expanded and adapted to other atmospheric models or combinations of them. Additional information can also be incorporated directly into the model. It is demonstrated that the proposed architecture provides high accuracy results, including a reliable uncertainty estimation, even in the multidimensional case. The models are tested using simulations and real data samples.
Demianenko M., Malanchev K., Samorodova E., Sysak M., Shiriaev A., Derkach D., Hushchyn M.
2023-08-28 citations by CoLab: 4 Abstract
Context. Modern-day time-domain photometric surveys collect a lot of observations of various astronomical objects and the coming era of large-scale surveys will provide even more information on their properties. Spectroscopic follow-ups are especially crucial for transients such as supernovae and most of these objects have not been subject to such studies. Aims. Flux time series are actively used as an affordable alternative for photometric classification and characterization, for instance, peak identifications and luminosity decline estimations. However, the collected time series are multidimensional and irregularly sampled, while also containing outliers and without any well-defined systematic uncertainties. This paper presents a search for the best-performing methods to approximate the observed light curves over time and wavelength for the purpose of generating time series with regular time steps in each passband. Methods. We examined several light curve approximation methods based on neural networks such as multilayer perceptrons, Bayesian neural networks, and normalizing flows to approximate observations of a single light curve. Test datasets include simulated PLAsTiCC and real Zwicky Transient Facility Bright Transient Survey light curves of transients. Results. The tests demonstrate that even just a few observations are enough to fit the networks and improve the quality of approximation, compared to state-of-the-art models. The methods described in this work have a low computational complexity and are significantly faster than Gaussian processes. Additionally, we analyzed the performance of the approximation techniques from the perspective of further peak identification and transients classification. The study results have been released in an open and user-friendly Fulu Python library available on GitHub for the scientific community.
Aaij R., Abdelmotteleb A.S., Abellan Beteta C., Abudinén F., Ackernley T., Adeva B., Adinolfi M., Adlarson P., Afsharnia H., Agapopoulou C., Aidala C.A., Ajaltouni Z., Akar S., Akiba K., Albicocco P., et. al.
2023-08-25 citations by CoLab: 0 PDF Abstract
Abstract The B+ → Jψη′K+ decay is observed for the first time using proton-proton collision data collected by the LHCb experiment at centre-of-mass energies of 7, 8, and 13 TeV, corresponding to a total integrated luminosity of 9 fb−1. The branching fraction of this decay is measured relative to the known branching fraction of the B+ → ψ(2S)K+ decay and found to be$$ \frac{\mathcal{B}\left({B}^{+}\to {J\psi \eta}^{\prime }{K}^{+}\right)}{\mathcal{B}\left({B}^{+}\to \psi (2S){K}^{+}\right)}=\left(4.91\pm 0.47\pm 0.29\pm 0.07\right)\times {10}^{-2}, $$ B B + → Jψη ′ K + B B + → ψ 2 S K + = 4.91 ± 0.47 ± 0.29 ± 0.07 × 10 − 2 , where the first uncertainty is statistical, the second is systematic and the third is related to external branching fractions. A first look at the J/ψη′ mass distribution is performed and no signal of intermediate resonances is observed.
Dorigo T., Giammanco A., Vischia P., Aehle M., Bawaj M., Boldyrev A., de Castro Manzano P., Derkach D., Donini J., Edelen A., Fanzago F., Gauger N.R., Glaser C., Baydin A.G., Heinrich L., et. al.
2023-06-01 citations by CoLab: 18 Abstract
The full optimization of the design and operation of instruments whose functioning relies on the interaction of radiation with matter is a super-human task, due to the large dimensionality of the space of possible choices for geometry, detection technology, materials, data-acquisition, and information-extraction techniques, and the interdependence of the related parameters. On the other hand, massive potential gains in performance over standard, “experience-driven” layouts are in principle within our reach if an objective function fully aligned with the final goals of the instrument is maximized through a systematic search of the configuration space. The stochastic nature of the involved quantum processes make the modeling of these systems an intractable problem from a classical statistics point of view, yet the construction of a fully differentiable pipeline and the use of deep learning techniques may allow the simultaneous optimization of all design parameters. In this white paper, we lay down our plans for the design of a modular and versatile modeling tool for the end-to-end optimization of complex instruments for particle physics experiments as well as industrial and medical applications that share the detection of radiation as their basic ingredient. We consider a selected set of use cases to highlight the specific needs of different applications.
Popov S., Lazarev M., Belavin V., Derkach D., Ustyuzhanin A.
2023-03-07 citations by CoLab: 2 Abstract
There are many problems in physics, biology, and other natural sciences in which symbolic regression can provide valuable insights and discover new laws of nature. Widespread deep neural networks do not provide interpretable solutions. Meanwhile, symbolic expressions give us a clear relation between observations and the target variable. However, at the moment, there is no dominant solution for the symbolic regression task, and we aim to reduce this gap with our algorithm. In this work, we propose a novel deep learning framework for symbolic expression generation via variational autoencoder (VAE). We suggest using a VAE to generate mathematical expressions, and our training strategy forces generated formulas to fit a given dataset. Our framework allows encoding apriori knowledge of the formulas into fast-check predicates that speed up the optimization process. We compare our method to modern symbolic regression benchmarks and show that our method outperforms the competitors under noisy conditions. The recovery rate of SEGVAE is 65% on the Ngyuen dataset with a noise level of 10%, which is better than the previously reported SOTA by 20%. We demonstrate that this value depends on the dataset and can be even higher.
Bona M., Ciuchini M., Derkach D., Ferrari F., Franco E., Lubicz V., Martinelli G., Morgante D., Pierini M., Silvestrini L., Simula S., Stocchi A., Tarantino C., Vagnoni V., Valli M., et. al.
2023-02-14 citations by CoLab: 36 Abstract
Flavour mixing and CP violation as measured in weak decays and mixing of neutral mesons are a fundamental tool to test the Standard Model and to search for new physics. New analyses performed at the LHC experiment open an unprecedented insight into the Cabibbo–Kobayashi–Maskawa metrology and new evidence for rare decays. Important progress has also been achieved in theoretical calculations of several hadronic quantities with a remarkable reduction of the uncertainties. This improvement is essential since previous studies of the Unitarity Triangle did show that possible contributions from new physics, if any, must be tiny and could easily be hidden by theoretical and experimental errors. Thanks to the experimental and theoretical advances, the Cabibbo–Kobayashi–Maskawa picture provides very precise Standard Model predictions through global analyses. We present here the results of the latest global Standard Model analysis performed by the UTfit collaboration including all the most updated inputs from experiments, lattice Quantum Chromo-Dynamics and phenomenological calculations.
Demianenko M., Samorodova E., Sysak M., Shiriaev A., Malanchev K., Derkach D., Hushchyn M.
2023-02-01 citations by CoLab: 3 PDF Abstract
Abstract Photometric data-driven classification of supernovae becomes a challenge due to the appearance of real-time processing of big data in astronomy. Recent studies have demonstrated the superior quality of solutions based on various machine learning models. These models learn to classify supernova types using their light curves as inputs. Preprocessing these curves is a crucial step that significantly affects the final quality. In this talk, we study the application of multilayer perceptron (MLP), bayesian neural network (BNN), and normalizing flows (NF) to approximate observations for a single light curve. We use these approximations as inputs for supernovae classification models and demonstrate that the proposed methods outperform the state-of-the-art based on Gaussian processes applying to the Zwicky Transient Facility Bright Transient Survey light curves. MLP demonstrates similar quality as Gaussian processes and speed increase. Normalizing Flows exceeds Gaussian processes in terms of approximation quality as well.
Ryzhikov A., Temirkhanov A., Derkach D., Hushchyn M., Kazeev N., Mokhnenko S.
2023-02-01 citations by CoLab: 1 PDF Abstract
Abstract The volume of data processed by the Large Hadron Collider experiments demands sophisticated selection rules typically based on machine learning algorithms. One of the shortcomings of these approaches is their profound sensitivity to the biases in training samples. In the case of particle identification (PID), this might lead to degradation of the efficiency for some decays not present in the training dataset due to differences in input kinematic distributions. In this talk, we propose a method based on the Common Specific Decomposition that takes into account individual decays and possible misshapes in the training data by disentangling common and decay specific components of the input feature set. We show that the proposed approach reduces the rate of efficiency degradation for the PID algorithms for the decays reconstructed in the LHCb detector.
Anderlini L., Barbetti M., Derkach D., Kazeev N., Maevskiy A., Mokhnenko S.
2023-02-01 citations by CoLab: 2 PDF Abstract
Abstract The increasing luminosities of future data taking at Large Hadron Collider and next generation collider experiments require an unprecedented amount of simulated events to be produced. Such large scale productions demand a significant amount of valuable computing resources. This brings a demand to use new approaches to event generation and simulation of detector responses. In this paper, we discuss the application of generative adversarial networks (GANs) to the simulation of the LHCb experiment events. We emphasize main pitfalls in the application of GANs and study the systematic effects in detail. The presented results are based on the Geant4 simulation of the LHCb Cherenkov detector.
Ratnikov F., Rogachev A., Mokhnenko S., Maevskiy A., Derkach D., Davis A., Kazeev N., Anderlini L., Barbetti M., Siddi B.G.
The abundance of data arriving in the new runs of the Large Hadron Collider creates tough requirements for the amount of necessary simulated events and thus for the speed of generating such events. Current approaches can suffer from long generation time and lack of important storage resources to preserve the simulated datasets. The development of the new fast generation techniques is thus crucial for the proper functioning of experiments. We present a novel approach to simulate LHCb detector events using generative machine learning algorithms and other statistical tools. The approaches combine the speed and flexibility of neural networks and encapsulates knowledge about the detector in the form of statistical patterns. Whenever possible, the algorithms are trained using real data, which enhances their robustness against differences between real data and simulation. We discuss particularities of neural network detector simulation implementations and corresponding systematic uncertainties.
Mistryukova L., Knyazeva I., Plotnikov A., Khizhik A., Hushchyn M., Derkach D.
2022-10-19 citations by CoLab: 0 Abstract
Methods of Stokes profile inversion based on spectral polarization analysis represent a powerful tool for obtaining information on magnetic and thermodynamic properties in the solar atmosphere. However, these methods involve solving the radiation transport equation. Over the past decades, several approaches have been developed to provide an analytical solution to the inverse problem, but despite its advantages, in many cases it requires large computing resources. Neural networks have been shown to be a good alternative to these methods, but in general they tend to be overly confident in their predictions. In this paper, the uncertainty estimation of atmospheric parameters prediction is presented. It is shown that deterministic networks containing partially-independent MLP blocks allow one to estimate uncertainty in predictions achieving the high accuracy results.
Aaij R., Abdelmotteleb A. ., Abellán Beteta C., Abudinén F., Ackernley T., Adeva B., Adinolfi M., Afsharnia H., Agapopoulou C., Aidala C. ., Aiola S., Ajaltouni Z., Akar S., Albrecht J., Alessio F., et. al.
2022-08-24 citations by CoLab: 7 Abstract
The first study of the angular distribution of ${\ensuremath{\mu}}^{+}{\ensuremath{\mu}}^{\ensuremath{-}}$ pairs produced in the forward rapidity region via the Drell-Yan reaction $pp\ensuremath{\rightarrow}{\ensuremath{\gamma}}^{*}/Z+X\ensuremath{\rightarrow}{\ensuremath{\ell}}^{+}{\ensuremath{\ell}}^{\ensuremath{-}}+X$ is presented, using data collected with the LHCb detector at a center-of-mass energy of 13 TeV, corresponding to an integrated luminosity of $5.1\text{ }\text{ }{\mathrm{fb}}^{\ensuremath{-}1}$. The coefficients of the five leading terms in the angular distribution are determined as a function of the dimuon transverse momentum and rapidity. The results are compared to various theoretical predictions of the $Z$-boson production mechanism and can also be used to probe transverse-momentum-dependent parton distributions within the proton.
Aaij R., Abellán Beteta C., Ackernley T., Adeva B., Adinolfi M., Afsharnia H., Aidala C.A., Aiola S., Ajaltouni Z., Akar S., Albrecht J., Alessio F., Alexander M., Alfonso Albero A., Aliouche Z., et. al.
2022-07-23 citations by CoLab: 12 PDF Abstract
Coherent production of J/ψ mesons is studied in ultraperipheral lead-lead collisions at a nucleon-nucleon centre-of-mass energy of 5 TeV, using a data sample collected by the LHCb experiment corresponding to an integrated luminosity of about 10 μb−1. The J/ψ mesons are reconstructed in the dimuon final state and are required to have transverse momentum below 1 GeV. The cross-section within the rapidity range of 2.0 < y < 4.5 is measured to be 4.45 ± 0.24 ± 0.18 ± 0.58 mb, where the first uncertainty is statistical, the second systematic and the third originates from the luminosity determination. The cross-section is also measured in J/ψ rapidity intervals. The results are compared to predictions from phenomenological models.

Partners

Lab address

Москва, Покровский бульвар, 11 комн. S-924
Authorization required.