Orekhov, Boris V

PhD in Letters/Foreign languages and literature
🥼
🤝
Publications
17
Citations
15
h-index
2
Savchuk S.O., Arkhangelskiy T., Bonch-Osmolovskaya A.A., Donina O.V., Kuznetsova Y.N., Lyashevskaya O.N., Orekhov B.V., Podryadchikova M.V.
2024-04-15 citations by CoLab: 6 Abstract  
The paper provides an overview of the results of the fundamental reconstruction and modernization project of the National Corpus of the Russian Language platform, carried out from 2020 to 2023. The focus of the paper is on the new opportunities that are opening up for linguists and a wider audience. This includes improving the representativeness of existing corpora, creating new corpora, new annotation obtained through the application of neural network models, and new interface solutions. Three notable new components are examined in more detail: a resource-related one, which is the new Social Networks corpus, a search-related one, which is the Panchronic corpus that combines searches across corpora from different periods, and an analytical one, which is the functional complex of statistics and data visualization.
Orekhov B., Skorinkin D.
2023-07-01 citations by CoLab: 0 Abstract  
Tradicionalmente, la estilometría se ha utilizado para resolver problemas de atribución de autoría. Los métodos cuantitativos de atribución siguen siendo la última herramienta de los investigadores cuando no se dispone de pruebas documentales fiables. En los últimos veinte años, el método Delta, desarrollado por John F. Burrows, se ha impuesto como el principal método de atribución. En general, ha demostrado ser una forma bastante fiable de atribuir textos en casos controvertidos. Sin embargo, como muestra nuestra investigación, destaca el caso de Fernando Pessoa, quien produjo sus textos "en nombre" de identidades ficticias, comúnmente llamadas "heterónimos". Delta no identificó dichas obras como se esperaba, es decir, como textos pertenecientes a la pluma de una sola persona, Fernando Pessoa, sino como textos de diferentes autores. El artículo lleva a cabo una serie de experimentos para comprobar hasta qué punto Pessoa consigue confundir la valoración cuantitativa de autoría de sus textos poéticos. Los textos de Pessoa son examinados como un corpus independiente y con el trasfondo de la obra de otros poetas lusófonos. En todos los casos, las distancias entre los textos pertenecientes a los heterónimos de Pessoa son comparables a las distancias entre textos de autores diferentes, es decir, mucho mayores que las distancias entre los textos de un mismo autor.
Skorinkin D., Orekhov B.
2023-04-08 citations by CoLab: 2 Abstract  
Abstract It is a basic assumption of stylometry that texts written by the same person show greater stylometric similarity even if published under multiple pennames. Statistical authorship attribution strongly relies on the ability of Burrows’s Delta and its variants to cluster one author together regardless of pseudonyms. At the same time, the very first computational discoveries by the founder of modern stylometry showed that a single author is capable of producing multiple voices (Burrows, 1987, Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Clarendon Press). We investigate two authors whose stylistically autonomous pennames seem to deceive Delta and override authorial signals: a Portuguese poet Fernando Pessoa and a French novelist Romain Gary. Pessoa managed to create at least three pennames (the author himself used the term ‘heteronym’) who exhibit all traits of individual human beings from the stylometric point of view. Gary’s alter ego Emile Ajar, who was an intentional literary mystification, also demonstrates traits of stylometric autonomy. At the same time, other pseudonyms used by Gary lack that autonomy completely. Our investigation shows that there appears to be a continuum between a purely formal use of a penname, which brings almost no distinction from the real name of an author, and a strong literary sub-personality such as those created by Pessoa.
Chelnokova D., Vdovin A., Orekhov B.
Slovene scimago Q3 Open Access
2023-01-01 citations by CoLab: 0 Abstract  
Using a dataset of 2,036 titles of Russian novels from 1763 to 1917, the article raises the issue of average title length evolution over 150 years of the history of original novels. Unlike British novels, in which, according to Franco Moretti’s hypothesis, the titles became shorter due to market competition, the titles of Russian novels from the 1840s onwards became shorter primarily due to the influence of the ‘thick journals’ as a particular cultural form and institutional framework. Lead and authoritative Russian critics set the trend towards the usage of shorter and more symbolically loaded titles, discrediting the archaic and long titles common for picaresque novels. In addition, it turned out that the shortening of titles led to a change in the correlation of their elements — additional metatextual information (abstract, genre, author), since the 1830s almost completely left the title to the subtitle. As a result, the titles acquired a special artistic status and greater semantic significance.
Leibov R., Orekhov B.
Shagi/ Steps scimago Q2
2022-05-23 citations by CoLab: 0
Vyrenkova A., Rakhilina E., Orekhov B.
2022-05-15 citations by CoLab: 1
Leibov R., Orekhov B., Šeļa A.
2021-12-31 citations by CoLab: 0 Abstract  
A dolgozat egy már meglévő, „a versmérték jelentésmezőjeként” ismert költészetelmélet formalizálását kísérli meg, amely elmélet azt állítja, hogy a modern líra különböző metrikai formái bizonyos jelentésbeli asszociációkat halmoznak fel és őriznek meg. Az LDA témamodellező (topic modelling) algoritmussal vizsgáltuk az orosz költészet tág korpuszát (1750–1950), hogy ezáltal minden egyes verset egy tématérben, a versmértékeket pedig a témák valószínűségének eloszlása szerint reprezentáljunk. Nem felügyelt osztályozást és kiterjedt mintavételt alkalmazva megmutatjuk, hogy a verselési formákon belül és között erős a forma és a jelentés kapcsolata: ugyanahhoz a versmértékhez tartozó két minta sokszor nagyon is hasonlóként tűnik fel, és ugyanannak a családnak két verselési formája legtöbbször szintén egy klaszterbe kerül. Ez a kapcsolat akkor is kimutatható, ha a korpusz kronológiai szempontból ellenőrzött, és nem következménye a populáció méretének. Amellett érvelünk, hogy hasonló megközelítést nyelvek és költészetihagyományok szemantikai mezőinek összehasonlításakor is alkalmazni lehet, amelynek révén az irodalomtörténet legalapvetőbb kérdéseire adhatók releváns válaszok.
Orekhov B.
2021-12-31 citations by CoLab: 0 Abstract  
The article discusses the statistically identified properties of Bashkir versification in comparison with the existing descriptions of other Turkic versification systems. The focus is on imparisyllabic forms, predominant meters, and peculiarities of rhyme. The study allows concluding that Bashkir Uzun-Kyuy (a regular alteration of 10- and 9-syllable lines) is unique and its equivalents are not found in other Turkic poetic traditions except the Tartar tradition, with which Bashkir verse has common roots. The frequency of Bashkir 9-syllable verse is also unusual as compared with poetry in other Turkic languages. Octosyllabic lines, which are often used together with 7-syllable verse, are common for various Turkic systems and can also be found in Bashkir poetry, most prominently in Kyska-Kyuy (a regular alteration of 8- and 7-syllable lines). More data is needed to judge to what extent the rhythm of Bashkir verse is comparable with the verse rhythm in other Turkic poetic traditions.
Orekhov B.V.
2020-11-01 citations by CoLab: 0 Abstract  
The collected works of Leo Tolstoy were printed and published in 90 volumes of some 46,000 pages between 1928 and 1958. This paper, however, is not about the 90 volumes themselves, but about Volume 91 of this edition, a supplement volume containing indexes of works and proper names, from both the fictional works and the many volumes containing Tolstoy’s letters. “Volume 91” is a web application based on the digitised index of proper names for the 90-volume collection of Tolstoy’s works (http://index.tolstoy.ru/). The digitised data features additional properties, which can be explored by the enthusiast as well as the specialist. This paper not only presents a new tool for literary scholars, but generalizes and shows how this kind of resources can be used to gain new insights into larger text collections
Orekhov B., Fischer F.
Orbis Litterarum scimago Q2
2020-10-13 citations by CoLab: 3
Glazunov E.V., Orekhov B.V.
2020-09-30 citations by CoLab: 0
Bonch-Osmolovskaya A., Skorinkin D., Pavlova I., Kolbasov M., Orekhov B.
Web Semantics scimago Q2 wos Q2
2019-12-01 citations by CoLab: 2 Abstract  
The paper presents the results of a project devoted to the creation of a digital edition of Leo Tolstoy’s complete works. 1 Our primary source is the 90-volume critical print edition of Tolstoy’s oeuvre. We discuss the rationale for semantic markup of metadata for three classes of texts: works, letters and diaries. We extract information from the critical apparatus and supplement it with some new additional markups that enable visualizing the evolution of Tolstoy as a publicist. We show that the named entity index constitutes a valuable knowledge base, which can serve as a basis for generating a knowledge graph that is more detailed and systematic than the open linked databases like DBpedia.
Izvestiya RAN. Seriya Literatury i Yazyka
2019-01-01 citations by CoLab: 0
Orekhov B.V., Uspensky P.F., Fainberg V.V.
Russkaia Literatura scimago Q4
2018-12-03 citations by CoLab: 0
Morozov D., Garipov T., Lyashevskaya O., Savchuk S., Iomdin B., Glazkova A.
2024-12-30 citations by CoLab: 0 Abstract   Cites 1
Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.
Guéville E., Wrisley D.J.
2024-11-29 citations by CoLab: 0 Abstract   Cites 1
Abstract The topic of this paper is a thirteenth-century manuscript from the French National Library (Paris, BnF français 24428) containing three popular texts: an encyclopedic work, a bestiary and a collection of animal fables. We have automatically transcribed the manuscript using a custom handwritten text recognition (HTR) model for old French. Rather than a content-based analysis of the manuscript’s transcription, we adapt quantitative methods normally used for authorship attribution and clustering to the analysis of scribal contribution in the manuscript. Furthermore, we explore the traces that are left when texts are copied, transcribed and/or edited, and the importance of that trace for computational textual analysis with orthographically unstable historical languages. We argue that the method of transcription is fundamental for being able to think about complex modes of authorship which are so important for understanding medieval textual transmission. The paper is inspired by trends in digital scholarship in the mid-2020s, such as public transcribe-a-thons in the GLAM (Galleries, Libraries, Archives and Museums) sector, the opening up of digitized archival collections with methods such as HTR, and computational textual analysis of the transcriptions.
Bochkarev V.V., Savinkov A.V., Shevlyakova A.V.
2024-11-22 citations by CoLab: 0 Abstract   Cites 1
In this work, we conducted a comparative testing of 20 sets of pre-trained vectors to computationally estimate valence ratings of words in the Russian language. The word valence was estimated using neural network predictors. A vector representing a word was fed to the input of a multilayer feed-forward neural network that calculated the valence rating of this word. The currently largest Russian dictionary with valence ratings, KartaSlovSent, was used as a source of word valence ratings for training models. The highest accuracy of valence rating estimation was obtained using a set of fasttext vectors trained on the CommonCrawl corpus that includes 103 billion words. Spearman’s correlation coefficient between human ratings and their machine ratings was 0.859. The high estimation accuracy and the large size of the dictionary allows one to use this set of vectors to extrapolate human valence ratings to the widest range of words in the Russian language. It is also worth mentioning 4 sets of vectors presented on the RusVectores project page and trained using the texts of the Araneum Russicum Maximum and Taiga corpora. Despite a significantly smaller size of the training corpus, using these sets of vectors allows obtaining only slightly lower accuracy. The lowest results were obtained for sets of vectors trained using corpora of news texts.
Geist L.
2024-10-11 citations by CoLab: 0 Abstract   Cites 1
Abstract Languages show variation in the encoding of plurality in the domain of foodstuffs. Some foodstuffs are lexicalized by singular mass nouns (e.g., garlic) and others by plural count nouns (e.g., beans). In the paper it is argued on the basis of German and Russian that there is no difference in meaning between these two forms: both denote aggregates as clusters of objects. Since objects are built into clusters, they are inaccessible for counting and both types of nouns uniformly behave like mass nouns. Such a uniform behavior would be unexplainable if these forms differed in meaning and the plural form were a regular count plural. This investigation suggests that two types of plural have to be distinguished: the mass aggregate plural, which indicates a clustered plurality of objects, and the count plural, which designates sets of disjoined objects. Regular plural markers may in principle be ambiguous between these two interpretations. However, if a plural marker is attached to a singulative or unit-denoting morpheme of a noun, the plural is unambiguously interpreted as count plural. The mass aggregate plural may receive a special morphological marking in some languages, as in Russian.
DAVIDYUK T.I.
2024-08-23 citations by CoLab: 0 Abstract   Cites 1
The article describes a corpus and experimental study of the variability of person-number agreement in Russian in the context of coordinated subjects, whose elements differ in person. Experimental and corpus studies have demonstrated that in such constructions with the conjunction и ‘and’, in addition to resolved agreement, 3 rd person plural agreement and (for VS word order) closest conjunct agreement are possible. This work aims to illustrate the variability of person-number agreement with coordinated subjects that involve other conjunctions, namely и … и ‘both … and’, или ‘or’ and или … или ‘either … or’. The article describes the results of a corpus study and four linguistic experiments. The acceptability judgement experiments aim to complement corpus data on the investigated constructions, which, in some cases, were found to be limited. The results revealed greater variability in agreement with constructions containing disjunctive conjunctions compared to analogous constructions with conjunctive conjunctions. The most frequent and acceptable agreement option in all cases is resolved agreement. The possibility of 3 rd person plural agreement was also discovered, as documented in previous experimental studies and corpus data. Other possible agreement strategies for constructions with disjunctive conjunctions include closest disjunct agreement and (for disjunct order X или я ‘X or I’ / или Х, или я ‘either X or I’) 3 rd person singular agreement.
Isachenko O.M.
The article describes semantic, pragmatic, and derivational characteristics of zoonyms denoting sets: выводок, гурт, караван, косяк, отара, помёт, рой, свора, стадо, стая, табун (11 lexemes). In the existing explanatory dictionaries, their secondary meanings are presented inconsistently and incompletely. In particular, adverbial uses of metaphorical meanings that actually function in the modern Russian language are not noted. Such metaphors mainly denote the movement and movement of large masses of objects, most often people. Based on the material of explanatory dictionaries and the Russian National Corpus, methods and formulas are proposed for correcting and unifying the lexicographic description of Russian lexemes with the meaning of ‘many animals,’ as well as introducing special pragmatic labels to differentiate their use. A multidimensional analysis of the units of the microgroup “Animal Groups” clearly demonstrates the process of complex interaction between semantics and grammar in the processes of metaphorization and adverbial conversion. The sample size includes 638 contexts with set metaphors associated with the class of counter words in the grammatical aspect and the thematic field of zoonyms in the lexical aspect.
Ryzhova D., Rakhilina E., Reznikova T., Badryzlova Y.
Folia Linguistica scimago Q1 wos Q3
2024-01-19 citations by CoLab: 0 Abstract   Cites 1
Abstract The paper contributes to the typology of encoding motion events by highlighting the role of the verbal root meaning in lexicalization of motion. We focus on lexical semantics of the verbs of falling, which we study on a sample of 42 languages using the frame-based approach to lexical typology. We show that, along with downward motion, the verbs of falling regularly denote adjacent situations; and vice versa, the idea of downward motion is systematically conveyed by verbs from adjacent semantic fields. These findings challenge the application of the classical parameters of motion events (e.g. Path) to any given motion event description and offer new insights into the understanding of lexicalization patterns in general.
Jiang K.
2024-01-01 citations by CoLab: 0 PDF Abstract   Cites 1
Abstract This paper explores how artificial Intelligence enhances emotional expression and aesthetic imagery in modern Chinese literature. The study focuses on the latent semantic analysis of literary texts, text mining challenges, techniques and algorithms for latent semantic analysis, and the application of artificial Intelligence in literary criticism. Through text representation, analysis of text mining challenges, latent semantic analysis (LSA), and the application of specific algorithms, this study provides an in-depth understanding of the intrinsic semantics of literary texts. The study employs text representation methods, such as the TF-IDF formulation, to deal with high-dimensional sparse text data. The main challenges include high dimensionality and discovering potential semantic relationships between words. Latent semantic analysis (LSA) realizes data dimensionality reduction through SVD singular value decomposition technique to extract the implicit relationships between words in text. The results show that LSA effectively pulls the potential semantic structure of the text, reduces the “noise”, simplifies the text vectors, and realizes the dimensionality reduction. In addition, the text clustering analysis using Kohonen self-organized feature mapping network reveals the semantic relationship and sentiment distribution among literary works. Intelligent technology can effectively enhance literary works’ emotional depth and aesthetic value and provide a new interpretation perspective for modern literature.
Henrickson L., Meroño-Peñuela A.
AI and Society scimago Q1 wos Q2
2023-09-04 citations by CoLab: 39 Abstract   Cites 1
AbstractRecent advances in natural language generation (NLG), such as public accessibility to ChatGPT, have sparked polarised debates about the societal impact of this technology. Popular discourse tends towards either overoptimistic hype that touts the radically transformative potentials of these systems or pessimistic critique of their technical limitations and general ‘stupidity’. Surprisingly, these debates have largely overlooked the exegetical capacities of these systems, which for many users seem to be producing meaningful texts. In this paper, we take an interdisciplinary approach that combines hermeneutics—the study of meaning and interpretation—with prompt engineering—task descriptions embedded in input to NLG systems—to study the extent to which a specific NLG system, ChatGPT, produces texts of hermeneutic value. We design prompts with the goal of optimising hermeneuticity rather than mere factual accuracy, and apply them in four different use cases combining humans and ChatGPT as readers and writers. In most cases, ChatGPT produces readable texts that respond clearly to our requests. However, increasing the specificity of prompts’ task descriptions leads to texts with intensified neutrality, indicating that ChatGPT’s optimisation for factual accuracy may actually be detrimental to the hermeneuticity of its output.
Orekhov B., Skorinkin D.
2023-07-01 citations by CoLab: 0 Abstract   Cites 1
Tradicionalmente, la estilometría se ha utilizado para resolver problemas de atribución de autoría. Los métodos cuantitativos de atribución siguen siendo la última herramienta de los investigadores cuando no se dispone de pruebas documentales fiables. En los últimos veinte años, el método Delta, desarrollado por John F. Burrows, se ha impuesto como el principal método de atribución. En general, ha demostrado ser una forma bastante fiable de atribuir textos en casos controvertidos. Sin embargo, como muestra nuestra investigación, destaca el caso de Fernando Pessoa, quien produjo sus textos "en nombre" de identidades ficticias, comúnmente llamadas "heterónimos". Delta no identificó dichas obras como se esperaba, es decir, como textos pertenecientes a la pluma de una sola persona, Fernando Pessoa, sino como textos de diferentes autores. El artículo lleva a cabo una serie de experimentos para comprobar hasta qué punto Pessoa consigue confundir la valoración cuantitativa de autoría de sus textos poéticos. Los textos de Pessoa son examinados como un corpus independiente y con el trasfondo de la obra de otros poetas lusófonos. En todos los casos, las distancias entre los textos pertenecientes a los heterónimos de Pessoa son comparables a las distancias entre textos de autores diferentes, es decir, mucho mayores que las distancias entre los textos de un mismo autor.
Kolmogorova A.
2023-06-02 citations by CoLab: 0 Abstract   Cites 1
The paper introduces the phenomenon of semantic editions: a new digital representation of texts and personalities of the Great Literature, e. g., The World of Dante, Mapping the Republic of Letters, Chekhov Digital, Tolstoy Digital, and Pushkin Digital. The author analyzed these platforms to reveal the methods and ideology behind this new format. The everyday practice of social networks and messengers serves as a cognitive strategy: the literary heritage undergoes fragmentation, tagging, clustering, and visualization. Semantic editions and network communication share the following features. Information is fragmented and classified by tagging for subsequent clustering. The links are horizontal, e. g., co-occurrence graphs, contact networks, etc. Quantitative data provide qualitative conclusions, e. g., frequencies of mention for toponyms and anthroponyms. Expert comments and hypertext merge into a polyphony.
Litvinenko V.A., Titov R.V., Zubkov A.V., Orlova Y.A., Kulikova Y.V.
2022-09-23 citations by CoLab: 0 Abstract   Cites 1
The work is devoted to the creation of an automated method for compiling a psychological portrait of a person by collecting and analyzing information from social networks. The article describes the model of the created psychological portrait, the metrics used for its creation and the process of analyzing of these metrics. The created method was implemented as a software in the form of a REST-server and was tested on the social network VKontakte.
Cui M.
2022-09-09 citations by CoLab: 3 Abstract   Cites 1
Poetry is the jewel in the crown of our country’s classical culture and has been praised and studied by countless people for thousands of years. In recent years, with the rapid development of computer technology and the leap-forward improvement of hardware computing power, natural language processing (NLP) technology has achieved remarkable results in practice. We applied NLP to the text analysis of classical poetry, proposed a set of methods to automatically recognize the artistic conception in classical poetry, and established the classical poetry artistic conception dataset for experimentation through the crawler method. In the experiment, we studied the application of different machine learning algorithms in text classification, combined such algorithms with different document vectorization methods, compared their performance on the topic classification problem of poetry, and concluded that there are some better accuracy rates under the classical machine learning framework. Comparing the effects of word-based vectors and word-based vectors, we concluded that the ancient poetry word vectors constructed based on characters have a higher accuracy rate. We also further introduced deep learning methods into the research, analyzed the pros and cons of various neural networks, and studied the neural network architectures that have good results in the practice of NLP, such as TextCNN and BiLSTM models. We also introduced mature NLP pre-training models such as BERT to classify the artistic conception of classical poetry. In addition, we also constructed an emotional dictionary matching method based on word vectors for sentiment analysis. The experimental results have shown that the method proposed in this paper has a good effect of automatic recognition of classical poetry mood, which can be used to recommend similar poems and select poems with emotion as the theme through the poetry mood.
Sherstinova T., Moskvina A., Kirina M.
2021-05-12 citations by CoLab: 2 Abstract   Cites 1
A significant part of modern technologies associated with the development of artificial intelligence systems and digital analytics of diverse data relies on methods of computer text processing (NLP, speech technologies). However, NLP methods are applied primarily to specialized texts, such as scientific literature, technical documentation, news, etc., or social media discourse. Fiction texts are usually left out of the focus of NLP practitioners as the fictional world seems to be of less significance or less “information value” from a practical point of view. Moreover, due to the poetic and metaphorical nature of literary texts, the use of some NLP methods (e.g., topic modelling) for fiction analysis turned out to be more complicated. At the same time, the influence of literature both on the consciousness of individuals and on the formation of social values can hardly be overestimated. Besides, making computers “understand” fiction in a similar way as humans do would be a real challenge for artificial intelligence. The article puts forward the idea of modelling thematic areas of literature on a national scale, which should reveal the main thematic domains of national literature as a whole. It will allow a better understanding of the cultural traits of the national consciousness in a given historical period and contribute to either literary studies or practical tasks. Methodological approaches to determining and modelling themes of literary works are considered, technical difficulties arising in the process are described, and the ways to solve them are suggested. The proposed methodology has been implemented in the design of the corpus of Russian short stories of 1900-1930s and can be applied in the development of artificial intelligence systems that process large volumes of literary texts in any language.
Skorinkin D., Orekhov B.
2023-04-08 citations by CoLab: 2 Abstract  
Abstract It is a basic assumption of stylometry that texts written by the same person show greater stylometric similarity even if published under multiple pennames. Statistical authorship attribution strongly relies on the ability of Burrows’s Delta and its variants to cluster one author together regardless of pseudonyms. At the same time, the very first computational discoveries by the founder of modern stylometry showed that a single author is capable of producing multiple voices (Burrows, 1987, Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Clarendon Press). We investigate two authors whose stylistically autonomous pennames seem to deceive Delta and override authorial signals: a Portuguese poet Fernando Pessoa and a French novelist Romain Gary. Pessoa managed to create at least three pennames (the author himself used the term ‘heteronym’) who exhibit all traits of individual human beings from the stylometric point of view. Gary’s alter ego Emile Ajar, who was an intentional literary mystification, also demonstrates traits of stylometric autonomy. At the same time, other pseudonyms used by Gary lack that autonomy completely. Our investigation shows that there appears to be a continuum between a purely formal use of a penname, which brings almost no distinction from the real name of an author, and a strong literary sub-personality such as those created by Pessoa.
Morozov D.A., Glazkova A.V., Iomdin B.L.
2022-06-29 citations by CoLab: 7 Abstract  
Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.
Lyashevskaya O.N., Shavrina T.O., Trofimov I.V., Vlasova N.A.
2020-11-11 citations by CoLab: 8 Abstract  
The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission routine, and participating systems. The approaches proposed by the participating systems and their results are reported and analyzed.
Karasik V.I.
Zanry Reci scimago Q2 Open Access
2019-08-24 citations by CoLab: 14
Sorokin A., Kravtsova A.
2018-09-26 citations by CoLab: 16 Abstract  
The present paper addresses the task of morphological segmentation for Russian language. We show that deep convolutional neural networks solve this problem with F1-score of 98% over morpheme boundaries and beat existing non-neural approaches.
Georgakopoulos T., Polis S.
2018-02-23 citations by CoLab: 60 Abstract  
The semantic map model is relatively new in linguistic research, but it has been intensively used during the past three decades for studying both cross-linguistic and language-specific questions. The goal of the present contribution is to give a comprehensive overview of the model. After introducing the different types of semantic maps, we present the steps involved for building the maps and discuss in more detail the different types of maps and their respective advantages and disadvantages, focusing on the kinds of linguistic generalizations captured. Finally, we provide a thorough survey of the literature on the topic, and we sketch future avenues for research in the field.
Shilikhina K.M.
Zanry Reci scimago Q2 Open Access
2018-01-01 citations by CoLab: 3
Jannidis F., Evert S., Proisl T., Reger I., Pielström S., Schöch C., Vitt T.
2017-06-09 citations by CoLab: 79
Hoover D.L.
2017-04-28 citations by CoLab: 19
Kirillov A.G.
Zanry Reci scimago Q2 Open Access
2017-01-01 citations by CoLab: 1
Olena O. Z., Olena I. G.
Zanry Reci scimago Q2 Open Access
2017-01-01 citations by CoLab: 5
Mazzei A., Valle A.
2016-12-30 citations by CoLab: 1
Ghazvininejad M., Shi X., Choi Y., Knight K.
2016-12-30 citations by CoLab: 44
2016-10-18 citations by CoLab: 22 Abstract  
This book addresses the question of whether there are continuities in Latin spanning the period from the early Republic through to the Romance languages. It is often maintained that various usages admitted by early comedy were rejected later by the literary language but continued in speech, to resurface centuries later in the written record (and in Romance). Are certain similarities between early and late Latin all that they seem, or might they be superficial, reflecting different phenomena at different periods? Most of the chapters, on numerous syntactic and other topics and using different methodologies, have a long chronological range. All attempt to identify patterns of change that might undermine any theory of submerged continuity. The patterns found are summarised in a concluding chapter. The volume addresses classicists with an interest in any of the different periods of Latin, and Romance linguists.
Rakhilina E., Reznikova T.
2016-08-08 citations by CoLab: 23
Total publications
17
Total citations
15
Citations per publication
0.88
Average publications per year
1
Average coauthors
1.53
Publications years
2008-2024 (17 years)
h-index
2
i10-index
0
m-index
0.12
o-index
3
g-index
3
w-index
0
Metrics description

Top-100

Fields of science

1
2
3
Literature and Literary Theory, 3, 17.65%
Computer Science Applications, 1, 5.88%
Information Systems, 1, 5.88%
Computer Networks and Communications, 1, 5.88%
Software, 1, 5.88%
Human-Computer Interaction, 1, 5.88%
Linguistics and Language, 1, 5.88%
General Arts and Humanities, 1, 5.88%
Anthropology, 1, 5.88%
Language and Linguistics, 1, 5.88%
General Psychology, 1, 5.88%
General Social Sciences, 1, 5.88%
1
2
3

Journals

1
1

Citing journals

1
Journal not defined, 1, 6.67%
1

Publishers

1
2
1
2

Organizations from articles

2
4
6
8
10
Organization not defined, 7, 41.18%
2
4
6
8
10

Countries from articles

2
4
6
8
10
Russia, 10, 58.82%
Country not defined, 7, 41.18%
Germany, 1, 5.88%
Estonia, 1, 5.88%
United Kingdom, 1, 5.88%
2
4
6
8
10

Citing organizations

1
2
3
4
5
6
7
8
Organization not defined, 8, 53.33%
1
2
3
4
5
6
7
8

Citing countries

1
2
3
4
5
6
Russia, 6, 40%
Country not defined, 3, 20%
USA, 2, 13.33%
China, 2, 13.33%
Germany, 1, 6.67%
Australia, 1, 6.67%
United Kingdom, 1, 6.67%
UAE, 1, 6.67%
1
2
3
4
5
6
  • We do not take into account publications without a DOI.
  • Statistics recalculated daily.