The role of the Russian language in computational linguistics. What does a computer linguist do? Major associations and conferences

COMPUTER LINGUISTICS (tracing paper from English computational linguistics), one of the areas of applied linguistics, in which for the study of the language and modeling the functioning of the language in certain conditions, situations and problem areas, computer programs are developed and used, computer technologies for organizing and processing data. On the other hand, this is the area of application of computer language models in linguistics and related disciplines. How special scientific direction Computational linguistics took shape in European studies in the 1960s. Since the English adjective computational can also be translated as "computational", the term "computational linguistics" is also found in the literature, but in domestic science it acquires a narrower meaning, approaching the concept of "quantitative linguistics".

Often the term "quantitative linguistics" is referred to computational linguistics, which characterizes an interdisciplinary direction in applied research, where quantitative or statistical methods of analysis are used as the main tool for learning language and speech. Sometimes quantitative (or quantitative) linguistics is contrasted with combinatorial linguistics. In the latter, the dominant role is played by the "non-quantitative" mathematical apparatus - set theory, mathematical logic, theory of algorithms, etc. From a theoretical point of view, the use of statistical methods in linguistics makes it possible to supplement the structural model of language with a probabilistic component, that is, to create a theoretical structural-probabilistic model with significant explanatory potential. In the applied field, quantitative linguistics is represented, first of all, by the use of fragments of this model used for linguistic monitoring of the functioning of the language, decryption of the encoded text, authorization / attribution of the text, etc.

The term "computational linguistics" and the problems of this direction are often associated with the modeling of communication, and above all - with the provision of human interaction with a computer on a natural or limited natural language(for this purpose, special natural language processing systems are being created), as well as with the theory and practice of information retrieval systems (ISS). Ensuring communication between a person and a computer in a natural language is sometimes denoted by the term "natural language processing" (translated from of English language term Natural Language Processing). This direction of computational linguistics emerged in the late 1960s abroad and developed within the framework of the scientific and technological discipline called artificial intelligence (works by R. Schenk, M. Lebowitz, T. Vinograd, etc.). In its meaning, the phrase "natural language processing" should cover all areas in which computers are used to process language data. In practice, however, a narrower understanding of the term has become entrenched - the development of methods, technologies and specific systems that ensure communication between a person and a computer in a natural or limited natural language.

To a certain extent, work in the field of creating hypertext systems can be attributed to computational linguistics, considered as a special way of organizing text and even as a fundamentally new type of text, opposed in many of its properties to an ordinary text formed in the Gutenberg tradition of typography (see Gutenberg).

Automatic translation also falls within the competence of computational linguistics.

Within the framework of computational linguistics, a relatively new direction, which has been actively developing since the 1980s and 1990s, has also emerged - corpus linguistics, where general principles for constructing linguistic data corpuses (in particular, text corpuses) are developed using modern computer technologies. Text corpora are collections of specially selected texts from books, magazines, newspapers, etc., transferred to machine media and intended for automatic processing. One of the first corpuses of texts was created for American English at Brown University (the so-called Brown Corpus) in 1962-63 under the direction of W. Francis. In Russia, since the early 2000s, the Vinogradov Institute of the Russian Language of the Russian Academy of Sciences has been developing the National Corpus of the Russian Language, consisting of a representative sample of Russian-language texts with a volume of about 100 million word usage. In addition to the actual design of data corpuses, corpus linguistics is engaged in the creation of computer tools (computer programs) designed to extract a variety of information from text corpora. From the point of view of the user, the requirements of representativeness (representativeness), completeness and economy are imposed on the text corpora.

Computational linguistics is actively developing both in Russia and abroad. The flow of publications in this area is very large. In addition to thematic collections, since 1984 the journal "Computational Linguistics" has been published in the USA on a quarterly basis. Great organizational and scientific work is conducted by the Association for Computational Linguistics, which has regional structures around the world (in particular, the European branch). Every two years international conferences COLINT are held (in 2008 the conference was held in Manchester). The main directions of computational linguistics are also discussed at the annual international conference "Dialogue", organized by the Russian Research Institute of Artificial Intelligence, the Faculty of Philology of Moscow State University, Yandex and a number of other organizations. The relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Lit .: Zvegintsev V.A. Theoretical and Applied Linguistics. M., 1968; Piotrovsky R.G., Bektayev K.B., Piotrovskaya A.A. Mathematical linguistics. M., 1977; Gorodetskiy B. Yu. Actual problems of applied linguistics // New in foreign linguistics. M., 1983. Issue. 12; Kibrik A.E. Applied linguistics // Kibrik A.E. Essays on general and applied problems of linguistics. M., 1992; Kennedy G. An introduction to corpus linguistics. L., 1998; Bolshakov I.A., Gelbukh A. Computational linguistics: models, resources, applications. Fur., 2004; National corpus of the Russian language: 2003-2005. M., 2005; Baranov A.N. Introduction to Applied Linguistics. 3rd ed. M., 2007; Computational linguistics and intelligent technologies. M., 2008. Issue. 7.

Computer linguistics(also: mathematical or computational linguistics, eng. computational linguistics) is a scientific direction in the field of mathematical and computer modeling of intellectual processes in humans and animals in the creation of artificial intelligence systems, which aims to use mathematical models to describe natural languages.

Computational linguistics overlaps partially with natural language processing. However, in the latter, the emphasis is not on abstract models, but on applied methods of describing and processing language for computer systems.

The field of activity of computer linguists is the development of algorithms and applied programs for processing linguistic information.

Origins

Mathematical linguistics is a branch of the science of artificial intelligence. Its history began in the United States of America in the 1950s. With the invention of the transistor and the emergence of a new generation of computers, as well as the first programming languages, experiments began with machine translation, especially Russian scientific journals. In the 1960s, similar studies were carried out in the USSR (for example, an article on the translation from Russian into Armenian in the collection "Problems of Cybernetics" for 1964). However, the quality of machine translation is still far inferior to the quality of human translation.

From May 15 to May 21, 1958, the first All-Union conference on machine translation was held at the 1st Moscow State Pedagogical Institute. The organizing committee was headed by V. Yu. Rosenzweig and the executive secretary of the organizing committee G. V. Chernov. The full program of the conference is published in the collection "Machine Translation and Applied Linguistics", vol. 1, 1959 (aka "Machine Translation Association Bulletin No. 8"). As V. Yu. Rosenzweig recalls, the published collection of conference abstracts ended up in the USA and made a great impression there.

In April 1959, the First All-Union Conference on Mathematical Linguistics was held in Leningrad, convened by the Leningrad University and the Committee for Applied Linguistics. The main organizer of the Meeting was ND Andreev. A number of prominent mathematicians took part in the Meeting, in particular, S.L.Sobolev, L.V. Kantorovich (later - Nobel laureate) and A.A. Markov (the last two took part in the debate). V. Yu. Rosenzweig made a keynote speech "General linguistic theory of translation and mathematical linguistics" on the opening day of the Meeting.

Directions of computational linguistics

Natural Language Processing natural language processing; syntactic, morphological, semantic analysis of the text). This also includes:

Corpus linguistics, creation and use of electronic text corpora
Creation of electronic dictionaries, thesauri, ontologies. For example, Lingvo. Dictionaries are used, for example, for automatic translation, spell checking.
Automatic translation of texts. Promt is popular among Russian translators. Among the free is the Google Translate translator
Automatic extraction of facts from text (information extraction) (eng. fact extraction, text mining)
Autoreference (eng. automatic text summarization). This feature is included, for example, in Microsoft Word.
Building knowledge management systems. See Expert Systems
Creation of question-answer systems (eng. question answering systems).

Optical Character Recognition (eng. OCR). For example, FineReader
Automatic speech recognition (eng. ASR). There are paid and free software
Automatic speech synthesis

Major associations and conferences

Study programs in Russia

Write a review on the article "Computational Linguistics"

Notes (edit)

An excerpt characterizing Computational Linguistics

“Take, take the child,” said Pierre, handing the girl and addressing the woman imperiously and hastily. - Give it back to them, give it back! - he shouted almost at the woman, putting the screaming girl on the ground, and again looked back at the French and the Armenian family. The old man was already sitting barefoot. The little Frenchman took off his last boot and patted his boots against one another. The old man, sobbing, said something, but Pierre only caught a glimpse of it; all his attention was drawn to the Frenchman in the bonnet, who at that time, swaying slowly, moved towards the young woman and, taking his hands out of his pockets, took hold of her neck.
The beautiful Armenian woman continued to sit in the same motionless position, with her long eyelashes lowered, and as if she did not see or feel what the soldier was doing to her.
While Pierre ran the few steps that separated him from the French, the long marauder in the bonnet was already tearing the necklace she was wearing from the Armenian woman's neck, and the young woman, clutching her neck with her hands, screamed in a piercing voice.
- Laissez cette femme! [Leave this woman!] - Pierre croaked in a furious voice, grabbing the long, hunched soldier by the shoulders and throwing him away. The soldier fell, got up and ran away. But his comrade, throwing his boots, took out a cleaver and menacingly advanced on Pierre.
- Voyons, pas de betises! [Oh well! Don't be silly!] He shouted.
Pierre was in that rapture of rage in which he remembered nothing and in which his strength increased tenfold. He threw himself at the barefoot Frenchman, and before he could take out his cleaver, he had already knocked him down and thrashed him with his fists. An approving cry from the surrounding crowd was heard, at the same time a horse patrol of French lancers appeared from around the corner. Lancers trotted up to Pierre and the Frenchman and surrounded them. Pierre remembered nothing of what happened next. He remembered that he was beating someone, he was beaten and that in the end he felt that his hands were tied, that a crowd of French soldiers were standing around him and searching his dress.
- Il a un poignard, lieutenant, [The lieutenant, he has a dagger,] - were the first words that Pierre understood.
- Ah, une arme! [Ah, weapons!] - said the officer and turned to the barefoot soldier who had been taken with Pierre.
- C "est bon, vous direz tout cela au conseil de guerre, [Okay, okay, you will tell everything at the trial,] - said the officer. And then he turned to Pierre: - Parlez vous francais vous? [Do you speak French? ]
Pierre looked around him with bloodshot eyes and did not answer. Probably, his face seemed very scary, because the officer said something in a whisper, and four more lancers separated from the team and stood on both sides of Pierre.
- Parlez vous francais? The officer repeated the question to him, keeping away from him. - Faites venir l "interprete. [Call an interpreter.] - A little man in a civilian Russian dress drove out from behind the rows. By his dress and speech, Pierre immediately recognized him as a Frenchman in one of the Moscow shops.
- Il n "a pas l" air d "un homme du peuple, [He does not look like a commoner,] - said the translator, looking around Pierre.
- Oh, oh! ca m "a bien l" air d "un des incendiaires, - the officer oiled. - Demandez lui ce qu" il est? [Oh oh! he looks a lot like an arsonist. Ask him who he is?] He added.
- Who are you? The translator asked. “The bosses should be responsible for it,” he said.
- Je ne vous dirai pas qui je suis. Je suis votre prisonnier. Emmenez moi, [I won't tell you who I am. I am your prisoner. Take me away,] - Pierre suddenly said in French.
- Ah, Ah! - said the officer, frowning. - Marchons!
A crowd gathered around the lancers. Closest to Pierre was a pockmarked woman with a girl; when the detour started, she moved forward.
- Where does this lead you, my dear boy? - she said. - Girl then, girl then where will I put, if she is not theirs! - said the woman.
- Qu "est ce qu" elle veut cette femme? [What does she want?] Asked the officer.
Pierre was drunk. His enthusiasm was further intensified at the sight of the girl he had saved.
"Ce qu" elle dit? "He said." Elle m "apporte ma fille que je viens de sauver des flammes," he said. - Adieu! [What does she want? She carries my daughter, whom I have saved from the fire. Farewell!] - and he, not knowing how this aimless lie had escaped from him, walked with a decisive, solemn step between the French.
The departure of the French was one of those who were sent by order of Duronel through various streets of Moscow to suppress looting and, in particular, to catch arsonists, who, according to the general opinion, which appeared on that day among the French high-ranking officials, were the cause of the fires. Having traveled several streets, the patrol picked up five more suspicious Russians, one shopkeeper, two seminarians, a peasant and a courtyard, and several looters. But of all the suspicious people, Pierre seemed the most suspicious of all. When they were all brought to a lodging for the night in a large house on Zubovsky Val, in which a guardhouse was established, Pierre was placed separately under strict guard.

In St. Petersburg at that time in the highest circles, with greater fervor than ever before, there was a complex struggle between the parties of Rumyantsev, the French, Maria Feodorovna, the Tsarevich and others, drowned out, as always, by the trumpeting of court drones. But calm, luxurious, preoccupied only with ghosts, reflections of life, Petersburg life went on as before; and because of the course of this life, great efforts had to be made to realize the danger and the difficult situation in which the Russian people found themselves. There were the same exits, balls, the same French theater, the same interests of the courtyards, the same interests of service and intrigue. Only in the highest circles were efforts made to resemble the difficulty of the present situation. It was told in a whisper about how both Empresses acted opposite to each other, in such difficult circumstances. Empress Maria Feodorovna, concerned about the welfare of the charitable and educational institutions under her jurisdiction, made an order to send all the institutions to Kazan, and the things of these institutions had already been packed. Empress Elizaveta Alekseevna, when asked what orders she pleases to make, with her characteristic Russian patriotism, deigned to answer that about government agencies she cannot make orders, as this concerns the sovereign; about the same thing that personally depends on her, she deigned to say that she would be the last to leave Petersburg.

Since 2012, the Institute of Linguistics of the Russian State University for the Humanities has been preparing masters for the master's program Computational linguistics(direction Fundamental and Applied Linguistics). This program is designed to prepare professional linguists proficient in both the basics of linguistics and modern methods research, expert-analytical, engineering work and able to effectively participate in the development of innovative language computer technologies.

V educational process involved the developers of large research and commercial systems in the field of automatic text processing, which provides a link between master's training with the mainstream of modern computational linguistics. Particular attention is paid to the participation of masters in Russian and international conferences.

Among the teachers are authors of basic textbooks on linguistic specialties, world-class specialists, project managers of large systems for automatic language processing: Ya.G. Testelets, I.M. Boguslavsky, V.I. Belikov and V.I. Podlesskaya, V.P. Selegey, L.L. Iomdin, A.S. Starostin, S.A. Sharov, as well as employees of companies that are world leaders in the field of computational linguistics: IBM (Watson system), Yandex, ABBYY (Lingvo, FineReader, Compreno systems).

The basis for the preparation of masters in this program is a project-based approach. The attraction of undergraduates to research work in the field of computational linguistics takes place on the basis of the Russian State University for the Humanities and on the basis of companies engaged in the development of programs in the field of AOT (ABBYY, IBM, etc.), which, of course, is a big plus for both the masters themselves and for their potential employers. In particular, target masters are admitted to the magistracy, the training of which is provided by future employers.

Entrance tests: "Formal models and methods modern linguistics". Exact information about the time of the exam can be obtained on the website of the department of magistracy of the Russian State University for the Humanities.

Head of the magistracy - head. Educational and Scientific Center for Computational Linguistics, Director of Linguistic Research at ABBYY Vladimir Pavlovich Selegey and Doctor of Philosophy, Professor Vera Isaakovna Podlesskaya .

The program of the entrance exam and interviews in the discipline "Formal models and methods of modern linguistics."

Comments on the program

Any question of the program can be accompanied by tasks related to the descriptions of specific linguistic phenomena related to the section of the question: the construction of structures, the description of constraints, possible algorithms for construction and / or identification.
Questions marked with asterisks are optional (tickets are numbered 3). Possession of relevant material is a significant bonus for candidates, but not required.
In addition to theoretical questions, tickets for the exam will offer a small fragment of a special (linguistic) text in English for translation and discussion. Applicants are required to demonstrate a satisfactory level of proficiency in English-language scientific terminology and skills in scientific text analysis. As an example of text that should not cause serious difficulties for an applicant, below is an excerpt from the article https://en.wikipedia.org/wiki/Anaphora_(linguistics):

In linguistics, anaphora (/ əˈnæfərə /) is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an antecedent expression and thus is contrasted with cataphora, which is the use of an expression that depends upon a postcedent expression. The anaphoric (referring) term is called an anaphor. For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, so her is now a cataphor (and an anaphor in the broader, but not the narrower, sense). Usually, an anaphoric expression is a proform or some other kind of deictic (contextually-dependent) expression. Both anaphora and cataphora are species of endophora, referring to something mentioned elsewhere in a dialog or text.

Anaphora is an important concept for different reasons and on different levels: first, anaphora indicates how discourse is constructed and maintained; second, anaphora binds different syntactical elements together at the level of the sentence; third, anaphora presents a challenge to natural language processing in computational linguistics, since the identification of the reference can be difficult; and fourth, anaphora tells some things about how language is understood and processed, which is relevant to fields of linguistics interested in cognitive psychology.

THEORETICAL QUESTIONS

GENERAL LANGUAGE ISSUES

The object of linguistics. Language and speech. Synchrony and diachrony.
Language levels. Formal models of language levels.
Syntagmatics and paradigmatics. Distribution concept.
Foundations of interlingual comparisons: typological, genealogical and areal linguistics.
* Mathematical Linguistics: Object and Research Methods

PHONETICS

The subject of phonetics. Articulation and acoustic phonetics.
Segment and suprasegmental phonetics. Prosody and intonation.
Basic concepts of phonology. Typology of phonological systems and their phonetic realizations.
* Computer tools and methods of phonetic research
* Analysis and synthesis of speech.

MORPHOLOGY

Subject of morphology. Morphs, morphemes, allomorphs.
Inflection and word formation.
Grammatical meanings and ways to implement them. Grammatical categories and grammemes. Morphological and syntactic grammatical meanings.
Concepts of word forms, bases, lemmas and paradigms.
Parts of speech; basic approaches to the selection of parts of speech.
* Formal models for describing inflection and word formation.
* Morphology in automatic language processing tasks: spell checking, lemmatization, POS-tagging

SYNTAX

Subject of syntax. Ways of expressing syntactic relations.
Ways to represent the syntactic structure of a sentence. Advantages and disadvantages of dependency and constituent trees.
Methods for describing linear order. Non-projectivity and rupture of components. Transformation concept; transformations associated with linear order.
The relationship between syntax and semantics: valences, control models, actants and sirconstants.
Diathesis and pledge. Actant derivation.
Communicative organization of the statement. Theme and rhema, given and new, contrast.
* Basic syntactic theories: MCT, generativism, functional grammar, HPSG
* Mathematical models of syntax: classification of formal languages according to Chomsky, recognition algorithms and their complexity.

SEMANTICS

Subject of semantics. Naive and scientific linguistic picture of the world. Sapir-Whorf hypothesis.
Meaning in language and speech: meaning and referent. Reference type (denotative status).
Lexical semantics. Ways to describe the semantics of a word.
Grammatical semantics. The main categories on the example of the Russian language.
Sentence semantics. Propositional component. Deixis and Anaphora. Quantifiers and bundles. Modality.
Hierarchy and consistency of lexical meanings. Polysemy and homonymy. Semantic structure of a polysemantic word. The concepts of invariant and prototype.
Paradigmatic and syntagmatic relations in vocabulary. Lexical functions.
Interpretation. Language of interpretations. Moscow Semantic School
Semantics and Logic. The truthful meaning of the statement.
The theory of speech acts. Utterance and its illocutionary power. Performatives. Classification of speech acts.
Phraseology: inventory and methods of describing phraseological units.
* Models and methods of formal semantics.
* Models of semantics in modern computational linguistics.
* Distributive and operational semantics.
* Basic ideas of grammar constructions.

TYPOLOGY

Traditional typological classifications of languages.
Typology of grammatical categories of a name and a verb.
Typology simple sentence... The main types of structures are: accusative, ergative, active.
Word order typology and Greenberg correlations. Left and right branching languages.

LEXICOGRAPHY

Vocabulary as an inventory of culture; social variation of vocabulary, lexical usage, norm, codification.
Typology of dictionaries (in Russian). Reflection of vocabulary in dictionaries of various types.
Bilingual lexicography with the involvement of the Russian language.
Descriptive and prescriptive lexicography. Professional linguistic dictionaries.
Specificity of the main Russian explanatory dictionaries. Structure vocabulary entry... Interpretation and encyclopedic information.
Vocabulary and grammar. The idea of the integral model of the language in the Moscow Semantic School.
* Methodology of the lexicographer's work.
* Corpus Methods in Lexicography.

TEXT LINGUISTICS AND DISCOURSE

The concept of text and discourse.
Interphrase communication mechanisms. The main types of means of their language implementation.
A sentence as a unit of language and as an element of text.
Superphrasal unity, principles of their formation and selection, basic properties.
The main categories of classification of texts (genre, style, register, subject area etc)
* Methods for automatic genre classification.

SOCIOLINGUISTICS

The problem of the subject and boundaries of sociolinguistics, its interdisciplinary nature. Basic concepts of sociology and demography. Levels of language structure and sociolinguistics. Basic concepts and directions of sociolinguistics.
Language contacts. Bilingualism and Diglossia. Divergent and convergent processes in the history of language.
Social differentiation of language. Forms of language existence. Literary language: usus-norm-codification. Functional spheres of the language.
Language socialization. The hierarchical nature of social and linguistic identity. Linguistic behavior of the individual and his communicative repertoire.
Sociolinguistic research methods.

COMPUTER LINGUISTICS

Tasks and methods of computational linguistics.
Corpus linguistics. The main characteristics of the case.
Knowledge representation. The main ideas of the theory of frames by M. Minsky. FrameNet system.
Thesauri and ontologies. WordNet.
Fundamentals of statistical analysis of texts. Frequency dictionaries. Collocation analysis.
* The concept of machine learning.

LITERATURE

Educational (basic level)

Baranov A.N. Introduction to Applied Linguistics. M .: Editorial URRS, 2001.

Baranov A.N., Dobrovolskiy D.O. Fundamentals of phraseology (short course) Tutorial... 2nd edition. Moscow: Flinta, 2014.

Belikov V.A., Krysin L.P. Sociolinguistics. M., RGGU, 2001.

Burlak S.A., Starostin S.A. Comparative-historical linguistics. M .: Academy. 2005

Vakhtin N.B., Golovko E.V. Sociolinguistics and sociology of language. SPb., 2004.

Knyazev S.V., Pozharitskaya S.K. Modern Russian literary language: Phonetics, graphics, spelling, spelling. 2nd ed. M., 2010

Kobozeva I.M. Linguistic semantics. M .: Editorial URSS. 2004.

Kodzasov S.V., Krivnova O.F. General phonetics... M .: RGGU, 2001.

Krongauz M.A. Semantics. M .: RGGU. 2001.

Krongauz M.A. Semantics: Tasks, tasks, texts. M .: Academy. 2006 ..

Maslov Yu.S. Introduction to linguistics. Ed. 6th, erased. M .: Academy, fil. fac. SPbSU,

Plungyan V.A. General morphology: An introduction to the problematic. Ed. 2nd. Moscow: Editorial URSS, 2003.

Testelets Ya.G. An introduction to general syntax. M., 2001.

Shaikevich A.Ya. Introduction to linguistics. M .: Academy. 2005.

Scientific and reference

Apresyan Yu.D. Selected works, volume I. Lexical semantics: 2nd ed., Isp. and add. M .: School "Languages of Russian culture", 1995.

Apresyan Yu.D. Selected Works, Volume II. Integral description of the language and systemic lexicography. M .: School "Languages of Russian culture", 1995.

Apresyan Yu.D.(ed.) New explanatory dictionary of synonyms of the Russian language. Moscow - Vienna: "Languages of Russian Culture", Wiener Slavistischer Almanach, Sonderband 60, 2004.

Apresyan Yu.D.(ed.) Linguistic picture of the world and systemic lexicography (editor-in-chief Yu. D. Apresyan). M .: "Languages of Slavic Cultures", 2006, Preface and Ch. 1, pp. 26 - 74.

Bulygina T.V., Shmelev A.D. Linguistic conceptualization of the world (based on the material of Russian grammar). M .: School "Languages of Russian culture", 1997.

Weinreich U. Language contacts. Kiev, 1983.

Vezhbitskaya A. Semantic universals and description of languages. M .: School "Languages of Russian culture". 1999.

Galperin I.R. Text as an object of linguistic research. 6th ed. M .: LKI, 2008 ("Linguistic heritage of the XX century")

A.A. Zaliznyak“Russian nominal inflection” with an appendix of selected works on the modern Russian language and general linguistics. Moscow: Languages of Slavic Culture, 2002.

A.A. Zaliznyak, E.V. Paducheva Towards a typology of relative sentences. / Semiotics and informatics, vol. 35. M., 1997, p. 59-107.

Ivanov Viach. Sun. Linguistics of the third millennium. Questions for the future. M., 2004. S. 89-100 (11. The language situation of the world and forecast for the near future).

Kibrik A.E. Essays on general and applied problems of linguistics. M .: Publishing house of Moscow State University, 1992.

Kibrik A.E. Language constants and variables. SPb: Aleteya, 2003.

Labov U. On the mechanism of language changes // New in linguistics. Issue 7. M., 1975.S. 320-335.

Lions J. Linguistic Semantics: An Introduction. M .: Languages of Slavic culture. 2003.

Lyons John. Language and linguistics. Introductory course. M: URSS, 2004

Lakoff J. Women, Fire and Dangerous Things: What Categories of Language Tell Us About Thinking. M .: Languages of Slavic culture. 2004.

Lakoff J., Johnson M... The metaphors we live by. Per. from English Edition 2. M .: URSS. 2008.

Linguistic Encyclopedic Dictionary / Ed. IN AND. Yartseva. M .: Scientific publishing house "Great Russian Encyclopedia", 2002.

Melchuk I.A. General morphology course. TT. I-IV. Moscow-Vienna: "Languages of Slavic Culture", Wiener Slavistischer Almanach, Sonderband 38 / 1-38 / 4, 1997-2001.

Melchuk I.A. Experience of the theory of linguistic models "MEANING ↔ TEXT". Moscow: School "Languages of Russian Culture", 1999.

Fedorova L.L. Semiotics. M., 2004.

Filippov K.A. Linguistics of the text: Course of lectures - 2nd ed., Isp. and add. Ed. St. Petersburg. University, 2007.

Haspelmath, M., et al... (eds.). World Atlas of Language Structures. Oxford, 2005.

Dryer, M.S. and Haspelmath, M.(eds.) The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013. (http://wals.info)

Croft W. Typology and Universals. Cambridge: Cambridge University Press, 2003. Shopen, T. (ed.)... Language Typology and Syntactic Description. 2nd edition. Cambridge, 2007.

V.I.Belikov. About dictionaries “containing the norms of the modern Russian literary language when used as the state language Russian Federation". 2010 // Portal Gramota.Ru (http://gramota.ru/biblio/research/slovari-norm)

Computational linguistics and intelligent technologies: Based on the materials of the annual International Conference "Dialogue". Issue 1-11. - M .: Publishing house Nauka, from the Russian State Humanitarian University, 2002-2012. (Articles on computational linguistics, http://www.dialog-21.ru).

National corpus of the Russian language: 2006-2008. New results and prospects. / Resp. ed. V.A. Plungyan. - SPb .: Nestor-History, 2009.

New in foreign linguistics. Issue XXIV, Computational linguistics / Comp. B. Yu. Gorodetsky. Moscow: Progress, 1989.

Shimchuk E. G. Russian lexicography: textbook. Moscow: Academy, 2009.

National corpus of the Russian language: 2003-2005. Digest of articles. M .: Indrik, 2005.

For contacts:

Educational and Scientific Center for Computational Linguistics, Institute of Linguistics, Russian State University for the Humanities

The content of the article

COMPUTER LINGUISTICS, direction to applied linguistics, focused on the use of computer tools - programs, computer technologies for organizing and processing data - to simulate the functioning of a language in certain conditions, situations, problem areas, etc., as well as the entire scope of computer language models in linguistics and related disciplines. Actually, it is only in the latter case that we are talking about applied linguistics in the strict sense, since computer modeling of a language can also be considered as a sphere of application of computer science and programming theory to solving problems of the science of language. In practice, however, almost everything related to the use of computers in linguistics is referred to as computational linguistics.

Computational linguistics took shape as a special scientific direction in the 1960s. The Russian term "computational linguistics" is a copy of the English computational linguistics. Since the adjective computational in Russian can also be translated as "computational", the term "computational linguistics" is also encountered in the literature, but in Russian science it acquires a narrower meaning, approaching the concept of "quantitative linguistics". The flow of publications in this area is very large. In addition to thematic collections, the journal "Computational Linguistics" is published in the USA on a quarterly basis. A great deal of organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures (in particular, the European branch). International conferences on computational linguistics - COLING are held every two years. The relevant issues are usually also widely presented at various conferences on artificial intelligence.

Computational Linguistics Toolkit.

Computational linguistics as a special applied discipline is distinguished primarily by its instrument, i.e. on the use of computer tools for processing language data. Since computer programs that simulate certain aspects of the functioning of a language can use a variety of programming tools, it seems that there is no need to talk about the general conceptual apparatus of computational linguistics. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. They are based on the theory of knowledge, which was originally developed in the field of artificial intelligence, and later became one of the branches of cognitive science. The most important conceptual categories of computational linguistics are such structures of knowledge as "frames" (conceptual, or, as they say, conceptual structures for the declarative representation of knowledge about a typed thematically unified situation), "scenarios" (conceptual structures for the procedural representation of knowledge about a stereotypical situation or stereotyped behavior), "plans" (knowledge structures that fix ideas about possible actions leading to the achievement of a certain goal). Closely related to the frame category is the concept of "scene". The scene category is mainly used in the literature on computational linguistics as a designation of the conceptual structure for the declarative representation of situations and their parts actualized in a speech act and highlighted by language means (lexemes, syntactic constructions, grammatical categories, etc.).

A set of knowledge structures, organized in a certain way, forms the "model of the world" of the cognitive system and its computer model. In artificial intelligence systems, the world model forms a special block, which, depending on the chosen architecture, can include general knowledge about the world (in the form of simple propositions such as "cold in winter" or in the form of production rules "if it is raining outside, then you need to put on a raincoat or take an umbrella "), some specific facts (" The highest peak in the world - Everest "), as well as values and their hierarchies, sometimes singled out in a special" axiological block ".

Most of the elements of the concepts of the toolkit of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and the ways of representing these entities used in their theoretical description and modeling. In other words, the elements of the conceptual apparatus of computational linguistics have ontological and instrumental aspects. For example, in the ontological aspect, the separation of declarative and procedural knowledge corresponds to various types of knowledge that a person has - the so-called knowledge of WHAT (declarative; such is, for example, knowledge of the postal address of an NN), on the one hand, and knowledge of HOW (procedural; such is , for example, the knowledge that allows you to find the apartment of this NN, even without knowing its formal address) - on the other. In the instrumental aspect, knowledge can be embodied in a set of descriptions (descriptions), in a data set, on the one hand, and in an algorithm, an instruction that a computer or some other model of a cognitive system executes, on the other.

Directions of computational linguistics.

The sphere of CL is very diverse and includes such areas as computer modeling of communication, modeling of the structure of a plot, hypertext technologies of text presentation, Machine translate, computer lexicography. In a narrow sense, CL issues are often associated with interdisciplinary applied direction with a few unfortunate name Natural language processing (translation English term Natural Language Processing). It emerged in the late 1960s and developed within the framework of the scientific and technological discipline "artificial intelligence". In its intrinsic form, natural language processing encompasses all areas in which computers are used to process language data. Meanwhile, in practice, a narrower understanding of this term has been fixed - the development of methods, technologies and specific systems that ensure communication between a person and a computer in a natural or limited natural language.

The rapid development of the direction of "natural language processing" falls on the 1970s, which was associated with an unexpected exponential growth in the number of end users of computers. Since teaching languages and programming technologies for all users is impossible, the problem arose of organizing interaction with computer programs. The solution to this communication problem followed two main paths. In the first case, attempts were made to adapt programming languages and operating systems to the end user. As a result, languages appeared high level type Visual Basic, as well as convenient OS, built in the conceptual space of metaphors familiar to man - WRITING DESK, LIBRARY. The second way is the development of systems that would allow interacting with computers in a specific problem area in natural language or in some limited version of it.

The architecture of natural language processing systems generally includes a unit for analyzing a user's speech message, a unit for interpreting a message, a unit for generating the meaning of a response, and a unit for synthesizing the surface structure of an utterance. A special part of the system is the dialogue component, in which the strategies for conducting a dialogue are fixed, the conditions for applying these strategies, ways to overcome possible communication failures (failures in the communication process).

Among computer systems for natural language processing, question-answer systems, dialogue systems for solving problems and systems for processing coherent texts are usually distinguished. Initially, question-answer systems began to be developed as a reaction to the poor quality of coding requests when searching for information in information retrieval systems. Since the problem area of such systems was very limited, this somewhat simplified the algorithms for translating queries into a representation in a formal language and the inverse procedure for transforming a formal representation into statements in a natural language. Among domestic developments, programs of this type include the POET system, created by a team of researchers under the leadership of E.V. Popov. The system processes requests in Russian (with few restrictions) and synthesizes the response. The block diagram of the program assumes the passage of all stages of analysis (morphological, syntactic and semantic) and the corresponding stages of synthesis.

Dialogue problem solving systems, unlike systems of the previous type, play an active role in communication, since their task is to obtain a solution to a problem based on the knowledge that is presented in it and the information that can be obtained from the user. The system contains knowledge structures that record typical sequences of actions for solving problems in a given problem area, as well as information about the required resources. When the user asks a question or poses a specific task, the corresponding script is activated. If some components of the script are missing or some resources are missing, the system initiates communication. This is how, for example, the SNUKA system, which solves the problems of planning military operations, works.

Systems for processing connected texts are quite diverse in structure. Their common feature is the widespread use of knowledge representation technologies. The functions of systems of this kind are to understand the text and answer questions about its content. Understanding is viewed not as a universal category, but as a process of extracting information from a text, determined by a specific communicative intention. In other words, the text is "read" only with the setting that the potential user wants to know about it. Thus, the systems for processing connected texts turn out to be by no means universal, but problem-oriented. Typical examples of systems of the type under discussion are the RESEARCHER and TAILOR systems, which form a single software package that allows the user to obtain information from abstracts of patents describing complex physical objects.

The most important area of computational linguistics is the development of information retrieval systems (ISS). The latter emerged in the late 1950s - early 1960s as a response to a sharp increase in the volume of scientific and technical information. By the type of stored and processed information, as well as by the features of the search, IRS are divided into two large groups - documentary and factual. Documentary ISS stores the texts of documents or their descriptions (abstracts, bibliographic cards, etc.). Factographic IRS deal with the description of specific facts, and not necessarily in text form. These can be tables, formulas and other types of data presentation. There are also mixed IRS, which include both documents and factual information. Currently, factual IRS are built on the basis of database technologies (DB). To provide information retrieval in the ISS, special information retrieval languages are created, which are based on information retrieval thesauri. An information retrieval language is a formal language designed to describe individual aspects of the content plan of documents stored in an ISS and a query. The procedure for describing a document in an information retrieval language is called indexing. As a result of indexing, each document is assigned its formal description in the information retrieval language - the retrieval image of the document. The query is indexed in the same way, to which the search image of the query and the search prescription are assigned. The information retrieval algorithms are based on comparing the search prescription with the search image of the query. The criterion for issuing a document to a request can consist in full or partial coincidence of the search image of the document and the search prescription. In some cases, the user has the opportunity to formulate the issuance criteria himself. It is determined by him information need... Descriptor information retrieval languages are more often used in automated information retrieval systems. The subject of a document is described by a set of descriptors. The descriptors are words, terms denoting simple, rather elementary categories and concepts of the problem area. As many descriptors are entered into the search image of the document as there are various topics covered in the document. The number of descriptors is not limited, which allows you to describe the document in a multidimensional matrix of features. Often in a descriptor information retrieval language, restrictions are imposed on the compatibility of descriptors. In this case, we can say that the information retrieval language has a syntax.

One of the first systems that worked with a descriptor language was the American UNITERM system, created by M. Taube. As descriptors in this system functioned key words of the document - unitherms. The peculiarity of this ISS lies in the fact that initially the dictionary of the information language was not specified, but arose in the process of indexing the document and the query. The development of modern information retrieval systems is associated with the development of a saurus-free ISS. Such IRS work with the user in a limited natural language, and the search is carried out in the texts of abstracts of documents, in their bibliographic descriptions, and often in the documents themselves. For indexing in the ISS of the saurus-free type, words and phrases of a natural language are used.

To the field of computational linguistics, to a certain extent, can be attributed to work in the field of creating hypertext systems, considered as a special way of organizing text and even as a fundamentally new type of text, opposed in many of its properties to an ordinary text formed in the Gutenberg tradition of typography. The idea of hypertext is associated with the name of Vannevar Bush, President Roosevelt's advisor on science. V. Bush theoretically substantiated the project of the technical system "Memex", which allowed the user to link texts and their fragments by various types of links, mainly by associative relations. The lack of computer technology made the project difficult to implement, since the mechanical system turned out to be too complex for practical implementation.

In the 1960s, Bush's idea received a rebirth in the Xanadu system by T. Nelson, which already assumed the use of computer technology. "Xanadu" allowed the user to read a set of texts entered into the system different ways, in a different sequence, the software made it possible to both memorize the sequence of the scanned texts, and to select from them almost any at an arbitrary moment in time. A set of texts with relations connecting them (transition system) was called hypertext by T. Nelson. Many researchers view the creation of hypertext as the beginning of a new information era, opposed to the era of typography. Linearity of writing, which outwardly reflects the linearity of speech, turns out to be a fundamental category that limits human thinking and understanding of the text. The world of meaning is nonlinear, therefore, the compression of semantic information in a linear speech segment requires the use of special "communicative packings" - division into topic and bump, dividing the content plan of the statement into explicit (statement, proposition, focus) and implicit (presupposition, consequence, implication of discourse) layers ... Refusal of the linearity of the text both in the process of its presentation to the reader (ie, during reading and understanding) and in the process of synthesis, according to theorists, would contribute to the "liberation" of thinking and even the emergence of its new forms.

In a computer system, hypertext is presented in the form of a graph, in the nodes of which there are traditional texts or their fragments, images, tables, videos, etc. The nodes are linked by a variety of relationships, the types of which are defined by the hypertext software developers or the reader himself. Relationships define the potential for movement, or hypertext navigation. Relationships can be unidirectional or bidirectional. Accordingly, bidirectional arrows allow the user to move in both directions, and unidirectional arrows only in one direction. The chain of nodes through which the reader passes when viewing the components of the text forms a path, or route.

Computer implementations of hypertext are hierarchical or networked. The hierarchical - tree-like - structure of hypertext significantly limits the possibilities of transition between its components. In such hypertext, the relationships between components resemble the structure of a thesaurus based on generic relations. Network hypertext allows you to use different types of relationships between components, not limited to the relationship "genus-species". According to the way of existence of hypertext, static and dynamic hypertext are distinguished. Static hypertext does not change during operation; in it the user can record his comments, but they do not change the essence of the matter. For dynamic hypertext, change is a normal form of existence. Typically, dynamic hypertext functions where it is necessary to constantly analyze the flow of information, i.e. in information services of various kinds. Hypertext is, for example, the Arizona Information System (AAIS), which is updated monthly by 300-500 abstracts per month.

The relationships between the elements of hypertext can be initially fixed by the creators, or they can be generated whenever the user turns to the hypertext. In the first case, we are talking about hypertexts of a rigid structure, and in the second, about hypertexts of a soft structure. The rigid structure is technologically quite understandable. The technology of organizing a soft structure should be based on a semantic analysis of the proximity of documents (or other sources of information) to each other. This is a non-trivial task in computational linguistics. Nowadays, the use of soft structure technologies on keywords is widespread. The transition from one node to another in the hypertext network is carried out as a result of a search for keywords. Since the set of keywords can be different each time, the structure of the hypertext also changes each time.

The technology of building hypertext systems does not distinguish between text and non-text information. Meanwhile, the inclusion of visual and audio information (videos, paintings, photographs, sound recordings, etc.) requires a significant change in the user interface and more powerful software and computer support. Such systems are called hypermedia, or multimedia. The visibility of multimedia systems predetermined their widespread use in teaching, in the creation of computer versions of encyclopedias. There are, for example, perfectly executed CD-roms with multimedia systems based on children's encyclopedias of the Dorlin Kindersley publishing house.

Within the framework of computer lexicography, computer technologies for the compilation and operation of dictionaries are being developed. Special programs - databases, computer filing cabinets, word processing programs - allow you to automatically form dictionary entries, store dictionary information and process it. Many different computer lexicographic programs are divided into two large groups: programs for supporting lexicographic works and automatic dictionaries of various types, including lexicographic databases. An automatic dictionary is a dictionary in a special machine format intended for use on a computer by a user or a computer word processing program. In other words, a distinction is made between automatic human end-user dictionaries and automatic dictionaries for word processing programs. Automatic dictionaries intended for the end user differ significantly in terms of the interface and structure of the dictionary entry from the automatic dictionaries included in machine translation systems, automatic summarization systems, information retrieval, etc. Most often they are computer versions of well-known conventional dictionaries. On the software market, there are computer analogues of English explanatory dictionaries (automatic Webster, automatic explanatory dictionary Collins English, an automatic version of the New Great English-Russian dictionary ed. Y.D. Apresyan and E.M. Mednikova), there is also a computer version of Ozhegov's dictionary. Automatic dictionaries for word processing programs can be called automatic dictionaries in a precise sense. They are usually not meant for the average user. The peculiarities of their structure, the scope of the vocabulary material are set by the programs that interact with them.

Computer modeling of the plot structure is another promising direction in computational linguistics. The study of the structure of the plot refers to the problems of structural literary criticism (in the broad sense), semiotics and cultural studies. The available computer programs for plot modeling are based on three basic formalisms of plot presentation - morphological and syntactic directions of plot presentation, as well as on cognitive approach... Ideas about the morphological structure of the plot structure go back to famous works V.Ya.Proppa ( cm.) about a Russian fairy tale. Propp noticed that with the abundance of characters and events in a fairy tale, the number of functions of characters is limited, and he proposed an apparatus for describing these functions. Propp's ideas formed the basis of the TALE computer program, which simulates the creation of a fairy tale plot. The TALE program algorithm is based on the sequence of functions of the fairy tale characters. In fact, Propp's functions set many typed situations, ordered based on the analysis of empirical material. Clutch capabilities different situations in the rules of generation were determined by a typical sequence of functions - in the form in which it can be established from the texts of fairy tales. In the program, typical sequences of functions were described as typical scenarios of character encounters.

The theoretical basis of the syntactic approach to the plot of the text was formed by "plot grammars", or "story grammars". They appeared in the mid-1970s as a result of the transfer of N. Chomsky's ideas of generative grammar to the description of the macrostructure of the text. If the most important components of the syntactic structure in generative grammar were verb and noun phrases, then in most plot grammars, setting, event and episode were singled out as basic ones. In the theory of plot grammars, the conditions of minimality were widely discussed, that is, the restrictions that determined the status of a sequence of plot elements as a normal plot. It turned out, however, that it was impossible to do this using purely linguistic methods. Many restrictions are of a sociocultural nature. Plot grammars, differing significantly in the set of categories in the generation tree, allowed a very limited set of rules for modifying the narrative (narrative) structure.

In the early 1980s, one of R. Schenk's students, V. Lehnert, as part of the work on the creation of a computer plot generator, proposed the original formalism of emotional plot units (Affective Plot Units), which turned out to be a powerful means of representing the plot structure. While it was originally developed for an artificial intelligence system, this formalism was used in a purely theoretical studies... The essence of Lehnert's approach was that the plot was described as a sequential change in the cognitive-emotional states of the characters. Thus, the focus of Lehnert's formalism is not the external components of the plot - the exposition, event, episode, morality - but its content characteristics. In this respect, Lehnert's formalism is partly a return to Propp's ideas.

The competence of computational linguistics also includes machine translation, which is currently experiencing a rebirth.

Literature:

Popov E.V. Communication with a computer in natural language... M., 1982
Sadur V.G. Speech communication with electronic computers and problems of their development... - In the book: Speech communication: problems and prospects. M., 1983
Baranov A.N. Categories of artificial intelligence in linguistic semantics. Frames and scripts... M., 1987
Kobozeva I.M., Laufer N.I., Saburova I.G. Modeling communication in human-machine systems... - Linguistic support of information systems. M., 1987
Olker H.R. Fairy tales, tragedies and ways of presenting world history... - In the book: Language and modeling of social interaction. M., 1987
Gorodetsky B.Yu. Computational Linguistics: Modeling Language Communication
McQueen K. Discursive Strategies for Natural Language Text Synthesis... - New in foreign linguistics. Issue XXIV, Computational Linguistics. M., 1989
Popov E.V., Preobrazhensky A.B . Features of the implementation of NL-systems
Preobrazhensky A.B. The state of development of modern NL-systems... - Artificial intelligence. Book. 1, Communication systems and expert systems. M., 1990
M.M. Subbotin Hypertext. A new form of written communication... - VINITI, Ser. Informatics, 1994, vol. 18
Baranov A.N. Introduction to Applied Linguistics... M., 2000

Computer linguists are engaged in the development of algorithms for recognition of text and sounding speech, the synthesis of artificial speech, the creation of semantic translation systems and the development of artificial intelligence itself (in the classical sense of the word - as a replacement for human - it is unlikely to ever appear, but various expert systems based on data analysis).

Speech recognition algorithms will be used more and more in everyday life - “smart houses” and electronic devices will not have remotes and buttons, and instead will use a voice interface. This technology is being refined, but there are still many challenges: it is difficult for a computer to recognize human speech because different people speak very differently. Therefore, as a rule, recognition systems work well either when they are trained for one speaker and are already adjusted to his pronunciation features, or when the number of phrases that the system can recognize is limited (as, for example, in voice commands for the TV).

Specialists in the creation of semantic translation programs still have a lot of work ahead: at the moment, good algorithms have been developed only for translation into and from English. There are many problems - different languages are arranged in different ways in the semantic plan, it differs even at the level of phrase construction, and not all meanings of one language can be conveyed using the semantic apparatus of another. In addition, the program must distinguish between homonyms, correctly recognize parts of speech, and choose the correct meaning of a polysemantic word that suits the context.

Synthesis of artificial speech (for example, for home robots) is also painstaking work. It is difficult to make the artificially created speech sound natural to the human ear, because there are millions of nuances that we do not pay attention to, but without which everything is no longer "right" - false starts, pauses, hitching, etc. The speech flow is continuous and at the same time discrete: we speak without pause between words, but it is not difficult for us to understand where one word ends and another begins, and for a machine this will be a big problem.

The largest direction in computational linguistics is related to Big Data. After all, there are huge corpuses of texts such as news feeds, from which you need to isolate certain information - for example, highlight news feeds or hone RSS to the tastes of a certain user. Such technologies already exist and will develop further, because the computing power is growing rapidly. Linguistic analysis texts are also used to ensure security on the Internet, to search for the necessary information for special services.

Where to study to become a computer linguist? Unfortunately, we have quite strongly separated specialties related to classical linguistics, and programming, statistics, data analysis. And in order to become a digital linguist, you need to understand both. Foreign universities have programs of higher education in computational linguistics, but we still have the best option - to get a basic linguistic education, and then master the basics of IT. It's good that now there are many different online courses, unfortunately, in my student years, this was not. I studied at the Faculty of Applied Linguistics at Moscow State Linguistic University, where we had courses on artificial intelligence and speech recognition - but still in an insufficient volume. Now IT companies are actively trying to interact with institutions. My colleagues from Kaspersky Lab and I also try to participate in the educational process: we read lectures, hold student conferences, and give grants to graduate students. But so far, the initiative comes more from employers than from universities.