Text Corpus and Translation
In linguistics, a corpora refers to a set of written texts or orally documented materials of any language. The creation of corpora acquired special significance for such humanities as linguistics, literary studies, historiography, and jurisprudence. The corpus allows for systematic research of individual issues and problems. Text linguistics in translation allows us to study the practice of using this or that word, term in different texts, to professionally understand the semantics, etiology and frequency of the use of this or that linguistic form. In this regard, it is important to have both monolingual and bilingual so-called Parallel Corpora.
The Delta Translation Group's English and Russian parallel language corpora are represented by more than a million language enigmas. Database search engine is located at:
Also, in the process of our work we use the following text corpora:
-
The Georgian Language Web Corp contains over 150 million words and is available on the University of Leeds website:
http://corpus.leeds.ac.uk/internet.html
-
Brown's corpus consists of 1 million American English words taken from texts on a variety of topics and grouped into 15 thematic categories
http://clu.uni.no/icame/brown/bcm.html
-
TITUS - ARMAZI - Caucasian Languages and Cultures: The first academic electronic database of Georgian language texts, which combines the textual material of Georgian literature of different periods
http://armazi.uni-frankfurt.de/framee.htm
-
Modern Georgian Language Corpus - includes two subcorps: Corpus of Modern Georgian Language (124,055,170 units) and Georgian Literary Corpus, with morphological annotation (20,903,850 units). The project is led by Paul Moirer, Senior Research Fellow, University of Bergen
http://clarino.uib.no/gekko/corpus-list .
-
English Web Corp - UkWaC, the corpus contains over 2 billion words, mostly from the .uk domain, and uses medium frequency words according to the British National Corpus as seed words, the corpus is morphologically annotated and lemmatized. UkWaC housing is available for search in the Word Matching Models program:
https://www.sketchengine.co.uk/documentation/wiki/Corpora/UKWaC
-
German Language Corpus / Deutsches Referenzkorpus DeReKo des Instituts für Deutsche Sprache (IDS)
http://www.ids-mannheim.de/kt/projekte/korpora/
-
20th Century German Reference Corpus / Referenzkorpus der deutschen Sprache des 20. Jahrhunderts (DWDS Kernkorpus)
http://www.dwds.de/
-
Corpus C4 / Corpus C4 (corpus includes 20th Century German Digital Dictionary (DWDS), Austrian Academic Corpus (AAC), Swiss Text Corpus (CHTK), and South Tyrol Corpus)
http://www.korpus-c4.org
-
German Text Archive / Deutsches Textarchiv (Historical Corpus of German Texts, 1600-1900, 1300 books)
http://www.deutschestextarchiv.de
-
British National Corpus (BNC)
http://www.natcorp.ox.ac.uk/
-
The Corpus of Contemporary American English (COCA)
http://corpus.byu.edu/coca/
-
Dortmund Chatkorpus
http://www.chatkorpus.uni-dortmund.de/