About the Corpus


The Electronic Corpus of 17th and 18th c. Polish Texts (up to 1772), also known as KorBa (an acronym of Korpus Barokowy, which is Polish for The Baroque Corpus), is the primary result of a 2013-2018 project by the Department of the History of the 17th and 18th Century Polish of the Institute of Polish Language, Polish Academy of Sciences (PAS), in cooperation with the Linguistic Engineering Group of the Institute of Computer Science, Polish Academy of Sciences (information on the project can be found in the "KORBA 1 Project"). The corpus consists of nearly 13.5 million segments as understood by the creators of the National Corpus of Polish1 (Narodowy Korpus Języka Polskiego, further also: NKJP). The texts encompassed in the corpus are presented in both transcribed and transliterated forms. Rich metadata, structural and linguistic tags and morphosyntactic annotation and lemmatization allow for a wide range of queries, as well as filtering the search results and locating their position in the source, up to the exact page number.

The project was heterogeneous in character. On the one hand, it consisted of choosing sample historical texts, followed by digitization and edition – on the other hand, it required the creation of IT tools for the purpose of storing, converting, displaying and searching through the text fragments contained in the corpus, as well as the modification of existing IT tools, which were originally created for use on corpora of modern texts. The project has undoubtedly resulted in a modernization of historical linguistics research methods and their integration with corpus linguistics.

Work on the corpus was resumed in 2019, as per the "Extending of the Electronic Corpus of 17th and 18th c. Polish Texts and its integration with the Electronic Dictionary of the 17th- and 18th-century Polish" project, which is planned to last until 2023 (information about the project can be found in the "KORBA 2 project"). The expansion will require both an increase in its volume within the current chronological constraints (1601-1772), as well as an increase in its chronological range to additionally encompass texts written from 1773 to 1800. In total, the Corpus is planned to consist of 25 million segments. An integration of the various resources on 17th-18th century Polish is also planned in the future – these sources include the Electronic Corpus of 17th and 18th c. Polish Texts, the Electronic Dictionary of the 17th- and 18th-century Polish2 (e-SXVII), the digitized card index the dictionary was based on3 and Digital Library of Polish and Poland-Related Ephemeral Prints from the 16th, 17th and 18th Centuries4 (Cyfrowa Biblioteka Druków Ulotnych Polskich i Polski Dotyczących z XVI, XVII i XVIII Wieku, CBDU).

KorBa and other Polish corpora

Until 2013, Polish language corpora were, for the most part, limited to modern texts. Work on each corpus was entirely separate, until in 2007-2012, thanks to an initiative by the Institute of Computer Science PAS (which coordinated the project), the Institute of Polish Language PAS, Polish Scientific Publishers PWN and the Department of Computational and Corpus Linguistics at the University of Łódź, realized as a Ministry of Science and Higher Education project, the National Corpus of Polish (NKJP) was created. It is currently the largest Polish language corpus, but it is mainly focused on relatively modern texts. KorBa was conceived as a historical supplement – the first stage in extending the National Corpus’ range to include older texts. We hope that, in the near future, it will be possible to create (sub)corpora encompassing the entire history of written Polish. Work on such corpora has already begun. A so-called corpus of Old Polish texts5 (up to the year 1500) has already been created by the Institute of Polish Language, PAS, however at this moment it does not contain any annotations (neither structural nor morphosyntactic), it also offers no search engine. The first corpus of old Polish texts that fits the modern standards for such resources was created for the international IMPACT (Improving Access to Texts6) project. Work on this corpus lasted from 2009 to 2012 and was carried out in the Formal Linguistics Department of the University of Warsaw by a team under the supervision of Janusz S. Bień. One characteristic of this corpus, stemming from the overall goals of the project, is the extreme closeness of its transliteration to the original texts – the corpus differentiates between all shapes of graphemes present in the source texts7. Recently, the Institute of Polish Language of the University of Warsaw created a comparatively small corpus of 19th century texts8 (from the years 1830-1918). The Institute of Literary Research of the Polish Academy of Sciences has also began work on a corpus of 16th century Polish texts9.

This pattern can often be seen outside of Poland as well. Large corpora of modern texts are created first, and only later (in rare cases simultaneously) supplemented with corpora of older texts. These historical corpora are usually much smaller than their counterparts focused on modern texts. Exceptions include some relatively large historical corpora of the most popular languages, such as e.g. English – Early English Books Online10 (755 million words) and Corpus of Historical American English11 (400 million words) – or Spanish12 (more than 100 million words). Compared to these large historical corpora, our corpus may seem small. However, its size is comparable to most other historical corpora of European languages.

The creators of NKJP set many of the standards for Polish corpus linguistics; they also created the IT tools necessary for its construction and for easy access. It should come as no surprise then that we used the NKJP as a model for the KorBa project's approach to tool creation and have ensured that our corpus will be compatible with NKJP in the future, both in terms of its linguistic and structural aspects. In particular, we have strived for maximum consistency with the NKJP's morphosyntactic tagging system. However, for obvious reasons, it was not possible to use the same set of tags as in a corpus of modern texts. Some changes were necessary, as 17th and 18th century Polish, on the one hand, distinguished between some grammatical categories (or their values) that are not present in modern day Polish, while on the other hand it lacked some categories characteristic of the modern Polish language. The full set of morphosyntactic tags is presented in detail in "Instruction".

Our corpus represents an attempt at building a relatively large corpus of old Polish texts that would meet all requirements set before modern resources of this type, while being designed with multidirectional research in mind. We have decided to forgo the extremely faithful transliteration featured in the IMPACT corpus, as it is a limiting factor in accessing the texts (any users of the corpus for whom notation is essential may prefer using the aforementioned IMPACT corpus). Our corpus allows easy access to the Polish national heritage left behind by the Baroque era and the evolution of the Polish language over the ages. Most importantly, however, the corpus can serve as a new tool for research in many branches of the humanities, such as linguistics, literature studies, culture studies, history and sociology. It allows one to easily search through and analyze old Polish texts.

The text selection process

During conceptual work, we have taken into account the generally accepted essential attributes of language corpora – balance and representativeness. In the case of historical material, however, such attributes are not necessarily possible to achieve fully. In the end, the choice of texts was influenced by various criteria, including ones not related to language.

The contents of the corpus were determined by limited knowledge about the whole of the era writing (and not only the texts which were preserved until the modern day). Naturally, most of the preserved texts are works of literature, as they were treated with more care, often reprinted and passed down from generation to generation as an element of cultural heritage. During work on the corpus, this fact has made it impossible to apply the principle used for most modern corpora, according to which literary fiction should only comprise up to 20% of a given corpus. Much of the knowledge that plays a key role when constructing a modern corpus is very limited for the 17th-18th century and must often be inferred indirectly. A text's popularity may be inferred from the presence of multiple editions, or from references in other texts of that era.

Authors of historical corpora are bound by many such constraints to available materials. They may only use one communication channel – written sources, and only ones that were preserved over hundreds of years. There are, however, also some advantages to working with historical material. Researchers have at their disposal a closed set of texts, one that has already finished its evolution – in comparison, a corpus of modern texts will always be open. Using historical material also results in a research perspective that is distanced by time, facilitating easier synthesis.

The limited availability of materials has forced us to utilize many different types of sources, some of which are inherently flawed. The most desired sources include original historical prints and manuscripts. These materials were, however, especially difficult to prepare, since they had to be transcribed into a digital form, which is an expensive and time-consuming task. Despite this, such texts comprise around half of the Corpus. Similar in value and credibility are these historical prints that have been carefully transliterated and made available to us by the Polish division of the IMPACT project (which encompasses around 1.6 million segments in total). Our corpus also contains later editions of Baroque texts. Of these, 19th century editions are especially flawed as a result of significant alterations of the text by the publisher, which were considered standard practice at that time. We have, however, chosen to include them, as they were often the only preserved editions of highly interesting and valuable texts. Another type of sources we have included are modern digital editions of 17th and 18th century texts. Their technical aspects were the most convenient, since they are already digital and as such can be easily incorporated. However, these sources often caused other difficulties, either of a legal nature (copyright laws), or a linguistic / editorial ones stemming from the fact that modern publishers often base their editions on not one, but several historical editions. In general, we have adhered to the rule that it is preferable to include a text that is only available in a later form, rather than omit it from the corpus. Precise bibliographical information has been provided for all sources, including information on the type of the source (transliterated historical print or manuscript or a later, 19th, 20th or 21st century edition).

In order to keep the corpus balanced, we have decided that very long texts (such as the so called Gdańsk Bible, Syreniusz herbarium, the collected sermons of Birkowski, Młodzianowski or Starowolski) will only be included in the form of selected fragments.

In choosing texts for KorBa, we have considered the following types of variation: chronological, geographical, genre and subject matter.

According to our criteria for chronological variation, all parts of the Baroque era should be represented equally in terms of volume. For this purpose, we have chosen to divide the era into four parts: 1601-1650, 1651-1700, 1701-1750 and 1751-1772. These timeframes are, of course, entirely artificial and serve only to order the material. The aforementioned volume criterion often had to be adjusted to account for other factors. For example, many of the key works considered an integral part of the Polish 17th-18th-century literature canon were released in the first half of the 17th century. On the other hand, the first decades of the 18th century brought with them a cultural regress and a noticeable decrease in original, interesting written works.

The chronological representation of texts in KorBa can be seen on the following graph:

When it comes to geographical variation, the corpus includes texts from all regions where the Polish language was used, in accordance with most historical studies on this period. These are: Mazovia, Lesser Poland, Greater Poland, Ruthenian Lands, the Grand Duchy of Lithuania, Silesia, Livonia, Pomerania and Prussia. The number of texts included from each region varies considerably. This is a reflection of the state of literary and, especially, publishing activity in these regions.

The geographical variation of texts in KorBa can be seen on the following map:

Genre variation is another factor that influenced out choice of texts for the corpus. The genologies and typologies used for constructing modern corpora are not ideal for use on historical material. For this reason, we have deemed it necessary to prepare a typology that would meet the requirements of KorBa. It is made up of several levels.

At the highest level, the texts are divided into rhymed (which constitute 21% of segments in the corpus), non-rhymed (76% of segments) and mixed (3% of segments). Information on this subject is particularly important, as the rhythm and rhyme of a poem may influence the inflectional form of any word a corpus user might search for.

The second level is the division between literature and non-literary texts. Literature has been divided further, in accordance with literary tradition, into epic, lyric, drama and syncretic texts. These categories, of course, contain texts representative of literary genres characteristic for the epoch. The detailed classification of non-literary texts corresponds with the typology used in NKJP, as much as it is possible in the case of historical texts. Due to difficulties with categorizing the Bible, we have decided to treat it separately from all other texts. Listed below are all the types and genres of texts in the corpus:

types genres
Epic epic poems, fables, hagiographies, parables and specula (mirrors), romances
Lyric carols and folk carols, emblems, epigrams, epithalamia, epitaphs, laments, odes, panegyrics, psalms, riddles, songs, sonnets
Drama comedies, dialogues, nativity plays, tragedies
Syncretic texts pastorals, satires
Scientific-didactic texts calendars, culinary recipes, encyclopedias and compendiums, guidebooks, guides, herbaria, instructions, lectures, phrasebooks, textbooks, treatises
Persuasive texts dedications, sermons, political speeches, proverbs, speeches for various occasions, writings on political and social topics, writings on religious topics
Factual literature accounts of events, chronicles, descriptions of journeys, geographical descriptions, memoirs, rolls of arms
Official & secretarial texts city laws, contracts, documents of regional parliaments, inventories, judicial records, official letters, parliamentary bills, prenuptial agreements, privileges and charters, Sejm journals, Sejm texts, testaments
Press releases and leaflets

The volume of each genre in the corpus is represented by the following percentage values:

One of our main goals when building KorBa was to properly represent the subject matter characteristic of the era. Of particular importance was representing those areas that were popular at the time, but are now outliers (e.g. alchemy, astrology, herbal medicine).

Search options

The data encompassed within the Electronic Corpus of 17th and 18th c. Polish Texts (up to 1772) is available through the MTAS search engine, which uses Corpus Query Language (CQL). This language enables searching for either single segments or sequences of segments. Their form and relationship can be precisely set, using the attributes assigned to every segment. Less advanced users can use the query builder, which will enable them to substitute the CQL symbols with more familiar grammatical terms.

The search results are available both in transcribed and transliterated form – the search engine allows one to easily switch between the two. Thanks to the rich metadata, it is also possible to search for segments of interest in any possible subcorpus.

A detailed description of all the functions of the search engine and MTAS search language can be found in "Instruction".


  1. Corpus available at: http://nkjp.pl.
  2. Dictionary available at: https://sxvii.pl/.
  3. Card index available at: https://www.rcin.org.pl/dlibra/publication/20029.
  4. Library available at: https://cbdu.ijp.pan.pl/.
  5. Corpus available at: https://ijp.pan.pl/publikacje-elektroniczne/korpus-tekstow-staropolskich.
  6. Cf.: http://www.impact-project.eu.
  7. Cf.: Bień, Janusz S. (2014) The IMPACT project Polish Ground-Truth texts as a DjVu corpus. „Cognitive Studies | Études Cognitives” (14). pp. 75-84; https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs.2014.008. The corpus is currently available at these addresses: https://szukajwslownikach.uw.edu.pl/IMPACT_GT_1/ and https://szukajwslownikach.uw.edu.pl/IMPACT_GT_2/.
  8. Cf.: http://www.f19.uw.edu.pl/ and https://szukajwslownikach.uw.edu.pl/f19/.
  9. Cf.: https://spxvi.edu.pl/korpus/.
  10. Cf.: https://corpus.byu.edu/eebo/.
  11. Cf.: https://corpus.byu.edu/coha/.
  12. Cf.: http://www.corpusdelespanol.org/.