Frequency lists

About the frequency list

For the project, we have created a list of lemmas (basic forms) by frequency of use. The lemmas, along with information on their frequency, were extracted from both versions of the corpus (one annotated using the Toygger tagger, the other using the Concraft tagger). Foreign elements, punctuation and numbers were not included. Frequency was calculated only for lemmas (basic forms); therefore, for example, the conjunction żeby and the particle żeby were counted as one. This decision was brought on by uncertainty regarding the inflectional category of many of the words contained in the corpus, in particular those that are indeclinable. Finally, all of the lemmas were included into a single list, which contains, in separate columns, information on frequency of use in both versions of the corpus. The list contains 286 980 lemmas. However, many of these were misinterpretations of tokens unknown to the Korbeusz morphological analyzer and had to be excluded from the final list. This was the result of errors that always appear on every stage of corpus building, such as typos in transliteration. The way in which we removed the misinterpretations from the list will be described in the next chapter.

Included below are the 200 most common lemmas in the corpus, along with their frequency of use in both versions of the corpus (tagged by two above-mentioned taggers). The frequency may differ slightly between the two versions, since some tokens have been interpreted differently depending on the tagger, and thus linked with different basic forms. For example the token winę in the same part of the text Toygger interpreted as a form of the noun wina ‘fault’, while Concraft – as a form of the verb winąć ‘to plait’.

The lemmas are sorted by frequency of use that come from the Toygger version.

The process of creating the frequency list

In accordance with the nature of our project, the list of basic forms from the corpus should adhere to the rules for entries in the Electronic Dictionary of 17th-18th-Century Polish. Therefore, we have omitted above-mentioned incorrect interpretations, as well as lemmas which would not constitute an independent lexeme wherever possible. The lemmas are, needless to say, transcribed, not transliterated (for more on the subject of transliteration and transcription see “Instruction”).

As mentioned above, we have consequently omitted words tagged as foreign (in foreign languages), punctuation and numerals (including roman numerals).

Among the 286 980 lemmas, 6198 contained symbols not found in the Polish alphabet: punctuation marks and symbols (e.g. lemmmas to_jest, arcy-biskup, w-tobie, k'myśli, po-, bę-, otwierał', ś^o^, G**, \), numerals (e.g. 6-funtowy, ½, niesie1318), as well as letters from foreign alphabets (e.g. jεy, εkstractu, ). These are often expansions of abbreviations (e.g. to_jest stands for the frequently used abbreviation tj. ‘i.e.’) or the abbreviations themselves, as well as adjectives containing numerals (e.g. 6-funtowy ‘6-pound’). Other than that, many of these lemmas result from incorrect segmentation or transcription or, less commonly, from other errors typical for various stages of large-scale research. Out of these lemmas only the particles , +że and the adverb +kroć were included in the list, as they appear in the database of the Korbeusz morphological analyzer. After the deletion of the 6195 basic forms containing symbols not found in the Polish alphabet, 280 785 lemmas were left on the final list.

Next, we have removed from the list the 214 600 lemmas which have been assigned to tokens that could not be identified by the Korbeusz morphological analyzer. Most of these were the result of various errors – if a token was not identified by Korbeusz, the taggers should decide on an interpretation based on patterns taken from hand-tagged material. In such cases, an unmodified form of the token is treated as a lemma. For these reasons, we have decided to remove all such lemmas from the list; searching them for possible lexemes would be an additional, time-consuming task and, as such, not included in the project. Most of the lemmas in question rarely appear in the corpus. Only 3 of them can be found among the first thousand entries on the frequency list, only 5 in the next thousand entries, 4 in the third and 8 in the fourth thousand. It is only much lower on the list that their appearance starts becoming more common. More than half of these lemmas have a frequency of one. Therefore, regardless of their relatively common appearance on the list, the corpus itself contains comparatively few tokens lemmatized in this way.

After the rejection of the aforementioned lemmas, the final list contains 66 185 entries. It may seem as if a great many of the words in the corpus were misidentified, seeing that more than 220 thousand lemmas were removed from the list. However, misidentified tokens comprise merely 4% of all segments from the corpus.

We have decided to leave those lemmas that start with a capital letter on the list, as they constitute a significant part of the corpus, especially these that stand near the top of the list. The list contains 14 595 lemmas starting with at least one capital letter (in total there were many more, but great amount of them were removed as unknown to Korbeusz). It comes as no surprise that the word Bóg ‘God’ is the most common in this category. Also common are proper names and nationalities (e.g. Chrystus ‘Christ’, Turek ‘Turk’, Polak ‘Pole’, Wojciech (Polish first name), Marcin (Polish first name), Mahomet ‘Muhammad’, Rzeczpospolita ‘Republic’, Lwów ‘Lviv’, Potocki (Polish surname), Jowisz ‘Jupiter’, Pegaz ‘Pegasus’), as well as segments that have been interpreted as modern day acronyms (e.g. BC, SA, CD). Although some common nouns have been for various reasons erroneously interpreted as surnames (and as a result we have two lemmas on the list instead of one, e.g. Zwada and zwada instead of zwada), their frequency is generally low.

The final list contains 66 185 lemmas; 200 of these are included below.

Note: when searching for chosen lemmas, it is important to select the “reject foreign segments” option. Otherwise the result will often be higher than the one recorded on the frequency list.

Lemma                 Number of occurrences - Toygger   Number of occurrences - Concraft  
i 368873 368666
być 262116 261846
w 246195 246196
z 219308 213025
on 214170 212091
się 184744 184744
na 184692 184692
nie 182146 182785
ten 147993 148927
to 117827 114197
który 117419 117419
do 113801 113455
a 113550 111727
mieć 85040 84448
co 73163 73186
swój 69302 68799
że 67147 67147
od 58168 58101
tak 57176 57164
o 54757 54976
jako 53898 53887
za 47162 47162
ja 46430 46265
pan 45694 45476
wielki 43846 29877
po 43792 43734
mój 41790 29601
ale 38635 38619
by 37857 37857
móc 35210 35070
siebie 34296 34139
gdy 32551 32551
sam 32194 32110
jeden 31753 31763
aby 29185 29185
przez 27463 27463
bo 27431 27431
dla 27113 27113
albo 26860 26896
też 26661 26661
król 26616 22506
Bóg 25825 24487
ty 25470 25695
człowiek 25069 25030
drugi 24981 24458
rok 24316 24081
dać 23348 22725
my 23307 23305
chcieć 22842 22781
już 22553 22553
tylko 22358 22146
nasz 20809 18555
dzień 20557 18086
przy 20362 20362
czas 20316 19845
20176 20176
tam 19944 19907
pod 19739 19739
19687 19687
kto 18868 18820
wszystek 18384 18338
u 18320 18312
dobry 18151 17472
święty 17927 17474
taki 17894 17794
twój 17762 17426
rzecz 17747 17491
miasto 17325 28170
jak 17131 17099
nad 17100 18689
żeby 16932 16932
mówić 16925 16784
gdzie 16875 16875
tedy 16788 16616
wszytek 16778 15702
kiedy 16551 16531
zaś 16059 16059
widzieć 15748 15663
tu 15460 15460
dwa 15434 15434
czynić 15151 15114
każdy 14950 14950
ani 14919 14905
inszy 14855 14601
+że 14711 14711
wiele 14596 14596
ziemia 14434 14396
bez 13992 13605
jaki 13743 13631
przed 13230 13230
wziąć 13006 12829
miejsce 12992 12993
świat 12831 12539
abo 12489 12489
część 12271 12279
iść 12211 12164
syn 12210 10852
jeśli 12193 12193
wiedzieć 11854 11846
11761 11761
żaden 11700 11700
stać 11587 11145
uczynić 11294 11294
potym 11246 11246
dobrze 11134 11134
woda 11035 10867
pierwszy 11000 10350
ręka 10990 10891
serce 10964 10964
inny 10836 10842
jeszcze 10795 10795
rzec 10733 10834
nic 10707 10694
książę 10467 8234
barzo 10462 10457
także 10391 10391
wojsko 10224 10190
ku 10046 10046
zły 9899 9470
kościół 9739 9739
raz 9473 9482
oko 9449 9368
dom 9131 9056
według 9078 9078
teraz 9064 9064
cały 9027 8976
mały 8946 8611
jednak 8892 8892
prawo 8886 8945
trzy 8879 8814
strona 8862 9984
ojciec 8861 8439
niech 8818 8818
głowa 8805 8494
słowo 8755 8723
ciało 8754 8754
stary 8582 8006
złoty 8564 8401
lecz 8508 8477
sposób 8472 8336
śmierć 8429 8429
dużo 8163 7427
wy 7955 7955
koń 7953 7837
polski 7935 7293
musieć 7923 7869
ów 7861 7344
dawać 7796 7776
przyjść 7717 7575
zaraz 7428 7415
niebo 7384 7382
różny 7365 7301
brać 7352 7120
prosić 7296 7251
potrzeba 7221 7138
góra 7213 7021
kazać 7086 7021
więc 7081 7081
nowy 7080 6759
imć 7040 3001
między 7026 7026
zwać 7022 6994
droga 6942 6776
choć 6898 6898
pisać 6785 6549
sprawa 6647 6602
boży 6635 6548
bywać 6617 6557
zawsze 6570 6570
dusza 6553 6345
trzeci 6539 6404
niż 6536 6509
trzeba 6486 6486
tysiąc 6473 6466
jeżeli 6455 6455
imię 6415 3496
krew 6337 6298
morze 6301 6262
ogień 6273 6240
pański 6256 5831
bardzo 6230 6231
miłość 6229 6229
lubo 6139 6137
rozumieć 6025 5909
mało 5933 5834
powinien 5920 5852
daleko 5872 5911
czy 5871 5871
powiedzieć 5833 5845
koniec 5827 5731
znać 5722 5703
sejm 5689 5678
cesarz 5639 5192
wiara 5635 5625
żyć 5630 5364
wojna 5600 5317
ksiądz 5581 7656
siła 5519 5510
brat 5468 5578
cnota 5395 5133

Full frequency list of lemmas