Korba - Frequency lists

Frequency lists

About the frequency list

For the project, we have created a list of lemmas (basic forms) by frequency of use. The lemmas, along with information on their frequency, were extracted from both versions of the corpus (one annotated using the Toygger tagger, the other using the Concraft tagger). Foreign elements, punctuation and numbers were not included. Frequency was calculated only for lemmas (basic forms); therefore, for example, the conjunction żeby and the particle żeby were counted as one. This decision was brought on by uncertainty regarding the inflectional category of many of the words contained in the corpus, in particular those that are indeclinable. Finally, all of the lemmas were included into a single list, which contains, in separate columns, information on frequency of use in both versions of the corpus. The list contains 286 980 lemmas. However, many of these were misinterpretations of tokens unknown to the Korbeusz morphological analyzer and had to be excluded from the final list. This was the result of errors that always appear on every stage of corpus building, such as typos in transliteration. The way in which we removed the misinterpretations from the list will be described in the next chapter.

Included below are the 200 most common lemmas in the corpus, along with their frequency of use in both versions of the corpus (tagged by two above-mentioned taggers). The frequency may differ slightly between the two versions, since some tokens have been interpreted differently depending on the tagger, and thus linked with different basic forms. For example the token winę in the same part of the text Toygger interpreted as a form of the noun wina ‘fault’, while Concraft – as a form of the verb winąć ‘to plait’.

The lemmas are sorted by frequency of use that come from the Toygger version.

The process of creating the frequency list

In accordance with the nature of our project, the list of basic forms from the corpus should adhere to the rules for entries in the Electronic Dictionary of 17^th-18^th-Century Polish. Therefore, we have omitted above-mentioned incorrect interpretations, as well as lemmas which would not constitute an independent lexeme wherever possible. The lemmas are, needless to say, transcribed, not transliterated (for more on the subject of transliteration and transcription see “Instruction”).

As mentioned above, we have consequently omitted words tagged as foreign (in foreign languages), punctuation and numerals (including roman numerals).

Among the 286 980 lemmas, 6198 contained symbols not found in the Polish alphabet: punctuation marks and symbols (e.g. lemmmas to_jest, arcy-biskup, w-tobie, k'myśli, po-, bę-, otwierał', ś^o^, G**, \), numerals (e.g. 6-funtowy, ½, niesie1318), as well as letters from foreign alphabets (e.g. jεy, εkstractu, až). These are often expansions of abbreviations (e.g. to_jest stands for the frequently used abbreviation tj. ‘i.e.’) or the abbreviations themselves, as well as adjectives containing numerals (e.g. 6-funtowy ‘6-pound’). Other than that, many of these lemmas result from incorrect segmentation or transcription or, less commonly, from other errors typical for various stages of large-scale research. Out of these lemmas only the particles +ż, +że and the adverb +kroć were included in the list, as they appear in the database of the Korbeusz morphological analyzer. After the deletion of the 6195 basic forms containing symbols not found in the Polish alphabet, 280 785 lemmas were left on the final list.

Next, we have removed from the list the 214 600 lemmas which have been assigned to tokens that could not be identified by the Korbeusz morphological analyzer. Most of these were the result of various errors – if a token was not identified by Korbeusz, the taggers should decide on an interpretation based on patterns taken from hand-tagged material. In such cases, an unmodified form of the token is treated as a lemma. For these reasons, we have decided to remove all such lemmas from the list; searching them for possible lexemes would be an additional, time-consuming task and, as such, not included in the project. Most of the lemmas in question rarely appear in the corpus. Only 3 of them can be found among the first thousand entries on the frequency list, only 5 in the next thousand entries, 4 in the third and 8 in the fourth thousand. It is only much lower on the list that their appearance starts becoming more common. More than half of these lemmas have a frequency of one. Therefore, regardless of their relatively common appearance on the list, the corpus itself contains comparatively few tokens lemmatized in this way.

After the rejection of the aforementioned lemmas, the final list contains 66 185 entries. It may seem as if a great many of the words in the corpus were misidentified, seeing that more than 220 thousand lemmas were removed from the list. However, misidentified tokens comprise merely 4% of all segments from the corpus.

We have decided to leave those lemmas that start with a capital letter on the list, as they constitute a significant part of the corpus, especially these that stand near the top of the list. The list contains 14 595 lemmas starting with at least one capital letter (in total there were many more, but great amount of them were removed as unknown to Korbeusz). It comes as no surprise that the word Bóg ‘God’ is the most common in this category. Also common are proper names and nationalities (e.g. Chrystus ‘Christ’, Turek ‘Turk’, Polak ‘Pole’, Wojciech (Polish first name), Marcin (Polish first name), Mahomet ‘Muhammad’, Rzeczpospolita ‘Republic’, Lwów ‘Lviv’, Potocki (Polish surname), Jowisz ‘Jupiter’, Pegaz ‘Pegasus’), as well as segments that have been interpreted as modern day acronyms (e.g. BC, SA, CD). Although some common nouns have been for various reasons erroneously interpreted as surnames (and as a result we have two lemmas on the list instead of one, e.g. Zwada and zwada instead of zwada), their frequency is generally low.

The final list contains 66 185 lemmas; 200 of these are included below.

Note: when searching for chosen lemmas, it is important to select the “reject foreign segments” option. Otherwise the result will often be higher than the one recorded on the frequency list.

Lemma	Number of occurrences - Toygger	Number of occurrences - Concraft
i	368873	368666
być	262116	261846
w	246195	246196
z	219308	213025
on	214170	212091
się	184744	184744
na	184692	184692
nie	182146	182785
ten	147993	148927
to	117827	114197
który	117419	117419
do	113801	113455
a	113550	111727
mieć	85040	84448
co	73163	73186
swój	69302	68799
że	67147	67147
od	58168	58101
tak	57176	57164
o	54757	54976
jako	53898	53887
za	47162	47162
ja	46430	46265
pan	45694	45476
wielki	43846	29877
po	43792	43734
mój	41790	29601
ale	38635	38619
by	37857	37857
móc	35210	35070
siebie	34296	34139
gdy	32551	32551
sam	32194	32110
jeden	31753	31763
aby	29185	29185
przez	27463	27463
bo	27431	27431
dla	27113	27113
albo	26860	26896
też	26661	26661
król	26616	22506
Bóg	25825	24487
ty	25470	25695
człowiek	25069	25030
drugi	24981	24458
rok	24316	24081
dać	23348	22725
my	23307	23305
chcieć	22842	22781
już	22553	22553
tylko	22358	22146
nasz	20809	18555
dzień	20557	18086
przy	20362	20362
czas	20316	19845
+ż	20176	20176
tam	19944	19907
pod	19739	19739
iż	19687	19687
kto	18868	18820
wszystek	18384	18338
u	18320	18312
dobry	18151	17472
święty	17927	17474
taki	17894	17794
twój	17762	17426
rzecz	17747	17491
miasto	17325	28170
jak	17131	17099
nad	17100	18689
żeby	16932	16932
mówić	16925	16784
gdzie	16875	16875
tedy	16788	16616
wszytek	16778	15702
kiedy	16551	16531
zaś	16059	16059
widzieć	15748	15663
tu	15460	15460
dwa	15434	15434
czynić	15151	15114
każdy	14950	14950
ani	14919	14905
inszy	14855	14601
+że	14711	14711
wiele	14596	14596
ziemia	14434	14396
bez	13992	13605
jaki	13743	13631
przed	13230	13230
wziąć	13006	12829
miejsce	12992	12993
świat	12831	12539
abo	12489	12489
część	12271	12279
iść	12211	12164
syn	12210	10852
jeśli	12193	12193
wiedzieć	11854	11846
aż	11761	11761
żaden	11700	11700
stać	11587	11145
uczynić	11294	11294
potym	11246	11246
dobrze	11134	11134
woda	11035	10867
pierwszy	11000	10350
ręka	10990	10891
serce	10964	10964
inny	10836	10842
jeszcze	10795	10795
rzec	10733	10834
nic	10707	10694
książę	10467	8234
barzo	10462	10457
także	10391	10391
wojsko	10224	10190
ku	10046	10046
zły	9899	9470
kościół	9739	9739
raz	9473	9482
oko	9449	9368
dom	9131	9056
według	9078	9078
teraz	9064	9064
cały	9027	8976
mały	8946	8611
jednak	8892	8892
prawo	8886	8945
trzy	8879	8814
strona	8862	9984
ojciec	8861	8439
niech	8818	8818
głowa	8805	8494
słowo	8755	8723
ciało	8754	8754
stary	8582	8006
złoty	8564	8401
lecz	8508	8477
sposób	8472	8336
śmierć	8429	8429
dużo	8163	7427
wy	7955	7955
koń	7953	7837
polski	7935	7293
musieć	7923	7869
ów	7861	7344
dawać	7796	7776
przyjść	7717	7575
zaraz	7428	7415
niebo	7384	7382
różny	7365	7301
brać	7352	7120
prosić	7296	7251
potrzeba	7221	7138
góra	7213	7021
kazać	7086	7021
więc	7081	7081
nowy	7080	6759
imć	7040	3001
między	7026	7026
zwać	7022	6994
droga	6942	6776
choć	6898	6898
pisać	6785	6549
sprawa	6647	6602
boży	6635	6548
bywać	6617	6557
zawsze	6570	6570
dusza	6553	6345
trzeci	6539	6404
niż	6536	6509
trzeba	6486	6486
tysiąc	6473	6466
jeżeli	6455	6455
imię	6415	3496
krew	6337	6298
morze	6301	6262
ogień	6273	6240
pański	6256	5831
bardzo	6230	6231
miłość	6229	6229
lubo	6139	6137
rozumieć	6025	5909
mało	5933	5834
powinien	5920	5852
daleko	5872	5911
czy	5871	5871
powiedzieć	5833	5845
koniec	5827	5731
znać	5722	5703
sejm	5689	5678
cesarz	5639	5192
wiara	5635	5625
żyć	5630	5364
wojna	5600	5317
ksiądz	5581	7656
siła	5519	5510
brat	5468	5578
cnota	5395	5133

Full frequency list of lemmas