The Role of Corpus in Dictionary Compilation

The dictionary is a description of the vocabulary of a language. A dictionary explains what words mean and shows how they work together to form sentences. The information presented in the dictionary is obtained from two main sources, namely introspection and observation . Introspection means looking inside our own brains and trying to remember everything we know about words. Meanwhile, observation means examining real-life examples of the language used (in newspapers, novels, blogs, tweets, etc.) so that we can observe how people use words when they communicate with each other.

Speakers who are fluent in a language, of course, must already know a lot about the vocabulary of that language. Therefore, introspection can be a useful source of insight into what the word means and how it is used. However, a dictionary must provide a complete and balanced account of the behavior of a word, and introspection alone cannot provide sufficient information for that purpose. As a result, pekamus, since Samuel Johnson’s time in the 18th century, have chosen to base their dictionaries on observation. In the Johnson era, observing language was tiring work. Observing language is equivalent to reading hundreds of books and extracting good examples of the words used. However, today’s computer technology makes all that easier.

This article will show the role of the corpus in compiling a dictionary. For that, first will be displayed the understanding of the corpus, the brief history of the corpus, and the typology of the corpus. It will then be followed by a dictionary and the role of the corpus in compiling the dictionary.

Observation of language: citations and corpus

The pekamus have been using citation for a long time. Examples of words, examples of words used are taken from books or other sources as a basis for describing a language. The data obtained by citation is very useful for tracking changes in language and for recognizing new words and phrases as they appear in use. Until now, citation still plays a role in dictionary compilation.

Meanwhile, the use of corpus in the preparation of dictionaries today has become a trend. A corpus is a collection of natural texts, both spoken and written language, which are arranged systematically. It is said to be “natural” because the text collected is a text that is produced and used fairly and is not made up. These include novels, academic books and papers, newspapers, magazines, recorded talk and interview broadcasts, blogs, online journals, and discussion groups, and much more. It is said to be “systematic” because the structure and content of the corpus follow certain extralinguistic principles, particularly the sampling principle, namely the basic principle in selecting texts to be included in the corpus. For example, there is a corpus that is limited to a certain type of text, to one or more variations of the English language, or to a certain period of time. “Systematic” also means that information about the exact composition of a corpus is available to the researcher (including the number of words in each category and the entire corpus, how the texts included in the corpus are sampled, etc.). Although corpus can refer to any systematic collection of texts, today corpus is usually used in a narrow sense and is often only used to refer to a collection of systematic texts that have been computerized or are presented in electronic form.


Early stage of corpus formation

The use of the corpus in language research is a fairly new approach. Corpus linguistics emerged in the 1960s, at the same time Noam Chomsky had a major impact on modern language studies. His book Syntactic Structures appeared in 1957 and quickly became a much-discussed text. The second book,   Aspects of Theory of Syntax , published in 1965, triggered a revision of the standard paradigm in theoretical linguistics. However, as language theory becomes increasingly focused on language as a universal phenomenon, other linguists are increasingly dissatisfied with the descriptions they find for the various languages ​​they study. Some of the grammar rules in the description are not in harmony in the written texts. Therefore, natural language data is required.

In the late 1950s Randolph Quirk conducted a survey of English Usage for empirical grammar research. Initially the data obtained were not computerized and it was not until the mid-1980s that Quirk and Greenbaum did it. The project is known as the International Corpus of English ( ICE ). The corpus data consists of 1 million words which include spoken (500 thousand words) and written (500 thousand words) data.

The second corpus project was carried out in the 1960s, namely Brown Corpus , taken from Brown University in Providence, Rhode Island. This corpus composed by Nelson Francis and Henry Kučera consists of 1 million words. The sample is 2,000 words taken from 500 American texts covering 15 categories of texts such as those contained in the Library of Congress , America’s national library. Brown Corpus is carefully crafted and very easy to use, and is proofread so there are almost no errors.

The third corpus project, English Lexical Studies , began in Edinburgh in 1963 and was completed in Birmingham. The project was led by John Sinclair, the man who first used the corpus specifically for lexical research and who brought up the new concept of collocation. The project is based on a very small sample of written and spoken language electronic texts of less than one million words.

The next corpus project was created for the purposes of compiling a dictionary, namely the Collins Cobuild English Language Dictionary , which was compiled in the mid-1970s and published in 1987 under the guidance of John Sinclair. It was the first time a general language dictionary was compiled on the basis of a corpus. Therefore, the corpus must be large enough to include all the entries and meanings of the words they contain. The corpus consists of 18.3 million words.

Furthermore, corpus projects continue to emerge, including London-Lund Corpus of Spoken English (500 thousand words, spoken), British National Corpus (100 million words), Bank of English (455 million words), American National Corpus (14 million words) , Corpus of Contemporary American English (450 million words), and International Corpus of English (1 million words from each regional / national variation).


Typology of the corpus

There are several types of corpus that can be used for different types of analysis, among them

  • general corpus / reference corpus (vs special corpus), for example the British National Corpus ( BNC ) or Bank of English , describes a language or language variation as a whole (spoken and written language, different types of text, etc.);
  • a historical corpus (vs a contemporary corpus), eg A Representative  Corpus  of Historical English Registers ( ARCHER ), describes the early stages of a language;
  • a regional corpus (vs a corpus containing more than one variation), for example the Wellington  Corpus  of Written New Zealand English ( WCNZE ), describes one regional variation of a language;
  • the corpus of learners (vs. the corpus of native speakers), such as the International Corpus of Learner English , describes the language used by the learners of a language;
  • a multilingual corpus (vs. an eccilingual corpus), describing two or more different languages ​​with the same text (for contrastive analysis); and
  • the oral corpus (vs written vs mixed corpus), eg the London-Lund Corpus of Spoken English , describes spoken language.

There is also a difference in the type of corpus by not referring to the texts that have been included in the corpus, but referring to how the texts are treated. The type of corpus is

  • annotated corpus (vs orthographic corpus), in which several types of linguistic analysis have been carried out on the text, such as sentence analysis and word class classification.


The role of the corpus in compiling a dictionary

In the preparation of the dictionary, the corpus is very helpful in working on the microstructure of the dictionary which includes entries / sublems, word classes, definitions, and writing examples of usage. Pekamus uses a computer program to extract information from the language corpus. The following is what the corpus can do in compiling a dictionary.

  1. At the entry collection stage, the corpus can assist the pekamus in compiling a list of words starting from the highest frequency to the lowest frequency. The recorder can choose how many words he will enter into the dictionary according to the type of dictionary to be arranged according to the frequency of appearance. Figure 1 is the corpus extracted using the sketch engine . It can be seen in the table that the words which are , and , and di are the three words with the highest frequency of occurrence over 1 million times.

Figure 1: Word list


  1. At the stage of determining the entry, the corpus — with a concordance program — can help the pekamus to distinguish which entries are compound words or idioms and which are not. From the data in Figure 2 looks corpus have 5 phrases, nouns, ie households , hospitals , wooden house , the house itself , and the house of the Lord . Two of them are not included in compound words, namely wooden house and house . Therefore, they cannot be entered in the dictionary as entries / sublems.


Figure 2: House word concordance


  1. The corpus can help in determining the word class of an entry because the corpus provides the different contexts in which the word is located. For example, the word salut in KBBI Edition IV (2008: 1211) only has one class of words, namely nouns. However, when the word salut is extracted from the corpus, it is found that it turns out that the word salut can also have an adjective word class, as in sentences 56279, 19125, and 4172 in Figure 3 .

Figure 3: Concordance word salute


  1. The corpus assists the pekamus in defining an entry. The defining process usually requires several stages of analysis so that a good and precise definition can be produced. Based on the available data, the sub- entry to enter can be defined as’ entering data, information, etc. to somewhere’.


Figure 4: concordance of the drafting word


  1. The corpus gives flexibility to pekamus in finding and determining good examples for dictionary users. Dictionary users, after looking up the meaning of a word, usually continue looking at example sentences. A good example is one that shows how the word is used in context and helps to explain what the word means. The example for a word / entry in the dictionary must be the same as when the word is used in real life.
  1. The corpus helps in identifying the collocation of a word. As can be seen in Figure 5 , the word eat can collaborate with the words drink , lunch , eat , dinner , food , and so on.


Figure 5: Collocation of food words


  1. The corpus can help track word changes. New words are the clearest manifestation of a change in language. The corpus can help look for more subtle changes in language, for example new meanings of existing words or changes in spelling or even changes in grammar.

Figure 6: Documentation word concordance

The documentation entry in KBBI Edition IV (2008: 338) has two policies, namely 1 collection, selection, processing and storage of information in the field of knowledge; 2 provision or collection of evidence and information (such as pictures, quotes, and newspaper clippings, and other reference material): the committee include d e ng an sexy exhibitions, publications, and documentation. If the corpus data in Figure 6 is examined closely, it appears that the meaning of documentationit can also mean ‘documents provided or collected as evidence or reference material’. So, the entry of documentation can be said to have undergone a change in language, especially in meaning.


The use of the corpus in the world of literature is still in its infancy. The use of the corpus has also become a standard in the preparation of modern dictionaries. This was also supported by the development of computer technology that was able to process the corpus in such a way that the task force was facilitated in their work. Many new findings were also obtained through systematic corpus data analysis. This resulted in updating and improving dictionary entries so as to produce the most accurate description of a language.


by Abdullah Sam
I’m a teacher, researcher and writer. I write about study subjects to improve the learning of college and university students. I write top Quality study notes Mostly, Tech, Games, Education, And Solutions/Tips and Tricks. I am a person who helps students to acquire knowledge, competence or virtue.

Leave a Comment