What Is Corpus Linguistics And Stages In Building Corpus

Corpus linguistics deals with the structure, preparation and evaluation of (electronic) corpora. Corpora are text collections, which are compiled according to linguistic issues. A corpus is a collection of linguistic data. Mostly it is written language, so the texts may be in a variety of forms such as, for example, and transcribed conversations. But there are also corpora of audio or video files.

It is part of a library of electronic texts, but the corpus is built with the criteria for special purposes.purposes.it has been derived from the word corpse which means the systematic collection of a large amount of text that can be stored and processed electronically.

Data sources for making corpus:

Here is some of the source text of the written language used to build the corpus:

  • Popular articles (of a newspaper or magazine) can be used to discuss a topic on a periodic basis.
  • Scientific articles (journals, teaching materials, paper) have a variety of s topics discussed in the form of text
  • Poems, a collection of poetry from several authors then each collection is considered as text.

The text of the Oral Language and Write:

The text is also derived from the spoken language is the text that comes from a speech or conversation, the conversation was recorded and then written in text form. The talks can take place with a fairly long duration. Recording and changing the conversation long into written form requires a significant financial cost. Long talks can be divided into shorter pieces.

Talk or the pieces can be used as a corpus if it meets two conditions, namely: the conversation begins and ends by the participants. The second requirement is opened and closed the conversation clearly. Examples of speech that can be converted into text form: an informal conversation between several people directly, a telephone conversation, a lecturer in a class discussion, conversation in the meeting, talks in the interview, and discussion in the debate.

 What is linguistic corpus and software:

Linguistic corpus is a term for the study of language and language analysis method that uses corpus. Linguistic corpora are very useful in the world of language teaching or research. Some of the areas that rely on the corpus include lexicography (compilation of dictionaries), grammar,sociolinguistics, translation, teaching, (stylistics), dialectology and historical linguistics.  In order to optimize the use of the corpus, there are several steps that must be carried out according I need a little time to really master.

Some famous software of corpora:

You must have some tools or software which can be downloaded for free on the internet. Like a concordance tool AntConc, SCP;parallel ;Vocabulary Profiler or RANGE. After downloading the desired corpus and have installed the required software, you can easily analyze the text as needed.

The concordance tool of the corpus not only analyzes English but also other languages ​​such as Arabic. An example is the analysis of the use of parts of speech, such as prepositions whatever follows the word symbol in English corpus, or words in the corpus Arabic. In addition, the use of this tool can also present the results of the analysis of the quantity of data such as the amount of usage of certain words in the corpus is desired

Apply five Stages In Building Corpus Its easy If You Do It In Corpus Linguistics

The corpus is made using systematic structure. Text documents are collected according to the corpus size. Before making corpus you must note down these planning are as follows: The purpose of the stages of making a corpus is designed in accordance with the size f text and project cost.

Some points relating to the science of language must need to be considered in building a corpus for example: the size of the text to be sampled, the range of language diversity (synchronous) and the period of the text (diachronic) for the sample material.

So there are 5 stages:

  • Planning and design corpus.
  • Selection of data sources.
  • Permission of the owner of the data.
  • Data collection and encoding
  • Handling corpus.


The corpus is built by starting with a plan. The corpus is designed with   many experts for to consider the uses of the corpus. It’s all about logical for example: The corpus for use in a common language in synchronic kind, you should consult with sociolinguistics. If there are stylistic variations in sampling strategies, we need to consult with a statistician. Hardware and software is also a consideration in the design of the corpus. So planning and consideration is an important task.

  Selection of Data Sources

Sources of data are selected with systematic analysis of the population. Indeed Web contains data in the form of text with a variety of different languages. The Web can be used as a data source in the development of the corpus. Search engines can be used as a tool to get the text associated with the building have a corpus planning.

  Permission for Use of Data from owner;

The text that has been collected in the corpus needs to be permission for using the data owner. Only in this way you can use text legally in corpus. The corpus should be used wisely, both in the legal conditions, or simply used by researchers in university.

 Data collection and encoding

Data collection requires considerable time, because the size depends on the volume of text to be collected. The printed data can be scanned with the tools and then transferred into text form. Some time if we are lucky, we can get the text in the form text easily.

The text that has been collected and affixed a marker to give an indication of the parts of the structure and characteristics can keep the authenticity of the text. SGML is used as a marker in the corpus.SGML tagging to enable text recognized by the computer or other machine.

Handling corpus

The existence of the text corpus cannot meet the data needs of language. Problems that need to be considered are the addition of the data in the corpus. The addition of a text document into the corpus makes wrong addition of hundreds or even thousands of words, so that the corpus of data values ​​change.

If there is addition of data growing dynamically then to that end, the provision of data requires a software language. The software needed to accelerate the provision of data. The software use here concordances to get the data and lists the frequency of words belonging to the basic processing devices.

Advanced processing devices used for (lemmatization), labeling part of speech, parsing, pairing words, disambiguation, and a link to the lexical database.

Development of corpus

If the corpus is built with the aim to obtain impartial corpus, the corpus should be adjusted or evaluated. At first, it is built for representing a population. Then the corpus is used and analyzed to gain strength and weakness of the corpus. Information from experts and the feedback from the analysis can be used to improve the function in the corpus. Quality of the corpus can be done with the addition or subtraction of materials carried out continuously.

Determination of Population and Sampling

We should follow the rules of statistical sampling theory. But you must know by applying statistical theory to build a corpus as a source of language is a challenge. The problems encountered when setting the sample population. Determination of the limits of the population is very difficult, while the books are always discussing the definition of population statistics are very clear. The absence of clear samples can be taken from the units of language.

Events occur at some sample bias. Accuracy in the determination of the samples is a problem that cannot be tolerated. Therefore, researchers should have a question: How many samples have been obtained and estimate whether it is possible to have the truth firmness on achieving results.Criteria used to determine the population there are two criteria, namely: internal and external criteria. Internal criteria are the benchmark that is based on the fundamental properties of language, examples of text classification based grammar or word form, while the external criterion is the benchmark that is not based on the basic properties.

