N-grass

N-gram . In Artificial Intelligence , natural language processing, Bioinformatics , information retrieval is called n-gram to a subsequence of n consecutive elements in a given sequence . If n = 2 they are called bigrams ; n = 3 , trigrams ; for n> = 4 then they are generically called n-grams or Markov Models of order (n-1) .

The n-grams are very useful in the processing of texts , the determination of the language of documents , the statistical processing of the same or the discovery or prediction of genes as these are subsequences given within the genetic material.

Summary

[ hide ]

  • 1 Definition
  • 2 Examples
  • 3 Historical background
  • 4 Application
    • 1 Estimation of the language of documents.
  • 5 Sources

Definition

Let s be a sequence S of ordered elements 1 s 2 s 3 … s k … any subsequence A = s i + 1 s i + 2 … s i + n is called n-gram , where i is a value between 0 and | S | -n to guarantee that the length of A is always n or what is the same | A | = n ; n> 1 .

The particular cases n = 2 , n = 3 are called bigrams and trigrams respectively.

As additional data, the definition covers the particular case of an n-gram, but in practice it is preferable to think of it as the set of the same that make up the given sequence.

Examples

Given the text ” Platero y yo “ , if the characters that compose it are taken as elements , its trigrams would be: “Pla” , “lat” , “ate” , “ter” , “ero” , “ro” , “oy “ , ” and “ , ” and and “ and finally ” me “ .

In the case of “The horizon is the limit of what we can see” , if the words of the text are established as elements , their bigramas are: “The horizon” , “horizon is” , “is limit” , “limit of” , “of what” , “what” , “that we can” , “we can see” .

Note: In the case of word processing, the punctuation marks are usually debugged and also the blank spaces between words.

Historical background

According to the Mathematical Theory of Information by Claude E. Shannon in a certain language L derived from an alphabet A , each linguistic sequence w = s 1 s 2 … s n-1 where i is a symbol of the alphabet, determines for each symbol j of A a probability P (a j ) of being the next element n of w , so that , where the probabilities P (a j) are calculated by historical accumulation. In conclusion, the previous symbols in a sentence should determine the most probable appearance of one group of symbols over the others in the alphabet. This idea is the basis of the n-gram models.

There are also elements of the same in the Markov chains that reached a high level of mathematical and probabilistic formalization, achieving that they had great application in various fields until now.

Application

The field of application of n-grams is very varied: statistical treatment of natural language, both written and visual or sound, of patterns, in image analysis, gene prediction and protein formation in bioinformatics.

Estimation of the language of documents.

One of the most interesting uses is to find the similarity between a document and a group of documents. This fact is of great importance in estimating the predominant language in the text.

Suppose that we have two sets of trigrams 1 and 2 , associated with the English and Spanish languages , coming from trigramizing large volumes of texts in both languages, so that 1 and 2 are representative. Then you have the set of trigrams T of a document that could be written in English or Spanish. The predominant language can be estimated by the expression of evaluation of similarity:

  • where 1and 2 are sets of trigrams and .

F (T, I 1 ) = t 1 and f (T, I 2 ) = t 2 are calculated and if 1 > t 2 the document associated with T is written in English, otherwise in Spanish. In case of 1 = t 2 cannot be decided.

More generally, if we had the trigrams 1 , 2 , …, n associated with the languages 1 , 2 , …, n and the similarity values f (T, I 1 ) , f (T, I 2 ) , …, f (T, I n ) , the language of the text associated with the trigrams T is decided by the highest value of the f (T, I i ) , if there are more than one cannot decide.

 

by Abdullah Sam
I’m a teacher, researcher and writer. I write about study subjects to improve the learning of college and university students. I write top Quality study notes Mostly, Tech, Games, Education, And Solutions/Tips and Tricks. I am a person who helps students to acquire knowledge, competence or virtue.

Leave a Comment