**Compiler theory** . O Compiler theory: In __man-machine communication__ there is a real difficulty: __computers__ operate on __bits__ (zeros and ones) and registers and men understand each other through languages (natural language), or use a scientific notion determined by the problem area in which it works, among other ways. The __programming languages__ are the vehicle of communication between man and __computer__ . A __programming language__ is a standard communication technique that allows to express the instructions to be executed on a __computer__. It consists of a set of syntactic and semantic rules that define a computer program. A source program written in a __programming language__ needs to go through a conversion process to the machine language that the __computer__ understands , this process of converting the source program to the __language__ or machine code is called the compilation process or simply compilation. And the program capable of carrying out this process is called a __compiler__ or translator.

Summary

[ hide ]

- 1 Compiler theory
- 1 Basic concepts
- 2 Compilers and translators
- 3 Structure of a compiler.

- 2 Lexicological analysis.
- 3 Operations on the symbol table.
- 4 Syntactic analysis (parsing)
- 5 Semantic analysis.
- 1 Management and information of errors.
- 2 Generation of intermediate code.
- 3 Optimization of the intermediate code.
- 4 Generation of the object code.

- 6 Representation of languages.
- 1 Basic Definitions.
- 1.1 Formal Definition of String:
- 1.1.1 Chain Concatenation:
- 1.1.2 Reverse of a String:
- 1.1.3 Prefixes and Suffixes of a String:
- 1.1.4 Length of a Chain:
- 1.1.5 Language:
- 1.1.6 Closure of an Alphabet:

- 2 Operations on Languages:

- 1.1 Formal Definition of String:
- 7 Concept of grammar.
- 1 Definitions related to the concept of Grammars

- 8 Chomsky classification.
- 9 Regular Expressions.
- 10 Recognizing schemes of formal languages. Finite automata.
- 1 Recognizing schemes.
- 2 Finite automata.
- 2.1 Definition 1.
- 2.2 Deterministic finite automata (AFD)
- 2.3 Non-deterministic finite automata (AFND)

- 3 Construction of finite automata from regular expressions

- 11 Lexical Analysis
- 1 How a lexical analyzer works
- 2 Specification of a Lexical Analyzer
- 2.1 Preliminary Concepts: Tokens, Patterns and Lexemas.
- 2.1.1 Attributes of a token

- 12 See also
- 13 Sources

- 2.1 Preliminary Concepts: Tokens, Patterns and Lexemas.

- 1 Basic Definitions.

Compiler theory

Basic concepts

Let’s informally look at some basics.

- Language: Set of
__symbols__and rules that allow__communication__, that is, transmit an idea, a__message__. - Lexis (vocabulary): Set of words that are part of a specific language.
__Grammar__: Groups the__elements__of form, structure and meaning that allow to express oneself in a specific language.__Syntax__: Set of rules necessary to construct correct__sentences__in a language.__Semantics__: Meaning of phrases generated by the__syntax__and the__lexicon__.__Programming__language : Set of symbols and rules that allow communication with a computer.- High-
__level__language: Language that allows the communication of a computer through conventional symbols close to a natural language. - Low-level language: Similar to machine language with small
__mnemonic__modifications that facilitate its use, for example__assembly__language .

Machine Language: Combination of __binary__ digits by which a computer works.

Compilers and translators

__Translator__is a program that takes as input a program written in a programming language (source language) and produces as output a program in another language (object language). The translator is written in a language called the implementation language.

When the source language is high-level ( __Pascal__ , C ++, etc.) and the object language is a low-level or machine language, the translator is called a compiler.

The implementation language can generally be any programming language, although there are languages explicitly designed to write compilers (FSL, CDL, etc.). The fundamental criterion followed to choose an implementation language is: “It should minimize the implementation effort and maximize the quality of the compiler.” Translators are generally represented by a T in which the languages involved in the process are included.

- Interpreter for its part is a program that takes the
__source code__, analyzes it and, unlike compilers, executes it directly, without generating an object language.

Structure of a compiler.

The job of a compiler is to take the source string of the program, determine if it is __syntactically__ valid, and, at the same time, generate an equivalent program in a language that the computer understands. Compiler work can be divided into different parts:

For a particular compiler, the order of the processes can vary, and in many cases, several processes can be combined into a single phase. In real translators, the process described above does not pass in the __linear linearity__ form in which it appears in the __scheme__ proposed in the figure, normally the center of the compilation process revolves around the __Syntactic Analyzer__ which is in charge of activating each of the remaining phases. as necessary. The general __modules__ are:

- Lexicological analysis (
__scanner__) - Syntactic analysis (
__parser)__ - Semantic analysis
- Generation of intermediate code
- Middle code
__optimization__ __Executable__code generation- Symbol table
__Error handling__

Lexicological analysis.

The input to a compiler is the code of a program written in a programming language. Said code is nothing more than a sequence of symbols belonging to the alphabet of the programming language. The lexicological analyzer or scanner is in charge of taking them and grouping them into simple or elementary syntactic entities called __tokens__ or __lexemes__ . The categories of __tokens__ can vary from one language to another, but in general the following are distinguished:

- reserved words
- identifiers
- numeric and literal constants
__operators__

Each token is assigned a __lexicological__ structure consisting of a pair of the form <token type, info>. The first component is a syntactic category like “constant”, “identifier”, “operator”, etc., and the second component provides __information__ related to the particular token (value of the constant, index of the symbol in the symbol table, etc.). We can affirm, therefore, that the scanner is a translator whose input is a string of symbols (source program) and whose output is a sequence of lexicological structures or tokens.

Operations on the __symbol table__ .

A fundamental task in a compiler is to store the __identifiers__ used in a program and their main attributes, so that at any time an identifier, its type, scope, etc. can be known, in the case of procedures, the quantity and type of the parameters, etc. This __information__ is generally stored in a structure known as a symbol table, which has an entry for each identifier and its attributes. The __tokens__ that represent constants or identifiers are stored in the symbol table as they appear.

Example: 1 a Real variable 2 b Real variable 3 c Real variable 4 7 integer constant

Syntactic analysis (parsing)

Syntactic analysis is a process in which the sequence of tokens is examined to determine if the order of that sequence is correct according to certain structural conventions (rules) of the syntactic definition of the language. Input the __parser__ or parser is the sequence of tokens generated by the scanner. The parser only parses the first component of each token; the second component is used in other steps.

Example: Suppose there are rules that define expression in the following way: exp -> exp + exp exp -> exp / exp exp -> id exp -> const

What result would be obtained from the process of parsing the __expression__ A + * B? A + * B ⎯scanner ⎯> id1 + * id2 ⎯parser ⎯> Error: It does not meet the syntactic specification of the language.

What result would be obtained from the process of parsing the expression A + B * 7? A + B * 7 ⎯scanner ⎯> id1 + id2 * 7 ⎯parser ⎯> Yes, the syntactic specification of the language is met.

It is necessary to know the syntactic structure of a sequence of __tokens__ . For example, in an expression A = B + C * 7 in which the variables A, B and C represent real values, the syntactic structure of that expression must reflect the fact that B and 7 must be multiplied before the sum and after doing all the operations the __assignment will__ be made . To do this, the tokens are grouped in a __tree-__ like structure , known as a syntactic tree.

Example: Let’s see the syntactic tree for the sequence: id1 = id2 + id3 * 7 This sequence leads to the following __operations__ being carried __out__ : 1. id3 is multiplied with 7 2. the result of 1 is added with id2 3. the result of 2 is stored in id1

In the previous syntactic tree, the direct __descendants__ of each __node__ represent the actions to be performed and the values for which the actions must be performed.

Semantic analysis.

In the __semantic analysis__ , __errors__ related to the validity of the program are detected . It can be said that these errors are of a syntactic-semantic type, but they cannot be detected by the parser, since they are related to interdependencies between the different parts of a program that are not reflected in a grammatical analysis. The semantic analyzer receives the information resulting from the syntactic analysis, which can be a hierarchical tree with the information related to the organization of the tokens in the instruction that is being analyzed. Example of errors that can be detected in the semantic analysis process are __compatibility__ cases between the declaration of an identifier and its use (type checking), the agreement between the definition of a function and its activation or call, etc.

Example: Let’s go back to the previous example: id1 = id2 + id3 * 7 During the semantic analysis of a strong type checking language the syntactic tree must be modified to reflect the analysis of the data types. Note that by doing a type check, the constant 7, being an integer constant, must be converted to real in order to be operated and the other __variables__ have no difficulty in being used since they are all real.

__Management__ and information of errors.

If compilers had to process only correct programs, their design and implementation would be greatly simplified. But programmers write bad programs frequently, and a good compiler should help the programmer to locate and identify errors. Errors in a program can be classified into 4 large groups: – lexicological (misspelling a number, an illegal symbol, etc.) – syntactic (arithmetic expression with unbalanced parentheses) – semantic (applying an operator to an incompatible operand) – logic or programming (infinite loop) The lexical analyzer detects errors when the remaining characters of the input do not form a valid token in the language.__syntactic analysis__ . During semantic analysis the compiler tries to detect structures that have a correct but semantically incorrect syntax according to the operations involved. Example if we try to add two __identifiers__ , one the name of an __array__and the other the name of a procedure. The syntactic and semantic analysis phases usually handle most of the errors detectable by the compiler. As we can see in each phase of the compilation process errors can be found. However, after the error is detected, the phase can handle the error so that the compiler can continue, thus allowing later errors to be detected. A compiler that stops when it encounters the first error is not very efficient. The handling of errors during the compilation process (in any of its phases) must meet at least the following requirements:

Generation of intermediate code.

After syntactic and semantic analysis, many compilers generate an explicit intermediate representation of the source code. This representation can be seen as the representation of a program for an abstract machine. Example: MSIL, Java bytecode, Quadruple notation, Polish notation, etc. The intermediate code must have two very important characteristics: it must be easy to produce and easy to translate into the object program. The tree created by the semantic analyzer is used to generate the translation of the source program and thus achieve the interpretation of the actions associated with the current statement or statements. The translation as mentioned can be to machine language, or to an intermediate language, such as assembly language. An elegant and effective method to generate the intermediate code from the syntactic tree is the so-called Direct Syntactic Translation. Each node n is associated with an intermediate code C (n). The code of node n is built by concatenating the code of its descendants. Then the translation is done from the bottom up.

Middle code optimization.

The optimization phase is carried out with the aim of improving the efficiency of the intermediate code. Some optimizations are trivial. For example, a code similar to the one generated in the previous table, can be represented in only two instructions.

temp1 = id3 * 7.0 id1 = id2 + temp1

The compiler can deduce that the EntToReal (7) operation can be performed at once in the compilation phase so we can eliminate said instruction by substituting 7.0 directly. Also temp3 is used only once to transmit its value to id1. That way there is no objection to substituting id1 for temp3.

Generation of the object code.

The final phase of a compiler is the generation of the object code, consisting of machine code or assembly code. The first thing is to select memory locations for each variable used in the program. Intermediate instructions are translated into machine code instruction sequences that perform the same operation. A crucial aspect is the assignment of variables to registers.

Example: The two instructions obtained in are translated instructions in assembler.

MOVF id3, R2 MULF # 7.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1

The first two operated, determine the origin and destination respectively. The suffix F of each instruction indicates that it is working with floating point (real) numbers. This code moves the content of the id3 address to register 2, then multiplies it by the number 7.0. The symbol # indicates that 7.0 should be treated as a constant. The third instruction moves the content of id2 to register 1 and adds it to the previously calculated value that is stored in register 2. Finally the value calculated in register 1 is moved to the address determined by id1.

Language representation.

In general, there are two different schemes to define a Language, which are known as the generator scheme and the recognizer scheme. In the case of generator schemes, it is a mechanism that allows us to “generate” the different sentences of the language. In the case of recognizing schemes, it is a mechanism that allows us to recognize or verify whether or not a certain sentence belongs to a language. If a Language has a finite number of strings, it can be defined simply by listing its strings, but how to proceed to define Languages with an infinite number of Strings? To define a language we can use a Grammar (generator scheme) or we can use Finite Automata (recognizing scheme).

Basic Definitions.

Before addressing the concept of language, we will look at a set of related basic definitions.

- Alphabet (Σ): An alphabet is understood to be an arbitrary, finite and nonempty set of symbols. Among the most common alphabets we can mention the binary alphabet {0, 1}, the
__Latin alphabet__{a, b, c, d, e, f, g,…}, etc. - String: A string is a sequence of symbols of a certain alphabet placed one after the other.

11001: constitutes a string over the alphabet {0, 1}. abcd: constitutes a string on the alphabet of Latin letters.

- Empty String: An empty string is nothing more than a string with no elements, the empty string is denoted by e.

Notation: In what follows we will assume the following convention to denote strings: a, b, c, d, … represent symbols. y, u, v, w, … represent strings an = aa … an times a0 = e

Formal Definition of String:

A string on a certain alphabet Σ is defined as follows:

- e (empty string) is a string over Σ.
- If x is a string over Σ, and a ∈ Σ then xa is a string over Σ, for all a ∈ Σ.
- Any other chain on Σ can be obtained, applying 1 and 2.

Chain Concatenation:

If x and y are strings over Σ, then xy is a string over Σ, which is called the concatenation of x and y.

Reverse of a String:

The reverse of a string x is denoted xR, and it is the string obtained by writing the symbols for x in reverse.

Prefixes and Suffixes of a String:

Let x, y, z be strings over Σ. Then x is a prefix of the string xy and y is a suffix of the string xy. Also y is a substring of the string xyz. If x ≠ yyx is a prefix of y then x is said to be a proper prefix. Similarly, the concept of a proper suffix can be defined.

Length of a Chain:

The length of a string is the number of symbols that make it up. Thus we say that the length of the string abc is 3, which is denoted in the following way: ⎟ abc ⎟ = 3 Obviously we have ⎟ e ⎟ = 0.

Language:

A language on Σ is a set of strings on said alphabet that obeys a certain formation rule.

Closure of an Alphabet:

The set Σ * denotes all strings over Σ, including the empty string, this set is called the Closure of Σ. For example, if Σ = {0,1} then Σ * = {e, 0, 1, 00, 01, 10, 11, 001,…} It is evident that if L is a certain language on Σ, then L ⊆ Σ *. The set Σ + is called Positive Closure of Σ and is defined as: Σ + = Σ * – e

Operations on Languages:

Some of the previous operations on strings can be generalized to languages, so we can define the concatenation or product of languages in the following way: Let L1 be a language on Σ1 and L2 a language on Σ2, then the L1L2 language is called concatenation or product of the languages L1 and L2 and is defined as follows: In an analogous way, the Closure of a Language can be defined in the following way: L0 = {e} Ln = L Ln-1 for n ≥ 1 L * = ∪ Ln for n ≥ 0 Example: (a) Give an example of a string that belongs to the language L = {0 n1n | n> = 1} Correct answers: 01, 0011, 000111, 00001111,… Incorrect answers: 0, 1, 001, 10, 1100, 011,…

Grammar concept.

Grammar is a mathematical entity or model that allows specifying a language, that is, it is the set of rules capable of generating all the combinatorial possibilities of that language, and only those of that language, whether it is a formal language or a natural language . The objectives of a grammar are:

- Define whether or not a sentence belongs to a language.
- Structurally describe the sentences of the language.

In the formal definition of a Grammar, two disjoint sets of symbols are used: N: Set of non-terminal symbols (which allow us to represent combinations of symbols). Σ: Set of terminal symbols that corresponds to the concept of alphabet previously studied. The center of a grammar is constituted by a set of “production rules” (P) or rules for the formation of chains, which are formed by elements of the relation: (N ∪ Σ) * N (N ∪ Σ) * → ( N ∪ Σ) * Definition: A grammar is a quadruple G = {N, Σ, P, S} where: N: Set of non-terminal symbols. Σ: Set of terminal symbols. P: Set of Production Rules. S: Axiom or distinguished symbol (S ∈ N)

Definitions related to the concept of Grammars

Sentential Form: A sentence form in a grammar is defined recursively as follows: Let G = {N, Σ, P, S} be a grammar, then:

- S is a sentence form.
- If αβλ is a sentence form and β → δ ∈ P then αδλ is a sentence form.

Statement: A statement form that only contains terminals is called a statement. Derivation: Let G = {N, Σ, P, S} be a grammar, the relation ⇒ (direct derivation) is defined as follows: If αβλ ∈ (N ∪ Σ) + and β → δ ∈ P then αβλ ⇒ αδλ. The derivation of length k that we denote means that if, then there exists a sequence of chains a = a0, a1,…, ak = β such that such that a = a0 ⇒ a1 ⇒… ⇒ ak = β. The relation (non-trivially derived) is defined as the positive closure of ⇒, equivalent to with k> 0 The relation ⇒ is defined as the closure of ⇒, equivalent to with k ≥ 0.

Chomsky classification.

According to what we have seen, every grammar generates a single language, but different grammars can generate the same language. For example, the language generated in Example 2.3 is the same as that generated by the grammar in the following example. Example: G0 = ({E, T, F}, {a, +, *, (,)}, P, E) Where: We could think of classifying grammars by the language they generate, for this reason we make the following definition . Definition: Two grammars are said to be weakly equivalent if they generate the same language. However, when making this classification we find that the problem of knowing whether two grammars generate the same language is unspeakable. There is no algorithm that accepts two grammars as input and tells us (the algorithm’s output) whether or not they generate the same language. Thus, we have to think of classifications based on the form of the grammar, rather than the nature of the language they generate. The following classification is known as the Chomsky hierarchy and follows this direction. Let G = (N, Σ, P, S) be a grammar:

- If each production in P is of the form A → xB or A → x with A, B ∈ N and x ∈ Σ * or of the form A → Bx or A → x with A, B ∈ N and x ∈ Σ * then the grammar G is called Regular Right or Regular Left respectively, or in general Regular Grammar.
- If each production in P is of the form A → α where A ∈ N and α ∈ (N ∪ Σ) ∗ then the grammar G is called Free Context.
- If each production in P is of the form α → β where | α | ≤ | β | and α contains at least one nonterminal, so the grammar G is called Context Dependent.
- If a G grammar does not meet the above restrictions, it is called Unrestricted Grammar.

Regular expressions.

So far we have seen a very powerful mechanism for the representation of languages, such as grammars. There is another mechanism for the representation of languages, which, although not as general as grammars, provides great facilities for the representation of certain very simple languages. In particular, we will see later that this mechanism is very useful for representing that part of a language that is parsable by a lexicographic analyzer, that is, reserved words, identifiers, etc. This new language representation mechanism is known as Regular Expressions. A regular expression on a certain alphabet Σ can be defined in the following way:

- e is a Regular Expression over Σ. In other words, the empty string is a regular expression on any alphabet.
- If a is a symbol belonging to Σ then a is a regular expression that denotes the set or language {a}, that is, the set that contains the element a.
- If r and s are regular expressions denoting the languages L (r) and L (s) then:
- A) r | s, is a regular expression denoting the language L (r) ∪ L (s). B) r. s, is an expression that denotes the language L (r). L (s) C) r * is a regular expression denoting the language (L (r)) * D) r + is a regular expression denoting the language (L (r)) +

Recognizing schemes of formal languages. Finite automata.

Recognizing schemes.

Languages, generally non-finite, can be represented by generating schemes (regular expressions and grammars) or by recognizing schemes. Next we will analyze how to represent languages using this last type of schema. A recognizer for the L language is a program (or mechanism) that takes as input a string x and answers “yes” if x belongs to the L language or “no” if it does not. In general, a recognizer can be represented graphically in the following way:

Figure 1.1 General diagram of a recognizer

Input tape: It is a sequence of symbols ai that belong to an alphabet Σ and constitute the string to be recognized. Usually the sequence can end with a special symbol whose function is to mark the end of the chain.

Read head: Mechanism that allows the recognizer to obtain the symbols from the input string. It can move left, right or remain stationary at a certain time.

Status control: Recognizer center. Through a finite set of states and a transition function it is described how the recognizer behaves before all the symbols of the language alphabet.

Memory: Device for storing information. The information is made up of symbols belonging to a memory alphabet and they are used by the recognizer during the process of analyzing a string. This device is organized with a certain structure. There is a wide variety of recognizers:

- Finite automata.
- Stack Automata.
- Linear Boundary Automata
- Turing machine.

The recognizers that we will analyze in this course, given their relationship with the lexical analysis phase, and the recognition of the tokens of a language, are finite automata.

Finite automata.

A finite automaton or finite state machine is an abstract tool that is used to recognize a certain regular language. It is a mathematical model of a system that receives a string made up of characters from a certain alphabet and determines whether that string belongs to the language that the automaton recognizes. Formally:

Definition 1.

Finite automaton. Formally a finite automaton is defined by a 5-tuple N. N = (S, Σ, δ, S0, F). Where: S: It is a finite set of states. Σ: Language alphabet. Set of automaton input symbols. δ: Transition function defined as: δ: S x Σ ∪ {∈} → P (S). S0: It is the initial state of the automaton S0 ∈ S. F: It is the set of final or acceptance states of the automaton. F ⊆ S.

Deterministic finite automata (AFD)

An AFD or deterministic finite automaton is that finite automaton whose arrival state is uniquely determined by the exit state and the character read by the automaton. An AFD is a particular case of AF in which:

- There are no e-transitions.
- For any state if and any symbol a ∈ Σ, there is at most one transition out of if.

An AFD is called complete when it has a transition for each state and each character of the alphabet. If it does not meet this condition, it is called incomplete. The great advantage provided by Deterministic Finite Automata is given by the simplicity of the mechanism that simulates the recognition of a chain.

Non-deterministic finite automata (AFND)

A non-deterministic finite automaton (AFND) is a finite automaton that presents zero, one or more transitions for the same character of the alphabet and two groups are classified: with e-transitions and without e-transitions depending on the existence or not of one or more transitions without the automaton reading the next character in the string it is parsing.

Construction of finite automata from regular expressions

Constructing a regular expression is not only a very intuitive process, but it is a convenient way to represent regular languages. But they have a great disadvantage and that is that it becomes unusable when we want to computationally determine whether a certain string belongs or not to the language represented by said regular expression. It would be necessary to generate all the strings of the language, or at least a large part of them, which can be an infinite process. The languages described by regular expressions are the same languages recognized by finite automata. As previously seen, one of the great advantages of deterministic finite automata comes from the simplicity of the mechanism that simulates the recognition of a chain. On the other hand, there is an algorithm to convert a regular expression into the corresponding non-deterministic finite automaton. The algorithm, called Thompson’s construction, builds from the regular expression an AFND with empty transitions, that is, an automaton that contains arcs labeled with e. Then this automaton with empty transitions can be converted into a finite automaton without empty transitions that recognizes the same language.

Lexical Analysis

How a lexical analyzer works

The lexical analyzer is the first phase of a compiler. Its main function is to read the input characters and to output a sequence of tokens that the parser then uses in its parsing. The interaction between both components (lexical analyzer and parser), represented in Fig. 1, is basically as follows: Before a request for a token (“next token” message), the lexical analyzer reads a sequence of characters from the program source code until a token is identified, which is returned in response to the parser request.

Specification of a Lexical Analyzer

Before implementing a Lexical Analyzer for a given language, it is necessary to design its specification. To do this, you must first identify the collection of tokens that make up the language. Each token must be specified by some generator scheme or a recognizer scheme, preferably regular expressions. Then it is necessary to Specify the Transition Diagram of the Lexical Analyzer. Each of these steps will be explained in detail, based on an example.

Preliminary Concepts: Tokens, Patterns and Lexemas.

When the issue of the specification of a Lexical Analyzer is approached, terms such as token, pattern and lexeme are mentioned, with their respective meanings. Let’s look at the example corresponding to Fig. 2. In general, there is a set of strings for which the same token is produced as output. This set of chains is described by a rule called a pattern and is associated with a specific token. A lexeme is a sequence of characters from the source program that matches a pattern corresponding to a token. For example, in the following statement for the Pascal programming language, const pi = 3.1416; the substring pi is a lexeme for the “identifier” token. Lexemes that match a token pattern represent strings from the source program that can be handled together as a lexical unit. This lexical unit is represented by the token, which, in the following phases of the compilation process, will constitute a terminal symbol in the generating grammar of the language. In most programming languages the following sets constitute tokens:

- reserved words: strings predetermined by the language which cannot be used as identifiers.
- operators: Which in turn could be divided into 4 arithmetic, relational, logical, and assignment tokens.
- constants or literals: Which in turn could be divided into three tokens, real literals, integer literals and string literals.
- punctuation symbols: This includes periods, commas, parentheses, braces, and so on. They can be divided into several tokens, it all depends on the syntactic meaning that a certain
__punctuation symbol has__in the language.

Patterns can be specified in various ways. In Fig. 2, the const token pattern is the const string itself. The pattern of the token relation is the set of C relational operators. To describe patterns for more complex tokens, any generator or recognizing language scheme is used. They are generally described using regular expressions, although sometimes it is convenient to also describe them using a finite automaton.

Attributes of a token

For each token identified by the lexical analyzer, it is necessary to store other information such as the __lexeme__ , the line number of its first appearance and other data that depend on the type of token (if the identifier is of an array or of a function, in each case to store the __limit__ or quantity and types of the parameters, etc). This __information__ is necessary for later phases of the compilation process and is stored in the compiler’s symbol table. That is why a token usually has only one attribute, which is a pointer to the entry of the symbol table where all the information regarding the token is stored. For example, the tokens identified in the following __Fortran__ statement E = M * C ** 2 can be written as a sequence of pairs: <id, pointer to the entry in the symbol table for E> <assign_op,> <id, pointer to the entry in the symbol table for M > <mult_op,> <id, pointer to the entry in the symbol table for C> <exp_op,> <num, pointer to the entry in the symbol table for ‘2’>

Note that in some pairs the value of the pointer is not necessary, since the first component of the pair is sufficient to identify the __lexeme__ . In this small example, for the num token, the __Compiler__ can store the string that forms the number in the symbol table, and the __pointer__ to that input would be the num token attribute.