Unicode is a system for representing characters from all the different languages in the world. It was designed to exchange documents in different languages without problems and arises in late 1987
[ hide ]
- 1 What is Unicode?
- 2 A bit of history
- 3 Advantages
- 4 Unicode Presentation
- 5 Information processing: Coding forms
- 6 Coding schemes
- 7 Source
What is Unicode?
Unicode provides a unique number for each character, regardless of platform, program, or language. When Python parses an XML document , all the data is stored in memory as Unicode. Python works with Unicode since version 2.0 of the language. The XML package uses Unicode to store all the XML data, but you can use Unicode anywhere.
Unicode represents each character as a 2-byte number, from 0 to 65535. Each 2-byte number represents a single character used in at least one world language (characters used in more than one language have the same numeric code) . There is exactly one number per character, and exactly one character per number. Unicode data is never ambiguous.
The establishment of Unicode has been an ambitious project to replace existing character encoding schemes, many of which are very limited in size and incompatible with multilingual environments. Unicode has become the most extensive and complete character encoding scheme, being the dominant one in the internationalization and local adaptation of computer software. The standard has been implemented in a considerable number of recent technologies, including XML, Java and modern operating systems.
Computers only work with numbers. They store letters and other characters by assigning a number to each. Before Unicode was invented, there were hundreds of different encoding systems to assign these numbers. No specific encoding could contain enough characters: for example, the European Union , by itself, needs several different encoding systems to cover all its languages. Even for a single language like English, there was no single coding system that suited all commonly used letters, punctuation, and technical symbols.
Every computer (especially servers) needs to be compatible with many different coding systems; however, each time data is transferred between different encryption systems or platforms, such data is always at risk of damage.
Unicode provides a unique number for each character, regardless of platform, program, or language.
To do this, this method uses two bytes for each character. For reference, in classic ASCII format a single byte is sufficient to represent each character. This increased amount of space is typically provided by the programs and operating systems that support this encoding, and should not be a problem under normal circumstances.
A little history
Before Unicode, there were different character encoding systems for each language, each using the same numbers (0-255) to represent the characters in that language. Some (like Russian) have several incompatible standards that represent the same characters; Other languages (such as Japanese) have so many characters that they require more than one byte. Exchanging documents between these systems was difficult because there was no way for a computer to know for sure what character encoding scheme the document author had used; the computer only saw numbers, and numbers can mean many things. Unicode was designed to solve these problems.
The Unicode project started in late 1987 , after discussions between the engineers ofApple and Xerox : Joe Becker , Lee Collins, and Mark Davis . As a result of their collaboration, the first draft of Unicode was released in August 1988 under the name Unicode88 . This first version, with 16-bit codes, was published assuming that only the characters necessary for modern use would be encoded.
During 1989 the work continued with the addition of collaborators from other companies such as Microsoft or Sun Microsystems . The Unicode Consortium was formed on February 3 , 1991and in October 1991 the first version of the standard was published. The second version, including Han ideographic writing was published in June 1992 .
Unicode’s real advantage is its ability to store non- ASCII characters , such as the Spanish “ñ”. The Unicode character for ñ is 0xf1 in hexadecimal (241 in decimal), which can be written like this: \ xf1
>>> s = u’Dive in ‘
>>> print s
To create a Unicode string instead of a normal ASCII, the letter “u” is added before the string. This particular string does not have any non-ASCII character. This is not a problem; Unicode is an ASCII superset, so you can also store a normal ASCII string like Unicode.
When Python prints a string it will try to convert it to the default encoding, which is usually ASCII. Since the Unicode string is made of characters that are both ASCII, printing them has the same result as printing a normal ASCII string; the conversion is consistent, and if you didn’t know that “s” was a Unicode string you would never notice the difference.
Information processing: Coding forms
Unicode code points are identified by an integer. Depending on its architecture, a computer will use 8, 16, or 32-bit units to represent these integers. Unicode encoding forms regulate how code points will be transformed into computer-treatable units.
Unicode defines three forms of encoding under the name UTF or Unicode Transformation Format
UTF-8 – byte-oriented encoding with symbols of variable length.
UTF-16 – 16 bit variable length encoding optimized for multilingual basic plane (BMP) representation. Coding forms are limited to describing how code points are represented in machine-readable format. From the 3 identified forms, 7 coding schemes are defined.
Coding schemes deal with the way encoded information is serialized. The security of information exchanges between heterogeneous systems requires the implementation of systems that allow determining the correct order of bits and bytes and guarantee that the reconstruction of the information is correct. A fundamental difference between processors is the order of byte arrangement in 16- and 32-bit words, which is called endianness. Coding schemes must ensure that the ends of a communication know how to interpret the information received. From the 3 forms of coding 7 schemes are defined. Although they share names, encoding schemes and forms should not be confused.