Unicode

What is Unicode?

Unicode is a standard in computing that allows computers to consistently represent and manipulate text expressed in any of the world’s writing systems. In Python, the term “Unicode” refers to the built-in data type that holds such text.

Why is Unicode important?

Before Unicode, there were hundreds of different encoding schemes to represent text. These encoding schemes often confliced with one another. This created a problem: whenever text was transferred between different encodings, the same characters could be interpreted differently, leading to garbled text.

Unicode was invented to solve this problem. It is a universal standard that provides a unique number for every character across languages and scripts, making it possible to represent virtually any character on any computer.

Unicode in Python

In Python 2, Unicode was a separate data type from string - called unicode. Python 3 made a big stride in this area by making str data type effectively Unicode.

1# Unicode strings
2str1 = 'Hello World!'
3str2 = u'Hello World!'
4
5print(type(str1))
6print(type(str2))

In the code snippet above, both str1 and str2 are Unicode strings and will output <class 'str'> when the type is checked.

The prefix u before the string is actually optional in Python 3 because it’s assumed that all strings are Unicode by default. However, in Python 2, you need to add a u prefix to denote it as a Unicode string.

Note

In Python, Unicode strings can be specified using either single quotes, double quotes, or triple quotes.

Unicode Characters in Python

Unicode characters can be written in Python strings using the \u escape sequence followed by the Unicode code point in a four-digit hexadecimal format.

1# Unicode string with Unicode character
2str3 = u'\u00DCnicode'
3print(str3)

The above code will output Ünicode as \u00DC corresponds to the Ü character in Unicode.

If you have a Unicode code point that is more than four hex digits, use \U followed by eight hex digits.

1# Unicode string with Unicode character
2str4 = u'\U0001F609'
3print(str4)

This script will print an Emoji 😉 , which is represented by the Unicode code point 1F609.

Overall, understanding Unicode is essential when dealing with multilingual text data and ensuring it can be consistently represented and decoded correctly across different systems.