As a self-taught front-end developer, this is a line of code that you may have been mechanically writing without completely understanding. For me, it was something that I offhandedly wrote, even occasionally missing altogether, until I did some research and discovered its importance.
Simply put, when you declare the "charset" as "UTF-8", you are telling your browser to use the UTF-8 character encoding, which is a method of converting your typed characters into machine-readable code.
Character set vs character encoding
Your computer can only manipulate information it receives in binary format. This means that any characters on your web page must be converted. This is done in 2 stages - character set and character encoding.
- Each letter, punctuation mark, or character is assigned a unique number, called a "code unit" (Character Set)
- The code unit is then converted to binary (Character Encoding)
HTML documents can only contain characters defined by the Unicode character set, so we do not need to define the character set in our document. However, there are several forms of encoding that can be used with Unicode, so we do need to declare which we would like to use. Presently, UTF-8 is the recommended character encoding by the W3C.
As an example, the text "Hello, World!", is converted to binary in the following way.
Character Set (Unicode)
U+0048 U+0065 U+006C U+006C U+006F U+002C U+0020 U+0057 U+006F U+0072 U+006C U+0064 U+0021
Character Encoding (UTF-8)
01001000 01100101 01101100 01101100 01101111 00101100 00100000 01010111 01101111 01110010 01101100 01100100 00100001
Character references and entities
Within HTML, we occasionally need to access characters from Unicode besides what is on a standard keyboard. To solve this problem, we can use "numeric character references" and "named character entities" to reference them.
Each of these Unicode characters has both a named entity as well as a numeric reference, and you can call either one. For example, one that you may frequently use is the copyright symbol, which can be written as -
Why you need to declare character encoding
You may notice in certain situations that, if you skip on declaring the character encoding, nothing significant happens. This is because the character encoding can be specified elsewhere. There are actually three main methods, listed below in order of precedence they take (highest to lowest).
The first two methods are not so easily accessible and may be present without your knowledge. You can use the W3C Internationalization Checker to look these up for your website.
Even though the Meta tag may be overridden by the other two methods, it is still advised that you specify it for a few reasons.
- Clarity - it helps other people reading your code more easily determine what character encoding you intended to be used.
- Wrongful encoding - in some cases, particularly for static websites, the character encoding may not be specified by these other methods. In this case, you risk incorrect encoding of your content.
- Validation - according to the W3C, declaring you character encoding through the meta charset tag is essential for your code to validate
How to declare character encoding
The proper way to declare your character encoding is to state it immediately after the opening head in your document, before anything else.
<!doctype html> <html> <head> <meta charset="UTF-8"> <title>My Website</title> </head> <body> </body> </html>