-
Для тех пользователей, кому неохота разбираться в тонкостях интернационализации: простой вариант.
-
Default variant (just the HTML document without any character set specification) is suitable only for ISO 8859-1 (Latin1) Web pages and not suitable for Russian Web pages: direct character set specification is required in this case in accordance with standards. For those who are too lazy to look, here are directly related quotes from the Hypertext Transfer Protocol -- HTTP/1.1 (RFC 2616) :
3.4.1 Missing Charset
Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient.
Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1.
3.7.1 Canonicalization and Text Defaults
...
The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.
There are two methods to specify that your document is written in a specific character set, such as KOI8-R:
- Method which requires additional HTTP daemon actions (this is primary)
- Document-only method, no additional server configuration required (this method is alternative and not recommended).
If you do not use one of these, your document is treated as a Latin1 document, i.e. the default character set (ISO 8859-1) is assumed in accordance with standards.
WARNING: Lots of people never bother to follow standards, so many Russian pages written in KOI8-R or windows-1251 or any other Cyrillic code table aren't accompanied by any sort of
charset=
attribute. As a result, some browsers that adhere to standards may display such documents using the Latin1 (ISO 8859-1) character set which renders Russian text completely unreadable. Don't blame your browser for this, contact the author of the page instead and ask him/her to fix the page using one of the two methods described here. See also My Impressions (in Russian) about the current Russian-language Web state.Here are the most popular Russian encodings and their correspondent registered character set names for use in Web pages:
KOI8 (Unix) charset=koi8-r
CP1251 (Windows) charset=windows-1251
ISO 8859-5 (SunOs) charset=iso-8859-5
CP866 (Dos) charset=cp866
MacCyrillic (Mac) charset=x-mac-cyrillic
(unregistered, but commonly recognized) - How to request KOI8-R documents when several encodings are available.
- How to inform the remote server about your local character set.
- KOI8-R
<FORM>
s handling. - Handling KOI8-R resources that are served via non-HTTP.
- Tuning Apache HTTPD for KOI8-R support.
- Standards and References:
- HTML Document Representation from HTML 4.01 specification, charset & encoding.
- W3C Internationalization / Localization
- HTML Unleashed: Internationalizing HTML by Dmitry Kirsanov
- Internationalization of the Hypertext Markup Language (RFC 2070)
- Hypertext Transfer Protocol -- HTTP/1.1 (RFC 2616)
- Transparent Content Negotiation in HTTP (RFC 2295)
- Form-based File Upload in HTML (RFC 1867)
- Tags for the Identification of Languages (RFC 5646)
- HTML 4.01 Specification
На закуску - наши впечатления, составленные под влиянием происходящего в русскоязычной паутине, а также предварительное исследование.