Language identification

A language identification algorithm is used the determine the language of a given text. Some languages can be determined reliably from their character use alone. In these cases it is sufficient to analyse the characters and encoding of the web content. For languages sharing the same script (such as languages using the latin alphabet), the distinction is more difficult.

Use statistical methods.

A widely used approach are supervised machine learning algorithms based on character n-grams. These algorithms are trained on example texts in a known language. The generated model can be used to label unknown text.

This class of algorithms has the following properties:

Auto-WCAG recommendation

The Auto-WCAG group does not recommend the use of any particular algorithm. Instead each tool that uses automatic language identification should disclose which algorithm, implementation, or third party libraries are used.

Background

List of implementations (both libraries and web services) from Wikipedia.