This specification is one of a family of related specifications that compose EPUB 3, the third major revision of an interchange and delivery format for digital publications based on XML and Web Standards. It is meant to be read and understood in concert with the other specifications that make up EPUB 3: The Overview should be read first. This specification supersedes Open Package Format 2.

The mapping is defined by the encoding.

Thus, the number of code units required to represent a code point depends on the encoding: These pairs of code units have a unique term in UTF Code points are mapped to one, two, or four code units. Rather than mapping characters directly to octets bytesthey separately define what characters are available, corresponding natural numbers code pointshow those numbers are encoded as a series of fixed-size natural numbers code unitsand finally how those units are encoded as a stream of octets.

The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. The repertoire may be closed, i. The characters in a given repertoire reflect decisions that have been made about how to divide writing systems into basic information units.

The basic variants of the LatinGreek and Cyrillic alphabets can be broken down into letters, digits, punctuation, and a few special characters such as the space, which can all be arranged in simple linear sequences that are displayed in the same order they are read.

But even with these alphabets, diacritics pose a complication: Ligatures pose similar problems. Other writing systems, such as Arabic and Hebrew, are represented with more complex character repertoires due to the C1 mark scheme to accommodate things like bidirectional text and glyphs that are joined together in different ways for different situations.

A coded character set CCS is a function that maps characters to code points each code point represents one character. For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" to 66, and so on.

A character encoding form CEF is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length i.

For example, a system that stores numeric information in bit units can only directly represent code points 0 to 65, in each unit, but larger code points say, 65, to 1. This correspondence is defined by a CEF. Next, a character encoding scheme CES is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network.

See comparison of Unicode encodings for a detailed discussion. Finally, there may be a higher level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as the same character.

An example is the XML attribute xml: But now the terms have related but distinct meanings, [5] due to efforts by standards bodies to use precise terminology when writing about and unifying many different encoding systems.

A "code page" usually means a byte-oriented encoding, but with regard to some suite of encodings covering different scriptswhere many characters share the same codes in most or all those code pages. Most, but not all, encodings referred to as code pages are single-byte encodings but see octet on byte size.

Contrasted to CCS abovea "character encoding" is a map from abstract characters to code words. Most of its use is in the context of Unicodificationwhere it refers to encodings that fail to cover all Unicode code points, or, more generally, using a somewhat different character repertoire: Some sources refer to an encoding as legacy only because it preceded Unicode.

Character encoding translation[ edit ] As a result of having many character encoding methods in use and the need for backward compatibility with archived datamany computer programs have been developed to translate data between encoding schemes as a form of data transcoding.

Some of these are cited below. Web browsers — most modern web browsers feature automatic character encoding detection.

The newer versions of the Unix file command attempt to do a basic detection of character encoding also available on Cygwin.Awk, C#, C++, E, merd, PHP5, Python, Ruby, Tcl, Vimscript, YCP == Python == /= Haskell == \= Oz == \== Classic REXX, Prolog = /= Ada =!= Maple, XPath.

EPUB Publications EPUB is an interchange and delivery format for digital publications based on XML and Web Standards. EPUB 3, the third major revision of EPUB, is defined by a set of specification documents including this document, which defines publication-level semantics and conformance requirements for EPUB 3.

