Unicode
This is a near-universal converter for various Unicode encodings.
Supported formats
Encodings
- UTF-8: A variable-length encoding that produces between 1-4 bytes per character. Equivalent to ASCII on the ASCII character set.
- UTF-16: A variable-length encoding that produces either 2 or 4 bytes per character.
- UTF-32 / UCS-4: A fixed-length encoding that produces 4 bytes per character.
- Codepoints: The codepoint of each character, separated by spaces.
- Punycode: An ASCII representation of Unicode
used in internet hostnames, defined by RFC-3492.
Note that the
xn--
prefix of internationalized domain names is not part of Punycode and must be omitted when decoding.
The Raw format simply prints the string as it should be represented. Note that most valid code points are either unassigned or do not have a glyph in the current font; characters may either be missing or represented as a square box.
This converter is only for Unicode text. Use the binary converter for arbitrary byte sequences.
Bases
Numbers can be represented in binary, octal, decimal or hexadecimal bases. When encoding bytes, the binary, octal and hexadecimal representations are padded to a fixed length (8, 3 and 2 digits per byte). The decimal representation, and all representations of codepoints, are space-separated.
Base64 is a special encoding that
represents byte sequences as alphanumeric characters (plus the characters
/
, +
and =
), using four characters for every three bytes.
The PGP word list encodes bytes as a sequence of words, and is useful for conveying data over an audio channel.
Note that both Base64 and PGP words encode a byte stream, and cannot be used with codepoints, which are integers rather than bytes.