TerSCII: Ternary Standard Code for Information Interchange

来源:互联网 发布:网络机顶盒就是wifi吗 编辑:程序博客网 时间:2024/06/10 04:33

source: http://homepage.cs.uiowa.edu/~jones/ternary/terscii.shtml


 

TerSCII: Ternary Standard Code for Information Interchange

Part of http://www.cs.uiowa.edu/~jones/ternary/ 
by Douglas W. Jones 
THE UNIVERSITY OF IOWA Department of Computer Science

Disclaimer: Nobody but the author endorses the use of this character set, and even he isn't so sure of it.

Abstract

The TerSCII character set is designed to serve, in the world of ternary information processing systems, in the same role as ASCII and its compatible descendant, Unicode, serve in the world of binary information processing. As a coding system ASCII and Unicode are resolutely binary, with blocks of 16, 32 and 64 characters as their basis. This is inappropriate in the Ternary world, where 9, 27 and 81 are far more likely to be relevant. In addition, just as the UTF-8 and UTF-16 codes are appropriate for representing Unicode as strings of 8 and 16-bit words, ternary computers will require TTF-3 and TTF-9 encodings for representing TerSCII as strings of 3-trit trybbles or 9-trit trytes.

  1. Background and Motivation
  2. The TerSCII Basic Roman Block
  3. The TTF-3 and TTF-9 Encodings

1. Background and Motivation

A single ternary (base 3) digit is a trit, which may take on the values 0, 1 and 2, for unsigned data, or –1, 0 and +1, for signed data. Trits are naturally grouped into triplets referred to as trybbles, where each trybble has 33 or 27 possible values. Where we need to compactly represent the value of ternary, we will use heptavintimal, that is, base 27, represented using the following encoding:

Heptavintimal trybble encodingsWeight:01234567891011121314151617181920212223242526Ternary:000001002010011012020021022100101102110111112120121122200201202210211212220221222Digits:0123456789ABCDEFGHKMNPRTVXZ

Three consecutive trybbles make up a tryte, able to represent a range of 273 or 19,683 distinct values. It is natural to pack 3 trytes or 27 trits into a word. This gives us a word that can be represented in 43 bits on a binary computer.

Representing text on such a machine using ASCII would naturally suggest using 5 trits per character, since 35 is 243. This is almost but not quite sufficient to represent UTF-8, but if we use 6 trits per character, we have 729 possible values, an awkward number even if we invent UTF-9 so that we can pack 512 values into each character.

Further investigation of ASCII and Unicode shows that the code is resolutely based on powers of two. The upper and lower case equivalents of each Roman letter are separated by a difference of 32. There are provisions for 32 control characters, most of which are rarely or inconsistently used. The blocks set aside in Unicode for different alphabets are all documented in hexadecimal and most begin cleanly on addresses that are multiples of 16 or 256. In contrast, the TerSCII code is resolutely designed in terms of multiples of powers of 3. The natural code block size is 9 by 9 instead of 16 by 16.

A second motive for developing a new character set follows from serious security problems caused by Unicode. In Unicode, there are many glyphs that display identicaly. Numerous letters appear identically in the Roman, Greek and Russian alphabets, and there are numerous different ways of displaying accent marks, some as single glyphs that render the letter with the accent mark, and some as sequences consisting of a letter followed by the accent mark. As a result, there are many character strings that render identically but are actually quite different.

Consider, for example, the string "ТerSСΙI" which should resemble the string "TerSCII" on a web browser conforming to modern standards, but is composed of the following Unicode entities:

  • 0422 CYRILLIC CAPITAL LETTER TE
  • 0065 LATIN SMALL LETTER E
  • 0072 LATIN SMALL LETTER R
  • 0053 LATIN CAPITAL LETTER S
  • 0421 CYRILLIC CAPITAL LETTER ES
  • 0399 GREEK CAPITAL LETTER IOTA
  • 0049 LATIN CAPITAL LETTER I
This has serious security consequences, for example, when a bogus web site has a URL that renders identically to a legitimate site. TerSCII, in contrast, does not permit this. There is exactly one encoding for each glyph. Accent marks may only be encoded as combining marks, never as separate accented characters. One consequence of this is that conversion from Unicode to TerSCII is deterministic and straightforward, while conversion from TerSCII to Unicode is nondeterministic, although there are sensible heuristics that can be used to pick the most appropriate of several identical Unicode glyphs in any particular context.

2. The TerSCII Code

With 26 letters in the English alphabet and comparable numbers in other western and middle-eastern alphabets, the first power of 3 that lends itself to representing a reasonable character set is 34 or 81. A 4-trit character allows encoding the Roman alphabet in both upper and lower case, plus 10 digits and a modest (but insufficient) set of control characters and punctuation marks. In this environment, a code extension system comparable to that of Unicode invites a character code built on 81-character blocks.

2.1 The TerSCII Basic Roman Block

Consider the following block for the basic Roman alphabet:

 00   0    1    2    3    4    5    6    7    8  0ESSP09IR_ir1EL-1AJSajs2ET'2BKTbkt3LR,3CLUclu4OP;4DMVdmv5RL:5ENWenw6SU.6FOXfox7HT!7GPYgpy8SD?8HQZhqzCodeMeaningESEnd of String, analogous to NULLELEnd of Line, analogous to LF or CR/LFETEnd of Text fileLRLeft to Right rendering of following textOPOverPrint following text on previous charRLRight to Left rendering of following textSUShift Up (superscript) following by 1/3 baselineHTHorizontal Tab in current rendering directionSDShift Down (subscript) following by 1/3 baselineSPSpace

This is a meagre character set, but it is good enough to typeset the body text of a novel, if substitution of apostrophes for quote marks is acceptable. Control over rendering direciton eliminates the need for backspace. The ability to overprint allows underlining and, with the addition of more characters, accent marks. Shift up and shift down can be used to superscript or subscript text.

Some rules apply to overprinting: All characters following an OP control code will overprint until the next LR or RL control code. No other overprinting mehanism is present. Specifically, there is no equivalent of the ASCII CR, which, on printing terminals, allowed the following characters to overprint (from left to right) an entire line of text. Instead, if an EL is encountered in LR mode, the next line is rendered starting at the left, and in RL mode, EL starts rendering the next line at the right. Where supported, changes from LR to RL mode in midline should operate so that "RL a b c d" should render identically to "RL a LR c b RL d" and both should print as "abcd".

2.2 The TerSCII Extended Roman Block

The next block we add provides characters that were missing from the above that are useful for western european languages.

 01   0    1    2    3    4    5    6    7    8  0 ‘       1 *       2 ’       3 /       4 |       5 \       6 ‹       7 ◊       8 ›       

Note that double quotes are merely pairs of single quotes. This eliminates the distinction between ‘‘this’’ (quoted with pairs of single quotes) and “that” (quoted with double quotes).

3. The TTF-3 and TTF-9 Encodings

The basic TRISCII character set can be encoded in 4-trit quartets, but addressing 4-trit units on a ternary computer is as difficult as addressing 6-bit units on a binary machine. 6 or 9 trits per character make far more sense. The Trinary Text Formats TTF-3 and TTF-9 use these sizes. These formats borrow some ideas from the UTF-8 encoding of Unicode, but they do so without threatening to create any degree of compatibility.

TTF-3 encodes each character as a sequence of one or more 3-trit nybbles. The leading trit on each nybble indicates whether that nybble is a stand-alone character, the first nybble of a long character, or a subsequent character of a long character. Each nybble carries 2 trits of the character representation.

trybble0123  blockstrit012345678910110 – 80t1t0   0: only control characters9 – 802t3t21t1t0   0: basic Roman except cc's81 – 7282t5t41t3t21t1t0   1 to 8729 – 65602t7t61t5t41t3t21t1t0  9 to 80

As with Unicode, character must be encoded using its shortest encoding. Thus, while HT (Horizontal Tab) can be encoded as 200 121 (KG27) or even 200 100 121 (K9G27), we require it to be encoded as 021 (727). This constraint plus our encoding scheme guarantees that simple trybble-by-trybble comparison of two strings in their TTF-3 form will alphabetize them as if the characters had been fully expanded into their canonical fixed-size representation.

Unlike Unicode, the first trybble of the longer character encodings does not indicate the length of the character. This scheme can, potentially, be stretched to arbitrary-length codes, but we arbitrarily declare that any character encoding with more than 8 trybbles is illegal. This sets an excessively generous upper bound on the size of the character set and permits encoding of any data that can be encoded in TTF-9.

TTF-9 encodes each character as a sequence of one or more 9-trit trytes. The first trit of each tryte gives the length of the encoding.

tryte01  blockstrit012345678910111213141516170 – 65600t7t6t5t4t3t2t1t0   0 to 806561 – 430467202t15t14t13t12t11t10t9t81t7t6t5t4t3t2t1t0  81 and up

This encoding scheme is generous, allowing for 40 million distinct character codes, considerably more than Unicode's upper limit. Like UTF-8 and TTF-3, TTF-9 allows lexical sorting of strings based on their full TRISCII representation while doing successive comparisons one tryte at a time.

Because TTF-3 encodes the common Roman characters in just 2 trybbles while TTF-9 encodes them in 3, TTF-3 should be more compact for European languages. It should remain competitive even where characters in blocks 1 to 8 dominate because of the efficient encoding of spaces and control characters.


0 0
原创粉丝点击