On type & fonts
Type design notes
Font production notes
Type & font tweets
Readable structure, unreadable text
25 May 2010
One advantage of XML, even a design goal according to the specification, is that it is human readable as regards data structure.
[1]/
[2] As a consequence, however, data itself turns out to be less readable:
Data usually means text – a string of characters, accordingly called "character data" in the specification. What makes XML readable in terms of data structure is that this structure is represented via markup which is text too. This implies that some characters may serve two levels: first, the level of data, and second, the (meta) level of data structure, being constitutive of markup since they serve to indicate tags, attribute assignment, etc.
[3] To avoid ambiguity, those few characters that are constituive of markup must be escaped when appearing in data – must be replaced by a somewhat longer descriptive character string (< and & becoming "<" and "&"). This makes data or text less readable and, I feel tempted to say, corrupts the text.
This is not news but worth repeating. The ambiguous role of some characters, being part of data and representing data structure, is problematic.
Introducing Data Format Non-Characters
There may be a simple solution for making readable, i.e. text based, data formats more readable and ensuring that all text can be expressed as a string of Unicode characters without need for escaping. Unicode already knows a block of non-characters. These are located near the end of UTF-16, FDD0..FDEF, are meant for internal use in applications, and are "private" as regards the meaning of individual non-characters.
[4] In analogy to these non-characters, Unicode could reserve code points for non-characters for use in data formats. Such Data Format Non-Characters are "private" too at the level of Unicode. Their meaning is expected to be defined in individual data formats' specifications.
[5]
I leave it open which code points to reserve for them, or how many. A range like FD40..FD4F suggests itself, it is a full column and moreover on the same page as existing non-character.
XML, for example, only needs very few of such non-characters to do the same job, unambiguously, as < > = " do now. Anything else, tag names, property names, etc can be described with normal characters. The same would be true for other text based formats.
Visual representation of Data Format Non-Characters
Visual representation of Data Format Non-Characters is not standardized by Unicode either. For visual representation, I expect Data Format Non-Characters to be mapped to representative normal characters from which they borrow shapes. This mapping, too, is part of each data format's specification.
[6]
XML, for example, would map Non-Character code points for tag delimiters, assignment and enclosure to < > = " as representative characters.
Just as text editors do syntax coloring today, they would know a data format's mapping and represent Data Format Non-Characters by representative normal characters but highlight them visually, e.g. by coloring or bolding, so that a user knows that these are not normal characters.
Text editors may even, as a user is editing data of such a format, map Data Format Non-Characters to the representative characters' keys (assuming that the latter are more likely to be used when editing such files) and give access to normal ones via shortcuts. Or the other way round. I am merely pointing out options.
Advantages
Data Format Non-Characters would remove the ambiguity described above. Data Format Non-Characters would be encoded by Unicode yet explicitly considered as not being characters. Representation of data structure would be expected to involve Data Format Non-Characters at least as delimiters, to clearly separate markup from content. And data could, again, make use of the full repertory of characters.
I am not thinking of XML and related formats alone. Actually, any text based format that needs to distinguish between data and data structure would profit from such non-characters, including CSV which may use two of them as column and line separators.
Postscriptum
Why not use existing FDD0..FDEF non-characters? According to Unicode, applications "should never attempt to exchange them." And if they find such a non-character in a text they should "take appropriate action, such as removing it from the text."
[7] This implies that applications which are ignorant of the special use of Data Format Non-Characters might open a file of a data format that uses them, remove non-characters, and thus corrupt the file.
[2] In my own terminology, "data" refers to pure data and "document" refers to data-as-represented or representation-of-data. Going by my distinction, XML is dealing solely with the pure data side, leaving the representation side to CSS or XSLT. In so far, what the XML specification call "XML documents" I would simply call "XML files" when the emphasis is on the physical side.
[4] The Unicode Consortium: The Unicode 5.0 Standard. Addison-Wesley 2007, pp 899 (code chart), 57, 549f (info).
The same online:
code chart holding non-characters in Arabic Presentation Forms-A (pp 1190 & 1203f, PDF-pp 3 & 16f), info in chapters
General structure (p 50, PDF-p 45, follow bookmark "2.13 Special characters and noncharacters") and
Special areas and format characters (p 514f, PDF-p 17f, follow bookmark "16.7 Noncharacters").
The latter explicitly says, "noncharacters can be thought of as application-internal private-use code points" and "have no interpretation whatsoever outside of their possible application-internal private uses".
[5] Summing up: Existing non-characters for application-internal use and non-characters for data format alike are "private" at the level of Unicode. While the meaning of the former is defined at the level of individual applications ("proprietary"), the meaning of the latter is expexted to be standardized at the level of data formats that make use of them.
[6] In contrast to non-characters for application-internal use, Data Format Non-Characters are explicitly meant for exchange. Applications should interpret them as defined in the data format's specification. If they do not know the respective data format, they should refuse to open the file since they are not able to represent Data Format Non-Characters properly.
All texts & images, unless noted otherwise:
Copyright © Karsten Luecke
All rights reserved.
All product and company names mentioned may be trademarks or registered trademarks of their respective companies.
Link for your feed reader.