|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 2, Number 4, 4Q2003 |
|
|
|
|
|||
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga, CA 95070 |
©
2003, H.M. Gladney ISSN: 1547-8610 |
Riddles can be instructive. If you agree, you might try the following before you look at the answers given later in this DDQ number.
In each case below, the assignment is to provide further members of the set or sequence given.
1) O, T, T, F, F, S, S, E, ...
2) 79, 72, 66, 59, 50, 42, 34, 28, 23, 14, ...
3) cherry, apple, rhubarb, plum, beet, Japanese maple, ...
4) 3, 7, 11, 15, 19, ...
Digital preservation is critical to most of the history of the future.[1] This expectation justifies every practical effort to ensure that the technical and administrative methodology used is sound and widely understood.
To examine the criteria that
should be used, we have returned to early 20th-century
epistemology. The thinking of Ludwig
Wittgenstein and his successors teaches the importance of sharing definitions
that are precise enough to minimize community confusion. It further teaches that we should pay
diligent attention to the boundary
between what can be specified precisely—what’s objective, and automated, and
what must forever remain issues of human values, opinions, and imperfectly
communicated intentions—what’s subjective, and therefore cannot be automated.
Such philosophic distinctions are critical to issues of trust and of preserved information authenticity. The latest member of our Trustworthy 100-Year Digital Objects series identifies essential criteria for any approach to long term digital preservation. Subtitled Syntax and Semantics—Tension between Facts and Values, its abstract reads:
Prior Trustworthy 100-Year Digital Object articles describe a method for preserving digitally represented information. Trustworthy Digital Object (TDO) representation and packaging makes any digital content reliably meaningful to consumers, no matter how distant these are in time, in space, and in social affiliation from their information sources. The current article focuses on digital document authenticity and on evidence a consumer can use to decide whether to trust the content.
Such considerations are necessarily epistemological. Arguing the issues must start by conveying as unambiguously as possible what we mean by words like ‘authenticity’ and ‘evidence’ and by distinguishing between such words as ‘objective’ and ‘subjective’. These arguments apply Wittgenstein’s teaching to pictorial models of digital and conventional communication.
The analysis leads us to identify an ethical imperative for digital preservation, and to suggest that the TDO method defines a quality standard against which any method of digital preservation should be judged.
A preprint copy is available on e-mail request. I would welcome critical commentary.
Trouble-free transfer of textual information across otherwise incompatible digital platforms depends on proper handling of character representation. This topic has become surprisingly complicated, partly because incomplete protocols were widely used before a standard became available. (Declaring a fresh start that discarded old data and obsolete tools would be impractical.)
Conceptually, the topic is relatively simple. However, since for many years I did not understand it clearly, I presume that some readers might welcome a brief explanation.[2]
Unicode/UCS |
is a function from natural numbers in [0,231-1]
(31-bit integers) to character names. |
|
UTF-8 |
is the most popular of several ways of
representing Unicode text to take less storage space than would be required
if characters were represented by a 32-bit words (4 bytes). |
|
A glyph |
is a picture for displaying and/or printing a
visual representation of a character. |
|
A font |
is a set of glyphs for some Unicode subset, with
stylistic commonalities in order to achieve a pleasing appearance when many
glyphs combine to represent a text. [3] |
|
A code point |
is the number or index that uniquely identifies a
Unicode character. |
The meaning of a character and the picture of
the character are distinct. Generally
there are many pictures that mean the same character—one for every font
variation. For instance, "the
first letter in the Latin alphabet" (a meaning) can be depicted by any
of
.
In a digital machine, the encoding is a string of zeros and ones (or on and off indications, or true and false indications—different ways of saying the same thing). Such a string can be viewed to be the binary encoding of a number; for instance, ‘10100’binary is the same number as ‘20’decimal.
Unicode characters are intended to represent the written forms of all the world’s major languages.[4] ‘Unicode’ is an informal name for the ISO 10646 international standard that defines the ‘Universal Character Set (UCS)’. Relative to all other character standards, UCS is a superset[5] that guarantees round-trip compatibility. (No information will be lost if you convert any text string to UCS and then back to the original encoding.) Key important Unicode and character representation concepts are illustrated by:
|
Unicode |
Unicode |
Storage Representation |
Human Presentation |
|
002D |
HYPHEN-MINUS |
000101100 |
- |
|
2010 |
HYPHEN |
11100010 10000000 10010000 |
- |
|
2013 |
en
dash |
11100010 10000000 10010011 |
- |
|
2212 |
MINUS |
11100010 10001000 10010010 |
- |
|
00E9 |
LATIN
SMALL LETTER E |
01101001 |
é é é é |
|
01A9 |
latin capital letter esh |
11000110
10100101 |
Σ |
|
03A3 |
GREEK CAPITAL LETTER SIGMA |
11001110
10100011 |
Σ
Σ
Σ Σ
Σ |
|
2211 |
n-ary summation |
11100010 10001000 10010001 |
Σ
Σ Σ Σ |
|
0633 |
ARABIC LETTER SEEN |
11011000
10110011 |
س |
Epistemological analysis of Unicode and of fonts: Unicode defines a function from code points (integers) to names (ASCII character strings), saying nothing about how the integers should be represented by binary code or about how the characters should be depicted by glyphs or sound when spoken.[7] Unicode character names are surrogates for conceptual objects. They are also mnemonics by virtue of being well-known English words. How can the Unicode definition, in itself, be useful?
A character takes its meaning from how it is used, not from the appearance of any associated picture (glyph). For instance, a ‘PARENTHESIS, LEFT’ signals the start of a delimited string. Provided that a glyph used in formatted text is understood to mean ‘PARENTHESIS, LEFT’, it is almost irrelevant whether it looks like ‘(‘, ‘(‘, or ‘(‘.
The identity of characters is defined by the first two columns illustrated in
the table above, and different characters might, in some fonts, have identical
glyphs. For instance, ASCII contains
characters with multiple uses; its ‘hyphen’
is used also for ‘minus’ and as a ‘dash’.
In contrast, Unicode defines ‘hyphen’
and ‘minus’ (as well as different
dash characters). For compatibility,
the old ASCII character is preserved in Unicode also (in the old code position,
with the name ‘hyphen-minus’).
Why might a distinction between ‘hyphen’
and ‘minus’ be important if their
glyphs are identical in many fonts?
Although the distinction might be unimportant for print and display
appearance, it is almost surely critical in programs such as those that include
sorting and searching. For instance,
when I search for minus signs, I prefer not to be distracted by hyphens.
When a text file is sent to an application, how does it know what character coding[8] is being used? The answer is that applications that support Unicode typically require that their input files have header records that identify the encoding. For instance, a proper XML header record is:
<?xml version="1.0" encoding="utf-8"?>
Invest to Save, Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation is now available in PDF format from the Delos website. It recommends[9] research into:
|
Emerging Research |
1A |
Repository
development for existing models, for repositories for software and file
format specifications, and management of peripheral devices. |
|
|
1B |
Cheap, long-lasting, efficient and verifiable storage media |
|
|
1C |
Generic devices capable of reading diverse classes of media |
|
|
1D |
Identify how their emergence will change digital entity
encoding formats |
|
|
1E |
Descriptive language for the performance and behavior of
preserved digital entities |
|
|
1F |
Inquiry
into context sensitivity, risk awareness and proper preservation
behavior. |
|
|
1G |
Accelerated ageing of media, systems and software, for
predicting risks to digital objects |
|
|
1H |
Semantics to represent temporal, procedural and spatial
relationships of digital entities |
|
Re-engineering |
2A |
Modeling digital preservation
processes |
|
|
2B |
Automation
of digital preservation processes |
|
|
2C |
Detecting trustworthiness and
information quality |
|
|
2D |
Scalability
of long-range archives |
|
|
2E |
Characterization
of collection completeness |
|
|
2F |
Distributed and grid storage |
|
Systems |
3A |
Formats of digital entities |
|
|
3B |
Managing complex and dynamic
digital entities |
|
|
3C |
Automated metadata creation |
|
|
3D |
Long-term metadata viability |
|
|
3E |
Multilingual entities and
technology |
|
|
3F |
Impact of preservation strategies on information loss |
|
|
3G |
Repurposing
e-content |
Although I contributed as a member of this workgroup, I could not agree with all its recommendations, and the format of the final report did not include dissenting opinions. Partly for this reason I talked again with an IBM Research expert on the storage device industry.
Its list of 21 recommended topics is too long, with the consequence that less promising topics might divert attention and resources from those that promise rapid, effective, and durable progress. If we give credence to expressions of urgency for digital preservation action that must also conform to reliably sound technological practice,[10] we should avoid the distraction and loss of focus that attention to unpromising topics will surely create.
For instance, the 1B, 1C, and 1D recommendations deal with topics whose reduction to practice would have to be handled by industrial enterprises. Close collaboration across disciplines and across enterprise types would be essential. Such collaboration is not evidenced by the recommendations, which read: [11]
“1B: Archival Media : To bring new classes of technology to bear on the recovery, reconstruction and interpretation of the meaning represented by bitstreams, they need to be encoded in preservation formats and on ‘archival media’. Research into generating cheap, long-lasting, efficient and verifiable media for storing the bitstreams is needed.
”1C: Salvage and Rescue: Preservation strategies depend upon our ability to access storage media over time. While we know that some storage media can have a shelf life of thirty years or more, the devices for reading particular classes of media tend to have much shorter life-spans, often only a couple of years. While a peripheral device repository might help here (see above), generic devices capable of reading diverse classes of media are needed to address peripheral device obsolescence.
“1D: Storage abstractions:
Preservation systems map between the operations that can be done on digital entity
encoding formats and the operations that are supported by storage
repositories. As newer classes of
storage devices are developed research will be necessary to identify how their
emergence will change digital entity encoding formats to take advantage of
content-based addressing and parallel processing of data. … ”
To allocate scarce research grant funds to these topics would be unnecessary and ineffective. As written, they fail to reflect well-known engineering and business facts, such as:
1) Achieving low unit price for a technology depends on finding or creating a large market—an unlikely prospect for a digital storage subsystem specialized for long term retention, unless it also happened to provide competitive storage density and read/write speed. Industrial participants have conducted, and continue to conduct, a sophisticated program seeking optimal combinations of durability, density, speed, and price. Only products of such processes are likely to offer prices that digital preservation programs can afford.
2) Looking for durable storage media as an isolated technical objective makes little sense. High performance solutions invariably require matched media, read-write heads, mechanics, packaging, and microcode. Today’s early-phase cost (for prototypes good enough to attract product managers) of a new storage technology is between $10M and $100M—well beyond what the NSF has typically awarded.[12]
3) Storage devices are typically packaged and sealed against dirt and damage.[13] After they leave their factories, the only non-destructive means of accessing their content is through their electrical connections. These support data stream protocols that are independent of the storage media and almost independent of device characteristics. Modern operating system software hides raw device characteristics from all higher level software.[14]
4) It is easy to copy even large amounts of data from aging devices to their replacements inexpensively with low error rates so that media risks are dwarfed by unrelated preservation risks.[15]
It might be possible to reformulate 1B, 1C, and 1D to avoid such problems. Doing so will require information and skills mostly to be found in industrial R&D laboratories.
The preservation literature often compares the lifetime of practical digital storage to that of paper. Two digital storage media are known to be as durable as paper. The first, single crystal nickel, has been pursued in the Long Now Foundation’s Digital Rosetta Project, which enjoys some NSF funding. The second digital medium as durable as paper is, in fact, paper![16]
Two-dimensional bar codes print technology is available commercially.[17] Such digital paper technology has not been diligently pursued for digital preservation. We wonder, “Why not?”
The sequence or set extensions I had in mind are shown as the tails of:
1) O, T, T, F, F, S, S, E, N, T, E, T, T, …
2) 79, 72, 66, 59, 50, 42, 34, 28, 23, 14, Christopher, Houston, Canal, …
3) cherry, apple, rhubarb, plum, beet, Japanese maple, stop sign.
4) 3, 7, 11, 15, 19, 0, 4, 8, …
By sending these puzzles to 30 friends, I gained confidence in a conjecture that few people will guess such responses, and that some will object that the test is unfair. How might my answers make sense?
The answer (1) is given more often by children than by adults, as the sequence I had in mind was the first letters of the English words for the natural numbers: ‘one’, ‘two’, ‘three’, ‘four’, … This illustrates that the answers to such riddles depend on shared experience and shared context. Perhaps children are more likely than adults to answer this one correctly because they have relatively few possibilities to explore.
Residents of New York City might provide the answer (2) from their common experience, because what I had in mind were subway stations of the 8th and Broadway line. This sequence is finite, unlike that in (1).
What I had in mind for (3) is also finite. However, only one member is required (or allowed) to complete the full set intended. What every object has in common is that it is partly colored red.
Mathematicians are likely to recognize that what I had in mind in puzzle (4) was ‘((3+n)mod 4) with n ranging over the ordered natural numbers’. Readers unfamiliar with modulo arithmetic might recognize that they use it in real life whenever they consider rotation or time-of-day; rotating an object 270° (270 degrees) clockwise leaves it facing the same way as rotating it 90° counterclockwise, and the time-of-day 24 hours from now will be the same as the time-of-day right now. For a digital computer that represents any integer in fixed length memory cell, modulo arithmetic is essential; adding 256 to an integer represented in one byte (an 8-bit binary string) does not change the value represented.
In each of the puzzles, the set members I had in mind shared some attribute: respectively being first letters of certain words, being related places, having a color in common, and conforming to an arithmetical rule. However, a priori shared attributes are not needed to choose some particular set or sequence. If you offered to pay me handsomely to load a truck, but provided no further specification, I would return with a truckload (or set) of objects (members) that you could not have predicted.
What’s going on? Notice that each puzzle answer includes “what I had in mind”. The presentation of each riddle was what Wittgenstein calls ‘ein Bild’—a ‘picture’ or ‘model’.[18] Each symbolizes something I had in mind—some thought or concept. However, to communicate a thought precisely and accurately is difficult. We must be sensitive to this difficulty if we wish to achieve the most economical reliable long-term digital preservation.
The core problem is a mathematical fact: suppose we are told that a set B is a subset of another set A, and are also given a tabulation of all the members of B. What does this tell us about members of A that are not also members of B?