Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 2, Number 4, 4Q2003

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

20044 Glen Brae Drive

Saratoga, CA 95070

©  2003, H.M. Gladney

 

ISSN: 1547-8610

 

Digital Preservation

A Test of Reasoning

Riddles can be instructive.  If you agree, you might try the following before you look at the answers given later in this DDQ number. 

In each case below, the assignment is to provide further members of the set or sequence given.

1)     O, T, T, F, F, S, S, E, ...

2)     79, 72, 66, 59, 50, 42, 34, 28, 23, 14, ...

3)     cherry, apple, rhubarb, plum, beet, Japanese maple, ...

4)     3, 7, 11, 15, 19, ...

Philosophical Basis for Digital Preservation

Digital preservation is critical to most of the history of the future.[1]  This expectation justifies every practical effort to ensure that the technical and administrative methodology used is sound and widely understood. 

To examine the criteria that should be used, we have returned to early 20th-century epistemology.  The thinking of Ludwig Wittgenstein and his successors teaches the importance of sharing definitions that are precise enough to minimize community confusion.  It further teaches that we should pay diligent attention to the boundary between what can be specified precisely—what’s objective, and automated, and what must forever remain issues of human values, opinions, and imperfectly communicated intentions—what’s subjective, and therefore cannot be automated.

Such philosophic distinctions are critical to issues of trust and of preserved information authenticity.  The latest member of our Trustworthy 100-Year Digital Objects series identifies essential criteria for any approach to long term digital preservation.  Subtitled Syntax and Semantics—Tension between Facts and Values, its abstract reads:

Prior Trustworthy 100-Year Digital Object articles describe a method for preserving digitally represented information.  Trustworthy Digital Object (TDO) representation and packaging makes any digital content reliably meaningful to consumers, no matter how distant these are in time, in space, and in social affiliation from their information sources.  The current article focuses on digital document authenticity and on evidence a consumer can use to decide whether to trust the content.

Such considerations are necessarily epistemological.  Arguing the issues must start by conveying as unambiguously as possible what we mean by words like ‘authenticity’ and ‘evidence’ and by distinguishing  between such words as ‘objective’ and ‘subjective’.  These arguments apply Wittgenstein’s teaching to pictorial models of digital and conventional communication.

The analysis leads us to identify an ethical imperative for digital preservation, and to suggest that the TDO method defines a quality standard against which any method of digital preservation should be judged.

A preprint copy is available on e-mail request.  I would welcome critical commentary.

Unicode, UTF-8, and Fonts

Trouble-free transfer of textual information across otherwise incompatible digital platforms depends on proper handling of character representation.  This topic has become surprisingly complicated, partly because incomplete protocols were widely used before a standard became available.  (Declaring a fresh start that discarded old data and obsolete tools would be impractical.)

Conceptually, the topic is relatively simple.  However, since for many years I did not understand it clearly, I presume that some readers might welcome a brief explanation.[2]

Unicode/UCS

is a function from natural numbers in [0,231-1] (31-bit integers) to character names.

UTF-8

is the most popular of several ways of representing Unicode text to take less storage space than would be required if characters were represented by a 32-bit words (4 bytes).

A glyph

is a picture for displaying and/or printing a visual representation of a character.

A font

is a set of glyphs for some Unicode subset, with stylistic commonalities in order to achieve a pleasing appearance when many glyphs combine to represent a text. [3]

A code point

is the number or index that uniquely identifies a Unicode character.

The meaning of a character and the picture of the character are distinct.  Generally there are many pictures that mean the same character—one for every font variation.  For instance, "the first letter in the Latin alphabet" (a meaning) can be depicted by any of   . 

In a digital machine, the encoding is a string of zeros and ones (or on and off indications, or true and false indications—different ways of saying the same thing).  Such a string can be viewed to be the binary encoding of a number; for instance, ‘10100’binary is the same number as ‘20’decimal.

Unicode characters are intended to represent the written forms of all the world’s major languages.[4]  ‘Unicode’ is an informal name for the ISO 10646 international standard that defines the ‘Universal Character Set (UCS)’.  Relative to all other character standards, UCS is a superset[5] that guarantees round-trip compatibility.  (No information will be lost if you convert any text string to UCS and then back to the original encoding.)  Key important Unicode and character representation concepts are illustrated by:

Unicode
Code Point

Unicode
Name

Storage Representation
(UTF-8 encoding
[6])

Human Presentation
(sample glyphs)

002D

HYPHEN-MINUS

000101100

-

2010

HYPHEN

11100010 10000000 10010000

-

2013

en dash

11100010 10000000 10010011

-

2212

MINUS

11100010 10001000 10010010

-

00E9

LATIN SMALL LETTER E
WITH ACUTE

01101001

é  é  é  é

01A9

latin capital letter esh

11000110 10100101

Σ

03A3

GREEK CAPITAL LETTER SIGMA

11001110 10100011

Σ Σ Σ Σ Σ

2211

n-ary summation

11100010 10001000 10010001

Σ Σ Σ Σ

0633

ARABIC LETTER SEEN

11011000 10110011

س

Epistemological analysis of Unicode and of fonts: Unicode defines a function from code points (integers) to names (ASCII character strings), saying nothing about how the integers should be represented by binary code or about how the characters should be depicted by glyphs or sound when spoken.[7]  Unicode character names are surrogates for conceptual objects.  They are also mnemonics by virtue of being well-known English words.  How can the Unicode definition, in itself, be useful?

A character takes its meaning from how it is used, not from the appearance of any associated picture (glyph).  For instance, a ‘PARENTHESIS, LEFT’ signals the start of a delimited string.  Provided that a glyph used in formatted text is understood to mean ‘PARENTHESIS, LEFT’, it is almost irrelevant whether it looks like ‘(‘, ‘(‘, or ‘(‘.

The identity of characters is defined by the first two columns illustrated in the table above, and different characters might, in some fonts, have identical glyphs.  For instance, ASCII contains characters with multiple uses; its ‘hyphen is used also for ‘minus’ and as a ‘dash’.  In contrast, Unicode defines ‘hyphen’ and ‘minus’ (as well as different dash characters).  For compatibility, the old ASCII character is preserved in Unicode also (in the old code position, with the name ‘hyphen-minus’).

Why might a distinction between hyphen’ and ‘minus’ be important if their glyphs are identical in many fonts?  Although the distinction might be unimportant for print and display appearance, it is almost surely critical in programs such as those that include sorting and searching.  For instance, when I search for minus signs, I prefer not to be distracted by hyphens.

When a text file is sent to an application, how does it know what character coding[8] is being used?  The answer is that applications that support Unicode typically require that their input files have header records that identify the encoding.  For instance, a proper XML header record is:

                      <?xml version="1.0" encoding="utf-8"?>

A Digital Preservation Research Recommendation

Invest to Save, Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation is now available in PDF format from the Delos website.  It recommends[9] research into:

Emerging Research

1A

Repository development for existing models, for repositories for software and file format specifications, and management of peripheral devices.

 

1B

Cheap, long-lasting, efficient and verifiable storage media

 

1C

Generic devices capable of reading diverse classes of media

 

1D

Identify how their emergence will change digital entity encoding formats

 

1E

Descriptive language for the performance and behavior of preserved digital entities

 

1F

Inquiry into context sensitivity, risk awareness and proper preservation behavior.

 

1G

Accelerated ageing of media, systems and software, for predicting risks to digital objects

 

1H

Semantics to represent temporal, procedural and spatial relationships of digital entities

Re-engineering

2A

Modeling digital preservation processes

 

2B

Automation of digital preservation processes

 

2C

Detecting trustworthiness and information quality

 

2D

Scalability of long-range archives

 

2E

Characterization of collection completeness

 

2F

Distributed and grid storage

Systems

3A

Formats of digital entities

 

3B

Managing complex and dynamic digital entities

 

3C

Automated metadata creation

 

3D

Long-term metadata viability

 

3E

Multilingual entities and technology

 

3F

Impact of preservation strategies on information loss

 

3G

Repurposing e-content

Although I contributed as a member of this workgroup, I could not agree with all its recommendations, and the format of the final report did not include dissenting opinions. Partly for this reason I talked again with an IBM Research expert on the storage device industry.

Its list of 21 recommended topics is too long, with the consequence that less promising topics might divert attention and resources from those that promise rapid, effective, and durable progress.  If we give credence to expressions of urgency for digital preservation action that must also conform to reliably sound technological practice,[10] we should avoid the distraction and loss of focus that attention to unpromising topics will surely create.

For instance, the 1B, 1C, and 1D recommendations deal with topics whose reduction to practice would have to be handled by industrial enterprises.  Close collaboration across disciplines and across enterprise types would be essential.  Such collaboration is not evidenced by the recommendations, which read: [11]

1B: Archival Media : To bring new classes of technology to bear on the recovery, reconstruction and interpretation of the meaning represented by bitstreams, they need to be encoded in preservation formats and on ‘archival media’.  Research into generating cheap, long-lasting, efficient and verifiable media for storing the bitstreams is needed.

1C: Salvage and Rescue: Preservation strategies depend upon our ability to access storage media over time.  While we know that some storage media can have a shelf life of thirty years or more, the devices for reading particular classes of media tend to have much shorter life-spans, often only a couple of years.  While a peripheral device repository might help here (see above), generic devices capable of reading diverse classes of media are needed to address peripheral device obsolescence.

1D: Storage abstractions: Preservation systems map between the operations that can be done on digital entity encoding formats and the operations that are supported by storage repositories.  As newer classes of storage devices are developed research will be necessary to identify how their emergence will change digital entity encoding formats to take advantage of content-based addressing and parallel processing of data.  … ”   

To allocate scarce research grant funds to these topics would be unnecessary and ineffective.  As written, they fail to reflect well-known engineering and business facts, such as:

1)      Achieving low unit price for a technology depends on finding or creating a large market—an unlikely prospect for a digital storage subsystem specialized for long term retention, unless it also happened to provide competitive storage density and read/write speed.  Industrial participants have conducted, and continue to conduct, a sophisticated program seeking optimal combinations of durability, density, speed, and price.  Only products of such processes are likely to offer prices that digital preservation programs can afford.

2)      Looking for durable storage media as an isolated technical objective makes little sense.  High performance solutions invariably require matched media, read-write heads, mechanics, packaging, and microcode.  Today’s early-phase cost (for prototypes good enough to attract product managers) of a new storage technology is between $10M and $100M—well beyond what the NSF has typically awarded.[12]

3)      Storage devices are typically packaged and sealed against dirt and damage.[13]  After they leave their factories, the only non-destructive means of accessing their content is through their electrical connections.  These support data stream protocols that are independent of the storage media and almost independent of device characteristics.  Modern operating system software hides raw device characteristics from all higher level software.[14]

4)      It is easy to copy even large amounts of data from aging devices to their replacements inexpensively with low error rates so that media risks are dwarfed by unrelated preservation risks.[15]

It might be possible to reformulate 1B, 1C, and 1D to avoid such problems.  Doing so will require information and skills mostly to be found in industrial R&D laboratories.

Digital Storage Media as Durable as Paper

The preservation literature often compares the lifetime of practical digital storage to that of paper.  Two digital storage media are known to be as durable as paper.  The first, single crystal nickel, has been pursued in the Long Now Foundation’s Digital Rosetta Project, which enjoys some NSF funding.  The second digital medium as durable as paper is, in fact, paper![16] 

Two-dimensional bar codes print technology is available commercially.[17]  Such digital paper technology has not been diligently pursued for digital preservation.  We wonder, “Why not?”

Test of Reasoning (continued)

The sequence or set extensions I had in mind are shown as the tails of:

1)     O, T, T, F, F, S, S, E, N, T, E, T, T, …

2)     79, 72, 66, 59, 50, 42, 34, 28, 23, 14, Christopher, Houston, Canal, …

3)     cherry, apple, rhubarb, plum, beet, Japanese maple, stop sign.

4)     3, 7, 11, 15, 19, 0, 4, 8, …

By sending these puzzles to 30 friends, I gained confidence in a conjecture that few people will guess such responses, and that some will object that the test is unfair.  How might my answers make sense?

The answer (1) is given more often by children than by adults, as the sequence I had in mind was the first letters of the English words for the natural numbers: ‘one’, ‘two’, ‘three’, ‘four’, …  This illustrates that the answers to such riddles depend on shared experience and shared context.  Perhaps children are more likely than adults to answer this one correctly because they have relatively few possibilities to explore.

Residents of New York City might provide the answer (2) from their common experience, because what I had in mind were subway stations of the 8th and Broadway line.  This sequence is finite, unlike that in (1).

What I had in mind for (3) is also finite.  However, only one member is required (or allowed) to complete the full set intended.  What every object has in common is that it is partly colored red. 

Mathematicians are likely to recognize that what I had in mind in puzzle (4) was ‘((3+n)mod 4) with n ranging over the ordered natural numbers’.  Readers unfamiliar with modulo arithmetic might recognize that they use it in real life whenever they consider rotation or time-of-day; rotating an object 270° (270 degrees) clockwise leaves it facing the same way as rotating it 90° counterclockwise, and the time-of-day 24 hours from now will be the same as the time-of-day right now.  For a digital computer that represents any integer in fixed length memory cell, modulo arithmetic is essential; adding 256 to an integer represented in one byte (an 8-bit binary string) does not change the value represented.

In each of the puzzles, the set members I had in mind shared some attribute: respectively being first letters of certain words, being related places, having a color in common, and conforming to an arithmetical rule.  However, a priori shared attributes are not needed to choose some particular set or sequence.  If you offered to pay me handsomely to load a truck, but provided no further specification, I would return with a truckload (or set) of objects (members) that you could not have predicted.

What’s going on?  Notice that each puzzle answer includes “what I had in mind”.  The presentation of each riddle was what Wittgenstein calls ‘ein Bild’—a ‘picture’ or ‘model’.[18]  Each symbolizes something I had in mind—some thought or concept.  However, to communicate a thought precisely and accurately is  difficult.  We must be sensitive to this difficulty if we wish to achieve the most economical reliable long-term digital preservation.

The core problem is a mathematical fact:  suppose we are told that a set B is a subset of another set A, and are also given a tabulation of all the members of B.  What does this tell us about members of A that are not also members of B?