|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 2, Number 1, 1Q2003 |
|
|
HMG
Consulting |
© 2003, H.M.
Gladney |
Progress
on Trustworthy 100-Year Digital
Objects
Completing Trustworthy 100-Year Digital Objects: Durable Encoding for When It’s Too Late to Ask has been delayed from the target announced in DDQ 1(4) in order to deal with a weakness pointed out by Peter Lucas. The program translation model included was too simple to manage modern compiler output[1] that bridges incompatible operating systems[2] and permits dynamic resource linking.[3] The preliminary version of the paper—sound but obviously incomplete—is sufficient for many file types; it is available to anyone who requests it.[4]
A second member of the series, Trustworthy 100-Year Digital Objects: Evidence Even After Every Witness is Dead is available on request.4 Its abstract reads:
How can a publisher store digital
information so that any reader can reliably test its authenticity and
provenance, even years later when no witness can vouch for its validity? What is the simplest security infrastructure
sufficient to protect evidence for authenticity testing?
In ancient times, wax seals impressed with signet rings were affixed to documents as evidence of their authenticity. A digital counterpart is a message authentication code fixed firmly to each important document. If a digital object is sealed together with its own audit trail, each user can examine this evidence to decide whether to trust the content—no matter how distant this user is in time, space, and social affiliation from the document’s source.
This is true for any kind of document, independently of its purposes, and provides each user with autonomy for most of what he does. Producers can prepare works for preservation without permission from or synchronization with any authority or service agent. Librarians can add metadata without communicating with document originators or repository managers. Consumers can test authenticity without Internet delays, apart from those for fetching cryptographic keys.
We suggest technical means for accomplishing this: encapsulation of the document content with metadata describing its origins, cryptographic sealing, webs of trust for public keys rooted in a forest of respected institutions, and a certain way of managing document identifiers. These means will satisfy emerging needs in civilian and military record management, including medical patient records, regulatory records for aircraft and pharmaceuticals, business records for financial audit, and scholarly works.
Our method accomplishes much of what is sought under labels such as “trusted digital repositories”, and does so more flexibly and economically than any method yet proposed. It requires at most easy extensions of available content management software, and is therefore compatible with what most digital repositories have installed and are using today.
An unexpected challenge, described below, stimulated an article supporting the Trustworthy … series, What Do We Mean by “Authentic”? Until this has been published, a copy can be had from me on request.4
An Archivist Looks at the World of
Archivists
The Trustworthy 100-Year Digital Objects papers might confirm fears of some professional archivists—fears that David Bearman encountered in a 1994 occasion in which he outraged an archivists' association meeting by suggesting that bringing digital records into archives was not required in the electronic age. In a 1999 article, Lilly Koltun writes:
“We [archivists] want even more urgently to fence and direct records access and research strategies, to compel these to progress from the general to the specific. We want to educate people to use our systematizations, our authorities and subject headings, to think like us, to discover and use the clues we leave to locate and hence define archival treasures. We want to create descriptions, which are actually value interpretations, by rote and by rules, to “facilitate” records access in the “neutral” environment we formulate; and we want to call this “contextuallization” when it is more truly reductivism. We are, therefore, in danger of being entirely by-passed by our own clientele; and those who read our descriptions will rightly demand to see, use, and revalue the particulars for themselves, impatient of the delay.” [Koltun, p.131]
Achivists could surmount apparent technology threat to the archival profession by embracing the future (as well as the past).[5] The issue might be more than their livelihoods. They might no longer be able to enjoy a sense of special role, at least not as characterized by the artificial and self-serving claims [Koltun] calls into question. Some of the prestige of the professional archivist—the claim for hierarchical status and respect in the world—will vanish when anybody can create and preserve his choice of valuable information.[6] An effective reaction would be thoughtful reconstruction of skills and institutions that can best manage the evidence for cultural histories.
On a related note, the Trustworthy 100-Year Digital Objects series calls into question [Koltun]’s premises that digital documents are necessarily fluid rather than fixed, that digital document provenance necessarily is impeached, and that digital document content will inevitably degrade as people attempt to keep content accessible in spite of technology migration. The optimal[7] packaging for a digital object and evidence of its authenticity seems to be the digital equivalent of a wax seal impressed with a ruler's signet ring. Encoding (representation) based on relatively simple standards that include a universal virtual machine definition promises durable intelligibility.
This proposed mimicry of ancient practice is faintly ironical, but might be disturbing because error-free digital copying might reduce the role of archival specialists. For records on material media, we need archival institutions to create stability and trustworthy audit trails. For digital evidence, we probably will not need the kind of institution that serves paper-based records.
The value of Koltun’s article is not so much that it says new things (what it says, it says extremely well), but that it says so from within the archival community. Archivists can less easily reject professional concerns raised by a fellow archivist than they might reject similar words from outside their own community.
Today we use digital documents mostly for entertainment and efficiency. These documents are mostly copies that are not critical, often because they represent hard copy versions. However, in business applications, we now occasionally depend on a digital document that may be the only copy readily found, and that might cause damage if it is falsified. Such applications will grow because they are faster, cheaper, and often more reliable than their paper-based counterparts.
Authenticity is among document security properties that need attention. Much work on privacy and confidentiality is reported, but little on authenticity except from library and archive communities.
I had thought what “authentic” means to be sufficiently agreed that we could unambiguously consider methods for testing a particular digital object. Recent papers in the digital library and archive literature make such confidence seem unduly optimistic.
That we have long used “authentic” for quite different object kinds—for fossils, for antique furniture, for manuscripts and books, for musical performances, and so on—suggests a (hidden) common meaning or conceptual structure. We need to extend this for use with digital objects, and find that the following definition accomplishes this without violence to any traditional use of the word.[8]
Both for signals and for material entities, we choose the following definition of authentic.
Given a derivation assertion R, “V is a copy of Y ( V=C(Y) )”
a provenance assertion S, "X
said or created Y as part of event Z.", and
a copy function, "C(y)
= Tn (…(T2( T1(y) ))),”
we say that V is a derivative
of Y if V is related to Y according to R.
We say that “by X as part of event Z” is a true provenance of V if R and S are
true.
We say that V is sufficiently
faithful to Y if C conforms to social conventions for the genre and for the
circumstances at hand.
We say that V is an authentic
copy of Y if it is a sufficiently
faithful derivative with true
provenance.
To illustrate how this definition encompasses the traditional uses of “authentic” as well as what we need for digital objects would be too lengthy for DDQ. We provide this in a just-completed article What do We mean by “Authentic”? that might interest some readers.4 We do suggest, however, that not every reader will have sufficient patience for the careful wording and analysis that we feel necessary for a topic encrusted with long tradition and burdened by evident emotion among archivists and research librarians.8
What Constitutes Documentary Evidence?
If we agree what we
mean by authentic, we soon encounter
a new question, “What is acceptable documentary evidence?” We need to understand what it means to be
evidence—how to distinguish evidence from information that no cautious reader
would assume to be authoritative—a subtle and difficult issue with which
archivists, jurists, and others have long grappled.[9]
The word “evidence” originates in the Latin evidens (visible, clear, plain) which in turn comes from evidere (ex videre)—literally ex vident is out of [the fact] that they see. People usually accept what they themselves see and much of what they read, because skepticism about everything would be exhausting, tiresome, and counter-productive. They do, however, notice when information clashes with their practical and professional experience.[10] That’s when they might seek evidence.
Suppose we were given 100 documents among which 50 were evidence (of something or other) and 50 were not evidence. The answer to “What constitutes evidence?” would not be a pile of the 50 evidentiary documents, but rather a statement of the criteria that are proper bases for distinguishing the evidentiary documents from those that are not evidence—the rules for separating the 100 documents into two piles.
We consider a
document to be evidence if it shows itself to have been indelibly recorded by
collaboration of two or more people, and if we further believe that these
people have insufficient motive to conspire in embedding falsehoods within the
document at issue.[11] This
distills more than a century’s deliberation, discussion, and debate by
professional archivists. See, for
instance, [Duranti] on a discipline called “diplomatics”—content analysis of documents that
are bound into document sequences and individually sealed.[12]
Evidence is rarely, if ever, certain, but rather probabilistic. [MacNeil 1, pp.72-5] That is why, for instance, we speak of a jury as “weighing the evidence”. A single piece of evidence is rarely accepted as deciding a sensitive issue, even if it is strong evidence. Instead, the party at risk usually seeks corroborative or contradictory evidence, and may regard even weak corroborations as valuable. If the risk to the user is low and the fact at issue is of a kind for which chicanery is unlikely, one piece of strong evidence might be sufficient. If the risk is great, and misdeeds are common (e.g., if money is at stake), deciding with one piece of evidence might be imprudent.
Trustworthy 100-Year Digital Objects: Evidence Even After Every Witness is Dead proposes cryptographic evidence that can be constructed to be very strong. This paper hardly mentions either corroborative evidence or internal evidence that is part of a document’s content. However, it does not imply that cryptographic audit trails are, all alone, sufficient. For instance, for old scholarly documents, the place and manner of holding will commonly be considered useful supporting evidence, as will the existence of closely related documents.
Russell’s
Paradox: Significance and Resolution
Russell’s Paradox is historically significant[13] because any inconsistency in a mathematical or logical system calls the entire system into question, and because Lord Russell discovered that an early version of what he published in Principia Mathematica (PM) suffered such an inconsistency. Recall that PM was intended to put all logic and mathematics on an axiomatic base. Russell tried to avoid the paradox by a theory of types for classes, but was himself not happy with this resolution.[14] Wittgenstein soon attacked this theory of class types as unnecessary.[15]
The paradox continues to intrigue, because it is amusing and an early example of more than 50 years of difficulties with the concept of infinity. Zermelo had, in 1904, already avoided the problem by limiting his set theory to finite sets.[16] A colloquial expression of the paradox is, “In a village with a single barber, the barber shaves everyone who does not shave himself. Who shaves the barber?” The mathematical form of the paradox is, “Consider the set that contains all sets that do not contain themselves. Is this set a member of itself or not?”
There have been many discussions of the paradox and some of these are lengthy. There is, in fact, a simple solution that requires no mathematics and that can be stated in under 50 words. If you are fond of riddles and logical problems, you might want to try to solve it yourself before looking ahead to where this solution is given.
What’s a Good Research Initiative?
As I pruned old files, I encountered a chart of criteria that I tried to use for selecting among research alternatives. As it still seems useful, its content is reproduced below.
|
Criteria for project choice |
Ø judged to have scientific importance |
|
" |
Ø having national or international effect |
|
" |
Ø having an achievable goal |
|
" |
Ø intelligible for non-specialists |
|
" |
Ø my having a history of progress in the topic |
|
" |
Ø my having a "sound byte" slogan for the topic |
|
" |
Ø having appeal to many researchers |
|
Emphases during project execution |
Ø having crisp, measurable goals |
|
" |
Ø having a credible program description |
|
" |
Ø encouraging community creativity |
On 20th September BusinessWeek assessed the broadband marketplace—why broadband growth rates are slowing, how communities are rolling out free wireless networks, and AOL versus MSN.
Digital
Rights Management Conference
Oversubscribed
The March 2003 Digital Rights Management (DRM) Conference sponsored by the Berkeley Center for Law & Technology was intended to raise the level of understanding about DRM technology and the legal and policy issues posed by DRM technologies. It was a sell-out performance. The organizers make as much as possible of the material presented and cited available via http://www.law.berkeley.edu/institutes/bclt/drm/resources.html.
Major sales growth for Microsoft seems possible only in the Third World and in mid-sized commercial enterprises. A friend and MS employee suggests that MS is worried about only two threat sources to its expansion plans: IBM and Linux. BusinessWeek reports, “How a ragtag band of software geeks is threatening Sun and Microsoft”. On 26th March, Gartner Group advertised several reports on Linux as a business.
According to BusinessWeek, Japan leads in CPU speed right now, but IBM promises to put the U.S. far ahead in 2005. IBM’s effort is spearheaded by the “Blue Gene” advanced technology project. The projected Blue Gene performance is useful to estimate cryptographic key lengths required for public key cryptography.[17]
Reuters summarized the history of emoticons. :-) Emoticons started as a single “:)” in an e-mail sent 20 years ago by Scott Fahlman, an IBM Research Staff Member. Fahlman was not paid for his creation, but expresses no concern about that. :o and ;-) seem appropriate comments!
Macrae’s John von Neumann biography[18] is an easily read and a good account of von Neumann’s immense contributions to computing, game theory, and fluid dynamics, including application of the last to nuclear bomb design.
I strongly recommend Landes’ Wealth and Poverty of Nations.[19] Among many reasons for this, one pertinent to DDQ is its objective analysis of evidence against the historical authenticity of works claiming to be scholarly and objective, but in fact rationalizing prior bias, which they declare exempt from scholarly debate.
Scholarly emotions run high on Middle Eastern matters. Readers and audiences know the answers in advance. Debates, often angry or sullen, are anything but debates. … Among the casus belli: the Arab-Israeli conflict; European economic imperialism, formal and informal; and Western criticisms (hence slanders) of Arab or Islamic culture, especially the treatment of women.
In these circumstances,
much of the debate has taken the form of name-calling. The purpose (or effect) … is to marginalize
or exclude the adversary. He is a ... (fill in the classifier). Nothing more need be said.
The most influential of
these dismissive strikes has been the invention
of "orientalism." This is the
sin of writing about … the Middle East from the outside, that is, from …
the condescending, hostile, exploitative
West. … the publication in 1978 of Edward Said's book … Orientalism … called into question most
Western writing on the Middle East. The bill of indictment ran as follows:
1. Studies by outsiders distort the subject of inquiry by turning persons into objects. … For Said, such systems as orientalism are "discourses of power, ideological fictions-mind-forg'd manacles."
2. Such pseudo-scholarship tends to stereotype … "Orientals" are the same through the ages … from … Islam that "never changes." Orientalists have no room for details, nuances, or texture.
3. Stereotyping lends itself to racism and prejudice. It separates one group from another, promotes arrogance on one side, resentment on the other. If we could get rid of "the Orient," we would have "scholars, critics, intellectuals, human beings, for whom the racial, ethnic, and national distinctions [are] less important than the common enterprise of promoting human community."
One can hardly quarrel with lofty sentiments, but sentiments
are not enough. The effort to purge the field of these factitious diseases has become an assault on knowledge. In the first place, the anti-"orientalist"
method would exclude indispensable tools of inquiry. As any good comparativist
knows, distinctions are the stuff of understanding. The anti-"orientalist" cannot have
it both ways—denounce, that is, the pursuit of distinctive characteristics as "essentialist,"
while calling for an understanding of intergroup differences. …
Secondly, the reality of nuance does not rule out the light that comes from generalization. Everything, to be sure, is more complex than appears. … Even so, some effort must be made to simplify, to find patterns. Otherwise we have nothing but a grab bag of unrelated data.
Thirdly, bad news is not
necessarily wrong. Substantive observations may cast an unfavorable light, but
such evidence must be judged on its merits, not dismissed as a priori falsehood. That way orientalist
critique boils down to a lawyer's brief for the defense. …
Scholars have a higher obligation.
… one must reject the implication that outsideness disqualifies: that only Muslims can understand Islam, only blacks understand black history, …, and so on. That way [leads to] a dialogue of the deaf. … Discrimination in such exclusionary fields, moreover, invites a loyalty test: is a given scholar on the right side? This applies both to outsiders, who can "earn" acceptance by right-think, and to insiders, where it overrides even color. Thus an Afro-American historian … who does not meet the standards of political correctness is an "oreo" …
On the other hand, some outside scholars qualify because
they agree politically with the gatekeepers.
So Edward Said makes an exception in Orientalism for a handful of Western scholars—pro-Palestinian, pro-Arab,
pro-Muslim—who may or may not be right, …
Motive trumps truth and fact.
That way lies censorship by exclusion and indifference. Scholarship and research are the losers.
Recall the power of Wittgenstein’s distinction between objective facts/statements and subjective values/opinions. This Landes excerpt illustrates a way of marshalling objective facts to critique subjective (and emotional) opinion.
A friend asked,
"... why does Windows/2000 save a
filename on the floppy disk with "%20" in all the spaces?” A careful explanation follows.[20]
"20" is
the print representation of the decimal number 20 (binary 10100), which is the
ASCII encoding for a blank character.
"%" is an "escape” symbol; here “to escape” means that,
rather than showing the glyph for the character that follows, the display
program will show the encoding. In a
computer, the encoding is necessarily a string of zeros and ones (or on and off
indications, or true and false indications—many ways of conveying a
notion). Such a string can be viewed to
be the binary encoding of a number. For
instance, 10100binary is the same number as 20decimal ,
24octal and 14hexadecimal.
The filename in the
directory being displayed probably has a blank character wherever a display
program shows "%20". In case
the reason this must be done is not obvious, the following paragraphs try to
explain.
A character’s meaning
and its picture (called a
“glyph”) are distinct. We
commonly use many different pictures that mean the same character—different for
every font. For instance, "the
first letter in the Latin alphabet" (a meaning) can be depicted by any
glyph in the following row:
A A A A A A A A A A
We usually construe each of these as meaning the same thing, viz., "the first letter in the Latin alphabet", even though the glyphs are manifestly different. In a single document, we typically use font differences to signal secondary aspects, such as emphasis.
Thus, when you use a directory browser, you are not seeing file names, but instead pictures that correspond to file names! Picturing a blank would often be ambiguous. However, any software engineer would infer the intended meaning of every "%20".
Other display programs choose different glyphs to represent the 10100th character. For instance, the Opera browser represents this character with a character-sized rectangle.
Home
Computing Technology and Price Watch
Starting with this issue, I will factor California sales tax into the prices I report.[21] I do this because many best buys include a manufacturer’s rebate by snail-mail; for these, the sales tax is reckoned on the in-store price rather than the net price. Thus, $0 loss leaders[22] still incur sales tax.
|
Laptop PC |
2 Ghz, 15” XGA, 512 Mb, 40 Gb
HDD, CD-RW/DVD-RO, Win/XP home ed. |
Compaq + AMD + Microsoft |
$1410 |
|
“Desk bottom” PC |
1.8 Ghz, 128 Mb, 40 Gb HDD,
CD-RW + DVD-RO, Win/XP home ed. |
No-name + AMD + Microsoft |
$540 |
|
“Desk bottom” PC |
733 Mhz, 128 Mb, 30 Gb HDD,
CD-RO, Linux |
No-name + AMD |
$194 |
|
Hard disk drive |
120 Gb, 7200 rpm |
No-name |
$100 |
Wireless 300-foot service has now become sufficiently inexpensive and reliable that I purchased a Siemens router for our home network and a Siemens PCMCIA adapter for my laptop PC. (I’m looking forward to writing the next DDQ in a shady garden corner.) I believe the prices will fall a further 25%, but that this will take a year to happen. I avoided devices for the new, higher speed 802.11g standard because they are still expensive and because reviews question their reliability.
|
Ethernet switch (LAN router) |
5-port |
Siemens |
$2 |
|
PCMCIA wireless adapter |
802.11b |
Network Assoc. |
$22 |
|
Wireless/DSL router |
802.11b, 5 Ethernet ports |
Siemens |
$56 |
|
Cat-5 patch cord |
14-foot |
|
$3 |
I’ve used each of the following utility programs for more than a year, and would buy each again at twice today’s price. I invoke each many times daily, and have encountered no reliability problems! Opera and PowerDesk each replaces MS Windows functionality, but is faster, uses screen space more efficiently, and integrates helpful additions. Free versions are available for testing.[23]
|
Opera WWW and file browser |
Among many features: multiple panels, links collected
for navigation, PW saving for protected Web pages, keyboard shortcuts, built-in
search utility, pop-up spam control, great hot-list management |
$39 |
|
PowerDesk23 directory browser and management utilities |
Dual panel, built-in file zipping and unzipping,
built-in file viewer, disk space utilization monitor, directory
synchronization (e.g., between an office machine and a notebook), file
finder, better toolbars than competitive products |
$25 |
|
SnagIt screen scraper |