|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 2, Number 2, 2Q2003 |
|
|
HMG
Consulting |
© 2003, H.M. Gladney |
Colleagues and I believe we now know a good solution for every technical challenge of long-term digital preservation, at least in principle.[1] For some kinds of data, implementations are likely to be easy. In contrast, complex data, such as real-time simulations, are likely to require sophisticated engineering. We nevertheless believe it possible to structure institutional archiving plans seamlessly, dealing correctly with easy cases in the near future, while we design tools for handling complex data.
The complete technical solution is not conceptually difficult in the sense of including any unsolved basic problem. Instead it is merely complex—consisting of many interrelated parts, with each part amenable to a well-known approach. We can explain such a solution only with a divide-and-conquer exposition in which each component is itself partitioned sufficiently to persuade any skeptical critic.
Note, however, that no preservation solution can be demonstrated to be correct by satisfying a “show me” demand today, because we cannot wait to see whether test cases survive or fail several centuries from now. All that’s feasible is to analyze every potential failure type[2] to explain how the solution precludes this failure, or else to repair shortfalls exposed.[3] To do this will be a painstaking task that is likely to be tedious and might also be lengthy.
What were the key technical challenges and where can their proposed solutions be inspected? The high level challenges and references to their solutions are:
Ø Ensuring that eventual information consumers can read or otherwise use each preserved object as completely as its producers intended. See Durable Encoding below.
Ø Ensuring that each eventual information consumer can decide whether preserved information is sufficiently trustworthy for his intended application. See the first section of DDQ 2(1).
Ø Ensuring that files (bit streams) are not destroyed or lost. See Stanford’s LOCKSS.
Ø Designing to minimize labor costs and skills required, replacing human effort by automatic procedures whenever doing so is feasible. This will be addressed in Trustworthy 100-Year Digital Documents: End User Interfaces, whose release is planned for late 2003.
Ø Empowering each information producer to package content and metadata to minimize what a professional archivist or librarian must do. This will be included in End User Interfaces (see above).
|
|
|
Does this apply to aspects of your digital preservation strategy? |
Ø Hiding the complexity of our durable encoding solution from end users by clever packaging in programmers’ tools.
Ø Demonstrating that the proposed solution handles everything that technology can handle, and does not attempt what essentially depends on human judgment. This will be addressed in Trustworthy 100-Year Digital Documents: Syntax and Semantics—Tension between Facts and Values, which we hope to have ready in August.
Ø Ensuring the communication of meaning as expressed by ontologies, doing so as to embrace evolving metadata standards. This is planned for Trustworthy 100-Year Digital Documents: What's Authentic? Essential and Accidental in Documents, planned for October.
Ø Persuading skeptics that our proposed solution is optimal in certain well-defined ways.[4] We plan to address this in Trustworthy 100-Year Digital Documents: Economic Effects on Archive Design, targeted for November.
Ø Designing implementations that maximally exploit deployed technology and that make minimal new infrastructure requirements. We will address this only if funding applied for is granted.
Most of the technology needed to accomplish these objectives is readily available today either as open-source software or as commercial-off-the-shelf (COTS) offerings. Implementations will consist mostly of integration of existing offerings to satisfy needs that vary among users that include repository institutions.
In contrast to the approach sought in most articles
addressing digital preservation, our solution makes no new requirements of
digital repository technology.[5] Nor does it depend on archival institutes
practicing business controls that include independent certification
inspections.[6] Existing content management (a.k.a.
digital library) offerings are adequate, or almost adequate.[7]
Formal publication of Trustworthy 100-Year Digital Documents articles will probably not occur in 2003. Until they are published, those whose preliminary versions are complete[8] can be obtained by sending a request to hgladney@pacbell.net.
Migration, Emulation, and Durable
Encoding
Pessimism seems to be growing regarding
completely correct and durably intelligible digital preservation.
“… the emulation-versus-migration debate has largely played itself out. Neither approach provides a sufficient, general answer to the problem of digital preservation, and it has proven largely fruitless to debate the merits of these approaches in the abstract.[9] Instead, there is growing recognition that different kinds of information captured in different ways for long-term preservation will need various kinds of support.” [Waters][10]
In this, “growing recognition … various kinds of support” seems futile, because Waters neither describes nor alludes to specific measures for “different kinds of information.” Furthermore, the statement suggests pessimism about invention of a single integrated set of measures—pessimism that we believe premature not only because inquiries are at an early stage, rather than a terminal stage, but even more because we are now asserting a complete solution.
The consensus about making complex content durably intelligible seems to admit only two possibilities:
“Signals … degrade, and not at a consistent rate, and hardware and software become obsolete. Data must therefore be transferred to new media or migrated to newer platforms, operating systems, and program applications. An alternate strategy is to emulate the original; that is, to provide a way through software to mimic the hardware on which a given system ran. Either way, each item in a digital archive requires active management.” [11] [Marcum][12]
For seven years, these two approaches have dominated information preservation discussions, almost excluding any other thinking.[13] Extensive debates have not resolved any issues, and not demonstrated that either method precludes errors. Avoiding small errors is helpful when the data being preserved represent natural language text, and essential if the data include computer programs.
That two methods fail does not demonstrate that no method will work. In fact, a 1995 Lorie idea[14]—based on defining a simple “universal virtual computer (UVC)”—will almost surely work.[15] We call an elaboration of Lorie’s idea “durable encoding,” and are working towards evaluating whether it accomplishes everything wanted, showing that practical implementations are possible, and how to package such implementations to hide their complexity from end users.
How can pessimism prevail when only a few ideas have been considered and when a promising idea has been announced, but not carefully examined? That’s hard to say, beyond conventional mutterings about peoples’ failure to look across professional boundaries and “too much consensus.” [16]
Undiscriminating use of the words ‘migration’ and ‘emulation’ may have contributed to the problem.[17] In software engineering, ‘migration’ denotes diverse procedures for copying data between storage locations and perhaps also alter it. Similarly, ‘emulation’ denotes making some machine behave like a different kind of machine, but is, in itself, silent about the machine types. That a few investigations of particular kinds of migration and/or emulation have failed is simply an insufficient basis for concluding that other techniques are not worth pursuing.[18]
What are the problems of transformative migration and preservation emulation,17 and how does durable encoding avoid them? To answer these questions would require explanations that are more technical and longer than most DDQ readers would want; we recommend that readers wanting evaluation of what we claim ask their in-house experts to inspect the report described in §Durable Encoding. Among other valuable properties, durable encoding avoids the expensive “active management” alluded to by [Marcum].12
Preserving Dynamic Behavior (or Content)
Recent literature reveals surprising
confusions about the authenticity of preserved dynamic information.
“Professor Duranti ... went on: ‘But
the reason for the InterPARES project 2 is that we are discovering that by stabilising records that, by their nature, are dynamic we, in fact, end up
forging them. That is, we are eliminating their
authenticity.’
“…
“After
further lengthy discussion on varying requirements for the integrity of
different digital objects, Hans Hofman suggested that from users’ perspectives
the question was simply one of trust.
Professor Duranti agreed but warned against archives’ past faith in
creators. She said: ‘This is no longer
true. The person who generates the material may trust it and might be wrong.
Because, with digital records, the fluidity of the record is such
that if you don’t have very detailed methods of control in place all along, so that you
can say that you have a trusted system, it doesn’t work.’” [Steemson][19]
This confounds aspects that can and should be treated as distinct, including at least:
Maitland recently circulated observations about new art forms and requested comments.[20] Her concerns included some (column 1 of Table 1) that seem to follow from the confusions [Steemson] reports.
Table 1: Observations about preserving ephemeral art
|
Maitland’s
observations (excerpted) |
Reactions
to the observations |
|
Existing boundaries between artistic disciplines which have hitherto been distinct have been eroded by the … complexities of new hybrid forms |
These “existing boundaries” are new distinctions that can safely be ignored. For instance, opera is an art form that has mostly ignored the boundaries alluded to. |
|
… information technology as a central part of the creative process has led to a higher degree of interactivity and practical engagement [by] the viewer/audience, [leading] to a fundamental change in our understanding of the notion of "artistic integrity" |
This introduces no new technical requirement. Of course, copyright law and priviledges intrude, creating tensions that only law courts and legislatures can resolve. |
|
There exists a tension (which is exacerbated within the new forms focused on here) between notions of permanence in art/its place (sic) in our cultural heritage and the idea of some pieces being entirely transient and ephemeral, with no existence or future beyond the temporal or spatial boundaries within which they occur |
If a work is truly “entirely transient and ephemeral”, by definition it cannot be preserved. I.e., there is no practical conflict, because the sentence is nonsensical[21] if the conventional meanings of “transient and ephermeral” and “permanence” are intended. Part of the issue is, “Who says that this is the way it has to be?” We can choose to record[22] an artistic performance, saying, “This is not truly ephemeral.” |
What’s the underlying problem? These excerpts express uncertainty with dynamic digital information apparently because of difficulties with progressions in time. In engineering parlance, a repeat R(t) of an original performance P(t) would be called authentic if it were a faithful copy except for a constant time-shift, tstart, i.e., if R(t)=P(t-tstart). This conforms to ‘variable instantiations’ in:
“…dynamic documents dependent upon data that might have variable instantiations and be held in databases and spreadsheets.” [Ross][23]
The trick is simply to choose some instance or some sequence of instances to preserve. This works for any kind of signal or real-world situation. Its meaning is simpler for digital documents than for analog recordings or for live performances because digital states are static most of the time, whereas we think of real world performances as being continuous in time.[24]
In casual conversation, we often say that a recording copy is authentic if it is “close enough to the original.” But consider, for instance, an orchestral performance and how signals flow from its musical instruments, with wall reflections to imperfect microphones, followed by deliberate and accidental changes in studio electronic circuits, and so on, until we finally hear it reproduced in our homes. We cannot say with objective certainty which of many different signal versions is “the original”. We can do no better than choosing some particular version[25] that we describe carefully (we call that description “technical and provenance metadata”), and judging authenticity by comparison with that version. There might be no circumstance or object type for which “the original” has an unambiguously objective meaning!
The difficulty with “the original” illustrated above is conceptual, rather than being caused by technology use. It would occur for most works even if the signal channels were perfect, because no author or artist creates much by any single action.[26] Leppard, the noted British conductor, illustrates this in a brief history of Gluck’s Orfeo et Euridice.[27] An effective coping strategy is to judge an object’s authenticity in terms of its state at the event of passing it between some donating and some receiving custodian.
The DDQ 1(4) announcement of a preliminary
version of Trustworthy 100-Year Digital Objects: Durable Encoding for
When It’s Too Late to Ask proved, in retrospect, slightly premature. Its request for critical comments was
answered by a close colleague, Peter Lucas.
Peter pointed out that its program compilation model was not merely
simple, but also simplistic, with the consequence that our proposal for
preserving complex programs would be insufficient, especially when source code
was inaccessible to whoever was preparing information for preservation.
The model is now corrected, leading to a crisper distinction between the data types for which preservation methodology is adequately specified—ordinary static data files and the program class called filters—and more complex cases for which a practical engineering approach still needs to be worked out.
Another reader suggested that both Durable Encoding … and the companion Trustworthy 100-Year Digital Objects: Evidence Even After Every Witness is Dead8 were written for software engineers and for computer scientists who might want to check the validity of their proposed preservation solution, but were more technical than would appeal to managers and administrators. This is a fair objection; we do believe that managers and administrators charged with starting digital archives should decide whether we are in fact addressing problems pertinent to their institutional requirements; if so, they should request a critical look by technology experts whose judgment they trust.[28]
The core ideas of Durable Encoding … include:
Ø Using as a basis a few relatively simple and broadly accepted EDP standards, such as the ISO Unicode and UTF-8 character encoding standards, but depending only on standards that we are confident will survive and be correctly handled in centuries to come.
Ø Avoiding including in the preserved data anything irrelevant to the information being conveyed.[29]
Ø Using XML packaging to convey metadata and structure relating whatever number of bit-streams are needed to convey the main content of each digital object.
Ø Using a Turing-equivalent virtual computer[30] to encode bit-streams for which the aforementioned standards are insufficiently expressive. In particular, computer programs cannot be reliably preserved without this device.
A graph in the “How Quickly is Technology Changing?” section of DDQ 1(1) reminded readers that persistent storage cost/effectiveness had improved exponentially since 1990 (at the rate of ~28% p.a. for the price/Mbyte), and that trade experts predicted that such improvements would probably continue for another decade. The impact might be easier understood from the following equivalent statement: for a home computer, $100 will today buy approximately 100 gigabytes of HDD storage, but is expected in 2013 to buy more than 1 terabyte—enough for about a million good quality digital photographs.
Below, we identify current evidence supporting such projections. Their significance to digital preservation is that, as whenever an important parameter changes by an order of magnitude, enterprise strategies must be changed—perhaps drastically—if optimality is wanted. Such changes are likely to affect how enterprises are organized and how they relate to other enterprises.[31]
Two decades ago, data center disaster
recovery recommendations included the suggestion that duplicates of tape
library holdings should be shipped by van to remote locations. A Microsoft Research team suggests a
currently economical replacement: configure
a PC with a terabyte of storage carrying the backup copies, and ship the entire
PC between locations by parcel post.[32]
A
variant of this is that CDs may be close to replacing floppy disks. We all
receive new software and promotional material on CDs. Perhaps you already share content by an ordinary PC user’s
equivalent of the TeraScale Internet—burning and snail-mailing a
CD. Although test marketing PC’s
without floppy drives apparently suggests that such machines are premature,
such a change cannot be long in coming, given that you can have a CD-RW drive
for $25 and 700 Mbyte blank disks for about $0.15 apiece.
Both magnetic and optical disks are essentially 2-dimensional storage devices.[33] 3-dimensional storage offers dramatic improvement in device capacity (for a given device size and weight); the technique receiving attention is holographic storage.[34] InPhase TechnologiesÔ has developed a prototype holographic video storage device it calls TapestryÔ. Tapestry can store 100 Gb—equal to about 20 compressed feature films—on a DVD-like disc. InPhase projects 1.3 terabytes on a single disc.
A relatively unknown effect of magnetic disk improvements is that IBM has started to leave the business, probably for similar reasons to those that motivated its exit from most of the computer printer business a decade ago.[35] The core of IBM’s business is high expertise in technical and business matters—expertise that commands high gross margins, rather than mass manufacturing methods and consumer marketplace infrastructure. The latter require different organization and different skills than are common in IBM.
Specifically, what IBM did was sell a
controlling interest in its HDD business[36] to
Hitachi Corp. about a year ago; Hitachi manages the joint venture on which the
two corporations agreed. This
arrangement might itself be temporary, lasting only until IBM finds a buyer for
its minority interest.
Managing Criticism by Investment
Suppose that, working at a reasonable pace,
you (or the (digital archiving) organization that you manage) cannot accomplish
everything that people expect of you.
How would you handle the situation?

What people usually do in this common circumstance is try harder, often by working overtime. Doing so is often not the best tactic.
Consider your underlying objectives. These surely included one that is seldom voiced—avoiding your employer’s, your stockholders’, or your spouse’s disappointment, criticism, or verbal abuse. If so, it’s worth considering how you might reduce or even eliminate verbal abuse.
What determines the level and vigor of abuse? The shortfall between what you can accomplish and what would be accepted as satisfactory performance is certainly a factor, but the critic’s energy and enthusiasm for complaining might also be important. The following graph, which applies to any stage in a project that takes some time to accomplish, suggests the form of the relationship between complaint levels and what you are actually accomplishing.

If you are accomplishing the job with sufficient quality (right end of the graph), your customer will not criticize your performance. If you are far short of what’s expected (left end of the graph), you will be criticized, but the level of criticism will depend more on the energy of the critic than on the shortfall in accomplishing what’s needed. In such circumstances, trying harder will not reduce the level of abuse. That’s possible only if your performance is in the reverse-s-shaped portion of the abuse curve.
The trick is to notice that, if the shortfall
is large, producing less will also not increase the abuse level! Consider dropping back on the job at hand,
in order to free time/resources for developing skills and tools that make
you more effective—making it possible in the future to do the kind of work
at hand with sufficiently less energy so that you can close similar
shortfalls. It’s called investment!
Of course, you must hide that you are using such tactics from the critic, or you might discover him willing and capable of increasing the abuse you receive beyond what you thought possible. Or else, he might switch from abuse to punishment! [37]
It’s on the Web, so it Must
Be True!
Colleagues
have mentioned how difficult it is to teach undergraduates to use the
literature, and that many copy what they find into their writing assignments
without acknowledgement and without evaluating whether what they accept is
true. This practice exposes itself when
several use the same material in a class assignment, possibly because they
submitted the same query to the same Web search engine.
“… many students have difficulty recognizing trustworthy sources, though perhaps the underlying problem is a lack of understanding of the Internet as an unmonitored source of information. …
“Students are also not consistently able to differentiate between advertising and fact. Many responses to [a survey] mentioned that as the Web site was just trying to sell a product, its claims could not be readily believed. However, many of these same students immediately believed claims made by Microsoft on its commercial Web site. …
“The very small number of students who double-checked information is also concerning. … Students in this study seemed to have a great deal of confidence in their abilities to distinguish the good sites from the bad. Colleges themselves often encourage this attitude [by how they] help students …” [Graham][38]
“… the reality of the situation we currently face. At this time, technologies frequently are designed and developed more for the benefit of vendors than for users, and persons concerned with digital preservation are expected to jump through whatever hoops are required by those technologies.” [Granger][39]
This is a bizarre utterance from an author who otherwise seems to believe that “collaboration structures” should include commercial institutions. It has the ring of left-wing political rhetoric, in contrast to my comparatively boring interpretation of two decades observing the internal workings of IBM’s marketing and development teams for content management and database products.[40]
Consider “persons concerned with digital preservation are expected to jump through whatever hoops are required”. The product managers that decide R&D investments simply ignore this community, because it does not present itself as part of the market they are charged with addressing.[41] We should expect such behavior in our private enterprise system, because these managers are appraised and rewarded primarily on achieving schedule[42] and revenue targets. What you hear said is along the lines of, “If you don’t make your numbers, nothing else counts!”
The lesson for anyone sympathetic to the [Granger] complaint is simple and direct: if you want a commercial vendor to meet your requirements, become a (potential) customer.
Ted Codd, Inventor of Relational
Database
On
April 20, Edgar F. Codd, the IBM computer scientist who created the relational
database model at the core of today’s ~$8 billion industry of storing the
world's business data, died in Florida.[43]
Before Codd's landmark research[44], it was possible to store lots of information--but analyzing it was difficult, requiring many lines of code for even simple tasks. His solution, based on simple mathematics, called for representing all database information as values in table rows and columns.
As often happens when a replacement technology threatens an established business,[45] the change was effected only with vigorous, and sometimes even rancorous, debate because peoples careers were affected.[46] OracleÔ was the first to release a successful R-DBMS product, perhaps partly because it did not risk losing DB customers (it had none).[47] The fact of a competing DBMS product helped settle the debate inside IBM, which announced SQL/DSÔ running on a VM/CMSÔ operating system base in 1981, and what has evolved as IBM/DB2Ô on an OS/MVSÔ base in 1983.
A technical issue at the core of the debate was that pointers (references) permit a hierarchical DBMS, such as IBM/IMS, to be more economical in its use of computing resources than a R-DBMS. However, programming relational database applications is much easier, especially because diverse and unanticipated applications can be created without restructuring the database layout, which is impractical with a hierarchical database. As the price of computing machinery decreased over the years, while skilled programmers became more expensive and harder to find, the IMS performance advantage disappeared, and economics gradually drove customers from hierarchical to relational DBMSs. In addition, steady progress over a quarter century (mostly by the IBM Research DB team) found optimizations that made R-DBMS performance better than anybody in 1975 dreamt possible.
Today, every digital library and search service depends on a relational database.
OCLC/RLG Form a Metadata Implementation group
In June, OCLC and RLG announced the formation
of a working group to explore preservation metadata implementation
strategies. According to listserv
postings and a Web page, the
new PREMIS
(PREservation Metadata: Implementation Strategies) group is to develop a
broadly applicable and implementable set of "core" preservation
metadata elements, and a data dictionary to support them.
It is intended also to evaluate strategies for
managing preservation metadata within a digital preservation system, and for
the exchange of preservation metadata between systems; to establish pilot
programs for testing the group's recommendations and best practices in a
variety of systems settings; and to explore opportunities for the cooperative
creation and sharing of preservation metadata.
In this 29th June 2003 N.Y. Times
op-ed column, a Wi-Fi provider is quoted as follows:
"If I can operate Google, I can find anything. And with wireless, it means I will be able to find anything, anywhere, anytime. Which is why I say that Google, combined with Wi-Fi, is a little bit like God. God is wireless, God is everywhere and God sees and knows everything. Throughout history, people connected to God without wires. Now, for many questions in the world, you ask Google, and increasingly, you can do it without wires, too."
“… once Wi-Fi is in place, with one little Internet connection I can download anything from anywhere and I can spread anything from anywhere. That is good news for both scientists and terrorists, … While we may be emotionally distancing ourselves from the world, the world is getting more integrated.”
“These contradictions have to do with the protection of the authors' interest and have become apparent with the rise of open access publishing as an alternative to the traditional commercial … journal subscriptions. … This paper reviews the specifics of publishers' contracts with editors and authors, as well as the larger spirit of copyright law in seeking to help scholars to better understand the consequences the choices they make between commercial and open access publishing models for the future of academic knowledge.” [Willinsky] [48]
The Supreme Court of Canada has held the
"Harvard Mouse" to be unpatentable,[49]
making Canada the first major jurisdiction whose high court has refused to
recognize the patentability of higher life forms. Only time will tell whether this will have commercial
significance. In the short term, the
decision has strong symbolic significance.
The decision turned on the definition of
invention in the Patent Act. In
Canada, as in the United States, the definition includes five eligible subject
categories: art, process, machine, manufacture or composition of matter. The decision primarily focused on whether
the oncomouse was a "composition of matter".
In late May, Barbra Streisand filed a lawsuit that promises to test the limits of privacy, partly because Streisand is a public figure whose business includes deliberate visibility.[50] Claiming her privacy was violated, the Streisand filed a $10 million lawsuit against Silicon Valley millionaire and environmentalist Ken Adelman, demanding that he remove an aerial photograph of her oceanfront Malibu mansion from the California Coastal Records Project Web site.
Adelman photographed the entire California coastline from a small helicopter—one picture every 500 feet—and put it on the site. The site contains ~12,200 photos and has won environmental groups’ praise for helping document coastal building law violations. Streisand’s lawsuit claims the site violates California's “anti-paparazzi'' law because Adelman did not ask permission to photograph her house.[51]
On 24th June, the San Jose Mercury News reported an effect that you might have expected: “Photo of Streisand home becomes an Internet hit.”
The United States Patent and Trademark Office (USPTO) has released its Congressional report on technology designed to protect digitized copyrighted works from infringement, as required under the 'Technology, Education and Copyright Harmonization Act of 2002' (TEACH Act).
This mandated report is intended solely for information purposes. Congress is to use it in establishing a baseline of knowledge for what technology is or could be made available and implemented. The USPTO does not intend the report to be recommendation or comparative assessment offerings."
Recent months saw the release of several reports, such as the USPTO Report just mentioned, that seem to be required reading for anyone serious about managing digital documents, partly because they represent current thinking towards significant U.S. Government expenditures for digital preservation. Collectively, these reports are longer and weightier than I can readily absorb in a few months, so I identify them here with only minimal comment.
In June 2002, the US GAO released U.S. General Accounting Office, Information Management: Challenges in Managing and Preserving Electronic Records, GAO-02-586. It recommended:
“GAO recommends that the Archivist of the United States develop documented strategies … for conducting systematic inspections of [records management] programs. In addition, to reduce risks, GAO recommends that the Archivist reassess the schedule for acquiring the new archival system so that the agency can complete key planning tasks and address IT management weaknesses.”
Congress has approved the Library of Congress’ Plan for the National Digital Information Infrastructure and Preservation Program (NDIIPP), which will enable the initial phase of building for the collection and long-term preservation of digital content. The complete text explains the key components of the digital preservation infrastructure.
In a first reading of the parts of this report addressing technology and architecture, I learned nothing new.
In contrast to the NDIIPP plan document, the preliminary Building an Electronic Records Archive at the National Archives and Records Administration (NARA): Recommendations for Initial Development report contains interesting guidance on technical issues of archive design and management. In its summary Finding 4, this report dismisses NARA’s investments at the San Diego Supercomputer Center with,
“Demonstrations conducted at the [SDSC] for NARA have provided a useful opportunity for NARA to explore relevant technologies. However, the work has not informed many significant aspects of the [Electronics Records Archives] design, has not reduced the engineering risk of the program, and …”
“The UNESCO Draft Charter
on the Preservation of the Digital Heritage (Document 166 EX/18) presents a
compelling case for digital preservation.
The Guidelines have been prepared to offer realistic and useful guidance
for those responsible for preserving digital heritage, including those having
only very limited resources.”
To argue some position in copyright policy, it often helps to understand the objectives of copyright as taught by its history. The Statute of Anne is the world’s first copyright legislation. A facsimile image of its first page announces The History of Copyright: A Critical Overview With Source Texts in Five Languages (a forthcoming book by Karl-Erik Tallmo).
E-Mail: Replacing Memoranda
or Conversation?
DDQ 2(1) tried to mount a mini-survey inquiring What Does
E-mail Supercede? That very few
people responded suggests that what once seemed an interesting question is no
longer one.
The tools and
resources mentioned in this section are particularly promising selections from
hundreds that I’ve inspected. These,
and further instances that will appear in later DDQ numbers, are either free of
charge or inexpensive.
WordNet® is an online
lexical reference system whose design is inspired by current psycholinguistic
theories of human lexical memory. Although
its coverage is small compared to well-known dictionaries, I find it very
helpful for most words that I’ve had occasion to find.
This PowerPoint presentation will help you educate end users
on your organization's security policies. Modify the slides to meet your
organization's needs, and use the presentation to enlist user support in the
fight against security breaches.[52] A
free utility generates random passwords. Other security education aids are readily
available from the same Web site.
It's difficult for any IT professional to imagine life before PCs. Download this poster and have some fun examining the history of computers, and brighten up the IT workspace at the same time!
Home Computing Technology
and Price Watch
DVD-writable technology is still too expensive for me. The best price I have seen for a 4x
DVD+RW/+R burner is $180. Blank
write-once DVD platters still cost about $1 each. I’m waiting for the better prices. I think they will be available
in about a year.
Many color inkjet printers can be purchased for less than $50, but that does not tempt me because the vendors’ profit source is the cartridges, which are “locked in” to printer models. The effective cost for color inkjet printing is an order of magnitude greater than that for laser grayscale printing. Photograph-quality color prints are another order of magnitude more costly. I’m waiting for lower prices for color laser printers and the solid ink refills that they use.
Prices observed[53] since DDQ 2(1) appeared include:
|
Mid-range PC |
Almost everything needed for a new home installation: · 120Gb UDMA hard disk drive · CD-RW drive (48x/12x/48x) · DVD-ROM drive (16x) · Graphics processor w/32Mb memory · 10/100 Ethernet & 56k v.90 modem · WindowsÔ XP Home Edition · 17” flat CRT monitor (1600x1200) · HPÔ DeskJet 3420 printer |
$774 |
each |
|
17” CRT monitor |
.27 mm dot pitch, 1280x1024 max. resolution |
$85 |
each |
|
Flat panel display |
17” with 1280x1024 resolution |
$444 |
each |
|
Optical mouse |
3-button + scroll wheel |
$8 |
each |
|
CD-RW drive |
VerbatimÔ, 42x24x52x |
$24 |
each |
|
CD-R blanks |
700Mb, 80 minute, in packs of 50 |
$0.15 |
each |
|
3.5” HDD |
Maxtor 160Gb, 7200rpm |
$0.61 |
per Gbyte |
|
USB mobile drive |
128 Mbyte |
$33 |
each |
|
Scanner |
Epson 1650 w/1200x2400 bpi @ 48 bits/color |
$53 |
each |
|
Scanner |
CanonScan LIDE 20 w/600x1200bpi |
$34 |
each |
|
Printer paper |
Multipurpose 20lb. |
$0.002 |
per sheet |
Acknowledgements
Once again, it is a pleasure to acknowledge that discussions
with John Bennett, Tom Gladney, Peter Lucas, and John Swinden were extremely
helpful towards creating this DDQ number.

[1]
The technical work remaining is what,
in IBM Research, used to be facetiously called “SMOP” (a “small matter of
programming”)
—creating prototypes that embody the solutions, then pilots, and finally
industrial-strength offerings. An
implicit critical distinction is that between novel thinking (as in the best
computer science) and software engineering (which requires high skills different
from those of computer science).
[2] Beyond this, one can in fact test the solution portion identified below in Durable Encoding by porting information between incompatible computing platforms.
[3] If these assertions are indeed correct, the exposition and justification of a “complete” solution will be lengthy and detailed beyond the patience of most readers. We seem to be forced in this direction. If any reader knows an alternative, we would be very pleased to have it communicated and explained!
[4] To prove optimality is very difficult, if not impossible. When plausibility arguments are exhausted, all that can be done is to issue a challenge to skeptics, “If you think you can do better, show the world!”
[5] An example of attempting to manage preservation by repository modifications is San Diego Supercomputer Center’s research for the (U.S.) National Archives and Records Administration. This work is evaluated in a new National Academies Report.
[6] See Neil Beagrie, Meg Bellinger, Robin Dale, Marianne Doerr, Margaret Hedstrom, Maggie Jones, Anne Kenney, Catherine Lupovici, Kelly Russell, Colin Webb, and Deborah Woodyard, Trusted Digital Repositories: Attributes and Responsibilities, RLG-OCLC Report, May 2002. The controls and audits this recommends are expensive, difficult to implement, and not proof against failures that impeach the archival integrity for which they are proposed. See DDQ 1(2).
[7] To evaluate repository management software in the light of a particular institution’s needs cannot be done without a written analysis that, depending on its depth and style, includes between 200 and 500 individual requirements statements. Commercial content management offerings, such as the IBM Content ManagerÔ, anticipate unique customer requirements. They address these in their “out of the box” offerings with interfaces for tailoring and extensions.
If long-term preservation requires anything new of content management components, it can surely be handled as a few among many institutional tailoring actions. We currently believe that the only pertinent preservation need is replication. See Vicky Reich and David S.H. Rosenthal, LOCKSS: A Permanent Web Publishing and Access System, D-Lib Magazine, June 2001.
[8] As of July 2003, Durable Encoding for When It’s Too Late to Ask and Evidence Even After Every Witness is Dead are available upon request. What Do We Mean by “Authentic”? What’s the Real McCoy? is to appear in D-Lib 9(7) in mid-July.
[9] Waters amplifies this with “… the largely polemical debate on the relative merits of emulation and migration in [Jeff Rothenberg, Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, a Report to the Council on Library and Information Resources, January 1999.] and [D. Bearman, Reality and Chimeras in the Preservation of Electronic Records, D-Lib Magazine 5(4), 1999]. For a more balanced view, see [Stewart Granger, Emulation as a Digital Preservation Strategy, D-Lib Magazine 6(10), October 2000]”.
[10] Donald Waters, Good Archives Make Good Scholars: Reflections on Recent Steps Toward the Archiving of Digital Information, in Council on Library and Information Resources and the Library of Congress, The State of Digital preservation: An International Perspective, pub107, April 2002. ISBN 1-887334-92-0
[11] What is meant here by “active management” includes human intervention for each preserved holding (or group of very similar holdings) whenever its current encoding format is threatened with obsolescence. This might be as frequently as every five years. The authors are communicating two concerns. (1) For collections of more than trivial size, the implied labor is likely to be costly and to require skills not readily available, so that the resources might be beyond what archiving institutions could muster even if funding were not a problem. (2) Even with skilled labor, transformation errors might occur, be very difficult to detect, and degrade content cumulatively.
The boldface highlighting in the final sentence is a DDQ addition.
[12] Deanna Marcum and Amy Friedlander, Keepers of the Crumbling Culture: What Digital Preservation Can Learn from Library History, D-Lib Magazine 9(5), 2003.
[13] A third proposal has recently appeared in a Netherlands project, and also in a National Archives of Australia project. These consider XML as an alternative to migration and emulation. This is a curious approach, something like considering apple trees as an alternative to apples, because XML should be used together with other measures for complex data types, rather than instead of other measures.
[14] Raymond A. Lorie, A Project on Preservation of Digital Data, RLG DigiNews 5(3), June 2001. See also Raymond Lorie, Long-term Archiving of Digital Information, Proc. First ACM/IEEE-CS Joint Conf. on Digital Libraries, 346-352, June 24-28, 2001.
[15] Readers’ skepticism about this assertion is quite justified until it has been critically examined by qualified scientists. It also must be shown to lead to practical implementations. Lorie is building prototypes intended to help show practicality, but this work is, arguably, under-funded. Furthermore, see Durable Encoding below.
[16] Deanna B. Marcum, Too Much Consensus, CLIR Issues 18, Nov./Dec. 2000.
[17] About two years ago, a few authors started to use “transformative migration” to describe some preservation investigations, and “preservation emulation” to describe what Rothenberg has long advocated.
[18] A prominent investigator of digital preservation some time ago considered what we call Durable Encoding (see below) and observed that it combined a form of migration with a form of emulation. When I confirmed that this was correct, the conversation ended. Just why the last happened, I do not know, but cannot help wondering whether he concluded that the combination of two failing procedures—migration and emulation—must create another failing procedure.
[19] Michael Steemson, Digital Experts Search for E-Archive Permanence in Integrity and Authenticity of Digital Cultural Heritage Objects, DigiCULT Thematic Issue 1, August 2002.
Did the participants have similar conceptual problems with analog recordings, or with ink on paper? Probably not. It seems that scholarly, legal, and social expectations for information authenticity are greater than was ever previously the case. Perhaps such increasing expectations can be satisfied only with refined tools, such as unusual care how we use natural language, as is illustrated with our reactions to Maitland’s note below.
Wittgenstein sums up philosophical problems that preceded his work with: “Most of the propositions and questions to be found in philosophical works are not false but nonsensical. Consequently, we cannot give any answer to questions of this kind, but can only point out that they are nonsensical. Most of the propositions and questions of philosophers arise from our failure to understand the logic of our language. [Wittgenstein, Tractatus 4.003]
[20] Eileen Maitland (University of Glasgow) on the digital-preservation@jiscmail.ac.dk listserv on 29 May 2003. DDQ’s tabulated reactions address only what can be accomplished by objective procedures. They are deliberately silent about subjective aspects (opinions, values, intentions, …).
[21] Natural language allows us to construct sentences that contradict themselves, as Wittgenstein’s works repeated illustrate.
[22] The interpretation of a work is a human subjective decision, as are values inherent in the work, and its conceptual relationships to other entities in the world. Furthermore, there is no a priori reason for an independent observer to accept its creators’ view of a work’s meaning, value, or significance in the world. In fact, it is impossible for such an observer to know the creators’ views with objective certainty! [Wittgenstein]
[23] Seamus Ross, Changing trains at Wigan: digital preservation and the future of scholarship, November 2000.
[24] H.M. Gladney and J.L. Bennett, What Do We Mean By Authentic? What’s the Real McCoy? to appear in D-Lib Magazine, July 2003.
[25] Doing this well enough to please large audiences is a high skill that requires much practice. We call the people who acquire and exercise such skill “experts” or “artists”, and regard their work products as “authoritative”.
[26] Not even natural phenomena create anything instantaneously. That this is true even in the smallest natural events is asserted by the Heisenberg Uncertainty Principle.
[27] Raymond Leppard, Authenticity in music, Faber Music, 1988. ISBN 0-571-10088-0
[28]
An overview article is being prepared to
explain what challenges are addressed by the Trustworthy 100-Year Digital
Objects articles. It will also
explain, in non-technical terms, our approach and how it avoids problems like
those articulated above in Migration,
Emulation, and Durable Encoding.
Finally, it will sketch the arguments for the claim that our approach is
optimal.
[29] For instance, nothing about the computer or the software being used to prepare DDQ is relevant to the content of DDQ 2(2).
[30] A purely Turing-equivalent machine is, in fact, sufficient only for data that can be handled with a program class called filters. For real-time, multiprocessing applications, we need to add obvious extensions to the simple Turing-equivalent machine.
[31] The author intends to address this topic in a forthcoming article, Trustworthy 100-Year Digital Objects: Economic Effects on Archive Design.
[32] Jim Gray, Wyman Chong, Tom Barclay, Alex Szalay, and Jan Vandenberg, TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange, May 2002.
[33] Prototype multilevel optical disks with a few layers have been demonstrated. However, “true 3-dimensional storage” will have large numbers in every dimension.
[34] See J. Ashley, M.-P. Bernal, G. W. Burr, H. Coufal, H. Guenther, J. A. Hoffnagle, C. M. Jefferson, B. Marcus, R. M. Macfarlane, R. M. Shelby, and G. T. Sincerbox, Holographic data storage, IBM J. Res. & Dev. 44(3), 2000, which is on-line at http://www.research.ibm.com/journal/rd/443/ashley.html .
[35] The assessment that follows depends on no “insider information” that might be supposed to come from my having been employed by IBM. The relationships between major marketplace factors and corporate decisions are revealed only to the few IBM managers that need to know them.
[36] The deal included the transfer of the IBM Research departments most immediately responsible for the cost/effectiveness improvements alluded to above and in DDQ 1(1). Arguably, it’s a classic example of working yourself out of a job!
[37] René Thom described such transitions in his ‘catastrophe theory’, a topic that a future issue of DDQ might take up.
[38] Leah Graham and Panagiotis Takis Metaxas, "Of Course It's True; I Saw It on the Internet" Critical Thinking in the Internet Era, Comm. ACM 46(5), 70-75, May 2003. See also Peter Neumann, E-Epistemology and Misinformation, Comm. ACM 46(5), 104.
[39] Stewart Granger, Digital Preservation and Deep Infrastructure, D-Lib Magazine 8(2), February 2002.
[40] There is no reason to believe that IBM is anything but representative of many commercial vendors.
[41] This is what corporate shareholders demand of their employees, the managers responsible for prudent disposition of resources. It says nothing about IBM’s charitable activities, which are separately managed to different criteria. For a more balanced assessment than that of Granger, see The Political Economy of Public Goods in Waters’ article cited in footnote 10.
[42] In addition to its importance for marketplace competition, schedule adherence is critical to cost control.
[43] San Jose Mercury News at http://www.bayarea.com/mld/mercurynews/news/5676110.htm.
[44] E.F. Codd, A Relational Model of Data for Large Shared Data Banks, Comm. ACM 13(6), 377-387.June 1970.
[45] In 1980, IBM had over 1100 corporate customers of its hierarchical DBMS, the Information Management System (IMS).
[46] My knowledge of this is first-hand, as at the time I managed the computing service in the IBM San Jose Research Laboratory (SJRL), where Codd did the work and where the first relational database prototype system was developed. The technical debate raged both within the SJRL, and also between the Research team and the IBM IMS product team housed at what today is called the IBM Silicon Valley Laboratory. In Research, Michael Senko (deceased) was the strongest proponent of hierarchical and network databases, and first Leonard Liu and then Frank King led the winning side.
[47] A brief history can be found in §1.3 of D.D. Chamberlin, Using the new DB2: IBM's object-relational database system, Morgan Kaufmann Publishers, 1996. ISBN 1-55860-373-5
[48] John Willinsky, Copyright Contradictions in Scholarly Publishing, First Monday 7(11), Nov. 2002.
[49] Toronto Globe and Mail, 6 Dec. 2002. See also http://www.biomedcentral.com/news/20021206/04/.
[50] San Jose Mercury News, 31 May, or see http://www.townhall.com/news/politics/200305/CUL20030530e.shtml.
[51] With a newspaper rendition in hand, it took me about five minutes to locate the controversial photograph, whose scale would make impossible the identification of anybody who happened to appear in it (nobody does).
[52] For access to this and other TechRepublic information, you will have to register. Registration is free of charge.
[53] The prices are mostly from San Jose Mercury News advertisements. Better deals might be available from on-line shopping services. To facilitate “level playing field” comparison, sales taxes and shipping costs are included in the estimates.