Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 4, Number 1, 1Q2005

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2005, H.M. Gladney

 

ISSN: 1547-8610

 

 

In less than a decade, Internet search engines have completely changed how people gather in­formation. No longer must we run to a library to look up something; rather we can pull up rel­evant documents with just a few clicks …  [O]nline search engines are poised for a series of upgrades that promise to further enhance how we find what we need.  Scientific American, February 2005[1]

Information Retrieval

Today’s most effective information discovery infrastructure is Internet catalogs and search engines—no longer the catalogs of research libraries.  Of course, research library catalogs are part of this.

ACM SIGIR[2] has tracked the vast search literature for about four decades.  Three IEEE journals provide online numbers on recent technical developments.[3]  IEEE MultiMedia looks at the growing amount of visual information available electronically, and asks, "Is It Time for a Moratorium on Metadata?"  IEEE Intelligent Systems examines searching from cell phones.  IEEE Distributed Systems Online addresses personalization and asks, "What's Next in Web Search?"  The Digicult Thematic Issue 6 treats the topic from the perspective of cultural heritage enthusiasts.

The number of offerings is bewildering, a circumstance likely to continue before simplification sets in.  This is driven by the potential for advertising revenue.  When I planned the current DDQ number in September, there was an upsurge of offerings.  Journalists noticed the trend, so now it is old news. What follows is therefore a synopsis—a readers’ digest—organized to help them make sense of the frenzy of news releases.  It also identifies tools that I find particularly helpful.[4]

More Precise Search Results (Context Sensitivity): Tools similar to Web-search tools are appearing for individual consumers’ local collections—PC files and electronic mail—and to limit search to Internet subsets, e.g., Google Scholar.  Their usefulness is enhanced by quality ratings for periodicals.[5]  Vendors also offer extensions to enterprise-confidential files, databases, and communications.

Enterprise search engines … unlike Web search engines, can search files no matter what [their] format …. or what repository contains them.  … [They] enable classification, taxonomies, personalization, profiling, agent alert, … collaborative filtering, and real-time analysis, … ability … to add servers to scale up …, metadata search, international [language] support, … fault tolerance, … security management for document access control and communication protocol encryption, … and software development kits that let users construct search-enabled applications with no need for reengineering.  The differentiating factor for enterprise search engines is how well these various features are deployed, as well as the relevance of the results they generate.” [6]

Tools are being tested for filtering and prioritization according to personal interest profiles and prior search history, and also for dynamic generation of search term refinements.[7]

Improved aggregation services and access to standard reference sources (dictionaries, thesauri, encyclopediae, …) will appeal to many users.  A current favorite is Refdesk.com.  Of interest also is the Web Reference Shelf, which is part of The Extreme Searcher's Internet Handbook.

User Interface Convenience and Information Visualization:  The simplest search results are classes (sets of object identifications), as illustrated by Google and U.C. Melvyl result deliveries, and quite imaginatively by NewsMap, which frequently updates its current news feed.  The next simplest results are sets of pairs—binary relations.  The most complex that see frequently used are sets of triples—ternary relations.  These can be depicted as graphs;[8] see KartOO and TouchGraph.

The Pluck browser add-in illustrates services that monitor search results dynamically, keeping users up-to-date when new results appear for a prior query

Grokker E.D.U. is for students access to special libraries and proprietary databases.  It selects from several search result sets collected by other search tools that can include MSN Search, Yahoo and Google, categorizing search results and delivers them in visual maps that show their relationships.

Geographic Searching and Sensitivity: have been discussed in the daily press, e.g., a review of MSN Search, praising the Google Maps website, which can be coupled with a GPS device.  Amazon has added street-level photographs to its business directory.

Excellent aerial photographs and U.S. Geographical Survey topographic maps are available on Terraserver, a prototype the Microsoft Bay Area Research Center is using for developing database technology.  Nothing approaching the level of coverage or detail seems to be available for any other region of the world, according to U.C.  Berkeley Earth Sciences and Map Library.

Of the street guide and route planning services that cover Europe, I particularly like Map24, which also covers the United States.  For now, at least, this service is free.

Favorite tools:  When I search for scholarly work, I first exhaust what is easily found with Google, then use the Univ. of California Melvyl catalog, and finally the Library of Congress catalog and similar international resources.  The U.S. Government Portal surely has counterparts in other countries.

For your local environment, consider Yahoo Desktop for full-text search in files and e-mail, and Google Picasa to browse and organize your digital photographs and other images.

To help me know where I am driving and find where I’m going, I recently acquired the Delorme Earthmate GPS receiver and its coupled Street Atlas USA™, on sale for a mere $75!

Research and Future Prospects[9]

“Currently, all search engines fail to capture the bulk of the “invisible Web''—resources locked up in databases and inaccessible by the engines' indexing crawlers.  These include regulatory filings at the U.S. Securities and Exchange Commission, detailed reports on charities at GuideStar and complete archives of most newspapers.”                  New York Times, 26th March

The Bielefeld Academic Search Engine (BASE, at Bielefeld Univ.) developers recommend the technology of Norway’s Fast Search & Transfer, an off-shoot of the Norwegian National University of Technology.  Fast itself is advertising its technology for enterprise search.  The Bielefeld site also mentions alternatives Nutch, an open source search base, and the Apache Jakarta Lucene text search engine.

In view of the current business and scholarly interest in information discovery, and the immense literature that has not been systematically exploited, we expect many more practical enhancements.[10]  These will include combining the best features of current separate offerings.  Research groups are also investigating adding semantics to search engines that currently use only document keys and syntactic features.

Individual researchers (or, more realistically, small interest groups) will find it easy and affordable to construct search databases better suited to their particular interests than those that libraries provide.  Automatic means could keep such databases up-to-date.  This suggests possible restructuring of how and where information functionality is laid out in the Internet.  In a decade or two, libraries might be neither the most used repositories nor the preferred search providers.  They will still have a critical role in scholarly activities and in preservation of the cultural record, but it well could be different than it is today.

Topics for Information Preservation

On Metadata

“[T]he number and variety of resources on the World Wide Web has made … resource description … central to discussions about the efficiency and evolution of this medium.  The inappropriateness of traditional schemas of resource description for web resources has encouraged … web-compatible schemas named "metadata".  While conceptually old for library and information professionals, metadata [will take a] more significant and paramount role than ever before …” [11]

However, the work to define adequate metadata schema that can be used within the time and effort that writers, publishers, and libraries are willing to invest, and the many published recommendations and debates about various schema, have not been matched by practical uptake.[12]  Bulterman asks, “Is It Time for a Moratorium on Metadata?” [13]

Java Standard 170 for Content Management

This proposal specifies an application programming interface for content repositories in Java 2.

“JSR 170 works on two levels.  Level 1 … governs access … at the content element level …

“With comprehensive repository functionality, Level 2 … permits complex applications to exchange data … and provides definitions for future, mature repository developments, emphasizing:

·       Read/write access: … for bi-directional interaction of content elements.  Procedure is not only checked at the document level, but also at the “properties” level, …

·       Versioning: … transparent version control within the whole content repository, [with] … easy access to various versions … [and] also problem-free modification of versions.

·       Full text search and filtering: [targeting] the entire non-binary content of a repository … [with] search … that controls the specific or sub-string search method respectively.

·       Object classes: [with] limitations … within which an applications developer can concentrate on specific content object types …

“Standardization of the methods for handling binary and text-based, as well as structured, semi-structured and unstructured data, is being examined, in addition to event monitoring, namespaces and standard properties, linking, locking and concurrency.” [14]

Universal Unique Identifiers Revisited

An IETF RFC (Internet Engineering Task Force Request for Comments) proposes simplification of resource identifiers.[15]  Its authors intend it only for information assets.  However many resource management applications, including library, archive, and museum catalogs, need to include descriptions of material and property assets.  Happily the details of the proposal work for all kinds of asset.

XML Packaging Standard Proposals

XOP (XML-binary Optimized Packaging) specifies efficient serializing of XML Infosets.  A XOP package places a serialization inside an extensible packaging format (such a MIME Multipart/Related).  Selected content portions that are base64-encoded binary data can be extracted and re-encoded (i.e., the data is decoded from base64) and placed into the package.

XFDU is a draft specification, from the originators of OAIS, is for encoding and encapsulation of metadata and content for the AIPs, SIPs, and DIPs that OAIS calls for.[16]

These proposals are too new for DDQ comment, even on their relationship (compatible? conflicting?), except to say that XFDU seems compatible with the preservation document structure proposed by Evidence After Every Witness is Dead.[17]

XML Projections for 2005

W3C (the World Wide Web Consortium) is closer to adopting a multi-vendor standard for XQuery.  Jerry King, general manager for DataDirect's XML products, predicts:

·  Moving Beyond SQL: SQL pre-dates many software development cornerstones, making applications difficult to implement using current technologies.  XQuery will help with XML content management applications, XML reporting, native XML programming, data integration and Web message processing.

·  Access Relational Databases as XML: That XQuery can use XML views to query relational databases the same way that it queries XML will greatly ease developers’ jobs.

·  Access Non-Relational Data as XML: Because most data formats can easily be translated to XML, Xquery will become popular for data integration.

·  Access Distributed Data Sources: Because XQuery provides built-in facilities for loading and querying data sources anywhere on the Internet.  XQuery will be used to join, integrate, share and manipulate data on the Internet as though it was on the local file system.

·  Standards-Based Programmatic Data Access: The XQuery API for Java (XQJ), the XML equivalent to JDBC or ADO, is a powerful new Java specification for processing query results in a JDBC-like fashion. Data access component vendors will provide embeddable components that support XQuery data access through XQJ for all major databases.

King also says that skills and tools for XSLT and XML Schema will be in much demand.

Digital Cultural Heritage: Problem or Opportunity?

“Experts are both in awe and in frustration about the state of the internet.  They celebrate search technology, peer-to-peer networks, and blogs; they bemoan institutions that have been slow to change. … The experts are startled that educational institutions have changed so little, …”                     Fox et al.[18]

"… digital is not generally viewed as a suitable long-term preservation archival surrogate for print.  It is currently regarded more as an access medium.  As a preservation medium, [it was seen] as unstable, experimental, immature, unproven on a mass scale and unreliable in the long-term." [19]

The second quotation needs careful attention to its context.  Whose perspective is represented?  What questions were the speakers asked to address?  It is from a poll of the directors of 16 major libraries—mostly people with a liberal arts background,[20] apparently without any technical experts.  They were asked only about digital surrogates for content already held in older formats (on paper and other media), and only about current practice, not about how means and controlling social conventions (including legal constraints) might evolve in either the near or the distant future.

I am reminded of an intellectual property attorney who reminded an IBM research staff audience, "You need to be careful which question you ask an attorney.  You might ask either, 'What problems might I encounter if I do X?' or 'If I choose to do X, how should I proceed to stay out of trouble?'

"Well, we attorneys are professionals, and as professionals will answer the specific question you ask.  The answer to the second is likely to be very different than that the first, and much more useful."

It seems to me that the literature from research librarians and information scientists predominantly treats digitally-represented information as a problem, rather than as an immense opportunity.[21]  I wonder whether this impression is reasonable and, if so, why their views are pessimistic.  I would appreciate views from the digital heritage community.

Trusted Computing Data Protection

The trusted personal computer hardware platform—running a secure environment rather than software-only solutions—is emerging as a powerful new tool to improve enterprise data protection and user authentication.  Industry offers many PCs and motherboards equipped with a Trusted Computing Module, a dedicated microchip enabled for security-specific capabilities.  Specifications have been developed and promoted by an industry standards organization called the Trusted Computing Group.

In contrast to the criticism that appears in Trust, Trusted, Trustworthy in DDQ 1(2), the word ‘trusted’ in the paragraph above is not misleading.  The critical distinction is that, in this case, the trusting entity is known; it is an operating system that depends on information from the Trusted Computing Module.

News Reports

Attacks on ICAAN Not Deserved

While no one owns the Internet, it cannot function without ICANN (Internet Corporation for Assigned Names and Numbers)—the not-for-profit corporation that manages the Internet addressing system.  For several years ICANN has been attacked by international organizations that say the United States holds too much control over the Internet’s core functions.  ICANN CEO Paul Twomey has explained how his organization has become a lightning rod for criticism and why he thinks it is undeserved.

U.K. Freedom of Information Act

Since the beginning of the year, British citizens could request information at any time and expect an answer unless an exemption applies.[22]  The 30 year rule has disappeared.  Over 50,000 files less than 30 years old have been released by The National Archives.

How Not to Write

Stylistic examples for prospective authors! ...  from secondary school essays:

1. His thoughts tumbled in his head, making and breaking alliances like underpants in a dryer without Cling Free.

2. He spoke with the wisdom that can only come from experience, like a guy who went blind because he looked at a solar eclipse without one of those boxes with a pinhole in it and now goes around the country speaking at high schools about the dangers of looking at a solar eclipse without one of those boxes with a pinhole in it.

3. She grew on him like she was a colony of E. coli and he was room-temperature Canadian beef.

4. Her vocabulary was as bad as, like, whatever.

5. He was as tall as a six-foot-three-inch tree.

6. The revelation that his marriage of 30 years had disintegrated because of his wife's infidelity came as a rude shock, like a surcharge at a formerly surcharge-free ATM.

7. The little boat gently drifted across the pond exactly the way a bowling ball wouldn't.

8. McBride fell 12 stories, hitting the pavement like a Hefty bag filled with vegetable soup.

9. The scene had an eerie, surreal quality, like when you're on vacation in another city and Jeopardy comes on television at 7:00 p.m. instead of 7:30.

10. The hailstones leaped from the pavement, just like maggots when you fry them in hot grease.

11. John and Mary had never met.  They were like two hummingbirds who had also never met.

12. He fell for her like his heart was a mob informant and she was the East River.

13. Even in his last years, Grandpappy had a mind like a steel trap, only one that had been left out so long, it had rusted shut.

14. Shots rang out, as shots are wont to do.

Reading Recommendations

The electronic proceedings from the Virtual Reference Desk 2004 Conference are available online, as are all the conference papers at the Electronic Publishing conferences from 1997 to 2044.

The Future of the Internet

The Pew Foundation has made available a study on the future of the Internet, briefly profiled in the New York Times on 11th January.

Trying to Understand Muslim Politics

To many people with a European or North American cultural tradition, Muslim political behavior must be puzzling, since many of its manifestations seem contrary to the best interests of their perpetrators and countrymen.  Books and film suggest that distrust and hatred are important beyond Western experience.

A Lawrence of Arabia scene shows an Arab League meeting quickly degenerating from co-operation to violent tribal jealousies that weakened all the participants in their dealings with the English and French.  This allowed the latter to establish vassal state governments that Arabs have ever since hated.[23]

Chapter 24 of Landes’ Wealth and Poverty of Nations[24] begins:

“No-one can understand the economic performance of Muslim na­tions without attending to the experience of Islam as faith and culture.     By the time Europeans entered the Indian Ocean by sea (1498), Islam had planted itself in parts of China and the Philippines, down the east coast of Africa, in southeastern Europe into the Danube basin, and along the trade routes of central Asia. 

“This explosion of passion and commitment was the most important feature of Eurasian history in … the thousand years from the fall of the western Roman empire … to the overseas expansion of Christian Europe.  In this sense, it anticipates the potency of the later European imperial sweep, …

“The critical difference between the two rushes of power is the place of technology.  The Muslim rested on old ways but new men, on the fighting zeal of fast-moving, horse-mounted warriors who were convinced that God and history were on their side.    The European push was based on superior firepower and moved by profit: loot yes, but above all, continuing, sustainable profit.”

Landes’ ideas are elaborated in Bernard Lewis’ more focused and shorter historical account, What Went Wrong? Western Impact and Middle Eastern Response, a book that I believe should be on everybody’s short list of social history.[25]

A partial explanation is suggested by early chapters of Leon Uris’ novel, The Haj.[26]  If its description of a boy’s education by his father accurately depicts common behavior, from a very young age Muslim youth are trained to distrust and loot from anyone—even family and tribe members.  Similar suggestions were the core of a recent editorial,[27] which included:

“Americans are still puzzled over why well-off Islamic fundamentalists crashed planes into skyscrapers and now send mercenaries to the Sunni triangle to slaughter us as we sponsor democracy.  Yet since Sept. 11, 2001, we have grasped that Muslim fascists understood that the course of American-led world history—democracy and globalized capitalism—was leaving them behind.  Thus they strike the United States before they are made irrelevant.

“…

“The United States has adopted a rational strategy against Islamic fascism: kill the terrorists, remove illegitimate regimes that aid the extremists, foster democracies in their places and alter American policy from tolerance of the corrupt status quo to calls for reform.  Yet we cannot finish the Islamicists' war unless we understand why they started it.  For that answer, look at who Americans are and what we represent—not what we supposedly have done.”

I am reminded of a Russian tale in which only one villager owned a cow.  A genie appeared to grant a wish to another villager, but was surprised at the choice made: “Kill that guy’s cow!”

Web Mobs

“Crime is now organized on the Internet.  Operating in the anonymity of cyberspace, the Shadowcrew and Web mobs like it threaten the trust companies have spent years trying to build with customers, online.  Here's how one cybercrime network uses administrators, vendors and forums to traffick in millions of credit card accounts and Social Security numbers.”       John McCormick and Deborah Gage[28]

To learn how one identity-theft business worked, read McCormick and Gage’s account.

Practical Matters and Software Recommendations

You might find the TCP/IP and TCPDUMP Pocket Reference helpful.

Lock Down a Linksys Router in 10 Steps

TechRepublic points out that, for small businesses and home offices, Linksys routers are popular targets for hackers.  It recommends a 10 step procedure to secure such a router.

The SPX Bundle - Capture, Edit, Enhance

Lockergnome reviews this package for screen shot presentations, graphic artwork, photos, and Web comics favorably.  PC users who have not already invested in screen capture and graphics software should consider this $30 package.

Microsoft Word® Tips

http://www.word-answers.com/ helps with Microsoft Word.  It points to over 900 articles in over 100 topic areas, addressing MS Word versions 6 through 2003.

WinAudit

WinAudit reports a PC’s hardware and software configuration, complementing BelArc Advisor.  It details installed software, licenses, peripherals, memory usage, processor model, network settings, etc.

WinAudit is free works with all Windows versions since Windows 95.  It requires no installation, and fits easily onto a floppy disk, enabling quick computer inspections with minimal effort.

Mozilla Firefox: A Performance Tip[29]

With Firefox open, enter about:config in the URL box; then, enter network.http in the browser's filter function.  In the line identified as network.http.pipelining change the setting of "false" to "true" by double clicking on the line.  In the line identified as network.http.proxy.pipelining do the same thing to change the setting from "false" to "true".  In the line identified as network.http.pipelining.maxrequests, doubleclick on the line twice and a window will open.  Change the value to 20.

These changes enable Firefox to use network connections more efficiently and should somewhat speed up Web page retrievals.

An Alternative to Windows

Technology Review senior editor Wade Roush purchased a new PC that he knew wouldn't be a fancy machine.  But it cost only $278.  He chose it because it was without any Microsoft software whatsoever.  Instead, it came with Linspire 4.5, a commercial open-source Linux version.  Plugged in, the machine revealed a glamorous new desktop screen and sophisticated help menus and audio tutorials.  Software giving Linux the look, feel, and functions of a Windows PC is increasingly available both in free, unsupported versions and in enhanced commercial versions.[30]

Home Computing Price Watch

USB 2.0 Memory Key

128 Mbyte

$20.

$160/Gbyte

PC Memory

512Mb PC3200 DDR

$35

$70/Gbyte

Digital camera storage

1Gb compact flash card

$50

$50/Gbyte

Mobile drive

USB connect 2.2Gb

$80

$36/Gbyte

Serial-ATA HDD

120Gb internal

$50

$0.42/Gbyte

DVD-R disks

8x

$0.07

each

DVD-R disks

4x

$0.05

each

DVD-ROM drive

16x

$20

each

DVD writer

8x Dual ±R / ±RW

$50

each

Wireless Router

Airlink 54Mbps Wireless-G cable/DSL wireless router

$19

each

PC Wireless Adapter

Airlink 54Mbps Wireless-G laptop PC or PCI adapter

$15

each

Flat panel LCD display

17” .264 mm pitch, 450:1 contrast ratio

$200

each

 

Acknowledgements

Critique by and discussions with John Bennett, Tom Gladney, and John Swinden have helped create this DDQ number.  Their help is gratefully acknowledged.



[1]     Mostafa, Javed. Seeking Better Web Searches, Scientific American 292(2), 67-73, 2005.

[2]     ACM Special Interest Group for Information Retrieval

[3]     From What's New @ IEEE In Computing 5(12), December 2004.

[4]     However, I do expect my preferred tools to be a different set a year from now.

[5]     For instance, see Mylonopoulos, N.A. and Theoharakis, V. Global perceptions of IS jour­nals,, Comm. ACM 44(9) 29-33, Sept. 2001.  Ruth Bolotin Schwartz and Michele C. Russo, How to Quickly Find Articles in the Top IS Journals, Comm. ACM 47(2) 98-101, 2004.  See also http://63.151.43.10/csaunders/rankings.htm.

[6]     Wang, Roland. Enterprise Search: The Next Frontier, Software Development 12(12), 36, 2004.

[7]     For instance, see the Google Labs service test at http://labs.google.com/personalized/.

[8]     A recent Museums and the Web conference paper introduces graphical presentation tools.  See Addis et al., New Ways to Search, Navigate and Use Multimedia Museum Collections over the Web, in Museums and the Web 2005: Proceedings, last updated March 22, 2005.

[9]     Just as I started the final preparation of DDQ 4(1) for release, the ACM News Service announced availability of David Southgates’, Powerful Query Technology Will Optimize Knowledge Management for Project Managers, TechRepublic, March 2005.

[10]   For a longer discussion, see Asadi, S. Jamali M.H.R. Shifts in search engine development: A review of past, present and future trends in research on search engines, Webology, 1(2), Article 6, 2004.

[11]   Safari, Mehdi.  Metadata and the Web, Webology 1(2), December 2004.

[12]    Greenberg, J. Spurgin, K. Crystal, A.  Final Report for the AMeGA (Automatic Metadata Generation Applications) Project, submitted to the Library of Congress, February 2005.

[13]   Bulterman, Dick C.A. Is It Time for a Moratorium on Metadata? IEEE Multimedia 5(12), Dec. 2004.

[14]   Cadoff, Dave. Java Standard 170, ServerWorld Magazine, 2003.

[15]   Van de Sompel, Herbert. et al. The "info" URI Scheme for Information Assets with Identifiers in Public Namespaces, IETF RFC, January 2005.

[16]    CCSDS 650.0-R-2, Reference Model for an Open Archival Information System (OAIS), July 2001.

[17]   Gladney, H.M. Trustworthy 100-Year Digital Objects: Evidence After Every Witness is Dead, ACM Trans. Info. Sys. 22(3), 406-436, July 2004.  See especially its Figures 2 and 3.

[18]    Fox, Susannah. Anderson, Janna Quitney. Ranie, Lee.  The Future of the Internet, Pew Internet and American Life Project, January 2005.  See also PC World commentary.

[19]   Anonymous from the British Library, Digital versus print as a preservation format – expert views from international comparator libraries, 2005.

[20]   One must again consider the implications of C.P. Snow’s The Two Cultures, 1959.  See DDQ 2(4).

[21]   See, for instance, Beebe, Linda. Meyers, Barbara. The Unsettled State of Archiving, Journal of Electronic Publishing 4(4), June 1999.  

[22]   National Archives, Release of over 50,000 files to mark the full implementation of the Freedom of Information Act, January 2005.

[23]   See, for instance, The Sykes-Picot Agreement: 1916.

[24]    Landes, David S. The wealth and poverty of nations: why some are so rich and some so poor, W.W. Norton, 1998. ISBN: 0-393-04017-8     Chapter 24, History Gone Wrong.

[25]   Lewis, Bernard.  What Went Wrong? Western Impact and Middle Eastern Response, Oxford UP, 2002.  ISBN 0-19-514420-1

[26]   Uris, Leon.  The Haj, Bantam reprint, 1995.  ISBN 0-553-24864-2

[27]   Hanson, Victor Davis.  They Hate Us For Who We Are, Not What We Do. To Terrorists, America Symbolizes Onset Of Modernism, San Jose Mercury News, January 13, 2005.

[28]   McCormick, John. Gage, Deborah Gage.  Shadowcrew: Web Mobs, Baseline Magazine, March 28, 2005.

[29]   Adapted from J. Teems’ Neat Net Tricks, 31st March 2005.

[30]   Adapted from Technology Review.  See http://www.technologyreview.com/articles/04/09/roush0904.asp?trk=nl