Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 4, Number 1, 1Q2005

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2005, H.M. Gladney

 

ISSN: 1547-8610

 

 

In less than a decade, Internet search engines have completely changed how people gather in­formation. No longer must we run to a library to look up something; rather we can pull up rel­evant documents with just a few clicks …  [O]nline search engines are poised for a series of upgrades that promise to further enhance how we find what we need.  Scientific American, February 2005[1]

Information Retrieval

Today’s most effective information discovery infrastructure is Internet catalogs and search engines—no longer the catalogs of research libraries.  Of course, research library catalogs are part of this.

ACM SIGIR[2] has tracked the vast search literature for about four decades.  Three IEEE journals provide online numbers on recent technical developments.[3]  IEEE MultiMedia looks at the growing amount of visual information available electronically, and asks, "Is It Time for a Moratorium on Metadata?"  IEEE Intelligent Systems examines searching from cell phones.  IEEE Distributed Systems Online addresses personalization and asks, "What's Next in Web Search?"  The Digicult Thematic Issue 6 treats the topic from the perspective of cultural heritage enthusiasts.

The number of offerings is bewildering, a circumstance likely to continue before simplification sets in.  This is driven by the potential for advertising revenue.  When I planned the current DDQ number in September, there was an upsurge of offerings.  Journalists noticed the trend, so now it is old news. What follows is therefore a synopsis—a readers’ digest—organized to help them make sense of the frenzy of news releases.  It also identifies tools that I find particularly helpful.[4]

More Precise Search Results (Context Sensitivity): Tools similar to Web-search tools are appearing for individual consumers’ local collections—PC files and electronic mail—and to limit search to Internet subsets, e.g., Google Scholar.  Their usefulness is enhanced by quality ratings for periodicals.[5]  Vendors also offer extensions to enterprise-confidential files, databases, and communications.

Enterprise search engines … unlike Web search engines, can search files no matter what [their] format …. or what repository contains them.  … [They] enable classification, taxonomies, personalization, profiling, agent alert, … collaborative filtering, and real-time analysis, … ability … to add servers to scale up …, metadata search, international [language] support, … fault tolerance, … security management for document access control and communication protocol encryption, … and software development kits that let users construct search-enabled applications with no need for reengineering.  The differentiating factor for enterprise search engines is how well these various features are deployed, as well as the relevance of the results they generate.” [6]

Tools are being tested for filtering and prioritization according to personal interest profiles and prior search history, and also for dynamic generation of search term refinements.[7]

Improved aggregation services and access to standard reference sources (dictionaries, thesauri, encyclopediae, …) will appeal to many users.  A current favorite is Refdesk.com.  Of interest also is the Web Reference Shelf, which is part of The Extreme Searcher's Internet Handbook.

User Interface Convenience and Information Visualization:  The simplest search results are classes (sets of object identifications), as illustrated by Google and U.C. Melvyl result deliveries, and quite imaginatively by NewsMap, which frequently updates its current news feed.  The next simplest results are sets of pairs—binary relations.  The most complex that see frequently used are sets of triples—ternary relations.  These can be depicted as graphs;[8] see KartOO and TouchGraph.

The Pluck browser add-in illustrates services that monitor search results dynamically, keeping users up-to-date when new results appear for a prior query

Grokker E.D.U. is for students access to special libraries and proprietary databases.  It selects from several search result sets collected by other search tools that can include MSN Search, Yahoo and Google, categorizing search results and delivers them in visual maps that show their relationships.

Geographic Searching and Sensitivity: have been discussed in the daily press, e.g., a review of MSN Search, praising the Google Maps website, which can be coupled with a GPS device.  Amazon has added street-level photographs to its business directory.

Excellent aerial photographs and U.S. Geographical Survey topographic maps are available on Terraserver, a prototype the Microsoft Bay Area Research Center is using for developing database technology.  Nothing approaching the level of coverage or detail seems to be available for any other region of the world, according to U.C.  Berkeley Earth Sciences and Map Library.

Of the street guide and route planning services that cover Europe, I particularly like Map24, which also covers the United States.  For now, at least, this service is free.

Favorite tools:  When I search for scholarly work, I first exhaust what is easily found with Google, then use the Univ. of California Melvyl catalog, and finally the Library of Congress catalog and similar international resources.  The U.S. Government Portal surely has counterparts in other countries.

For your local environment, consider Yahoo Desktop for full-text search in files and e-mail, and Google Picasa to browse and organize your digital photographs and other images.

To help me know where I am driving and find where I’m going, I recently acquired the Delorme Earthmate GPS receiver and its coupled Street Atlas USA™, on sale for a mere $75!

Research and Future Prospects[9]

“Currently, all search engines fail to capture the bulk of the “invisible Web''—resources locked up in databases and inaccessible by the engines' indexing crawlers.  These include regulatory filings at the U.S. Securities and Exchange Commission, detailed reports on charities at GuideStar and complete archives of most newspapers.”                  New York Times, 26th March

The Bielefeld Academic Search Engine (BASE, at Bielefeld Univ.) developers recommend the technology of Norway’s Fast Search & Transfer, an off-shoot of the Norwegian National University of Technology.  Fast itself is advertising its technology for enterprise search.  The Bielefeld site also mentions alternatives Nutch, an open source search base, and the Apache Jakarta Lucene text search engine.

In view of the current business and scholarly interest in information discovery, and the immense literature that has not been systematically exploited, we expect many more practical enhancements.[10]  These will include combining the best features of current separate offerings.  Research groups are also investigating adding semantics to search engines that currently use only document keys and syntactic features.

Individual researchers (or, more realistically, small interest groups) will find it easy and affordable to construct search databases better suited to their particular interests than those that libraries provide.  Automatic means could keep such databases up-to-date.  This suggests possible restructuring of how and where information functionality is laid out in the Internet.  In a decade or two, libraries might be neither the most used repositories nor the preferred search providers.  They will still have a critical role in scholarly activities and in preservation of the cultural record, but it well could be different than it is today.

Topics for Information Preservation

On Metadata

“[T]he number and variety of resources on the World Wide Web has made … resource description … central to discussions about the efficiency and evolution of this medium.  The inappropriateness of traditional schemas of resource description for web resources has encouraged … web-compatible schemas named "metadata".  While conceptually old for library and information professionals, metadata [will take a] more significant and paramount role than ever before …” [11]

However, the work to define adequate metadata schema that can be used within the time and effort that writers, publishers, and libraries are willing to invest, and the many published recommendations and debates about various schema, have not been matched by practical uptake.[12]  Bulterman asks, “Is It Time for a Moratorium on Metadata?” [13]

Java Standard 170 for Content Management

This proposal specifies an application programming interface for content repositories in Java 2.

“JSR 170 works on two levels.  Level 1 … governs access … at the content element level …

“With comprehensive repository functionality, Level 2 … permits complex applications to exchange data … and provides definitions for future, mature repository developments, emphasizing:

·       Read/write access: … for bi-directional interaction of content elements.  Procedure is not only checked at the document level, but also at the “properties” level, …

·       Versioning: … transparent version control within the whole content repository, [with] … easy access to various versions … [and] also problem-free modification of versions.

·       Full text search and filtering: [targeting] the entire non-binary content of a repository … [with] search … that controls the specific or sub-string search method respectively.

·       Object classes: [with] limitations … within which an applications developer can concentrate on specific content object types …

“Standardization of the methods for handling binary and text-based, as well as structured, semi-structured and unstructured data, is being examined, in addition to event monitoring, namespaces and standard properties, linking, locking and concurrency.” [14]

Universal Unique Identifiers Revisited

An IETF RFC (Internet Engineering Task Force Request for Comments) proposes simplification of resource identifiers.