In the past, global networks have usually transported textual information, but there is a growing need for these networks to transport other forms of information such as images, video, and audio. Until recently, electronic information sources served mainly specialized clients, but now these sources will be accessed by a wide range of users, ranging from computer specialists, discipline experts, engineers, and the general public, including novice computer users and students at all levels.
These trends have created an emerging,
important discipline: digital libraries. Several US agencies,
including NASA, ARPA, and NSF, have made available over the past few
years a considerable amount of money to support research in this
field. Other countries, including Canada, the UK, France, Italy, and
the Netherlands have also invested in digital library development:
National Library of Canada Electronic
Collection:
http://www.nlc-bnc.ca/eppp/e-coll-e.h
Initiative for Access—British Library Board:
http://portico.bl.uk/access/overview.html
International Institute for Electronic Library Research (involved
in several projects):
http://ford.mk.dmu.ac.uk.
Elite Project (Italy):
http://cosimo.ing.unifi.it/research/elite/elitinfo.html
As a result of
these activities, a number of recent symposia, workshops, and
conferences have been recently devoted to digital library issues, and
several journals have published editorial about digital libraries,
including Computer and IEEE Transactions on Knowledge and Data
Engineering.
A typical digital library uses a variety of database-management systems. Current DBMSs range from relational and extended relational systems to object-oriented database systems. Relational DBMSs are most often used for the storage of metadata and indexes with attributes that contain pointers to files in a file system. Most of the commercial RDBMSs also support the storage of Binary Large Objects (BLOBs); in an Oracle RDBMS, BLOBs can be as large as 2 Gbytes. Object-oriented database systems are slowly gaining acceptance and overcoming earlier performance and implementation problems. An OODBS can make it easier to model, store, and work with real-world objects such as images or maps.
Compression techniques save storage. For text-only documents, the Unix compress or freeware gzip utilities provide anywhere from 10- to 60-percent compression. Several compression standards exist for digital images (JPEG), audio (uLaw), and video (MPEG).
Digital library collections
that are too large to store entirely on a disk use hierarchical
storage mechanisms. In an HSM, the most frequently used data is kept
on fast disks while less frequently used data is kept in near-line
such as an automated (for example, robotic) tape library. Using
data-usage statistics, the HSM can automatically migrate data from
tape (near-line) storage to disk (on-line) and back, as required.
A user interface for digital libraries must display
large volumes of data effectively. Typically the user is presented
with one or more overlapping windows that can be resized and
rearranged. In digital libraries, a large amount of data spread
through a number of resources necessitates intuitive interfaces for
users to query and retrieve information. The ability to smoothly
change the user's perspective from high-level summarization
information down to a specific paragraph of a document or scene from a
film remains a challenge to user interface researchers.
Automated classification systems differ significantly in their approaches, depending on the type of content under consideration. Classifying short stories is quite different from classifying maps, both in terms of the mechanics involved and the appropriate classes. These distinctions make current automated classification efforts highly domain-specific.
Automated document
classification methods can be grouped into two general approaches, but
neither can yet capture the meaning of words in the documents. Image
classification approaches are conceptually different from those used
for text classification. Although many domain-specific systems allow
"content-based" querying, most are relegated to a very narrow range of
images and may require the services of human classifiers. Video
classification and indexing requires systems that can parse video into
manageable portions, typically called camera shots. As with image
classification, the type of classification and indexing performed on
video is driven by the types of queries posed by users. The
classification of audio, musical notation, and maps presents
additional research challenges.
The success of information retrieval can be measured in terms of the percentage of relevant and extraneous information retrieved. It is difficult to pinpoint quantitatively the effectiveness of information retrieval; only an individual user can determine what is truly useful. Techniques to improve retrieval effectiveness include preprocessing documents to extract additional metadata before storing them in a document database.
Several
researchers are focusing on automating the creation and maintenance of
user profiles and applying these profiles to information
retrieval. Software agents are an extension of filtering techniques,
although filtering tends to imply passive mechanisms whereas the use
of agents implies a more proactive approach. Many people have put
forth definitions of software agents, ranging from an adaptable
information filter to an autonomous program that works in conjunction
with or on behalf of a human user. Software agents also embody the
notion of improving over time as they record additional user actions
and reactions.
Increased demands for networking bandwidth come from two main fronts. First, the number of digital library users will undoubtedly increase. If the Internet is any indication, exponential growth in the number of users will be the rule. Second, as the delivery of multimedia data becomes the norm, the demands for high bandwidth increase. However, high bandwidth, in and of itself, is not enough to support digital libraries. The intelligent use of bandwidth and the ability to guarantee bandwidth for a given time period are also required.
Today's open networking standards such as TCP/IP and
the ever-growing Internet make it clear that successful digital
libraries must be built on an open, interoperable networking
infrastructure. Current digital libraries may be run exclusively on a
single computer, on several computers connected on a LAN, or on a
large number of computers spread out over a wide area
network. Delivery systems that require high bandwidth such as video
and image libraries are predominantly installed using LANs that run at
10 to 100 Mbits per second. In contrast, the Internet's major
backbones run at 1.5 Mbps to 150 Mbps, while links to individual
organizations fall in the 56 Kbps to 1.5 Mbps range. Individual users
typically connect to the Internet through service providers, local
universities, or other organizations at 2.4 Kbps to 28.8 Kbps.
There may also be times when a small group of individuals want access to a portion of digital library content such as when authors are preparing initial drafts of a document. In these cases, security mechanisms must be put into place to ensure that only authorized users gain access. Current digital libraries employ the basic security measures offered by the supporting operating systems. For example, any digital library running on Unix can restrict access using username and password authentication and protect files using group membership and file-access rights. This basic security will not meet the demands of large-scale digital libraries.
Finally, digital libraries must protect the identity of their users, who may wish to browse content that may be embarrassing.
The task force sponsors activities that benefit its members and profession. Such activities include sponsoring and cosponsoring symposia, sessions in large conferences, tutorials, and a newsletter, edited by Prof. Erich Neuhold, GMD-IPSI. Send newsletter contributions (news, brief articles, conference announcements) to neuhold@darmstadt.gmd.de. The task force cosponsored the Forum on Research and Technology Advances in Digital Libraries, held last May at the Library of Congress and is cosponsoring the International Journal of Digital Libraries, which Springer Verlag will begin publishing this year.
The executive committee of the task force includes Nabil R. Adam (chair), Rutgers University; David Choy, IBM Almaden Research Center; Milton Halem, NASA Goddard Space Flight Center; Nahum Gershon, Mitre; Erich Neuhold, GMD-IPSI; and Yelena Yesha, UMBC/CESDIS.
Membership in the Task Force on Digital Libraries is free. We invite you to join and contribute ideas, suggestions, comments, and time. For more information, see our home page at http://cimic3.rutgers.edu/ieee_dltf.html or through the IEEE Computer Society's home page at http://www.computer.org, or send e-mail to adam@adam.rutgers.edu.
The 1997 International Conference on the Advances in Digital Libraries Home Page.