Part 6
An underlying assumption of the CLASS Project from the beginning was that it would develop a network application. Project staff scan books at a workstation located in the library, near the brittle material. An image-server filing system is located at a distance from that workstation, and a printer is located in another building. All of the materials digitized and stored on the image-filing system are cataloged in the on-line catalogue. In fact, a record for each of these electronic books is stored in the RLIN database so that a record exists of what is in the digital library throughout standard catalogue procedures. In the future, researchers working from their own workstations in their offices, or their networks, will have access--wherever they might be--through a request server being built into the new digital library. A second assumption is that the preferred means of finding the material will be by looking through a catalogue. PERSONIUS described the scanning process, which uses a prototype scanner being developed by Xerox and which scans a very high resolution image at great speed. Another significant feature, because this is a preservation application, is the placing of the pages that fall apart one for one on the platen. Ordinarily, a scanner could be used with some sort of a document feeder, but because of this application that is not feasible. Further, because CLASS is a preservation application, after the paper replacement is made there, a very careful quality control check is performed. An original book is compared to the printed copy and verification is made, before proceeding, that all of the image, all of the information, has been captured. Then, a new library book is produced: The printed images are rebound by a commercial binder and a new book is returned to the shelf. Significantly, the books returned to the library shelves are beautiful and useful replacements on acid-free paper that should last a long time, in effect, the equivalent of preservation photocopies. Thus, the project has a library of digital books. In essence, CLASS is scanning and storing books as 600 dot-per-inch bit-mapped images, compressed using Group 4 CCITT (i.e., the French acronym for International Consultative Committee for Telegraph and Telephone) compression. They are stored as TIFF files on an optical filing system that is composed of a database used for searching and locating the books and an optical jukebox that stores 64 twelve-inch platters. A very-high-resolution printed copy of these books at 600 dots per inch is created, using a Xerox DocuTech printer to make the paper replacements on acid-free paper.
PERSONIUS maintained that the CLASS Project presents an opportunity to introduce people to books as digital images by using a paper medium. Books are returned to the shelves while people are also given the ability to print on demand--to make their own copies of books. (PERSONIUS distributed copies of an engineering journal published by engineering students at Cornell around 1900 as an example of what a print-on-demand copy of material might be like. This very cheap copy would be available to people to use for their own research purposes and would bridge the gap between an electronic work and the paper that readers like to have.) PERSONIUS then attempted to illustrate a very early prototype of networked access to this digital library. Xerox Corporation has developed a prototype of a view station that can send images across the network to be viewed.
The particular library brought down for demonstration contained two mathematics books. CLASS is developing and will spend the next year developing an application that allows people at workstations to browse the books. Thus, CLASS is developing a browsing tool, on the assumption that users do not want to read an entire book from a workstation, but would prefer to be able to look through and decide if they would like to have a printed copy of it.
******
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Re retrieval software * "Digital file copyright" * Scanning rate during production * Autosegmentation * Criteria employed in selecting books for scanning * Compression and decompression of images * OCR not precluded * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
During the question-and-answer period that followed her presentation, PERSONIUS made these additional points:
* Re retrieval software, Cornell is developing a Unix-based server as well as clients for the server that support multiple platforms (Macintosh, IBM and Sun workstations), in the hope that people from any of those platforms will retrieve books; a further operating assumption is that standard interfaces will be used as much as possible, where standards can be put in place, because CLASS considers this retrieval software a library application and would like to be able to look at material not only at Cornell but at other institutions.
* The phrase "digital file copyright by Cornell University" was added at the advice of Cornell's legal staff with the caveat that it probably would not hold up in court. Cornell does not want people to copy its books and sell them but would like to keep them available for use in a library environment for library purposes.
* In production the scanner can scan about 300 pages per hour, capturing 600 dots per inch.
* The Xerox software has filters to scan halftone material and avoid the moire patterns that occur when halftone material is scanned. Xerox has been working on hardware and software that would enable the scanner itself to recognize this situation and deal with it appropriately--a kind of autosegmentation that would enable the scanner to handle halftone material as well as text on a single page.
* The books subjected to the elaborate process described above were selected because CLASS is a preservation project, with the first 500 books selected coming from Cornell's mathematics collection, because they were still being heavily used and because, although they were in need of preservation, the mathematics library and the mathematics faculty were uncomfortable having them microfilmed. (They wanted a printed copy.) Thus, these books became a logical choice for this project. Other books were chosen by the project's selection committees for experiments with the technology, as well as to meet a demand or need.
* Images will be decompressed before they are sent over the line; at this time they are compressed and sent to the image filing system and then sent to the printer as compressed images; they are returned to the workstation as compressed 600-dpi images and the workstation decompresses and scales them for display--an inefficient way to access the material though it works quite well for printing and other purposes.
* CLASS is also decompressing on Macintosh and IBM, a slow process right now. Eventually, compression and decompression will take place on an image conversion server. Trade-offs will be made, based on future performance testing, concerning where the file is compressed and what resolution image is sent.
* OCR has not been precluded; images are being stored that have been scanned at a high resolution, which presumably would suit them well to an OCR process. Because the material being scanned is about 100 years old and was printed with less-than-ideal technologies, very early and preliminary tests have not produced good results. But the project is capturing an image that is of sufficient resolution to be subjected to OCR in the future. Moreover, the system architecture and the system plan have a logical place to store an OCR image if it has been captured. But that is not being done now.
******
SESSION III. DISTRIBUTION, NETWORKS, AND NETWORKING: OPTIONS FOR DISSEMINATION
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ZICH * Issues pertaining to CD-ROMs * Options for publishing in CD-ROM * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Robert ZICH, special assistant to the associate librarian for special projects, Library of Congress, and moderator of this session, first noted the blessed but somewhat awkward circumstance of having four very distinguished people representing networks and networking or at least leaning in that direction, while lacking anyone to speak from the strongest possible background in CD-ROMs. ZICH expressed the hope that members of the audience would join the discussion. He stressed the subtitle of this particular session, "Options for Dissemination," and, concerning CD-ROMs, the importance of determining when it would be wise to consider dissemination in CD-ROM versus networks. A shopping list of issues pertaining to CD-ROMs included: the grounds for selecting commercial publishers, and in-house publication where possible versus nonprofit or government publication. A similar list for networks included: determining when one should consider dissemination through a network, identifying the mechanisms or entities that exist to place items on networks, identifying the pool of existing networks, determining how a producer would choose between networks, and identifying the elements of a business arrangement in a network.
Options for publishing in CD-ROM: an outside publisher versus self-publication. If an outside publisher is used, it can be nonprofit, such as the Government Printing Office (GPO) or the National Technical Information Service (NTIS), in the case of government. The pros and cons associated with employing an outside publisher are obvious. Among the pros, there is no trouble getting accepted. One pays the bill and, in effect, goes one's way. Among the cons, when one pays an outside publisher to perform the work, that publisher will perform the work it is obliged to do, but perhaps without the production expertise and skill in marketing and dissemination that some would seek. There is the body of commercial publishers that do possess that kind of expertise in distribution and marketing but that obviously are selective. In self-publication, one exercises full control, but then one must handle matters such as distribution and marketing. Such are some of the options for publishing in the case of CD-ROM.
In the case of technical and design issues, which are also important, there are many matters which many at the Workshop already knew a good deal about: retrieval system requirements and costs, what to do about images, the various capabilities and platforms, the trade-offs between cost and performance, concerns about local-area networkability, interoperability, etc.
******
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ LYNCH * Creating networked information is different from using networks as an access or dissemination vehicle * Networked multimedia on a large scale does not yet work * Typical CD-ROM publication model a two-edged sword * Publishing information on a CD-ROM in the present world of immature standards * Contrast between CD-ROM and network pricing * Examples demonstrated earlier in the day as a set of insular information gems * Paramount need to link databases * Layering to become increasingly necessary * Project NEEDS and the issues of information reuse and active versus passive use * X-Windows as a way of differentiating between network access and networked information * Barriers to the distribution of networked multimedia information * Need for good, real-time delivery protocols * The question of presentation integrity in client-server computing in the academic world * Recommendations for producing multimedia +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Clifford LYNCH, director, Library Automation, University of California, opened his talk with the general observation that networked information constituted a difficult and elusive topic because it is something just starting to develop and not yet fully understood. LYNCH contended that creating genuinely networked information was different from using networks as an access or dissemination vehicle and was more sophisticated and more subtle. He invited the members of the audience to extrapolate, from what they heard about the preceding demonstration projects, to what sort of a world of electronics information--scholarly, archival, cultural, etc.--they wished to end up with ten or fifteen years from now. LYNCH suggested that to extrapolate directly from these projects would produce unpleasant results.
Putting the issue of CD-ROM in perspective before getting into generalities on networked information, LYNCH observed that those engaged in multimedia today who wish to ship a product, so to say, probably do not have much choice except to use CD-ROM: networked multimedia on a large scale basically does not yet work because the technology does not exist. For example, anybody who has tried moving images around over the Internet knows that this is an exciting touch-and-go process, a fascinating and fertile area for experimentation, research, and development, but not something that one can become deeply enthusiastic about committing to production systems at this time.
This situation will change, LYNCH said. He differentiated CD-ROM from the practices that have been followed up to now in distributing data on CD-ROM. For LYNCH the problem with CD-ROM is not its portability or its slowness but the two-edged sword of having the retrieval application and the user interface inextricably bound up with the data, which is the typical CD-ROM publication model. It is not a case of publishing data but of distributing a typically stand-alone, typically closed system, all--software, user interface, and data--on a little disk. Hence, all the between-disk navigational issues as well as the impossibility in most cases of integrating data on one disk with that on another. Most CD-ROM retrieval software does not network very gracefully at present. However, in the present world of immature standards and lack of understanding of what network information is or what the ground rules are for creating or using it, publishing information on a CD-ROM does add value in a very real sense.
LYNCH drew a contrast between CD-ROM and network pricing and in doing so highlighted something bizarre in information pricing. A large institution such as the University of California has vendors who will offer to sell information on CD-ROM for a price per year in four digits, but for the same data (e.g., an abstracting and indexing database) on magnetic tape, regardless of how many people may use it concurrently, will quote a price in six digits.
What is packaged with the CD-ROM in one sense adds value--a complete access system, not just raw, unrefined information--although it is not generally perceived that way. This is because the access software, although it adds value, is viewed by some people, particularly in the university environment where there is a very heavy commitment to networking, as being developed in the wrong direction.
Given that context, LYNCH described the examples demonstrated as a set of insular information gems--Perseus, for example, offers nicely linked information, but would be very difficult to integrate with other databases, that is, to link together seamlessly with other source files from other sources. It resembles an island, and in this respect is similar to numerous stand-alone projects that are based on videodiscs, that is, on the single-workstation concept.
As scholarship evolves in a network environment, the paramount need will be to link databases. We must link personal databases to public databases, to group databases, in fairly seamless ways--which is extremely difficult in the environments under discussion with copies of databases proliferating all over the place.
The notion of layering also struck LYNCH as lurking in several of the projects demonstrated. Several databases in a sense constitute information archives without a significant amount of navigation built in. Educators, critics, and others will want a layered structure--one that defines or links paths through the layers to allow users to reach specific points. In LYNCH's view, layering will become increasingly necessary, and not just within a single resource but across resources (e.g., tracing mythology and cultural themes across several classics databases as well as a database of Renaissance culture). This ability to organize resources, to build things out of multiple other things on the network or select pieces of it, represented for LYNCH one of the key aspects of network information.
Contending that information reuse constituted another significant issue, LYNCH commended to the audience's attention Project NEEDS (i.e., National Engineering Education Delivery System). This project's objective is to produce a database of engineering courseware as well as the components that can be used to develop new courseware. In a number of the existing applications, LYNCH said, the issue of reuse (how much one can take apart and reuse in other applications) was not being well considered. He also raised the issue of active versus passive use, one aspect of which is how much information will be manipulated locally by users. Most people, he argued, may do a little browsing and then will wish to print. LYNCH was uncertain how these resources would be used by the vast majority of users in the network environment.
LYNCH next said a few words about X-Windows as a way of differentiating between network access and networked information. A number of the applications demonstrated at the Workshop could be rewritten to use X across the network, so that one could run them from any X-capable device- -a workstation, an X terminal--and transact with a database across the network. Although this opens up access a little, assuming one has enough network to handle it, it does not provide an interface to develop a program that conveniently integrates information from multiple databases. X is a viewing technology that has limits. In a real sense, it is just a graphical version of remote log-in across the network. X-type applications represent only one step in the progression towards real access.
LYNCH next discussed barriers to the distribution of networked multimedia information. The heart of the problem is a lack of standards to provide the ability for computers to talk to each other, retrieve information, and shuffle it around fairly casually. At the moment, little progress is being made on standards for networked information; for example, present standards do not cover images, digital voice, and digital video. A useful tool kit of exchange formats for basic texts is only now being assembled. The synchronization of content streams (i.e., synchronizing a voice track to a video track, establishing temporal relations between different components in a multimedia object) constitutes another issue for networked multimedia that is just beginning to receive attention.
Underlying network protocols also need some work; good, real-time delivery protocols on the Internet do not yet exist. In LYNCH's view, highly important in this context is the notion of networked digital object IDs, the ability of one object on the network to point to another object (or component thereof) on the network. Serious bandwidth issues also exist. LYNCH was uncertain if billion-bit-per-second networks would prove sufficient if numerous people ran video in parallel.
LYNCH concluded by offering an issue for database creators to consider, as well as several comments about what might constitute good trial multimedia experiments. In a networked information world the database builder or service builder (publisher) does not exercise the same extensive control over the integrity of the presentation; strange programs "munge" with one's data before the user sees it. Serious thought must be given to what guarantees integrity of presentation. Part of that is related to where one draws the boundaries around a networked information service. This question of presentation integrity in client-server computing has not been stressed enough in the academic world, LYNCH argued, though commercial service providers deal with it regularly.
Concerning multimedia, LYNCH observed that good multimedia at the moment is hideously expensive to produce. He recommended producing multimedia with either very high sale value, or multimedia with a very long life span, or multimedia that will have a very broad usage base and whose costs therefore can be amortized among large numbers of users. In this connection, historical and humanistically oriented material may be a good place to start, because it tends to have a longer life span than much of the scientific material, as well as a wider user base. LYNCH noted, for example, that American Memory fits many of the criteria outlined. He remarked the extensive discussion about bringing the Internet or the National Research and Education Network (NREN) into the K-12 environment as a way of helping the American educational system.
LYNCH closed by noting that the kinds of applications demonstrated struck him as excellent justifications of broad-scale networking for K-12, but that at this time no "killer" application exists to mobilize the K-12 community to obtain connectivity.
******
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DISCUSSION * Dearth of genuinely interesting applications on the network a slow-changing situation * The issue of the integrity of presentation in a networked environment * Several reasons why CD-ROM software does not network * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
During the discussion period that followed LYNCH's presentation, several additional points were made.
LYNCH reiterated even more strongly his contention that, historically, once one goes outside high-end science and the group of those who need access to supercomputers, there is a great dearth of genuinely interesting applications on the network. He saw this situation changing slowly, with some of the scientific databases and scholarly discussion groups and electronic journals coming on as well as with the availability of Wide Area Information Servers (WAIS) and some of the databases that are being mounted there. However, many of those things do not seem to have piqued great popular interest. For instance, most high school students of LYNCH's acquaintance would not qualify as devotees of serious molecular biology.
Concerning the issue of the integrity of presentation, LYNCH believed that a couple of information providers have laid down the law at least on certain things. For example, his recollection was that the National Library of Medicine feels strongly that one needs to employ the identifier field if he or she is to mount a database commercially. The problem with a real networked environment is that one does not know who is reformatting and reprocessing one's data when one enters a client server mode. It becomes anybody's guess, for example, if the network uses a Z39.50 server, or what clients are doing with one's data. A data provider can say that his contract will only permit clients to have access to his data after he vets them and their presentation and makes certain it suits him. But LYNCH held out little expectation that the network marketplace would evolve in that way, because it required too much prior negotiation.
CD-ROM software does not network for a variety of reasons, LYNCH said. He speculated that CD-ROM publishers are not eager to have their products really hook into wide area networks, because they fear it will make their data suppliers nervous. Moreover, until relatively recently, one had to be rather adroit to run a full TCP/IP stack plus applications on a PC-size machine, whereas nowadays it is becoming easier as PCs grow bigger and faster. LYNCH also speculated that software providers had not heard from their customers until the last year or so, or had not heard from enough of their customers.
******
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ BESSER * Implications of disseminating images on the network; planning the distribution of multimedia documents poses two critical implementation problems * Layered approach represents the way to deal with users' capabilities * Problems in platform design; file size and its implications for networking * Transmission of megabyte size images impractical * Compression and decompression at the user's end * Promising trends for compression * A disadvantage of using X-Windows * A project at the Smithsonian that mounts images on several networks * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Howard BESSER, School of Library and Information Science, University of Pittsburgh, spoke primarily about multimedia, focusing on images and the broad implications of disseminating them on the network. He argued that planning the distribution of multimedia documents posed two critical implementation problems, which he framed in the form of two questions: 1) What platform will one use and what hardware and software will users have for viewing of the material? and 2) How can one deliver a sufficiently robust set of information in an accessible format in a reasonable amount of time? Depending on whether network or CD-ROM is the medium used, this question raises different issues of storage, compression, and transmission.
Concerning the design of platforms (e.g., sound, gray scale, simple color, etc.) and the various capabilities users may have, BESSER maintained that a layered approach was the way to deal with users' capabilities. A result would be that users with less powerful workstations would simply have less functionality. He urged members of the audience to advocate standards and accompanying software that handle layered functionality across a wide variety of platforms.