Chapter 8 of 17 · 3996 words · ~20 min read

Part 8

Among CXP's findings concerning the production of microfilm from digital files, KENNEY reported that the digital files for the same Reed lecture were used to produce sample film using an electron beam recorder. The resulting film was faithful to the image capture of the digital files, and while CXP felt that the text and image pages represented in the Reed lecture were superior to that of the light-lens film, the resolution readings for the 600 dpi were not as high as standard microfilming. KENNEY argued that the standards defined for light-lens technology are not totally transferable to a digital environment. Moreover, they are based on definition of quality for a preservation copy. Although making this case will prove to be a long, uphill struggle, CXP plans to continue to investigate the issue over the course of the next year.

KENNEY concluded this portion of her talk with a discussion of the advantages of creating film: it can serve as a primary backup and as a preservation master to the digital file; it could then become the print or production master and service copies could be paper, film, optical disks, magnetic media, or on-screen display.

Finally, KENNEY presented details re production:

* Development and testing of a moderately-high resolution production scanning workstation represented a third goal of CXP; to date, 1,000 volumes have been scanned, or about 300,000 images.

* The resulting digital files are stored and used to produce hard-copy replacements for the originals and additional prints on demand; although the initial costs are high, scanning technology offers an affordable means for reformatting brittle material.

* A technician in production mode can scan 300 pages per hour when performing single-sheet scanning, which is a necessity when working with truly brittle paper; this figure is expected to increase significantly with subsequent iterations of the software from Xerox; a three-month time-and-cost study of scanning found that the average 300-page book would take about an hour and forty minutes to scan (this figure included the time for setup, which involves keying in primary bibliographic data, going into quality control mode to define page size, establishing front-to-back registration, and scanning sample pages to identify a default range of settings for the entire book--functions not dissimilar to those performed by filmers or those preparing a book for photocopy).

* The final step in the scanning process involved rescans, which happily were few and far between, representing well under 1 percent of the total pages scanned.

In addition to technician time, CXP costed out equipment, amortized over four years, the cost of storing and refreshing the digital files every four years, and the cost of printing and binding, book-cloth binding, a paper reproduction. The total amounted to a little under $65 per single 300-page volume, with 30 percent overhead included--a figure competitive with the prices currently charged by photocopy vendors.

Of course, with scanning, in addition to the paper facsimile, one is left with a digital file from which subsequent copies of the book can be produced for a fraction of the cost of photocopy, with readers afforded choices in the form of these copies.

KENNEY concluded that digital technology offers an electronic means for a library preservation effort to pay for itself. If a brittle-book program included the means of disseminating reprints of books that are in demand by libraries and researchers alike, the initial investment in capture could be recovered and used to preserve additional but less popular books. She disclosed that an economic model for a self-sustaining program could be developed for CXP's report to the Commission on Preservation and Access (CPA).

KENNEY stressed that the focus of CXP has been on obtaining high quality in a production environment. The use of digital technology is viewed as an affordable alternative to other reformatting options.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ANDRE * Overview and history of NATDP * Various agricultural CD-ROM products created inhouse and by service bureaus * Pilot project on Internet transmission * Additional products in progress * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Pamela ANDRE, associate director for automation, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), presented an overview of NATDP, which has been underway at NAL the last four years, before Judith ZIDAR discussed the technical details. ANDRE defined agricultural information as a broad range of material going from basic and applied research in the hard sciences to the one-page pamphlets that are distributed by the cooperative state extension services on such things as how to grow blueberries.

NATDP began in late 1986 with a meeting of representatives from the land-grant library community to deal with the issue of electronic information. NAL and forty-five of these libraries banded together to establish this project--to evaluate the technology for converting what were then source documents in paper form into electronic form, to provide access to that digital information, and then to distribute it. Distributing that material to the community--the university community as well as the extension service community, potentially down to the county level--constituted the group's chief concern.

Since January 1988 (when the microcomputer-based scanning system was installed at NAL), NATDP has done a variety of things, concerning which ZIDAR would provide further details. For example, the first technology considered in the project's discussion phase was digital videodisc, which indicates how long ago it was conceived.

Over the four years of this project, four separate CD-ROM products on four different agricultural topics were created, two at a scanning-and-OCR station installed at NAL, and two by service bureaus. Thus, NATDP has gained comparative information in terms of those relative costs. Each of these products contained the full ASCII text as well as page images of the material, or between 4,000 and 6,000 pages of material on these disks. Topics included aquaculture, food, agriculture and science (i.e., international agriculture and research), acid rain, and Agent Orange, which was the final product distributed (approximately eighteen months before the Workshop).

The third phase of NATDP focused on delivery mechanisms other than CD-ROM. At the suggestion of Clifford LYNCH, who was a technical consultant to the project at this point, NATDP became involved with the Internet and initiated a project with the help of North Carolina State University, in which fourteen of the land-grant university libraries are transmitting digital images over the Internet in response to interlibrary loan requests--a topic for another meeting. At this point, the pilot project had been completed for about a year and the final report would be available shortly after the Workshop. In the meantime, the project's success had led to its extension. (ANDRE noted that one of the first things done under the program title was to select a retrieval package to use with subsequent products; Windows Personal Librarian was the package of choice after a lengthy evaluation.)

Three additional products had been planned and were in progress:

1) An arrangement with the American Society of Agronomy--a professional society that has published the Agronomy Journal since about 1908--to scan and create bit-mapped images of its journal. ASA granted permission first to put and then to distribute this material in electronic form, to hold it at NAL, and to use these electronic images as a mechanism to deliver documents or print out material for patrons, among other uses. Effectively, NAL has the right to use this material in support of its program. (Significantly, this arrangement offers a potential cooperative model for working with other professional societies in agriculture to try to do the same thing--put the journals of particular interest to agriculture research into electronic form.)

2) An extension of the earlier product on aquaculture.

3) The George Washington Carver Papers--a joint project with Tuskegee University to scan and convert from microfilm some 3,500 images of Carver's papers, letters, and drawings.

It was anticipated that all of these products would appear no more than six months after the Workshop.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ZIDAR * (A separate arena for scanning) * Steps in creating a database * Image capture, with and without performing OCR * Keying in tracking data * Scanning, with electronic and manual tracking * Adjustments during scanning process * Scanning resolutions * Compression * De-skewing and filtering * Image capture from microform: the papers and letters of George Washington Carver * Equipment used for a scanning system * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), illustrated the technical details of NATDP, including her primary responsibility, scanning and creating databases on a topic and putting them on CD-ROM.

(ZIDAR remarked a separate arena from the CD-ROM projects, although the processing of the material is nearly identical, in which NATDP is also scanning material and loading it on a Next microcomputer, which in turn is linked to NAL's integrated library system. Thus, searches in NAL's bibliographic database will enable people to pull up actual page images and text for any documents that have been entered.)

In accordance with the session's topic, ZIDAR focused her illustrated talk on image capture, offering a primer on the three main steps in the process: 1) assemble the printed publications; 2) design the database (database design occurs in the process of preparing the material for scanning; this step entails reviewing and organizing the material, defining the contents--what will constitute a record, what kinds of fields will be captured in terms of author, title, etc.); 3) perform a certain amount of markup on the paper publications. NAL performs this task record by record, preparing work sheets or some other sort of tracking material and designing descriptors and other enhancements to be added to the data that will not be captured from the printed publication. Part of this process also involves determining NATDP's file and directory structure: NATDP attempts to avoid putting more than approximately 100 images in a directory, because placing more than that on a CD-ROM would reduce the access speed.

This up-front process takes approximately two weeks for a 6,000-7,000-page database. The next step is to capture the page images. How long this process takes is determined by the decision whether or not to perform OCR. Not performing OCR speeds the process, whereas text capture requires greater care because of the quality of the image: it has to be straighter and allowance must be made for text on a page, not just for the capture of photographs.

NATDP keys in tracking data, that is, a standard bibliographic record including the title of the book and the title of the chapter, which will later either become the access information or will be attached to the front of a full-text record so that it is searchable.

Images are scanned from a bound or unbound publication, chiefly from bound publications in the case of NATDP, however, because often they are the only copies and the publications are returned to the shelves. NATDP usually scans one record at a time, because its database tracking system tracks the document in that way and does not require further logical separating of the images. After performing optical character recognition, NATDP moves the images off the hard disk and maintains a volume sheet. Though the system tracks electronically, all the processing steps are also tracked manually with a log sheet.

ZIDAR next illustrated the kinds of adjustments that one can make when scanning from paper and microfilm, for example, redoing images that need special handling, setting for dithering or gray scale, and adjusting for brightness or for the whole book at one time.

NATDP is scanning at 300 dots per inch, a standard scanning resolution. Though adequate for capturing text that is all of a standard size, 300 dpi is unsuitable for any kind of photographic material or for very small text. Many scanners allow for different image formats, TIFF, of course, being a de facto standard. But if one intends to exchange images with other people, the ability to scan other image formats, even if they are less common, becomes highly desirable.

CCITT Group 4 is the standard compression for normal black-and-white images, JPEG for gray scale or color. ZIDAR recommended 1) using the standard compressions, particularly if one attempts to make material available and to allow users to download images and reuse them from CD-ROMs; and 2) maintaining the ability to output an uncompressed image, because in image exchange uncompressed images are more likely to be able to cross platforms.

ZIDAR emphasized the importance of de-skewing and filtering as requirements on NATDP's upgraded system. For instance, scanning bound books, particularly books published by the federal government whose pages are skewed, and trying to scan them straight if OCR is to be performed, is extremely time-consuming. The same holds for filtering of poor-quality or older materials.

ZIDAR described image capture from microform, using as an example three reels from a sixty-seven-reel set of the papers and letters of George Washington Carver that had been produced by Tuskegee University. These resulted in approximately 3,500 images, which NATDP had had scanned by its service contractor, Science Applications International Corporation (SAIC). NATDP also created bibliographic records for access. (NATDP did not have such specialized equipment as a microfilm scanner.

Unfortunately, the process of scanning from microfilm was not an unqualified success, ZIDAR reported: because microfilm frame sizes vary, occasionally some frames were missed, which without spending much time and money could not be recaptured.

OCR could not be performed from the scanned images of the frames. The bleeding in the text simply output text, when OCR was run, that could not even be edited. NATDP tested for negative versus positive images, landscape versus portrait orientation, and single- versus dual-page microfilm, none of which seemed to affect the quality of the image; but also on none of them could OCR be performed.

In selecting the microfilm they would use, therefore, NATDP had other factors in mind. ZIDAR noted two factors that influenced the quality of the images: 1) the inherent quality of the original and 2) the amount of size reduction on the pages.

The Carver papers were selected because they are informative and visually interesting, treat a single subject, and are valuable in their own right. The images were scanned and divided into logical records by SAIC, then delivered, and loaded onto NATDP's system, where bibliographic information taken directly from the images was added. Scanning was completed in summer 1991 and by the end of summer 1992 the disk was scheduled to be published.

Problems encountered during processing included the following: Because the microfilm scanning had to be done in a batch, adjustment for individual page variations was not possible. The frame size varied on account of the nature of the material, and therefore some of the frames were missed while others were just partial frames. The only way to go back and capture this material was to print out the page with the microfilm reader from the missing frame and then scan it in from the page, which was extremely time-consuming. The quality of the images scanned from the printout of the microfilm compared unfavorably with that of the original images captured directly from the microfilm. The inability to perform OCR also was a major disappointment. At the time, computer output microfilm was unavailable to test.

The equipment used for a scanning system was the last topic addressed by ZIDAR. The type of equipment that one would purchase for a scanning system included: a microcomputer, at least a 386, but preferably a 486; a large hard disk, 380 megabyte at minimum; a multi-tasking operating system that allows one to run some things in batch in the background while scanning or doing text editing, for example, Unix or OS/2 and, theoretically, Windows; a high-speed scanner and scanning software that allows one to make the various adjustments mentioned earlier; a high-resolution monitor (150 dpi ); OCR software and hardware to perform text recognition; an optical disk subsystem on which to archive all the images as the processing is done; file management and tracking software.

ZIDAR opined that the software one purchases was more important than the hardware and might also cost more than the hardware, but it was likely to prove critical to the success or failure of one's system. In addition to a stand-alone scanning workstation for image capture, then, text capture requires one or two editing stations networked to this scanning station to perform editing. Editing the text takes two or three times as long as capturing the images.

Finally, ZIDAR stressed the importance of buying an open system that allows for more than one vendor, complies with standards, and can be upgraded.

******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ WATERS *Yale University Library's master plan to convert microfilm to digital imagery (POB) * The place of electronic tools in the library of the future * The uses of images and an image library * Primary input from preservation microfilm * Features distinguishing POB from CXP and key hypotheses guiding POB * Use of vendor selection process to facilitate organizational work * Criteria for selecting vendor * Finalists and results of process for Yale * Key factor distinguishing vendors * Components, design principles, and some estimated costs of POB * Role of preservation materials in developing imaging market * Factors affecting quality and cost * Factors affecting the usability of complex documents in image form * +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Donald WATERS, head of the Systems Office, Yale University Library, reported on the progress of a master plan for a project at Yale to convert microfilm to digital imagery, Project Open Book (POB). Stating that POB was in an advanced stage of planning, WATERS detailed, in

## particular, the process of selecting a vendor partner and several key

issues under discussion as Yale prepares to move into the project itself. He commented first on the vision that serves as the context of POB and then described its purpose and scope.

WATERS sees the library of the future not necessarily as an electronic library but as a place that generates, preserves, and improves for its clients ready access to both intellectual and physical recorded knowledge. Electronic tools must find a place in the library in the context of this vision. Several roles for electronic tools include serving as: indirect sources of electronic knowledge or as "finding" aids (the on-line catalogues, the article-level indices, registers for documents and archives); direct sources of recorded knowledge; full-text images; and various kinds of compound sources of recorded knowledge (the so-called compound documents of Hypertext, mixed text and image, mixed-text image format, and multimedia).

POB is looking particularly at images and an image library, the uses to which images will be put (e.g., storage, printing, browsing, and then use as input for other processes), OCR as a subsequent process to image capture, or creating an image library, and also possibly generating microfilm.

While input will come from a variety of sources, POB is considering especially input from preservation microfilm. A possible outcome is that the film and paper which provide the input for the image library eventually may go off into remote storage, and that the image library may be the primary access tool.

The purpose and scope of POB focus on imaging. Though related to CXP, POB has two features which distinguish it: 1) scale--conversion of 10,000 volumes into digital image form; and 2) source--conversion from microfilm. Given these features, several key working hypotheses guide POB, including: 1) Since POB is using microfilm, it is not concerned with the image library as a preservation medium. 2) Digital imagery can improve access to recorded knowledge through printing and network distribution at a modest incremental cost of microfilm. 3) Capturing and storing documents in a digital image form is necessary to further improvements in access. (POB distinguishes between the imaging, digitizing process and OCR, which at this stage it does not plan to perform.)

Currently in its first or organizational phase, POB found that it could use a vendor selection process to facilitate a good deal of the organizational work (e.g., creating a project team and advisory board, confirming the validity of the plan, establishing the cost of the project and a budget, selecting the materials to convert, and then raising the necessary funds).

POB developed numerous selection criteria, including: a firm committed to image-document management, the ability to serve as systems integrator in a large-scale project over several years, interest in developing the requisite software as a standard rather than a custom product, and a willingness to invest substantial resources in the project itself.

Two vendors, DEC and Xerox, were selected as finalists in October 1991, and with the support of the Commission on Preservation and Access, each was commissioned to generate a detailed requirements analysis for the project and then to submit a formal proposal for the completion of the project, which included a budget and costs. The terms were that POB would pay the loser. The results for Yale of involving a vendor included: broad involvement of Yale staff across the board at a relatively low cost, which may have long-term significance in carrying out the project (twenty-five to thirty university people are engaged in POB); better understanding of the factors that affect corporate response to markets for imaging products; a competitive proposal; and a more sophisticated view of the imaging markets.

The most important factor that distinguished the vendors under consideration was their identification with the customer. The size and internal complexity of the company also was an important factor. POB was looking at large companies that had substantial resources. In the end, the process generated for Yale two competitive proposals, with Xerox's the clear winner. WATERS then described the components of the proposal, the design principles, and some of the costs estimated for the process.

Components are essentially four: a conversion subsystem, a network-accessible storage subsystem for 10,000 books (and POB expects 200 to 600 dpi storage), browsing stations distributed on the campus network, and network access to the image printers.

Among the design principles, POB wanted conversion at the highest possible resolution. Assuming TIFF files, TIFF files with Group 4 compression, TCP/IP, and ethernet network on campus, POB wanted a client-server approach with image documents distributed to the workstations and made accessible through native workstation interfaces such as Windows. POB also insisted on a phased approach to implementation: 1) a stand-alone, single-user, low-cost entry into the business with a workstation focused on conversion and allowing POB to explore user access; 2) movement into a higher-volume conversion with network-accessible storage and multiple access stations; and 3) a high-volume conversion, full-capacity storage, and multiple browsing stations distributed throughout the campus.

The costs proposed for start-up assumed the existence of the Yale network and its two DocuTech image printers. Other start-up costs are estimated at $1 million over the three phases. At the end of the project, the annual operating costs estimated primarily for the software and hardware proposed come to about $60,000, but these exclude costs for labor needed in the conversion process, network and printer usage, and facilities management.

Finally, the selection process produced for Yale a more sophisticated view of the imaging markets: the management of complex documents in image form is not a preservation problem, not a library problem, but a general problem in a broad, general industry. Preservation materials are useful for developing that market because of the qualities of the material. For example, much of it is out of copyright. The resolution of key issues such as the quality of scanning and image browsing also will affect development of that market.

The technology is readily available but changing rapidly. In this context of rapid change, several factors affect quality and cost, to which POB intends to pay particular attention, for example, the various levels of resolution that can be achieved. POB believes it can bring resolution up to 600 dpi, but an interpolation process from 400 to 600 is more likely. The variation quality in microfilm will prove to be a highly important factor. POB may reexamine the standards used to film in the first place by looking at this process as a follow-on to microfilming.