chapter 12
, I have changed "he stood on the tock", to "he stood on the rock".]
V.129. Having investigated what looks like a typo, I find it isn't. Do I need to do anything?
Often in PG work, you come across an odd word or usage. Might be a typo; might not. You check it out, and find that it is deliberate--perhaps a word from local dialect that just happens to resemble a different word, perhaps the author is using an odd word or spelling to make a point with the language. Especially if it's an isolated incident, and especially if it's not obvious, you can add a transcriber's note to the end noting that the word is thus in your edition, and that it is probably right. This may prevent some well-intentioned converter from changing it.
It's rare that you will need to do this; you may encounter such a case only once in a hundred PG books, but it is an option.
V.130. Aarrgh! Some pages are missing! Do I have to abandon the book?
No. It happens more often than you might think, and we're quite used to dealing with it.
Finish the book, and ask other volunteers to help by finding another copy of the book to fill in the missing section. For something like this, you can try asking on [V.12] the WebBoard, or gutvol-d, or ask Michael Hart to put a note in the Newsletter asking for assistance. We can post the book incomplete, and put a Transcriber's Note [V.97] in the header asking any future reader who has a copy to fill in the gap.
V.131. Some words are spelled inconsistently in my book (e.g. sometimes "surprise", sometimes "surprize"). Should I make them consistent?
No.
English spelling didn't really standardize until the start of the 20th Century (and even then it fractured; e.g. "standardize" vs. "standardise") and the further back you go, the more inconsistent it becomes. Shakespeare, for example, signed his own name with several different spellings.
Where your printed edition genuinely uses alternate spellings of the same word, you should preserve them.
Word Processor FAQ
W.1. What's the difference between an editor and a word processor?
An editor shows you the characters you type, exactly as you type them. It puts new-line characters in when you hit the Enter key, and only when you hit the Enter key. Its ultimate aim is to give you exact control of plain text. EDIT in DOS, Notepad in Windows, vi and emacs in *nix, Tex-Edit Plus and BBEdit Lite in Mac, are all editors.
A word processor, in addition to entering the characters, also lets you change the font, the size of individual words, and whether they are italic or bold. It doesn't generally want individual line-ends put in on each line; it just rewraps the text as you change it. Its ultimate aim is to print your document on paper with full formatting facilities. WordPerfect for MS-DOS and Windows, MS-Word for Windows and Mac, AbiWord for Windows and Linux, and Nisus Writer for Mac are all word processors.
W.2. Should I use an editor or a word processor?
For dealing with plain text, which is what PG is about, you might expect a text editor to have the edge, since the formatting features of word processors can get in the way of making a clean text.
However, if you use a word processor, and you ignore all of the layout and formatting that have to do with fonts and paper, it will work equally well. There are a few common problems associated with Word Processors mentioned below.
W.3. Which editor or word processor should I use?
The one you like best!
Any of them will do the job. Even the most primitive editors of 1971 will do the job. The most feature-bloated word processor of tomorrow will do the job. No editor or word processor affects in the slightest the "quality" of the text produced.
For PG purposes, therefore, the only difference between them all is how easy you find them to use, and what facilities they have for helping you--and those are decisions that only you can make.
If you already have a favorite editor or word processor, stick to it. If you don't, there's a huge selection available for you to consider, on any type of computer.
Sometimes, using a word processor, you may encounter some problems in saving your book as plain text. You have to figure out how to get it right just once, and then use that same method thereafter. If you have problems with this, ask other volunteers or one of the Posting Team for help.
W.4. How can I make my word processor easier to work with for plain text?
First, switch off _everything_ called "Smart ------" or "Automatic". Modern word processors commonly offer lots of typical typing support features--"Smart Quotes", "Auto Correct", automatically capitalizing the first word in each sentence, anything like that. By all means, leave on any informative highlighting of misspelled words or other errors that it offers, but switch off any feature that changes what you type without asking you. Older books contain text that doesn't sit comfortably with modern rules, and we don't want your word processor deciding what Chaucer really wrote!
Now, choose a non-proportional font, and apply it to the whole document. It's important to work in a non-proportional font, because you may have to line words up underneath each other and it is not possible to do this consistently in non-proportional fonts like Times or Arial.
If you work in Courier, size 10, 11 or 12, and your word processor is set for a normal page size, about 7 inches across excluding margins, then what you see in your WP is a pretty good approximation to how the text will look in PG plain text format. One formula, suggested by John Mamoun in the Volunteers' Voices section, is to Select All the text, choose Courier New font, 10 point size, and set the margins at 5.5 inches, then Save As "Text with layout".
W.5. What is the difference between proportional and non-proportional fonts?
A non-proportional, or "monospaced", or "typewriter" font, is one where all of the letters take up exactly the same amount of space on screen: a capital "W", a lower-case "i" and a space are all equally wide. The Courier family of fonts is commonly used for this.
A proportional font is one where each letter takes up just the amount of space it needs, so that a capital "W" is much wider than a small "i".
Unfortunately, the different sizes of the letters in different proportional fonts means that it's not possible to line up letters consistently: a "W" may be equivalent to three "i"s in one proportional font, and to four "i"s in another. This means, for example, that it is not possible to use a proportional font to format plain text tables or poetry correctly--lining up the spaces and words using one proportional font will cause it to look skewed using another.
You should always look at PG texts in a non-proportional font, even if you prefer to work mostly using a proportional font, because readers and automatic converter programs will assume that you meant to your text to be viewed using a non-proportional font.
W.6. I can't get words in a table or poem to line up under each other.
You are using a proportional font. You should always use a non-proportional font like Courier for PG work. Change the font of the entire document to Courier and try again.
About using Microsoft Word:
PG volunteers use many different word-processors, but Microsoft Word is the one we hear most queries and problems about.
W.7. I've edited my book in Word--how do I save it as plain text?
First, make sure that all text is using Courier or Courier New and is at the same point size (usually 10-12). Move your right margin so that you see roughly the right number of characters per line (usually 65-70). Then choose File / Save As and then choose the format "Text Only with Line Breaks". Save your file with the extension ".txt" to distinguish it from your Word format file.
After saving, open your text file using Notepad or some other simple text editor and look at the results. You should see a typical PG layout of the text--lines up to 70 characters long, a blank line between paragraphs and no indentation at the start of each paragraph. If so, you're done.
W.8. Quotes look wrong when I save a Word document as plain text.
You may have left "Smart Quotes" on in Word options. This tells Word to use left- and right-slanted quote marks at the beginning and end of a quote instead of the plain ASCII straight quotes. When you save a document that contains these angled quotes as plain text, they come out as non-ASCII characters that look wrong on most editors and viewers. The solution is to turn off Smart Quotes in Word and/or replace the ones it has already created.
W.9. Dashes look wrong when I save a Word document as plain text.
When Word recognizes an em-dash as such, it may try to use a special character for it. This may appear as a black square, an empty box, or a funny accented letter when you Save As text and look at it in a different editor.
You can usually do a Find and Replace on this character either in Word or in another editor after Saving As text to change it to two dashes.
For those interested, the "funny character" is character 151 (97H), and is specific to Codepage 1252 [V.76].
W.10. I saved my Word document as HTML, but the HTML looks terrible.
Yes. Word is not unique in having this problem, but HTML saved from Word is the case we hear most about. Microsoft themselves offer a free plug-in to Word that saves the file in "Compact HTML", which is a bit better. You can fix it by hand, or you can use Tidy <http://tidy.sourceforge.net>, a handy utility, which will do some clean-up on the HTML. If you're working with HTML, you really need a copy of Tidy anyway, because it's such a great way to do a check on the correctness of your HTML.
Tidy is also embedded in some Windows GUI tools, like Tidy-GUI, HTML-Kit and NoteTab.
Scanning FAQ
S.1. What is a scanner?
A scanner is a machine that makes an image, a picture of the page that is fed to it, and sends that image to your computer. It only makes an image, like a camera does; it doesn't turn that image into text.
S.2. What types of scanners are there?
The most common type of scanner, the kind you're likely to find in your local computer store, is a flatbed scanner. It has a glass bed usually a bit bigger than Letter paper size (or A4 if you live in Europe! :-) and most of the common models are optimized for typical office correspondence. One of these may cost anything from under $100 to $400, depending on its features, or you can pick them up cheaper second-hand. You use this by placing the paper or book face-down flat onto the glass, and scanning from there. This is the kind of scanner most commonly used by PG volunteers.
Some stores will call sheetfed scanners a different category. These are flatbed scanners with Automatic Document Feed (ADF), but they are fundamentally the same machine, and the ADF sheetfeeder unit may often be bought as an accessory to the flatbed scanner. Recently, a few sheetfed scanners have appeared that are very small, without a full flatbed, just a narrow strip that the paper rolls through. Avoid these for PG work; you often need to be able to scan the book flat.
Hand scanners, as their name implies, are much smaller, and typically very cheap, or even thrown in free. You use these by holding them in your hand and running them along the text like a brush. These are really not intended for PG work; you need a very steady hand movement to get them to scan a page of text into a readable image, and they shouldn't be considered as an option for a 400-page book--scanning and OCR is tough enough without that!
You can think of production scanners as industrial-strength flatbed scanners. The basic mechanisms are the same, but a production scanner will certainly have ADF (sheetfeeder), more features and speed, and be rated for very high volume scanning. Production scanners are used by publishers, businesses with high-volume paper processing needs, and print shops. This last is useful, because you may be able to get some scanning done by a print shop. It can't hurt to ask. If you're thinking about buying one of these babies (and who among us hasn't? :-), be sure you have $2000 or more to spend.
Drum scanners are mostly used by publishers for professional, high-quality artwork. The paper is placed on the surface of a drum that rotates past a fixed scanning head. The drum can be very large. Because the sensors don't have to move, the electronics and optics can be of higher quality, and produce very accurate, high-definition images. They are exactly what you would want for making professional quality scans of old movie posters, but they're expensive, and not very useful for scanning War and Peace to OCR.
Planetary scanners are a different breed to all the others. They are really not scanners at all, but a very high-end digital camera on a stand. You place the book face-up with the pages open, with the camera looking straight down on it. It takes a picture, and passes it on to the connected computer. Planetary scanners are ideal for old, fragile, valuable books that can't be exposed to the stress of normal scanning. They typically come supplied with specialized software, sometimes even their own dedicated computer, and they are very, very expensive--$20,000+.
S.3. Which scanner should I get?
For most people, the answer is simple. Unless you have a lot of money and are sure you will be scanning a lot of books, you should get a normal, consumer-or-office type flatbed scanner, with or without an ADF sheetfeeder.
Having decided that, you're faced with the question of which scanner to buy. More good news! The market in scanners is very competitive, and there are many top-line vendors all watching each others' features like hawks, eager to deliver the highest-spec machine they can. There are only a couple of critical factors in this decision--most of it is about getting the best buy.
For PG work, you really _need_ an optical resolution no less than 300 by 300 dpi (dots per inch), and 600 by 600 is very desirable. Obviously, more is better, but it would be very rare to need more than 600 dpi for PG work. Pay no attention to the "interpolated" or "enhanced" resolution, where the software "guesses" what dots should fill in the gaps--you're only interested in the optical resolution. The good news is that it's very difficult to find modern scanners with a maximum optical resolution of less than 600 dpi, but if you're buying second-hand, you should check this out first.
You will also _need_ a scanning surface on the glass big enough to place your book with two facing pages flat. Again, the good news is that it's very hard to find a flatbed whose scanning surface is too small for PG work, since these scanners tend to be designed to handle office paper, which is about the right size. Most flatbed scanners have scanning surfaces of about 8.5" by 11.5", and this is standard for PG work. If you're working on books with very large pages, you may need to resign yourself to scanning one page at a time, but buying a scanner with a big flatbed for these rare occasions will be much more expensive.
You must make sure that you get a scanner that will connect correctly to your computer. There are currently (mid-2002) three main types of connections commonly available: SCSI, USB, and parallel.
SCSI (Small Computer Systems Interface) is the highest-quality option, but it means that you need a SCSI card in your computer, and be willing to figure out how to install it. If you're already a SCSI enthusiast, you don't need to read further; if you're not, I suggest you avoid it unless you enjoy tinkering. Production scanners mostly require SCSI.
Parallel-port connections used to be common, as a cheaper, easier alternative to SCSI. Since the introduction of USB they have become rarer, but you will still see them for sale second-hand. These plug into your printer port, and don't require any further engineering skills.
Most new scanners hook up using a USB (Universal Serial Bus) interface, which is a no-muss, no-fuss "plug-in and go" option, but be sure, if you have an old PC, that it actually has a USB port and that your operating system supports it; some older Windows PCs and Macs may not. If your PC doesn't support USB, you should probably look at Parallel-port scanners.
By the time you read this FAQ, FireWire and USB 2.0 interfaces may also be common. For your purposes, these are like more advanced versions of USB. Just make sure that your computer has the right support to match the scanner.
If you're buying second-hand--and used scanners can be very cheap--make absolutely sure that you're getting the original software that came with the scanner, and that that software will work with your current operating system on your PC.
Having ensured that your choice of scanners passes these tests, you're now free to indulge your tastes for any extras you like. Color is nice, but rarely used, since we mostly transcribe older books that have no color printing. Higher resolutions are comforting to have, both since you may occasionally find them useful and because it shows that the optics are of higher quality than you actually need for your PG scans.
If you are nervous about your choice of scanner, or how easy it is to get one working, feel free to contact other PG volunteers for their opinions, as described in the FAQ "How do PG volunteers communicate?" [V.12].
S.4. What is ADF?
ADF stands for Automatic Document Feed, and it's just a jargon term for a sheetfeeder, where you put in a stack of pages to be scanned and go away while that's happening instead of putting in each page manually.
S.5. Should I get ADF?
That depends. Yes, ADF is a great idea, and can be a huge work-saver, and if you have the cash to spend, it may well be worth it. But ADF has a dirty little secret: like any other gizmo with moving parts, it occasionally jams. The sheetfeeders built into these low-cost machines are aimed at handling typical office paper straight from the laser printer--large, smooth, good quality, with perfectly-cut, perfectly-aligned edges. In your PG work, you will be dealing with hundred-year-old pages of various thicknesses and textures, usually much smaller than the sheetfeeder was designed to work with. And you will have to have cut the pages, and may leave ragged edges in doing so.
Under these conditions, you may find that paper often jams in your sheetfeeder, and it defeats the purpose if you have to stand over the scanner while it works, or if you end up having to lift the cover and use your scanner as an ordinary flatbed, or, worse, if your paper gets scrunched up as if a dog had been playing with it.
And of course, in order to feed the pages through, you will have to cut them out of the book, destroying it. (It may be possible, with the help of a bookbinder, to have the pages professionally cut, and later re-bound.)
With ADF, you probably won't actually scan much faster than scanning flat, but you won't have to keep turning over the pages during that time.
So when you're making that choice, think carefully. If money isn't a problem, or you do expect to be working with cut sheets, then go ahead and get a sheetfeeder--it's great when it works! But don't be disappointed when it doesn't work all the time.
S.6. What's a "TWAIN driver" and why do I need one?
A TWAIN driver (see <http://www.twain.org>) is a piece of software that installs onto your Windows PC or Mac and controls your scanner from there. With any modern scanner, there will be a TWAIN driver included in its software package. Once installed, you shouldn't have to think about it again, or even know it's there.
A modern OCR package will usually find your TWAIN driver and use it to control the scanner. This is very handy. There may also be a small scanning package with your TWAIN driver, which will provide a screen where you can make fine adjustments to scanner settings, and start scans. You probably won't _need_ this, since your OCR package will probably do it for you, but it may be useful for semi-manual control of the scanner.
Unix-based systems like Linux use SANE <http://www.mostang.com/sane/> rather than TWAIN drivers.
S.7. How do I scan a book?
This depends on whether you have cut the pages out, or whether you are working with an intact book.
If you have cut the pages out, and you have an ADF, then you will obviously feed them through that.
If you don't have an ADF, there usually isn't much point in cutting the pages. Most modern OCR will recognize a "dual-page" or "two-up" scan, and, if yours does, then that's normally the best option. Scanning the uncut book, open and flat, is the most common scanning method used in PG.
Take the book and place it open, flat on the scanner glass. To fit both pages on the glass, you may need to position it lengthways, at 90 degrees to its natural angle. Most OCR software will recognize that the image has been rotated through a right-angle, and will correct it when it reads the text.
A common problem with scanning an opened book is "guttering", which happens when the spine of the book is not pressed flat enough, and the inside of each page, where it meets the spine, is curved against the glass. There's more about this, and an example, scan3, in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". To avoid guttering, make sure that the spine is held down throughout the scan. (Some people put a weight on the spine to hold the spine down on each scan; others just press their hand against it.)
Another common problem is light scattering, when too much light gets into the scanner. The scanner head detects light, and you want the only internal light source to be from the scanner itself, not ambient room light or sunlight. Scanners have covers, that are intended to be closed while scanning, for a controlled light level, but when you're scanning a book held open and flat, you can't close the cover fully. In a bad case, this can lead to a condition of the scan like overexposure of film and you can see an example in scan4 of the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". If this happens, just make sure that your room is dim while you scan--don't have a ray of bright sunlight bouncing around the inside of the scanner!
Occasionally, when scanning cut pages with very thin paper, you may get a shadow of the text on the other side showing through. If this happens, you can try covering the inside of the scanner lid, which is normally white, with a piece of black paper.
Many modern OCR packages will control the scanner automatically, and you may be able to set your OCR so that it does an automatic timed scan every, say, 30 seconds. This is a great timesaver, since you don't have to go back and forth between the scanner and the screen. Just set your timer, hold down the book for the scan, take the book up, turn the page, put it down again, and wait for the next scan to start. Set the timer for whatever interval you are comfortable with. Highly recommended, if your OCR or scanning package can do it.
By default, most scanners will always scan the entire area of the flatbed, but usually, your book will occupy only about half of it. Look for a setting on your OCR or scanning package which allows you to reduce the area that the head scans. Just scan enough to get the image of your pages. This makes the time for each scan and subsequent OCR recognition shorter, and in a really good case can cut your total scanning and OCR time in half.
Scanning all pages together is usually fastest, but you may prefer to scan each double-page, then correct it in your OCR package's editor, then scan the next. This is a more leisurely approach favored by some volunteers.
S.8. My book won't open flat enough for a good scan, and I don't want to cut the pages.
Well, then, you have a difficult choice to make, but you do still have several options:
You can accept a poor-quality scan, and spend a lot of time fixing up the guttering on the margins.
You can bite the bullet, and cut the pages.
You can type the book, or find a typist who will work on it for you.
You can find a print shop or bookbinder who will cut the pages professionally, and re-bind the book when you're done. You may even replace it with a fresh new binding that will give the book a new lease of life.
Take your choice.
Most books will open flat enough for an adequate scan, though you may have to put stress on the spine to do it.
If you have a really precious book, and you can't find a typist, you might consider the options of a digital camera [S.11] or finding someone with a planetary scanner [S.2] to scan it for you.
Michael Hart said: "I would give up every book I own, including my first edition of the OED, my Civil War edition of the Merriam Webster's Unabridged, etc., etc., etc., so everyone could use it any time they wanted rather than that only I or my friends could use it . . . and obviously _I_ could use it too."
Fortunately, it rarely comes to that.
S.9. How long does it take to scan a book?
Putting the book flat on the glass means that you scan two pages at a time. A reasonable modern scanner will scan the area of two typical pages at 400dpi in anywhere from 20 to 40 seconds--let's call it 30 seconds for two pages. That's four pages a minute, or 240 pages an hour. You could reasonably get through a 400 page book in two hours, even allowing for an occasional break or glitch.
Of course, you should also allow time for scanning a few trial pages with different settings before you start, to decide which settings to use. Ten minutes spent here can save you hours of proofreading time.
There are two big tips that can save you a lot of scanning time:
If your OCR or scanner control package has a timer setting, that automatically keeps scanning without user intervention, you can forget about the screen and just keep turning the pages as needed.
You should set your scanner just to scan the area the book covers on the glass. By default, your software will probably scan the full area of the glass, and usually, your book won't need that. By scanning only what you need, you may typically save anything from 20% to 70% of the time taken to scan the full area. If your book is small enough to open flat _across_ the scanner instead of "down" the side, 400 pages an hour is not out of the question with this trick.
S.10. What scanner settings are best?
For a given book, scanner, PC and OCR software, there must be some "ideal" scanner settings, but if you change any of these components, the ideal scanner settings will change with them. Some OCR packages recognize greyscale better than black and white; some don't like greyscale at all. Some books have small print needing higher resolution; some are speckled so that higher resolution leads to more errors.
Obviously, the best settings also depend on the individual book, and some books will require you to get downright creative with the settings, but most PG books are scanned in Black and White or greyscale, somewhere between 300dpi and 600dpi.
This decision is a trade-off between speed and accuracy, and an illustration of the difference between principle and practice. In principle, a true-color, 9600dpi scan is a much better rendering of the page than a B&W 400dpi scan. In practice, all that extra information doesn't usually help the OCR make better distinctions between letters, and the larger and more detailed the scan, the longer it takes to make the scan, the more disk space the image file takes, and the more processing time and memory the OCR package needs to recognize it.
A further paradox emerges when considering higher vs. lower resolutions: depending on the paper and ink quality, you may see _more_ errors start to appear on very high resolution scans. These are caused by small imperfections in the paper or ink spots that show up on the high-res scan, and that the OCR tries to interpret as letters or punctuation.
So, in summary, bigger is better, but only up to a point.
Brightness is a setting often neglected, that can make quite a big difference to your results. Look at the scanned image: if you see lots of dark patches, make your scan lighter; if your letters appear thin and faded, make your scan darker.
See the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?" for some typical scans and results.
S.11. Can I use a digital camera in place of a scanner?
Digital cameras are getting better resolution all the time, and some volunteers have experimented with making a kind of home-made planetary scanner from a digital camera and a stand. So far, the results don't quite match a dedicated scanner, but as digital cameras improve, this may become a common option. One problem, which planetary scanners use specialized software to correct, is that the natural curve of the pages near the middle of the book tends to give a foreshortened aspect to the letters there, which can cause problems for OCR software, like guttering.
Whatever the current problems, the prospect of using digital cameras is exciting, because it will mean that non-typists will be able to produce old books borrowed from libraries without worrying about scan quality vs. damage to the spine.
S.12. What is OCR?
OCR stands for Optical Character Recognition. This is very important software that looks at the picture of the page that your scanner has supplied, and turns it into text.
When the scanner delivers the image of the page, that image is only a picture. You can't, for example, search for text in it, or edit the text to add a blank line. Your editor or word processor can't work with it. The OCR program does the job of "reading" and "typing" the image for you. OCR packages call this "reading" or "recognizing".
S.13. What differences are there between OCR packages?
One word: huge. All OCR packages do the same job, but they do it in different ways, with different features, and with different levels of accuracy. OCR can save you a lot of time, or cost you a lot of time. It's really worth putting some effort into making sure you get the right OCR package, and, once you have it, into understanding how to use it. It'll save you time in the long run.
S.14. How accurate should OCR be?
OCR packages commonly say that they are "99%+" accurate, or something like that. Let's analyze what that actually means: say there are 1,000 characters (letters) on each page, then with 99.9% accuracy, you would expect to have to make 1 correction per page. With 99% accuracy, that would be up to 10 corrections per page. And in a 400-page book, this all adds up.
But there's a "Your Mileage May Vary" clause built into that. Typically, the manufacturers test their OCR on fresh, laser-printed or press-printed copy with perfect scans, and this is fair, since they are aiming their products primarily at businesses that process these kinds of materials. _You_ are not dealing with fresh print; you're dealing with old books, yellowed, spotted, marked, imperfectly printed in the first place, and possibly using unfamiliar fonts. And it's unlikely that you will have the patience to get a perfect scan on every page. The result is that the accuracy of OCR for typical PG work doesn't match the accuracy on images of perfect, fresh paper.
Apart from the scan quality, OCR also has to contend with different fonts and sizes for the letters.
However, if you're getting more than 10 errors per page, you should look at some examples of OCR in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?".
S.15. Which OCR package should I get?
The accuracy of OCR software has improved enormously in the last few years, and OCR technology looks likely to keep improving even faster than software in general. Further, there is competition in this area, and products leapfrog each other with new versions regularly. The brands most commonly mentioned by PG volunteers (mid-2002) are Abbyy, OmniPage and TextBridge [P.1], and trial versions of all three have been available for download over the Web, and may still be when you read this. [Warning: these are big downloads--40MB or more.]
Most common OCR packages will offer two main working options: to scan a page and view/edit the resulting text on the spot before saving, and to scan a whole batch of pages together and view/edit them all later. Some people like to fix up one page at a time; others prefer to get all of the OCR work done at once, then get the whole text into their editor. Most OCR software will cater for both, and if this is important to you, you should check that the OCR you're buying supports the way you want to work.
If you intend to work in a language other than English, make sure that the OCR you buy supports the characters in your language.
Some OCR software has a "training" or "learning" mode. Using this mode, it scans and "reads" or "recognizes" a page, then you correct that page, and the OCR "learns" from its mistakes and tries to do better on the letters it misread when it recognizes the next page. If you're dealing with a very rare font, this can make a difference to your OCR quality, but modern OCR packages come with enough inbuilt font knowledge for most languages, and you probably won't need this.
If possible, try a couple of OCR packages before you decide. If you want opinions on specific versions, contact other PG volunteers and ask for their opinions, as described in the FAQ "How do PG volunteers communicate?" [V.12].
S.16. What types of mistakes do OCR packages typically make?
Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.
Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.
The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.
The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.
Lower-case m is often mistaken for rn or ni.
The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.
For example:
" Hello1' caIled jirnmy breczily. 11Anyone home ? "
There seemed to he no-oneabout. Only tbe eat beard him."
should read:
"Hello!" called Jimmy breezily, "Anyone home?"
There seemed to be no-one about. Only the cat heard him.
S.17. Why am I getting a lot of mistakes in my OCRed text?
If you're new to OCR, you may have come with the idea that OCR is almost perfect, and just makes a few mistakes now and then. No. It's slightly amazing that OCR works at all, and when it does, it isn't perfect.
You might reasonably expect to average anything up to 10 errors per page for typical PG work; if you're seeing more, then there is a problem with
a) your printed book b) your scan, or c) your OCR package
Problems with the printed book fall into three categories: bad printing, age, and unusual fonts. Bad printing consists of problems like too much or too little ink on the press at the time the book was printed, and irregularities in the print where the metal type was damaged. Age causes yellowing--even browning--of the paper, and faded print. Unusual fonts may be hard for OCR to recognize, and very tightly-spaced print may make adjacent letters seem to touch, which confuses OCR software.
There are many ways for you to have problems with your scan. Obviously, if your scanner is defective or the glass is dirty, you will notice it immediately, but there are many mistakes you can make that will result in a poor-quality image, and cause later problems for your OCR.
You may not be able to control the quality of the paper you have to work with, but there is a lot you can do about the quality of your scan.
The two mistakes that people inexperienced with scanners most commonly make are not holding the spine down firmly enough to get a flat image of the paper, and not setting the brightness correctly, or letting too much light get in. In your early scans, watch out for these problems.
First, if you haven't already, read the FAQ "How do I scan a book?" [S.7] and check that you're following the basic recommendations there.
Now let's look at some samples, and see the kinds of problems you might encounter.
A disclaimer about these samples: specific OCR packages are named, but you should _not_ take these as a fair and comprehensive comparative review of the software. The object of this exercise is to show typical scanning conditions and problems, and the resulting OCR output. OCR packages have quite a range of variance within themselves, may work better on some texts than others, may improve with "training" or different settings, and I have even seen the same OCR package produce different text from the same image with the same settings! Further, since OCR quality is improving rapidly, and packages leapfrog each other in quality, the next version of a particular brand may be vastly better than any of the software mentioned here. Of particular interest in this context is the leap in quality between OmniPage 10 and OmniPage 11.
* * * * *
Scan 1--A perfect Scan
Scan1 is as near to a perfect scan as you can expect in PG work. It comes from "The Founder of New France" by Charles W. Colby. It is only a 300 dpi image, but given the quality of the print and of the scan, 300dpi is all we need. Ironically, it comes from Gardner Buchanan, who complains about the age and infirmity of his scanner in his description of how he produces a text. The moral is that you don't have to have the latest equipment to get good results!
The actual scan is in the image file scan1-3.tif
It doesn't really need any comment, and all of the packages except gocr rendered it perfectly. Note the fake "space" before the semicolon--if you look closely at the image, you will see why the OCR packages mistook it for a full space, as discussed in the FAQ [V.104] "My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?"
Champlain was now definitely committed to the task of gaining for France a foothold in North America. This was to be his steady purpose, whether fortune frowned or smiled. At times circumstances seemed favourable ; at other times they were most disheartening. Hence, if we are to understand his life and character, we must consider, however briefly, the conditions under which he worked.
gocr 0.3.6 converted this as:
Champtain was now definitely committed to the task of gaining for France a foothotd in _orth America. This was to be his steady purpose, whether fortune frowned or smiled. At times circumstances seemed favourable ., at other times they were most disheartening. _ence, if we are to understand his life and character, we must consider, however brieRy, the conditions under which he worked.
* * * * *
Scan 2--A Typical Scan
Scan2 is a paragraph from Baroness Orczy's "Castles in the Air". Notice the ink-splotch above the capital "I" in the first line, which will give our OCR some problems. The page is also unevenly inked elsewhere, and I have scanned it with the brightness level a bit too high.
I have made two separate scans, one at 300dpi and one at 400dpi, both Black and White, named scan2-3.tif and scan2-4.tif respectively. The page was cleanly cut, and carefully placed straight onto the scanner glass with the cover down. The original print is somewhere between the size of Times New Roman 10 and 11, with capital letters about 2.2 millimeters high, but better and more clearly spaced. These scans are fairly typical for PG work. Because of the relatively large letters, and the reasonable scan, there isn't much difference between the text produced from the 300 dpi scan and the 400 dpi scan.
I actually cut this book to get the pages out so that I could feed it through my ADF, but the paper is so thick and textured that it sticks together, and jams when feeding through. The thick, absorbent paper, combined with the uneven inking, means that, no matter how good the scan, any OCR has to contend with the irregular edges of letters, which are clearly visible even at 300dpi.
Here is the output for these scans from some OCR software packages. I changed just one thing: Abbyy recognized the em-dashes as such, and output them as a special character in Codepage 1252 for em-dashes, which isn't available in ASCII, so I converted that to the PG standard 2 dashes.
Abbyy FineReader 6:
Yes, indeed, I was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain %vas seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs--a goodly sum in those days, Sir--was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, Twas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs--a goodly sum in those days, Sir--was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
gocr 0.3.6:
__e_, indeed, f___as on_the track of h_. hristide Fournier, 3nd of one of the most im__ant hau1s of enem)_ goods ___hich had e__er been made in France. h?ot onl3_ that. I had a1so before me one of the most brUtish crimînat_s it h__4 e___er been m31 misfortune to co_me acro__3. A bu113_, a tiend oí cruelt__. In very truth m3_ fertiIe brain ___as s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun- i;__,i__gnt íor such a miscreanf. yes, in_i__ee3, fj_1e thou3and francî-a b_ood13_ sum in those days, _ir-_vas practica1l3_
a3_ured me. _ut o___er and above n_ere lucre there was the certaint_v that in a few_ da3_s' ti_e I shou1d see the lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of _ear and of sorrow from the s__eetest iace T had Seen fof man)_ a day.
Yes, indeed, f___as on the track of h__. Ariseide Fournier, and of one of the most important hau1s _f enemy goods ___hich had ever been made in France. NoEUR on1y that. I had also before me one of the most brutish crimina1s it h_ad ever been my misfo__tune to come acros__. A bu11y, a fiend of crue1ty. _n very truth my fertib brain _vas seeî3_:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e ru_an by the heels. hanging _____ou1d _ a merciful pun- iï_h_ment for such a miscreant. Yes, indeed, five thou__and f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly a3îured me. But over and above mere _ucre th.ere was th_e certainty that in a few days' ti_e _ shou1d see the 1i__t of gratjtude shining out of a pair o_, _userous b1ue b . e__es, and a __inning smi1e chasing away the l_k of _,ear and of sorrow from the s___,eetest face _ _ad _.een _o_ many a day. . .
Recognita Standard 3.2.7AK:
~'es, indeed, ~w-as on the track of ltT. Aristide Fournier, and of one of the most important hauls of enemy goods "=hich had ever been made in France. ~Tot only that. I ha~i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully-, a fiend of cruelty. In very truth my fertiIe brain was s; ething w-ith plans for eventually iaying that abominable ruffian by the heels : hanging ~-ould be a merciful pun- ishment for such a miscreant. ires, indeed, five thousand franes-a goodly sum in those days, Sir-was practically as~ured me. But over and above mere lucre there was thP certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous btue ey·es, and a winning smile chasing away the hk of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, l~was on the track of h~i. Aristide Fournier, and of one of the most important hauls of enemy goods w~hich had ever been made in France. lVot only that. I had also before mP one of the most brutish criminals it had ever been my misfortune to come acrass. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for ez~entually laying that abomin_ able ruffian by the heels : hanging ~~.-ould be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand f:ancs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should~ see the Iight of gratitude shining out of a pair of iEustrous blue eyes, and a w inning smile chasing away the Iook of fear and of sorrow from the s"-eetest face ~ had seen ~'or rr~any a day.
OmniPage Pro 10:
Yes, indeed, twas on the track of 11T. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I ha(i also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
OmniPage Pro 11:
Yes, indeed, twas on the track of AT. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Yes, indeed, fwas on the track of h-I. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day.
Textbridge Millennium Pro:
Yes, indeed, rwas on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I hail also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for many a day. - - -
Yes, indeed, f was on the track of M. Aristide Fournier, and of one of the most important hauls of enemy goods which had ever been made in France. Not only that. I had also before me one of the most brutish criminals it had ever been my misfortune to come across. A bully, a fiend of cruelty. In very truth my fertile brain was seething with plans for eventually laying that abominable ruffian by the heels: hanging would be a merciful pun- ishment for such a miscreant. Yes, indeed, five thousand francs-a goodly sum in those days, Sir-was practically assured me. But over and above mere lucre there was the certainty that in a few days' time I should see the light of gratitude shining out of a pair of lustrous blue eyes, and a winning smile chasing away the look of fear and of sorrow from the sweetest face I had seen for manyaday. -
* * * * *
Scan 3--Guttering and Smaller Print
Scan3 is a paragraph from "The Egoist" by George Meredith. It was scanned in a dim room, with the scanner cover open and the book held open, flat against the scanner glass. However, the spine was not pressed firmly enough against the glass, and as a result you can see that the words on the left-hand edge (which were near the spine) appear to be slanted, a bit distorted, and not well lit. This problem is familiar to people who scan for PG--everybody gets distracted sometimes, and fails to keep enough pressure on the spine. As you see from the results below, it caused problems for all of the OCR packages on the words affected. If you find this kind of "guttering" regularly in your own scans, where the characters near the spine are not being recognized correctly by your OCR, you need to make sure that your book is down as flat as possible before making a scan. Because of the smaller size and the guttering problem, the 400dpi scan made for better quality text in this case.
Here's the output from the sample OCR:
Abbyy FineReader 6:
NEITHER Clara nor Vernon appeared at the mid-day table, n Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an uncdified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir \Villoughby was proud of her, and therefore anxious to soltlo her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended hia nrido.
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Bale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir "VVilloughby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his pride.
gocr 0.3.6:
__,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_ _, _ii_(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__ i,_i,;,_ .,, _(_u_-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll . tf e__Ul__b rU_l gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU o_ _ 8O .t _' t_ail u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)_OB_'_ U_Oll 8eelll6 lttr _,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self. _i__ _ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS to _(_(.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_ j__ l_()_)(_(l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_ _o__(),__ (li,_iIci._ Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_ )ii))),, lIL_Ll v_b__uely f_.ighteUe eVen _OTe kba_ lt OfEe_ded hi_ pi_i..(l_u- . _ , , --.___ _ _,- - -__-
________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_ D_. _id(lle_oi_ t_lked with Miss _ale ,on _ _Ssi__l __i tt_r_'_ iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np _'_.on_ _tune to _tone aGro_S a braWlin( __ inOU__taiß _foPd_ So t2_at a__ u__p,(_ified ___idiei_Ge _ni62it real y 8uppO.8e_ upon _seeii_6 l_e_ o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_ _i_ _viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to ___.tle li__i. i)u__inesS Whike he W_S î_ the hum'ou_ to_ lose her_ __e l_op(_d to finish it by 8hooting a wopd o_ tWo ak Verno__ _ _eforR_ _(in_icr_ Clara's petition to _ Set _free, releaSed fro_ )ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD pi.icle. -. - - - - - '
Recognita Standard 3.2.7AK:
~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table. Dr. bLidrlleton talkc;d wi.th Miss Dale vn elassieal matters, like a ~n~a-mZtured giant gi.ving a child th© jucnp frvm stonc to stone across a brawling mounta,in ford, so that au uiicilificd .ruciicucc mil;·ht really suppasc, upon seeixig hor ·n~er thc ciillicul.ty, she had clouo something for herself. Sir ~Villcm;;lrlry wvs proua of her, and therefors angiaus to sct.tla lrur tn~sincss while he was in the humoar to lose her. lle lu,hcot to iinish it by shooting a word ar two at Vernon bol'ore ~linncr. Clara's petition to bo set froe, released £rom JGGnt., hvd vagucly frighteued even more than it offended hia ri~le. p
NEITfi~R Clara nor Vernon appeareci at the xnid-day table. Dr. Middleton talked with Miss Dalo on classics,l rnatters', like a good-natured giant giving a child the jtimp from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon ~ seeing her over the difficulty, she had done something for herself. Sir yillon ;hby was proud of her, and therefore anxiotis to scttle luer business while he w~as in the hurxiour to lose her: He hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from jcLm, had vaguely frighteued even more than it offended his pride.
OmniPage Pro 10:
NF r~rn,Px Clara nor Vernon appeared at the mid-dap table. Dr. Middleton talked with Miss Dale on classical matter, like .t good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an uneVified audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir jV;llo,r;;lrl>y was proud of her, and therefore anxious to set.tlo lror Uusiness while he was in the humour to lose her. Ile. lropcol to finish it by shooting a word or two at Vernon bol'ore dinner. Clara's petition to beset free, released from )zinc, had vaguely frightened even more than it offended his pride.
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Bale on classical matters', like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon ~ seeing her over the difficulty, she had done something for herself. Sir yillou ;hby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. He hoped to finish it by shooting a word or two at Vernon before dinner. Clam's petition to be set free, released from him, had vaguely frightened even more than it offended his pride.
OmniPage Pro 11:
NF f,rnMR Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Dale on classical matters, like .t good-natared giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lifie(l audience might really suppose, upon seeing her over the difficulty, she had done something for herself. Sir jVillon;hl)y was proud of her, and therefore anxious to setale leer business while he was in the humour to lose her. lle hoped to finish it by shooting a word or two at Vernon bofore dinner. Clara's petition to beset free, released from )lint, had vaguely frightened even more than it offended his pride. -.2 ..1_ - ____
NEITHER Clara nor Vernon appeared at the mid-day table. Dr. Middleton talked with Miss Dale on classical matters', like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an unedified audience might really suppose, upon,seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle her business while he was in the huniour to lose her. Il"e hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hint, had vaguely frightened even more than it offended his pride. - -
TextBridge Millennium Pro:
NErr'!'~~ Clara nor Vernon appeared at the mid.day table. pr. ~1id(lIeto11 talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that au ~1edifi~ tLU(llCIlCC might really suppose, upon seeing her over the (hjiheulty, she had done something for herself. Sir wiflouighby was proud of her, and therefore anxious to settle her business while he was in the humour to lose her. lie ho1)ed to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from him, had vaguely frightened even more than it offended his prú~t~.
NEITHER Clara nor Vernon appeared at the mid-day table. Pr. Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lified audience might really suppose, upon - seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle hier l)uSifleSS while he was in the humour to lose her. lie hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hirn~, had vaguely frightened even more than it offended his pri(le.
* * * * *
Scan 4--A Really Bad Case!
Scan4 is a paragraph from Pope's translation of Homer's "Odyssey". This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6" by 4.5", with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 and 400dpi scans, but closed the cover for the 600dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300dpi and 400dpi images, flashed up a suggestion that I should lower the brightness of the scan.)
This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks' Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600dpi scans. That's how much things have improved recently.
A separate point to note here is that you can see the "three-quarter space" effect before the exclamation mark and semi-colon that was discussed in [V.104].
The results of the OCR are:
Abbyy FineReader 6:
" Ah me ! on what inhospitable coast, On Tvh.it new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men. whose bosom tender pity warms ? What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd Pryads of the shady wood ; Or azure daughters of the silver flood ; Or human voir-e? but issuing1 from the shades, AVhv cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men, whose bosom tender pity warms '? "What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd Dryads of the shady wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by wild barbarians fierce in arms ; Or men, whose bosom tender pity warms ? "What sounds are these that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd*Dryads of the slrady wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
gocr 0.3.6:
[The 300 and 400 dpi scans produced nothing recognizable. The result of the 600 dpi scan is below.]
'' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_ On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ; _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _ Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ? ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ? '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_ 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _ Op az(_pe da_____litc__s of _tlie sil __?r t1ood ; Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _ __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_--li__t so_nd- in__ad_S___''
Recognita Standard 3.2.7AK:
.: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t, On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ; Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ; Or u.~u. w-Ln.e bossum tender pit~- warna'? ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ? 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5, 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood; Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ; C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~, 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"
" ~h me ! ou "-Mat iuMospita~le coast, On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ; Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ; Or m~ n, "-hose hosom tender pit~- warm5 ? ~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ? ~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers . Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ; Or aznre dau~liters of tMe sil~-~r fiood ; Or lmman ~-oi:~e'? but iauin~ frotn the shades, a lVly cea.~e I straibht to learn "-Mat souud in~ad°s?"
" Ah me ! on what inhospitable coast On ~~-hat new r e~ion is L;1 ~-sses toss'd ~ , Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ · Or men, whose hosom tender pit~l ~varn~s ? ~'G'l~at somnds are these tliat ~atl~er from the shores ? ~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers, Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ; Or azure daylltcrs of tlle silver flood ; Or lm:nan voice? uut issL~ing from the shades, ~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"
OmniPage Pro 10:
,. _lh in- ' on "-hat inh-slit al.:e coast, On "M.^t new reion is 1=1;-a:e~ to-s'd ; P"::e:~'d hw "ild Larba.:an~ fierce in arms ; Or inn. "-hnse bo.,om tender pity warms What <m-,n ds are thFSe that gather from the shores? '1-l.e vo_,e o2 u~vnhit: thm hn,,-,nt The sylvan bowers, The is ;r-ha;r'd h.-;-ads of the liz-Ay iNood Or azure dau_ht;- of tl:c o=1 cr flooj ; Or hnnmn wire? l,11t i--rii:g from the shadP3, Al-ly cease I straiAlit to learn what sound invades?"
'Wh me ! on what inhospitable coast, On what new region is L fusses toss'd ; Possess'd br wild barbaric ns fierce in arms ; Or men, whose bosom tender pith- warms AN-hat sounds are these that gather from the shores ? The voice of nymphs that Haunt the sylvan bowers, The fair-hair'd IWvads of the shady -wood ; Or azure daughters of the silver flood ; Or human voice? bat iauina from the shades, Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast, On what new region is Ll ysses toss'd ; Possess'd bv -wild barbarians fierce in arms ; Or men, whose bosom tender pity warnis ? AVlia± sounds are these that gatller from the shores The voice of nYI11pliS that haunt the -sylvan bowers, The fair -hair'd D.-yads of the shady wood ; Or azure daughters of the silver flood ; Or human voice? lout issuing from the shades, Why cease I straight to learn what sound invades?"
OmniPage Pro 11:
.` lh in-' on what inhospital,le co-st, On xclznt near region is t 1:-sse~ toss'(: ; Possess'd bY Mild barbarians fierce in aims ; Or inn. whose boson tender pity warms What <m-,n ds are tlipse that gather from the shores ? '_I-I.e 1-o=,- of nv:npii? that haunt the sylvan bowers, She ra;r-ha;r'd 1):, ads of the shad- wood ; Or az.ire dau_lit~- of tl:e silo-:-r flood ; Or human voice? l,,tt i?snina from the shadpq, Al-lry cease I straiAit to learn shat sound invades?"
''' :Ah me ! on what inhospitable coast, On iyhat new region is Ulysses toss'd ; Possess'd br wild barbarimis fierce in arms ; Or men, whose bosom tender pity warms AN-hat sounds are tliese that gather from the shores ? The voice of nymphs that haunt the sylvan bowers, The fair-hair'd D~ yads of the shady -wood ; Or azure dau.L-hters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
" Ah me! on what inhospitable coast, On what new region is Ulysses toss'd ; Possess'd by -wild barbarians fierce in arms ; Or n1en, whose bosom tender pity warnis ? AVliat sounds are these that gather from the shores The voice of nyniplis that haunt the sylvan bowers, The fair-hair'd Dryads of the shady Wood ; Or azure daughters of the silver flood ; Or human voice? but issuing from the shades, Why cease I straight to learn what sound invades?"
TextBridge Millennium Pro:
no on what inhe~ptaEie coast, On what new realun is hivs,e' to5sd ,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~ Or u~,-n. w'linse bo,uuiu tender pity warnls Wl at ~ are t1ie~e that ~atler from the shores ? 'n.e a oro of imvntpirs tint he~nt the sad van bowers, 'flie tah'-ha~r'd D~vahs ct the shady wood 1)1' az Ire dauul~t ~ of tl,e shvr flood Or liunian vi i 'I ? h'tt is- eng from the shades, \VIiv cea-~e I straight to learn w hat sound invades 1"
Ah me on what inhospitable coast, On what new region is U vases toss'd Possess'd by wild barbarians fierce in arms Or men, whose bosom tender pity warms ~ What sounds are these that gather from the shores? The voi'e of nymphs that haunt the sylvan bowers, The fair-baird Prvads of tl~e shady wood Or azure daughters of the silver flood Or human vuiae? but issuing fi'om the shades, Why cease I straigl~t to learn what sound invades?"
Ah me on what inhospitable coast, On what new region is Ulysses toss'd Possess'd by wild barbarians fierce in arms Or men, whose bosom tender pity warms? What sounds are these that gather from the shores? rfhe voice of nymphs that haunt the sylvan bowers, The fair-hair'd Dtyads of the shady wood; Or azure daughters of 'the silver flood Or human voice? but issuing from the shades, Why cease I straigl~t to learn what sOund invades?"
What can we conclude from this?
Small mistakes in scanning, like letting too much light in, getting your scanner settings wrong for the page, or not pressing the paper flat enough, can make a major difference to the final quality of the text that you will have to correct.
Sometimes, no matter what you do with your scanner, problems with the paper or the print will make it difficult for your OCR package to give good output.
Generally, bigger is better within the range 300dpi-600dpi, but you only need higher resolution with more difficult material.
Different OCR packages will produce widely differing texts from the same images. Given a really good image, most OCR software will work acceptably, but when you have lower quality material to work with, the gap between OCR packages shows clearly.
S.18. I got an OCR package bundled with my scanner. Is it good enough to use?
That depends on how well your package performs on the actual scans that you do, and how much you value your time vs. money. Most scanners are bundled with OCR software, but these OCR packages are often older or "brain-damaged" versions, with their functionality deliberately lowered. It's unlikely that you'll get a current-version, top-of-the-line OCR package thrown in for free.
You may have to pay extra for better OCR, but it means that you spend less time making corrections. The question is how much better you want your OCR to be.
Save the images from the FAQ "Why am I getting a lot of mistakes in my OCRed text?" [S.17] and try processing them with the OCR you have. Compare the quality of the text produced with the quality of the samples. This should give you some idea of how your OCR compares to others.
Try a few pages from your book with your OCR. How many mistakes do you see on each page? Do you find that acceptable?
S.19. I want to include some images with a HTML version. How should I scan them?
We don't often see color prints in our books, but if you do have one, then scan it in color. Otherwise, try both greyscale and B&W, and see which gives you the best image.
It's usually better to scan images in a higher resolution than you're going to use, and then use an image manipulation package to reduce them [H.10] to a size appropriate for your HTML file. An initial scan at 600dpi is often good. Image manipulation programs will also allow you to "clean up" the pictures, by increasing contrast, despeckling, or other filtering.
S.20. I want to include some images with a HTML version. What type of image should I use?
GIF, JPEG and PNG images are supported by current browsers, and you should stick with those unless you have a specific reason not to.
GIF and PNG tend to be more efficient--provide better quality at a given file size--for simple line-drawings; JPEG is usually better for photographic images.
S.21. Will PG store scanned page images of my book?
No. Or, at least, not yet.
The idea has been kicked around a bit. There's no question of replacing etexts with page images, but many volunteers who have already scanned the book anyway like the idea of saving page images as well--for general information, and as a means of checking future correction suggestions against the original. Some volunteers already keep their page images, stored for possible future use.
Working some back-of-the-napkin figures: a page of text might take up 1KB of space on a computer as plain text or HTML or XML. The same page might take 70KB if stored as a black-and-white image, of just enough quality to serve as a reliable guide to making corrections. Pages with pictures, or stored with enough resolution to allow some future researcher to write a paper on the changing shape of serifs in the 18th and 19th centuries, would start at around 350KB per page, and go up from there.
A 300 page book thus becomes
about 300KB as plain text (and around 150K zipped) about 20,000KB as minimal-quality images about 100,000KB as high-quality images
and with the images, we won't save much space on the zipping, because they're already compressed.
On a normal "56K" modem, getting about 4KB / second, it would take:
75 seconds to download the text file (40 for the Zip) 80 minutes to download the minimal images over 5 hours to download the high-res images.
Someday, the disk and bandwidth capacities that we will take for granted will be such that uploading images, when we have them, will be quite natural, just for the few people who will want them. But we're not quite there yet.
Late flash! As of late 2002, the Internet Archive is providing space to volunteers for storing page images. To see the images, and find out more, go to <http://texts01.archive.org/gutenberg-images/>
HTML FAQ
H.1. Can I submit a HTML version of my text?
Yes.
H.2. Why should I make a HTML version?
Well, you can make one just because you want to, but on some texts there is special reason to.
If you want to preserve the pictures that accompany the text, making a HTML version means that you can specify where and how those images appear.
If there is particular meaningful information in the layout of the text that can't be expressed in ASCII, like special characters or complex tables or fonts, HTML may offer an open format alternative.
H.3. Can I submit a HTML version without a plain ASCII version?
You can submit it, but the Posting Team will then consider whether we should also make an ASCII, or perhaps ISO-8859 or Unicode version of it. We really do want our texts to be viewable by everybody, under every circumstances, and we do not want to start posting texts that are in any way inaccessible to anyone.
See also the FAQ [G.17] "Why is PG so set on using Plain Vanilla ASCII?"
H.4. What are the PG rules for HTML texts?
1. The only absolute rule is that the HTML should be valid according to one of the W3C HTML standards.
You can verify that your HTML is valid at the W3C's HTML Validator at <http://validator.w3.org/>
For a more convenient and friendly, though less official, check of the correctness of your HTML, you should use Dave Raggett's Tidy program at <http://tidy.sourceforge.net>, which not only points out any messiness in your HTML code, but also has some neat modes to clean it up and standardize the formatting.
After that, we have some requirements and recommendations. Compliance with the requirements might be waived if there is a really good reason to make an exception in this case.
2. Requirement: File names and extensions
If you want your text to work within 8.3 filename conventions, you may use .htm as the extension for your HTML files; otherwise, use .html as the extension. If you are working to 8.3 conventions, all of your images as well as your HTML files should have 8.3-compliant filenames.
All file names and extensions should be in lower-case throughout. Yes, we know this is not strictly necessary, but we don't want to have to correct every file that comes with "image.gif" referenced in the HTML accompanied by a file IMAGE.GIF.
3. Requirement: HTML and plain-text
Project Gutenberg does publish well-formatted, standards compliant HTML. However, we insist that a plain text version be available for all HTML documents we publish (even if images or formatting are absent), except when ASCII can't reasonably be used at all, for example with Arabic, or mathematical texts.
4. Requirement: Archive format for posting
If the HTML book contains more than one file (including images), create a ZIP (preferable) or TAR archive containing all of the files in the book. The ZIP file may, if you wish, unzip to a subdirectory named for the book. For example, a book called 'The Humour of Mark Twain' might unzip in a directory called 'mthumor'. Make sure directory names contain only alphabetic and numeric characters, no spaces, and are 8 characters or less, even if you're not sticking to 8.3 conventions for filenames.
5. Recommendation: Simplicity
Make your HTML as simple as possible. HTML is an evolving standard, and one that may be completely obsolete in the long term. Use of advanced features may just mean that your version will be obsolete or unreadable that much faster.
6. Recommendation: Images
Images included with your HTML should be in a format that Web browsers can read: GIF, JPEG or PNG. Images should be edited for high quality in a reasonably small file size. Make the best decision you can concerning the image size and placement in the text. Every image included must be linked into (referenced by) the HTML.
7. Recommendation: Line lengths
If it is reasonable to do so, try to wrap paragraphs of text at around the normal PG margin of 70 characters. Ideally, your HTML should be as near as possible identical to your text version except for the HTML tags and entities. People who open your HTML won't all be using browsers, people will need to make corrections, not all editors can handle very long lines, and even with editors that can handle long lines, it's easier to work with short lines.
Apart from these rules and recommendations, we also have a rule about the PG header, but that will normally be handled by the Posting Team. Where your HTML is all in one file, the header text will be inserted within PRE tags in that file. Where the HTML is split into multiple pages, the header will be put into a separate file named index.htm or index.html, and will link to the first page of your HTML.
H.5. Can I use Javascript or other scripting languages in my HTML?
No.
We don't want our readers to have to worry about any potential for malicious or just plain buggy code.
H.6. Should I make my HTML edition all on one page, or split it into multiple linked pages?
For a typical novel, one page or HTML file is appropriate, but when that single HTML file gets up around 2 megabytes in size, it may be worth considering a split because of the difficulty of loading it in some browsers.
In some other cases, where the content requires different styles on different pages, or different pages need different character sets, or the page, with images, just gets too heavy, you may need to split the HTML even if the HTML itself isn't technically too big.
When we post a HTML eBook containing multiple files, whether they contain text or images, we post them only in zipped format, so if you don't have images, and want your text to be directly accessible, you should stick to one file where possible.
H.7. How can I check that I haven't made mistakes in coding my HTML?
There are two kinds of mistakes you can make in coding HTML: you can produce invalid HTML, or you can produce HTML that doesn't do what you want.
Checking for invalid HTML is straightforward. The W3C site <http://validator.w3.org> will formally validate your file and point out any mistakes, and this is the official standard. However, it is not always convenient to use, especially when you're in a cycle of fix-and-retest. For this, you should try the program Tidy <http://tidy.sourceforge.net>, which runs on your computer, tells you about errors, and has other useful functions as well. Tidy is available for just about every operating system, and there are several Windows utilities that include Tidy. The links on the main Tidy page will lead you to the right version for you. Tidy is fast and friendly, compared to validation over the web, but it is not the last word. The W3C Validator may find formal errors, such as DOCTYPE mismatches with HTML tags or entitles, that Tidy may not. The best solution is to complete your HTML tests using Tidy, and then, when Tidy finds nothing further to gripe about, submit it to <http://validator.w3.org> for the official seal of approval. Please run these checks before submitting your HTML; we can generally fix it for you, but it may take us a lot of work.
Producing HTML that actually does what you want is equally important. If you've converted the eBook from text, you may have created inconsistencies, or closed an italics tag in the wrong place, or used the wrong tag at some points. The only way to check this is by reading through the HTML in a browser.
H.8. Can I submit a HTML or other format of somebody else's text?
Maybe.
This question has several complications. First, you must understand that it is quite possible, even likely, that your HTML file will eventually be overwritten by better information.
The value of a HTML file, as opposed to a plain text file, lies in its ability to capture elements of the original that have been lost in the plain text. A plain text file, using extended character sets like ISO-8859 [V.76] or Unicode [V.77] and _underscores_ for italics, can capture all of the author's intent in almost all cases. Sometimes, images and other important features of the original cannot be captured in plain text alone, but can be captured in HTML, or other markup.
When Michael Hart stopped posting books, in September 2001, we had HTML formats of about 1.6% of all our eBooks. At the end of 2002, that has risen to nearly 11% of all our eBooks. If you have a clearable copy of an existing posted book, with extra features not included in the original plain text, we would encourage you to make a new edition, or version, or format, correcting any errors in the original, and adding any new information not included there.
If, on the other hand, you just want to make a "blind format change"--making your best guess at what the HTML, or other format, layout should be for a book you've never seen, based on the original producer's work--your best bet is to get in touch with the original producer, and ask whether they can supply more material for you to work with. Otherwise, you are at best just rearranging information rather than contributing something new.
A blind format conversion can be done in anything from 2 minutes [R.33] to an hour. It just doesn't make sense for us to keep posting these files when they contain nothing new, and especially when two people may want to convert the same text. It is likely that, at some time in the next couple of years, we will start on a large-scale conversion project, to add some form of markup to all of the existing text files for ease of serving, and having a mish-mash of existing markup styles to deal with at that point won't help either.
H.9. How big can the images be in a HTML file?
The images should be as big as necessary, and no bigger.
Sorry, but there is no clear number to give here. Web page designers sweat blood to save an extra 20K on a page; so should you. If you're an experienced HTML maker, you know this stuff; if you're not, take it as a guideline that you should generally aim to keep your images in the 30K to 50K size range, with occasional forays into 70-80K territory. That's generally big enough for a clear picture, unless you're reproducing fine artwork.
H.10. The images I've scanned are too big for inclusion in HTML. What can I do about it?
This is a common problem, where images from the book occupy a full or half page. Your images should be of an appropriate size for downloading, and 2 megabytes of high-quality scan per image is not really an appropriate size for most PG texts!
You should reduce the size, and maybe the quality, of the original scan for simple viewing purposes. There is lots of image-manipulation software to do this. For Windows, you might look at the freeware Irfanview, and for both *nix and Windows there is ImageMagick [P.1]. Look for the words "resize" and "resample" in the Help.
Apart from simple converters, which do enough for this purpose, you can also manipulate the images in full imaging creation and editing packages like Paint Shop Pro, Adobe Photoshop and The Gimp [P.1].
Different image encoding methods can make a huge difference to the filesize. Any of the packages mentioned above can encode images as GIF, JPEG or PNG, and, particularly for black and white line drawings, these can encode to very different sizes. So, for example, a 60K JPEG may save as a 30K GIF, because the GIF encoding works better for that
## particular image. Try your images out, and see what works.
When manipulating images, always work from your original. Don't convert your original to a JPEG, and then shrink that and convert it to a GIF. Depending on the format, images may lose definition as they are converted (search for "lossy compression" in your favorite search engine to find out more about this), and they certainly lose definition as they are resized, and you end up with the "imperfect copy of an imperfect copy of an . . ." effect. When you're experimenting, take your original, resize and Save As GIF, then go back to your original, resize and Save As JPG, and so on.
You can also use an image optimizer. These are specialist software programs that try to make image files smaller without sacrificing resolution or detail.
H.11. Can I include decorative images I've made or found?
No.
Please include only the images you got from the book. If you want to make an edition of the book for your own web site, you can of course use whatever you like there, but for PG purposes, we want the book, the whole book, and nothing but the book.
H.12. How can I make a plain text version from a HTML file?
You can edit out the HTML by hand, of course, but there are several easier ways to convert.
You can view the HTML in a browser, Select All text, and just Copy and Paste into your editor. This is easiest, but doesn't handle formatting like tables very well.
You can use the Lynx [P.1] browser to convert your text with the command lynx -dump myfile.html > myfile.txt
Bruce Guthrie's HTMSTRIP for MS-DOS [P.1] is very configurable.
<http://www.w3.org/Tools/html2things.html> has a list of other HTML to plain text converters.
H.13. How can I make a HTML version from my plain text file?
This is not a course in HTML, but, for most books, you don't really need a course in HTML. Making a HTML format of most books is very easy, and doesn't take long, once you have mastered basic HTML. Let's assume you have your completed PG plain text file ready, and walk through the steps commonly needed to make a HTML version. We'll do this by successive approximation, doing the major things first, and then dealing more and more with the detail.
There are lots of specialized HTML editors out there, but you don't actually need any of them. The same editor that you used to create your text will also create your HTML. HTML is just text, with two types of special instructions added: tags and entities.
A _tag_ is an instruction to the browser, usually to display something with specific rules. Tags are shown within angled brackets: for example, <p> is the instruction to start a new paragraph.
An _entity_ is a named special character that might not be available in your character set. Entities are shown starting with an ampersand "&" and ending with a semi-colon ";" : for example, — is the representation of an em-dash.
I'm marking up a made-up short text as I write these steps, loosely based on the sample page from question [V.121]. You can see the changes made at each stage by looking at the files
htmstep0.txt (text before starting) htmstep1.htm (after adding the HTML header and footer) htmstep2.htm (after adding paragraph marks) htmstep3.htm (after marking main headings) htmstep4.htm (after adding special line breaks and indents) htmstep5.htm (after adding italics and bold) htmstep6.htm (after adding accents and non-ASCII characters) htmstep7.htm (after adding an image) htmstep8.htm (showing some extra techniques)
Before you start, make sure that you can see these files both in your browser and in your editor. In your editor, you should see the HTML codes; in your browser, you should see the text as it is intended to be viewed.
Note for people who already know HTML: yes, this example omits lots of possible ways to do things, and lots of refinements. You already know how to do what you want to do--skip onwards, and give the beginners room to learn in peace! :-)
Step 1. Add the HTML header and footer information
Add the following lines at the top of your text file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <title>The Project Gutenberg eBook of My Book, by A. N. Author</title> </head> <body>
Let's explain these one by one:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
says that your file is HTML 4.01 Transitional, which is the latest version, allowing the widest range of tags and entities.
<html>
denotes the start of the HTML
<head>
denotes the start of the HTML header information.
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
says that the characters are text, using ISO-8859-1 encoding. If you need to use a different character set, you should change ISO-8859-1 to whatever you intend to use. ISO-8859-1 is good for lots of PG books in English that use French or German words.
<title>The Project Gutenberg eBook of My Book, by A. N. Author</title>
You should obviously change this to the actual title and author you're producing. The
</head>
denotes the end of the HTML header information and
<body>
denotes the start of the actual text itself - the body of the book.
At the very end of the file, you should append these two lines
</body> </html>
these denote the end of the body of the book, and the end of the HTML.
At this point, you actually have a valid HTML file! OK, if you view it with a browser, it doesn't look anything like the way it's supposed to, but it _is_ HTML. Save it with a name like MYFILE1.HTM or STEP1.HTM and get a copy of Tidy for your DOS, Unix, Mac or Windows system from <http://tidy.sourceforge.net>. Run Tidy on your file, telling it just to look for errors (tidy -e if running from a command-line; if you're using a GUI version, there should me a menu option or tickbox for showing errors only). Tidy should tell you that there are no errors. Yay!
If it does say that there are errors, deal with them now, before you continue. Make sure, at each step, that you have cleaned up any errors; it's a lot easier now than later. Also, when you've finished each step, save your file with a number in its name, so that if you run into problems later and get confused, you can, at worst, drop back to the correct version at the end of the previous step.
The most likely error you might have at this point relates to the characters "<", ">", or "&". These are the characters used by HTML to indicate tags and entities. If these characters are used in the text of your file, (and ampersand is likely to be), you should replace them with entities, so that HTML will know that they are to be displayed as characters, not interpreted as commands.
Replace & with & < with < > with >
There is an example of this in the file htmstep1.htm
Step 2. Add paragraph marks.
For novels and general prose, paragraphs are the main logical and display unit. Paragraphs are marked in HTML with the sign <p> at the start, and </p> at the end. You don't actually need the </p> at the end, but adding these is a good habit to get into. You do, very much, need the <p> at the start.
The line-lengths within a <p> </p> pair are irrelevant; the browser in which the text is viewed will ignore extra spaces and line-ends, and will wrap text to fit the screen. This is bad for poetry and tables, but we will discuss those later. For this step, all you need to know is that you can leave your text exactly as it is, and just add the paragraph marks.
Put a <p> at the start of the line before the first letter of every paragraph, and a </p> just after the last letter or punctuation of every paragraph. If you can do macros in your editor, this will just take a minute; otherwise, it may be rather boring, but at least it is simple. For this step, put the paragraph marks around _everything_ that has a blank line after it, even poetry or chapter titles. We'll come back and change that later.
Now save your text as something like MYFILE2.HTM or STEP2.HTM. Again, run Tidy to check for errors, and fix them before continuing.
If you now look at the file htmstep2.htm in your browser, you will see that it is starting to take shape. Look at it in your editor, and you will see the paragraph marks.
Step 3. Add marks for headings.
We want to indicate to the reader that certain lines are for chapter or other headings. HTML provides the tags <h1>, <h2>, and so on for this. <h1> is for the biggest heading, and usually, you will reserve this for the title, and use <h2> for chapter headings. If you find these too big, you could choose <h2> for main headings, and <h3> for chapters. Whenever you use one of these header tags, you must close it with its equivalent end tag. So a chapter heading might look like:
<h2>