Chapter 1 of 12 · 19530 words · ~98 min read

Chapter 14

" and have irregular numbers of blank lines) or spacing and a few 8-bit characters. Occasionally, we have to rewrap a text. We also look out for included publishers' trademarks, which we normally prefer to remove (trademarks are NOT subject to copyright expiration: Macmillan(TM), the publishing house, is still around and trading), unnecessary or downright odd indentation or centering, stray page numbers, and prefaces or introductions or appendices that may not be in the public domain. If the file has lots of 8-bit characters, we probably need to make a separate 7-bit version, and post both.

If the gutcheck and spellcheck don't look clean, or if conversion is required, we may spend a lot more than 15 minutes on it. In a bad case, we may have to get the file re-proofed.

If you are conscious that you're doing something non-standard, and really mean it to stay, say so in your e-mail. (For example, I recently posted a text containing a family-tree representation that had lines over 80 characters. Now, I would have left that one alone anyway, but it helped that the submitter drew my attention to it in the e-mail.) If it's too non-standard, the poster may not allow it to stay, but at least you can discuss it. When a text needs a lot of non-standard formatting or markup, you really need to ask yourself whether you shouldn't be submitting it in HTML, with all the bells and whistles, and settle for something more normal in the text variant.

Mostly, errors are obvious, and there are at least some obvious errors in most texts. When errors are completely obvious, we just fix them without feedback to the producer unless you have specifically asked for feedback in your e-mail.

We're getting more HTML formats now, which is great, but incoming HTML often needs a lot of work, because people who are not experienced with HTML often make mistakes. The W3C <http://validator.w3.org> is the official standard for valid HTML, but, for the average volunteer, it's awkward to use. However, if you're submitting a HTML format, please use Tidy, which you can get from <http://tidy.sourceforge.net>, to check your text before sending it.

Header and Footer

We add the PG header and footer. If there is a header and footer already there, we strip them off first, since recent changes in the header mean that a lot of people send files with headers that are out of date. We have written programs to help with this.

We get the number for the text from a program on beryl called "ticket" that Brett Fishburne wrote, that dispenses the next number. That way, if two or three of us are posting at the same time, we won't all grab the same number. We create a 5-letter base filename, checking that it hasn't been used before, and finally zip up the file.

Posting

We now transfer the .ZIP and .TXT files to two servers: ftp.ibiblio.org and ftp.archive.org. (This is usually the point at which we realize that we forgot to make a change we noticed while checking. Aaaargh!)

5. Notification

At this point, the book is posted, but nobody knows about it! We need to do something about that. . . .

We compose an e-mail to the "posted" e-mail list, cc: the producer, with the line that is to go into GUTINDEX.ALL, the master list of PG files.

The "posted" list has only a few subscribers. These are the people who index and create links to PG texts, and include both PG volunteers and the maintainers of other sites that link to PG texts.

They also commonly download the texts to get more information for their indexes, and tell us if there is anything wrong with the files.

This e-mail is simply the official notification to all these people and the producer that the file has been posted. Here's a sample of such an e-mail:

To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#5301, Duncan) ! From: "Jim Tinsley" <jtinsley@pobox.com> Date: Tue, 25 Jun 2002 06:21:27 -0400 (EDT) Cc: you@example.com

Mar 2004 The Imperialist, by Sara Jeannette Duncan [SJD#4][mprlsxxx.xxx]5301

There may also be some remarks, if the text is in any way non-standard, or if files other than plain text were posted with it.

From this e-mail, you can, if you want to see any corrections made, immediately download the posted file and compare it to your version. Since the notification is made _after_ the file has been copied to the servers, it should be there waiting for you.

To find out how to download a book that has just been posted, see the FAQ "How can I download a PG text that hasn't been cataloged yet?" [R.3]

6. Indexing

From the "posted" list, the posting line is added to GUTINDEX.ALL and our indexers begin the cataloging process, which is much more thorough, for the website. This includes work like finding author's dates of birth & death, getting the Library of Congress classification, and the other information that makes up the website searchable index. That process takes extra time, which is why the website searchable catalog must always lag behind the actual titles posted.

7. Corrections

It's remarkable how many people who went over and over the text to the point of hating it suddenly see problems with it when they download it a couple of days after it's posted! Something psychological there, I expect. Anyhow, if you do download your text and see problems with it, don't worry, just e-mail whoever posted it, or any other member of the Posting Team. No, you're not stupid, or if you are, you're in good company, because we've all done it! There's no big deal about replacing the posted file with a corrected copy immediately.

Over time, other readers may submit corrections. If you find an error in a PG etext, see the FAQ "I've found some obvious typos in a Project Gutenberg text. How should I report them?" [R.26]

When the corrections are small, as most are, we will just make the change to the existing text. If there are a lot of changes, we may post a new edition [R.35] with a new edition number; e.g. if the file abcde10 was corrected, we may post abcde11. We never make a new edition when we get corrections immediately after posting.

V.17. How long must a text be to qualify for PG?

The rule of thumb is that we try not to post texts shorter than 25K, or about 350 lines of 70 characters. This rules out, for example, a lot of individual short poems. If you are interested in contributing this type of material, consider making a collection of similar texts--poems by the same author, or magazine articles on the same subject. We have made a few exceptions, like Martin Luther King's "I have a dream" speech, but very few.

V.18. What books are eligible?

A book is "eligible" for posting if we can legally publish it. This is the case if:

1. it is in the public domain in the U.S.A., OR, 2. the copyright holder has granted unlimited non-exclusive distribution rights to PG.

V.19. Are reprints or facsimiles eligible?

A reprint or facsimile of a book that would be eligible is itself eligible.

For example, if a book published in 1995 is a reprint of a book published in 1900, then it is eligible. However, the onus is on us to prove that it _is_ a reprint, and if it doesn't _say_ on the TP&V that it is a reprint, confirming its eligibility may be impractical.

V.20. What is the difference between a reprint and a facsimile?

A facsimile retains the page layout and formatting of the original. A reprint keeps the same words, but may lay the pages out differently. For our copyright purposes, there is no difference--we can use either.

V.21. What is the difference between a reprint and a "new edition"?

A reprint contains only the words and pictures that were printed in the original. A new edition is in some way changed; it has different text, or pictures. It may be abridged, or expanded. It may have material added or changed, using other versions of the book.

A new edition gets a new copyright, and has to be cleared based on its own copyright date and status, not the date of the original printing of the title. See also the FAQ "How come my paper book of Shakespeare says it's 'Copyright 1988'?" [C.16] for an example.

Please note that we are talking here about a new edition of the printed book, not a new (corrected) edition number for Project Gutenberg naming purposes.

V.22. What book should I work on?

Nobody in Gutenberg is going to set assignments for you. You decide what book to process. Just pick one that no-one else has already done, or is working on. It's also sensible to pick one that you'll like--you'll be living with it for a while. On a practical note, it's probably better to start with a short book or even a short story, since a long book can take quite a while to produce.

Start by thinking of books written before 1923. Pick a book you like, and check it out. If it's already done or still in copyright, try other books by the same author.

Visit the Project Gutenberg site and download a full list of Gutenberg books in GUTINDEX.ALL. Have a look at the List of Books In Progress and Complete [B.1]. Look for authors you like, and see what books by them aren't yet available.

Check out your old books. Maybe you have an eligible edition that would be of great help to the project.

Try your library. They may have some eligible editions--books we can prove to be in the public domain--and you will certainly come away with ideas. Ask your librarian. Librarians are keen to help on projects like this.

Browse second-hand bookshops in your area. There are lots of treasures to be picked up very cheaply.

Search for literature pages and bookshops on the Internet.

If all else fails, you can always ask on the Volunteers' Board or try the gutvol-d mailing [V.12] list for ideas. Others may know of books that people are especially looking for, or projects already started where you could help out.

V.23. I have a book in mind, but I don't have an eligible copy.

First, determine whether there are any eligible copies of the book, by finding out the date it was published, possibly from the Catalog of the Library of Congress [B.5] and checking the Public Domain and Copyright Rules [B.1]. If there is a public domain edition, the next problem is to find one to work with.

V.24. Where can I find an eligible book?

The most commonly used outlets are used bookstores, garage sales, library sales, charity shops and any other place that sells old books.

The Internet is a wonderful medium for finding used and antiquarian books--used bookstores all over the world have found ways of co-operating and listing their inventories on the Net, so that whether you live in Los Angeles, Moscow or Perth, you can still find that book you're looking for in a shop in a laneway of Amsterdam. Most on-line listings will quote the publication year of the book, so you can check that it's pre-1923.

Two such sites that allow second-hand booksellers to list their inventory are:

Advanced Book Exchange <http://www.abebooks.com>

Alibris <http://www.alibris.com>

The book search page at trussel.com [B.5] has a list of many such Net bookshops, or you can simply visit any search engine and search for Used or Antiquarian Bookshops. You can often buy eligible books through these sites very cheaply.

If you still can't find the book you need, post a message on the Volunteers' Board or to the gutvol-d mailing list; maybe someone else can find it for you.

Sometimes, it may be possible for you to work from a later edition, so long as somebody who has an eligible edition can check it to make sure that no changes have been made. Sometimes, you may be able to find a modern reprint; reprints may be eligible, as long as they say they are reprints of an edition that would be eligible.

If you can type, or can scan without damaging the book, you can borrow books long enough to produce them. Even if your local library doesn't have the books you want, they may well be able to get them for you on inter-library loan. Ask your librarian about it.

V.25. What is "TP&V"?

This is an abbreviation for "Title Page and Verso", and means a paper or image copy of the front and back of the title page.

Even if the back is blank, we need to have an image of it for the files, to show that it _is_ blank, so that if, in ten years' time, somebody queries our right to publish, we can show that we haven't just lost it.

Publishers print copyright information, like title, author, copyright year and owner, and whether the book was a reprint, on the TP&V, and by filing this, we can prove that the book we produced was in the public domain.

Sending us the TP&V is the One True Way to getting PG copyright clearance [V.37].

V.26. What is "Posting"?

Posting is the final stage in the production process, where the file is given a number and official PG header, and copied onto our FTP servers for distribution. See section 4 of the FAQ "How does a text get produced?" [V.16] for a blow-by-blow account.

V.27. I think I've found an eligible book that I'd like to work on. What do I do next?

Make sure nobody else is working on it, and that it's not already online somewhere.

V.28. What books are currently being worked on?

Check out David Price's In Progress List (a.k.a. "the InProg List") online at <http://www.dprice48.freeserve.co.uk/GutIP.html>. David gets the information from Copyright Clearances that have been done, and organizes it into a list. It can never be 100% up to date, since clearances come in all the time, but it's the best online facility we have, and it's much more clearly presented than the original clearance files.

V.29. How do I find out if my book is already on-line somewhere?

There's no foolproof method; some student somewhere could have scanned it and put it on her college web page without announcing it anywhere. However, there are some regular places to check.

It may sound obvious, but you should always look in the PG archives first. Download GUTINDEX.ALL and keep it handy. Search the InProg List [B.1].

The two other main places to search for your book are the Internet Public Library <http://www.ipl.org> and the On-Line Books Page <http://onlinebooks.library.upenn.edu/>. These projects specialize in indexing books that people make available on-line.

If you still don't see your book on-line anywhere, hit your favorite search engine, and give it the title, author's last name, and preferably a few uncommon words from the first page of the book. Sometimes one of those solo efforts shows up in a general search.

V.30. My book is not on the In-Progress list, and I can't find it on-line. Is it safe to go ahead and buy it?

Probably. It could have been cleared, but not included in the InProg list yet. If the amount of money to buy it is a consideration, you can e-mail any of the members of the Posting Team, and ask them to check the latest clearances for you. Even this isn't foolproof; another volunteer could be placing their order at the same time you're placing yours. Such duplications do happen, but they are very rare.

V.31. My book is on-line, but not in Project Gutenberg. What should I do?

If the on-line file is from the same edition as the one you have (e.g. not a different translation) then you may be able to submit that file, perhaps slightly edited, to Gutenberg using the clearance from your paper copy. See "I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?" [V.62] for how to do that.

And of course, you can always still make your own version for PG. It's surprising how often even very similar paper editions have small differences that can be interesting or significant.

V.32. My book is already on-line in Project Gutenberg, but my printed book is different from the version already archived. Can I add my version?

Yes! In fact, assuming that the version already there is in the public domain, you can piggyback on the work already done by what is called "comparative retyping". For example, let's say that you have a later edition than the existing file; you can just take the existing file, edit it to match your paper version, and submit it as a new file. Of course, you must have Copyright Cleared [V.37] your paper version as well.

V.33. I see a book that was being worked on three years ago. Is anyone still working on it?

Maybe, maybe not. Some people abandon books, some people who are regular producers clear them and put them at the bottom of the pile, perhaps for years (though they will get round to them sometime), and some people just simply take two or three years to produce a book.

Once, we put names and contact details on the public InProg list, but for privacy and spam-prevention reasons, we've taken them off. However, the Posting Team have access to the master list of cleared files, and will send a message on your behalf to the person who originally cleared the book, asking if the project is still active, or if the producer wants help.

So if you really want to check this situation out, e-mail one of the Posting Team.

V.34. I've decided which book to produce. How do I tell PG I'm working on it?

As soon as you get Copyright Clearance [V.37], your book is entered in the "cleared" files. David Price will take these, and add your entry in his next release of the In Progress List.

V.35. I have a two- or three-volume set. Should I submit them as one text, or one text for each volume?

Both.

Quite a lot of 18th and 19th Century books, even straightforward novels, were published as multipart sets. When you have such a set, you should usually submit one text for each volume, and a "complete" text with the contents of all volumes together.

People who do this often complete and submit one volume at a time, until they've finished, and then contribute the "complete" file.

V.36. I have one physical book, with multiple works in it (like a collection of plays). Should I submit each text separately?

If the works are clearly separate, stand-alone texts, and are long enough [V.17] to warrant inclusion on their own in the archives, then yes, you should, and you _may_ also submit a "complete" version as well, if it seems appropriate. This most commonly happens in a collection of plays, though essays and other works may also fit the criteria. Collections of poetry rarely do, since most poems are too short to submit as stand-alone texts.

Sometimes the book includes a preface or introduction or glossary covering all the works in it. In this case, you can decide whether to include these with each of the parts, or save them for the "complete" version.

V.37. How do I get copyright clearance?

Basically we need to see images of the front and back of the title page of the book, which is where copyright information is usually shown. This is called "TP&V", for "Title Page and Verso" [V.25].

To Submit Online:

As of late 2002, we have a new automated upload procedure using a web page. This is by far the fastest and easiest way to get clearance. You need scanned images (PNG, JPEG, TIFF, GIF), of the two pages, of good enough resolution that the text can be read clearly, though the files don't need to be huge.

Just go to <http://beryl.ils.unc.edu/copy.html> and follow the instructions.

There are two other, older ways to submit a text for clearance.

To submit by paper mail, photocopy the front and back of the title page, even if the back is blank, write your e-mail address on it, and send the photocopies to:

MICHAEL STERN HART 405 WEST ELM STREET URBANA, IL 61801-3231 USA

This is called Title Page & Verso, or TP&V for short, and is needed for copyright research. A colored envelope is best, to make sure your letter is easily recognized as TP&V.

E-mail Michael hart@pobox.com when you send them, so he knows they're on the way. It's a good idea to check back with him by e-mail after a week or so if you haven't heard from him.

About this, Michael says: "Please include always your e-mail name and address, and mark the envelope with some distinctive mark and or color. Colored envelopes fine. Just something so I can find it easily, the mail here is slow and deep, like snow. Please send a note to: <hart@pobox.com> for more info."

To submit by e-mail, scan the front and back of the title page, even if the back is blank, and e-mail the images to Greg Newby <gbnewby@ils.unc.edu> as TIFF, JPEG or GIF in medium resolution. Make sure that the print is legible before you send.

Whichever method you use, you should expect to get an e-mail back after about a week, with one line containing the Author, Title, your name and date with the word "OK" at the end. This means that your text has been cleared.

A Clearance Line looks something like:

The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok

If you don't get any response, e-mail to check that your TP&V was received OK. If the word at the end of the line is not "OK", then your text is not eligible, and a comment will probably be appended explaining why it is not eligible.

Don't start work on your book until you get that OK! It's very sickening to do all that work, and then find out that your text can't legally be put on-line!

V.38. I have a two- or three-volume set. Do I have to get a separate clearance on each physical book?

Yes.

Some multi-volume works, notably reference books and translations, were published in a series, and it may be that the first volume is 1922, but the others are 1923 or later, so we have to clear each individually.

V.39. I have one physical book, with multiple works in it (like a collection of plays). Do I have to get a separate clearance for each work?

No. Since they were all printed together, one TP&V will suffice for all, but . . .

You should list each separate title included, if you intend to submit each title separately (see the FAQ "I have one physical book, with multiple works in it like a collection of plays. Should I submit each work separately?" [V.36]). If, say, you clear a "Collected Plays of Sheridan", and later submit an eBook of "The School for Scandal", we will have trouble finding your clearance unless we have made a note that "School for Scandal" is part of the contents of "Collected Plays".

In a case like this, you should include, on your paper or e-mail, something like:

George Bernard Shaw. Plays Unpleasant. 1905. Contents: Preface to Unpleasant Plays Widower's Houses The Philanderer Mrs. Warren's Profession

You only need to do this when you are going to submit each part separately, which is commonly the case with plays, and sometimes essays, stories and novellas. Taking a different example, the "Collected Poems of Emily Dickinson", we would not need to list the contents, since we wouldn't publish each poem separately.

There is one exceptional case: if your book was printed after 1923, but contains stories or plays some of which are stated to be reprints of pre-1923 editions, you should give as much detail as possible about what you intend to submit.

V.40. Who will check up on my progress? When?

Nobody. There are no schedules or timetables. You're welcome to contact other volunteers [V.12] with comments or questions, though.

V.41. How long should it take me to complete a book?

Most books get done in between one and three months, but this varies wildly. It depends on the amount of time you can afford to give it, the length of the book and, if you're not typing, the quality of the scan--if the book scans badly, you need to put more time into proofing.

Some very productive volunteers manage to turn out an e-text a week; some books can take a year or more.

Scanning itself doesn't take too long. Even if it takes you as much as two minutes per page to scan, you will still complete a 300 page book in 10 hours, and you will probably be scanning much faster than that [S.9]. The problem is that the text generated by the scanner and your OCR package is usually faulty. There are many cute scanner errors, mistaking b for h, or e for c, so that "heard" is scanned as "beard" or "ear" as "car". Makes the story more interesting sometimes!

So now you need to do a first proof of the e-text. Read it carefully, correct scanning mistakes, and make sure that you haven't left out pages or got them in the wrong order. Unless your scan was exceptionally good, this is the time-burner in the process.

When you've done the first proof, you can either do a second proof yourself, or send it to another volunteer for second proofing.

If you're a typist, of course, you can skip right over the messy scanning and scan-correction process. Yay typists!!

V.42. I want/don't want my name published on my e-text

No problem. When you send the e-text for posting, mention exactly what, if anything, you want the Credits Line [V.47] to say.

V.43. I'd like to put a copy of my finished e-text, or another Gutenberg text, on my own web page.

Great! PG encourages the widest possible distribution of e-texts. We like to publish everything in plain text, which is the most accessible format, since everybody can read plain text. But once it's available in plain text, it's open to you or anyone else to convert it to other formats like HTML for further distribution.

If you are reposting a text, though, please be careful to check that your posting complies with the conditions spelled out in the header, especially for copyrighted works.

V.44. I've scanned, edited and proofed my text. How do I find someone to second-proof it?

You can post a request on the Volunteers' Board, or on the gutvol-d Mailing List. You will probably get some offers there. In a difficult case, you might ask Michael Hart to add it to the "Requests for Assistance" section of the next Newsletter.

In general, the best way to handle it is to make a co-operative proofing project out of it. This is like a miniature version of the distributed proofreading sites, without the page images.

There are always people looking for proofing work, but many beginners take on more than they can handle, and don't finish the job, and this can be very disappointing if you give the whole thing to one volunteer who then vanishes without trace. You can minimize the risk of this by splitting the book into chunks of about 20-30 pages, or one chapter if that's around the right size, each. Write explicit instructions about what you want them to do when they spot a suspected error, like fix it or mark it with an asterisk. (Marking is probably safer with beginners who don't have the book or an image of the page to refer to.) Give the first chapter to the first person who responds, the second to the second, and so on. As you hand out the chapters, let the proofers know that if they're not returned within three or five days, you'll assume they've quit. Three days is more than plenty of time for 20 pages. If someone returns a chapter, you can give them another. If someone doesn't get back to you within the time set, assume they're not going to, and recycle that chapter to someone else. No hard feelings, no problem. This process of "co-operative proofing" ensures that beginning proofers don't choke on the work, and that one vanishing volunteer doesn't hold up the whole project.

V.45. I've gone over and over my text. I can't find any more errors, and I'm sick of looking at it. What should I do now?

We all know that feeling! Particularly with your first book, you've probably gone through a patch when you thought you'd never finish--and when you do, you can't stand the idea of looking at it again. Heh. Cheer up--the first twenty texts are the worst! :-) And you'll feel a lot better when you see your text available for everyone to read.

You have three choices:

You can send it for posting as it is. [V.46]

You can put it aside for week or so, and come back to it with fresh eyes.

You can ask in any of the standard ways [V.12] for someone else to second-proof it for you. This has a lot to recommend it; it gets other sets of eyes looking at the text, it relieves the pressure that you may feel, it may rekindle your enthusiasm for the text, it allows you to "meet" other volunteers, and possibly form partnerships for future PG collaboration. Above all, it gives new proofers a chance to get their feet wet, and this is good for them, and good for PG. You are not only contributing a text, you're helping to train and encourage the next generation of producers.

V.46. Where and how can I send my text for posting?

As of late 2002, we have a new automated upload procedure using a web page. This has a lot of good things going for it, because we keep a record of what's uploaded, you get an e-mailed copy of the notification, you don't have to fiddle with FTP, and we can make up the header automatically from the information you enter, which saves time and prevents keying errors.

As always, it's better to ZIP your file first, because it'll take less time to transfer.

Just go to <http://beryl.ils.unc.edu/cgi-bin/upload>, fill in the form, specify the file to upload, and hit "Send" at the bottom.

And you're done!

If, for some reason, you can't use this page, there are two backup options: you can e-mail it, or you can upload it by FTP. Whichever you use, it is always best to ZIP the file first if you can.

If you are comfortable with sending files by FTP, this is better than e-mail, First, you will need a username and password, which you can get by e-mailing any of the Posting Team.

If you already know how to use command-line FTP, here's how to do it:

Log in to beryl.ils.unc.edu using the username and password supplied and change to the work directory by typing "cd work". Change to binary mode with the "bin" command and "put" your file.

Summary instructions: ftp beryl.ils.unc.edu login: yourlogin password: yourpassword cd work bin put yourfile.ext quit

Here is a sample session:

>ftp beryl.ils.unc.edu Connected to beryl.ils.unc.edu. 220-Access from unknown@127.0.0.1 logged. 220 FTP Server User (beryl.ils.unc.edu:(none)): xxxxxxxx 331 Password required for xxxxxxxx. Password: xxxxxxxx 230 User xxxxxxxx logged in. ftp> cd work 250 CWD command successful. ftp> bin 200 Type set to I. ftp> put MYFILE.ZIP 200 PORT command successful. 150 Opening BINARY mode data connection for MYFILE.ZIP. 226 Transfer complete. ftp: 172313 bytes sent in 17.34Seconds 9.94Kbytes/sec. ftp> quit

When you are in the work directory, you will not be able to list files, but they _do_ exist and they _are_ there.

When you have uploaded your file, e-mail a note to any or all of the Posting Team, including your 1. filename 2. credits line as you want it on your text 3. clearance line you received [V.37]

An ideal note might be:

Subject: Beryl upload for posting: Hamlet

I have uploaded to beryl: Hamlet, by William Shakespeare

File is: hamlet.zip

Credits line is: Produced by John Doe <jdoe@example.com>

Clearance was given as: Hamlet William Shakespeare John Doe 05/03/02 ok

If you'd rather send it by e-mail, send the e-mail, including the Credits Line and Clearance Line as in the sample above, to any or all of the Posting Team, with your text as an attachment. Again, ZIPped is better, since it avoids certain damage that can happen to a plain text e-mail along the way.

Do not add the Project Gutenberg header or footer to your file, unless we specifically asked you to. If you do add it, we'll just have to strip it off again, since we add headers automatically when posting. There are times, perhaps when you're working in an unusual non-editable format, when we may give you a header and ask you to add it, but this is rare.

Please read section "4: Posting" of the FAQ "How does a text get produced?" [V.16] for more detail about what happens in posting. Especially, if you want to draw some peculiarities of this text to the Posting Team's attention, or want feedback on any minor edits done during posting, you should say so in the e-mail you send.

_Don't assume that we know anything_ when you send the e-mail. We don't know what you want us to put on the Credits Line. We don't know that this is an unusual text, and needs some kind of special reformatting. We don't know that the text should be split into two volumes before posting. We don't know that you would really like us to check it closely before posting. You have to tell us, exactly and precisely, what you want on the Credits Line. If the text needs some specific work, you have to tell us exactly what that is. And please do that in your e-mail, not in the text itself. Remember that we could be dealing with five or ten other texts at the same time, and even if the poster you discussed it with two weeks ago is the same one who posts the book, he may not remember.

V.47. What is the "Credits Line"?

The Credits line is a line that the Posting Team can insert into each PG text naming the producer or producers of a particular text.

You should decide what you want on the credits line of your text; it's really not up to us.

Most credits lines are something like:

Produced by John Doe <jdoe@example.com>.

If you don't want to be mentioned by name at all, just say, in your e-mail:

Please omit the Credits Line for this text. I want to contribute it anonymously.

If you do want to be mentioned, please give the exact wording you want us to use. Some people want their name only; they don't want us to include their e-mail addresses. Others want to make their e-mail addresses public so that readers can contact them with comments. That is entirely up to you, but you do need to tell us. If you do want to include your e-mail, remember that having it permanently on the net is a spam-magnet, and we can't effectively remove or change it later.

Occasionally, a Credits Line can spill onto more than one line, for example:

This text was converted to HTML by Jane Roe <jroe@example.com> from an original ASCII text scanned by Jack Went and proofed by Jill Hill

V.48. How soon after I send it will my text be posted?

First read the "Posting" section of the FAQ "How does a book get produced?" [V.16] to understand the process.

You should expect some response within three or four days. We try to get to all submissions within that time. In most cases, that response will be simply the official notification that it has been posted. If there is a query on your text, for example if we can't find the copyright clearance or if we have trouble converting or correcting your text, we will probably e-mail you back directly with questions.

If you don't hear from us within four days, send a follow-up e-mail; it could be that your original note never got to us, or just fell through the cracks.

If your file happens to arrive while one of us is logged in and working, it could get posted within the hour. Some frequent contributors who know our habits know just how to time their uploads!

V.49. I found a problem with my posted text. What do I do?

Most postings go smoothly, but problems can happen. Sometimes, one of the servers is down. Sometimes a file gets corrupted for some unknown reason. Sometimes, let's face it, we screw up.

Usually, one of the indexers will tell us about it, but if you catch it first, e-mail whoever sent out your notification e-mail and explain the problem. Don't worry; your original file will be quite safe, since we keep these long after posting them.

V.50. Someone has e-mailed me about my posted text, pointing out errors.

Great!

Since you're the original producer, you're in the best position to decide whether these are real errors. If they're right about it, tell the Posting Team and we'll correct the text.

V.51. Someone has e-mailed me about my posted text, thanking me.

Nice feeling, isn't it? :-)

About Proofing

V.52. What role does proofing play in Project Gutenberg?

A very big one!

Typists' work doesn't usually need many corrections, but unfortunately, scanners and OCR packages are far from perfect, and scanned text varies from "almost-right" down to "maybe I should consider typing instead of scanning". Proofing is the process that turns a scan into a readable e-text.

Proofing a typist's work is straightforward; you just read it, and keep an eye out for mistakes. Typists typically have few mistakes in their texts, but the errors that they do make tend to be hard to spot. Proofing OCRed text has its quirks, and you can expect many, many errors to correct.

The only thing that all proofers agree on is to differ in their methods. Some people scan and almost complete the proofing process within their OCR package, others do no editing at all within their OCR. Some spell-check first, others spell-check last. Some work through in one pass, doggedly line by line, others make several light passes. Some start at the end and work backwards! Some proofers mark all queries with special characters like asterisks (*) in the text, most just make all the obvious changes and mark only the dubious ones. Some people always send their texts out for proofing, others prefer to do it all themselves.

So this guide is not prescriptive; this is not the "only way" to do it. The only rule is that, at the end of the process, your e-text should be as error-free as you can make it, and should conform to Gutenberg's editing standards, which are mostly just common sense guidelines to make readable text.

The aim of this FAQ is to give you an understanding of what text looks like when it comes fresh off the scanner, and an overview of the whole process by which it becomes a publishable e-text.

V.53. What is Distributed Proofing?

It has always been common for volunteers to share proofing work among themselves--you take the first five chapters, I'll take the next, and so on.

When you're just starting as a PG volunteer, you should go to one of the Distributed Proofing sites [B.4] and do some work there to get a grounding in the basics and a feel for whether you would like to continue working in PG. In distributed proofing, you get a very short section, as little as a page of text at a time, and usually an image file of the page as it scanned. You then make the text match the image. This is a great start, since all you have to do is read, compare and correct. However, other work also needs to be done, and will normally be done by the project managers of these sites. The samples below give you an idea of the whole process, and also some ideas of what proofing a whole book from start to finish is like.

V.54. What do I need to proof an e-text?

You actually need only two things: the e-text itself and a text editor or word-processor that can handle book-sized files and save them as text.

Nearly all word processors and text editors in current use will work. Volunteers use many common programs, including WordPerfect, Microsoft Word, WordPad, DOS EDIT, vi, Brief, Crisp, EditPad, MetaPad, emacs, AbiWord, and the word processors from Open Office abd AppleWorks. And all of these are in actual use by volunteers today. Since all of them contain the necessary basic functions, the best program is the one you're most comfortable with.

Be cautious with recent, powerful word-processors that "auto-correct" text, or use "smart quotes" or any other such automatic retyping or formatting feature, since they can Do Bad Things to your e-text without your consent! When using any such package, it is best to switch off any feature that makes changes without asking you.

Two utilities which may come in useful are a spell-checker and a version difference checker. These may be built into your word processor, or you may have them as separate packages.

A spell-checker is like a chain-saw: a powerful tool, but one to be used very carefully. It is very easy to say "Yes" to the wrong change, and make a really bad mess of the text. Spell-checkers have problems with proper names, foreign words, archaic usages, and dialects. Incautious use can leave you with a text such as that immortalized in the

Owed two a Spell in Chequer.

Eye half a spell in chequer, It cane with my Pea Sea. It plane lee marques four my revue Miss steaks eye can knot sea.

Every e-text should pass through a spell-checker at some point, but the human half of the partnership needs a very light hand on the confirmations of change!

A difference checker, such as FC or COMP for MS-DOS, diff for Unix or ExamDiff <http://www.prestosoft.com/examdiff/examdiff.htm> for Windows, may also come in handy. A difference checker compares two versions of the text, and points out the changes. This is important when you've sent a text out for proofing, and you get it back with changes. Rather than re-reading the whole text, you can use a difference checker to highlight the changes so that you can verify them against the printed text. As a proofer, you can use it to compare the original text with what you're sending back to ensure that you've only changed what you meant to change.

V.55. Do I need to have a paper copy of the book I'm proofing?

No.

Your job as proofer is to ensure that the e-text you're working on is readable in itself, and contains no obvious errors. Where you think there might be an error, but you're not sure, you mark the spot in the e-text, and let the volunteer who has the paper book look it up.

V.56. What's the difference between "first proof" and "second proof"?

These are fuzzy terms used to indicate how accurate the e-text is, and what type of work is needed to improve it. Quite commonly, the same volunteer who scans the book proofs the whole thing in one or two passes. Sometimes, given a good scan, the text can be sent out for "first proof" with little or no preparatory fixing-up. Often, the scanner makes quite a lot of corrections, then sends the text out for "second proof".

A text is ready for first proofing when it's obvious that there are plenty of errors, but it's possible to figure out, in almost every case, what the correct text should be without needing to refer to the book.

The objective of first proofing is to eliminate all the obvious errors, so that if you speed-read quickly through the text, you probably won't notice any.

Second proofing involves taking a text that has been first-proofed and correcting all the remaining, more subtle errors. Often, some simple errors such as incorrect spacing and quotes may be left for second proofing. Texts that have been typed instead of scanned will always be of at least second-proof quality.

V.57. What do I do with an e-text sent to me for proofing?

First, establish reasonable expectations. A typical book takes 10-15 hours of concentrated effort, and when you first start, you're climbing a learning curve. For your first session, decide to mark out a chapter or two--something like 500 to 1,000 lines--and work only on that. If you get through 1,000 lines in your first sitting, you have done extremely well! It's a good idea to send this first 1,000 lines or so back immediately. The volunteer who sent you the e-text will comment on it, and let you know about any style guidelines you may have breached or common errors you may have missed. Most beginning proofers do make mistakes, so don't worry about it--it's easier to correct these in 1,000 lines than to go back over them in 15,000 lines!

You will usually receive the e-text as an attachment to your e-mail. It's better to send e-texts as attachments than to paste them as text into the body of the e-mail to make sure that the text isn't changed by different e-mail clients. It's better to send e-mailed attachments as ZIP files [R.20], since e-mails sent as text can be damaged along the way. But whether you receive a TXT file or a ZIP file that you have to open, you should save the .TXT file to your hard disk and open it with your editor.

It may be that the text you see appears double-spaced--every second line is blank--or that all the text is on one incredibly long line. This is a familiar effect when moving between a DOS/Windows computer and a Mac or Unix system, but it can happen between any two editors. It is caused by the use of different characters to mark the end of a line. If you have this problem, ask whoever sent you the text to re-send it, telling them what kind of computer and editor you have.

Now you make any changes that obviously need to be made, and mark any places where the text looks wrong, but you're not sure what the right text should be. You can usually use asterisks (*) to mark these dubious spots, but you might use other characters if the text already contains asterisks. When in doubt, mark them all, and let the volunteer with the text sort them out!

It is usually best not to make global changes to line lengths by reformatting lots of paragraphs, since the person who sent you the e-text may want to use a difference checker when you return it, and changed line-lengths throughout mean that every line will be different.

When working on a long text, or when making a lot of changes, it may be wise to save several versions of the text with different filenames at different stages so that if something goes badly wrong, you can revert to the last good version. This applies especially to saving the text just before performing a spell-check.

When you're finished with the e-text, make sure you save it as a plain text file (.TXT) and send it back by zipping it if you can, and attaching it to an e-mail.

V.58. What kinds of errors will I have to correct?

Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.

Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.

The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.

The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.

Lower-case m is often mistaken for rn or ni.

The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.

For example:

" Hello1' caIled jirnmy breczily. 11Anyone home ? "

There seemed to he no-oneabout. Only tbe eat beard him."

should read:

"Hello!" called Jimmy breezily, "Anyone home?"

There seemed to be no-one about. Only the cat heard him.

As well as scanner errors, which affect one letter at a time, you have to keep an eye out for editing mistakes by the volunteer who scanned the text or by previous proofers. These are typically cases where a whole line, paragraph or page has been omitted or misplaced. They show up as sentences that don't make sense, or paragraphs that don't follow from the previous one.

This means that you have to keep reading the flow of the text, so that you can spot context errors as well as typos.

V.59. How long does it take to proof an e-text?

This depends on how long the e-text is, how clean the text is when you start, and how thorough you're being, as well as how much time per day you can give it and how fast you can proof.

On a first proof, it can take a very long time to get the e-text to a readable condition if it scanned badly. As a beginner, you would be unlikely to be given such a difficult text to work with. First proofs are usually done by the same person who did the scanning, and are only given out in the context of established scanning/proofing teams.

You might expect to proof anywhere between 500 and 2,000 lines per hour during a second proof. A short novel or novella might have as few as 6,000 or 7,000 lines; War and Peace weighs in at about 54,000 lines. Most novels run to 10,000 to 15,000 lines. So you might spend anything between 5 and 30 hours second-proofing a standard book, with 10 to 15 hours being typical.

For an average novel, a week or two for second proofing is good going. A month is reasonable.

Proofing an e-text is a significant amount of work, and you may find it psychologically more comfortable to take on a chunk at a time--say 1,000 lines per session--and send that proofed section back, rather than wait until the whole job is done before sending anything back. This helps to avoid the fairly common case where you keep falling behind where you expect to be until you dread the thought of getting back to the text, and finally just abandon it.

If you find after a while that you just don't want to continue, please tell the person who sent you the text that you're not going ahead with it. It's very frustrating for the volunteer who scanned the book, and who wants to get it posted, to wait for two or three months, only to have to start all over again with another proofer.

V.60. Are there any special techniques for proofing?

The classic way to proof is to open the text in your editor or word processor, and just start reading carefully.

This method has received a major boost since editors and word processors have added a feature of showing squiggly red underlines under words not in their dictionary. While this is very useful, you still need to read carefully, since not all errors produce misspelled words. The classic, and very common, example of this is scanning "he" for "be". These visual spellchecks also commonly do not check words beginning with capitals. Capitalized words are commonly names not in the dictionary, and when checking of capitalized words is switched off, they will not query "Tbe". Other errors that a spellchecker doesn't look for include missing spaces, mismatched quotes and misplaced punctuation. For these, you can try gutcheck [P.1]. And of course, no automatic check will find omitted lines or words. Worse, spellcheckers will query words not in their dictionary that might be quite correct, and this can be quite troublesome when dealing with older texts or dialect.

Still, if your concentration is up to the job, scrolling through a text with non-dictionary words underlined in red is a fast and effective way of giving a text the final once-over.

Volunteers have also used other techniques for proofing. Some people can't sit at their screen and read for hours; many people don't want to.

Some people just use the good old-fashioned method of printing out the text to be proofed, and blue-pencilling the mistakes.

It is becoming fairly common now for people to load the text onto their PDA, and read it from that. Mistakes found can be bookmarked or jotted down and fixed when they go back to their PC.

Getting your computer to read the text aloud is a very effective way of achieving high accuracy. Modern PCs have audio capabilities built in, and it is possible to find free or cheap shareware "read-aloud" text-to-speech packages for just about everything. Some PDAs are also capable of doing text-to-speech.

The first time you try text-to-speech, it will probably sound and feel a little strange, but you will quickly learn to _hear_ errors in words. This can be very effective, but you should have given the text at least a light proofing before you begin; it is hard to deal with a high number of errors using a text-to-speech method.

When proofing by a speech program, you either set your text-to-speech program to pronounce all punctuation, or, if that is not possible, you make a special version of your text to feed it, first doing a global replace of "," with " comma ", ";" with " semi-colon ", and so on. Mark a block of 500 to 1,000 lines for reading aloud, and set the reading speed to whatever is comfortable for you. Then you sit down with the original book in front of you, and listen. When you hear an error, mark the place in the text with a light pencil. Stopping the reading at every error, editing the text and restarting is possible, but it breaks the flow, and ends up taking longer. When the reading is done, go to your keyboard and correct the errors found.

V.61. What actually happens during a proof?

Stage One--The original Scan

We start with a scanned e-text, in this case a paragraph from The Odyssey. The paragraph used as an example here has been "enhanced" with more errors than in the real scanned text, so that you can see samples of many problems all in one place.

We begin by looking at the original OCRed text, of which our sample section reads:

1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water, And 1 sprink-

## BOOK XL

ODYSSEY X, 24-56. 173

ODYSS.EY XI, %4-56. 173 lef white incal thereon, and entreated with many prayers strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fil the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad hesought the tribes of the d with vows and prayers, 1 took the sheep and cut their s over the trench. and the dark blood flowed forth, he spirits of the dead that he departed gathered from out of Erebus.

It's clear that we should tidy up the page headings and numbers that have been scanned in with the main text, and that we should separate the paragraphs and remove the spaces inserted by the scan at the start of some lines. We also need to restore some of the text that got lost in the scan. Since there isn't much of it, we just type it in. Having done this, we get to . . .

Stage Two--First pass through the scanned text

At this point, we have a complete text. All of the words are actually there, and we have eliminated page breaks and other extraneous artifacts of proofing. Again, mileage varies: some people like to preserve page breaks and numbering until much later, to make it easy to refer back from the e-text to the book.

Our job in this phase is to fix all of the obvious scanning errors and double-check that we really do have all the text. Our aim here is to create an e-text that is ready for First Proof. In fact, since it's fairly clear what all the words are, this text could be considered ready for first proof.

1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And 1 sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad besought the tribes of the dead with vows and prayers, 1 took the sheep and cut their throats over the trench. and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Now we convert those numeral 1s to capital Is and to quotes, where appropriate, we straighten up the quotes and we deal with other obvious scanning errors, which brings us to . . .

Stage Three--The First Proof

At this point, we could hand over the text to an experienced proofer who doesn't have a copy of the book. This would be called a "first proof". An e-text is at first proof stage when there are still plenty of errors, but in each case it's pretty obvious what the correct word is. The excerpt now looks like normal text.

Unfortunately, in stage two above, we accidentally deleted a line.

'There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And I sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea I would offer in my halls a barren heifer, Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when I bad besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Stage Four--Corrections from First Proof

We receive the first proof back from the proofer, and find that it has been mostly corrected.

The corrections made were "l/I", "there after/thereafter", "prornised/promised", "bad/had", and "rarn/ram".

We have also wrapped the lines--at 60 characters in this case, but it is commonly as much as 70 characters per line. Sentences which look wrong, but where it isn't clear what the right text should be, have been marked with asterisks (*).

'There Periniedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white incal * thereon, and entreated with many prayers the strengthless beads of the dead, and promised that on my return to Ithaea I would offer in my halls a barren heifer, * Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

We look up the text where the first proofer has asterisked it, and make the corrections.

The text is now ready for second proofing. An e-text is ready for second proofing when you can skim through the text without noticing that there are errors.

We can either do a second proof ourselves, or send it out for second proofing.

Second proofing involves a very careful reading of the text, looking for small errors. In some ways, it's much harder than first proofing, since it's very easy to let your eyes run on auto-pilot and in doing so, miss subtle errors.

Having performed the second proof, which caught errors like "beads/heads", "Ithaea/Ithaca", "Periniedes/Perimedes" and "he/be", we now have our final e-text.

'There Perimedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white meal thereon, and entreated with many prayers the strengthless heads of the dead, and promised that on my return to Ithaca I would offer in my halls a barren heifer, the best I had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that be departed gathered them from out of Erebus.

Hooray! At long last we have an e-text to post, which can be downloaded, read and enjoyed by anyone in the world from now on.

About Net searching:

V.62. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?

You can submit it, but you can't "just" submit it.

We wish we could give a permanent home to all the etexts that people have produced and placed on the Net, but without proof of their public domain [C.10] status, we can't.

We need to be able to prove that the eBooks we publish are in the public domain, so, in order to use one of the many texts that are just floating around the Net, you need to find a matching paper edition that we can prove is eligible [V.18].

(By the way, please be sure that it isn't already in the PG archive. A lot of texts circulating on the Net originated at PG, and people quite often submit them back to us.)

Before you get into this, you should check whether the text you have found is likely to be in the public domain in the U.S. A quick way to verify this is to hit the Library of Congress Catalog site at <http://catalog.loc.gov> and search for the title or author. If you find no publications before 1923, then you should probably move on; the Library of Congress doesn't list every book, and in particular doesn't list all books published outside the U.S., but, if there isn't a pre-1923 copy there, it may be difficult to follow up on. If you're not dissuaded, do a search on the Net for used book shops that might have pre-1923 copies.

Sometimes, with a text on the Net, you know who typed it; it's on someone's website, or the transcriber is named in the text. Sometimes, the text has just been floating around Usenet or old gopher sites for years, with no attribution.

The first thing to remember is that we would like to give credit to the original transcriber if they want it, and if we can identify them.

The next thing to consider is that the original transcriber may well have an eligible copy of the book, and may be able to provide TP&V [V.25] for it.

So, if you can locate the original transcriber, it makes sense to e-mail them, explain what you propose to do, and ask them whether they can help with copyright clearance and whether they would like to be credited in the PG edition. Often, you will get no response, or a response but no prospect of material that will help with clearance, but sometimes you will get lucky.

If the transcriber can't help with TP&V, it's up to you to find a matching paper edition of the same book. This may not be as hard as it sounds. Libraries can help, and may get editions for you on interlibrary loan.

This is an ideal way for students, academics and librarians to contribute texts to PG, since you probably have access to a good library with stocks of old books to find matching paper editions.

If you find a matching paper edition, you then need to compare the etext you found with the book. Legally, what we're trying to prove here is that we have done "due diligence"--that we have done our best to prove that the etext is indeed a copy of a public domain work.

The minimum "due diligence" we can perform is to compare the first and last pages of each chapter, (or every 20 pages where the book is not neatly divided into chapters of about that size). You should list all of the differences between the book and the etext that you find on those pages. It is to be expected that there will be some minor differences of punctuation, spacing and spelling, and even perhaps of wording. Minor differences are OK, but we do need to list them, to prove that we did the comparison. When you have your lists, you can send in the TP&V as normal, accompanied by your lists, for clearance.

Many texts floating round without attribution, and indeed many with attribution, could do with a thorough checking, and another option you have is "comparative retyping", where you go through the whole etext, proofing it carefully against the cleared paper book, and changing everything that is different in the etext to match the paper edition. If you do this, you don't need to produce a list of differences, since there won't be any by the time you've finished; you can just submit it as a normal text--_and_ it may well be a lot cleaner! However, if you do take this path, please do a very thorough job on the proofing and comparison.

If the etext you find has been marked up, in HTML for example, you should remove all HTML for the PG edition, because, even though the text itself has been proved to be in the public domain, the original transcribers may hold copyright on the HTML markup, even if you can't find them. If you do want to make a HTML edition of it for PG, strip out all of the original markup and then re-add your own markup.

If you do find the producer and he or she wants to be identified, you may submit a double credits line like:

Transcribed by Sally Wright <theoriginaltranscriber@example.com> Produced for PG by You <you@example.com>

V.63. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Why should I submit it to PG?

The first reason is file safety.

Yes, we accept that the file is already available to everyone today, but it may not be safe in the long term. We've seen college students who put books on their personal site, and then lose that site when they graduate. We've seen individuals who transcribe several books, and later lose interest, or move, or die, and the work they've done is lost. We've seen small projects with a few volunteers who produce and post books for a few years, but then break up or run out of funds to maintain their site. We've seen large institutions drop their collections as part of a cost-cutting exercise. We've even seen organizations lock public domain works up behind licenses, requiring users to commit to registration and a "no copying" agreement before downloading them.

Whenever a set of etexts is published and distributed by only one person or organization, there is a danger that their etexts will disappear from the Net sometime. We want _all_ etexts to be spread as widely as possible, copied as much as possible, so that no one event or loss, or whim of a sponsor, can obliterate them.

We think that the PG collection is, for that reason, the safest place to put a text for its long-term survival. There are copies of the PG archives all over the world, on public servers and private CDs. PG publications are widely converted, collected and read on PDAs. Other text projects copy works from PG.

The PG archive is so valuable, yet free and easily portable, that even if every current PG volunteer vanished overnight, people around the world would copy and preserve it. Even if PG itself decided to withdraw all our texts, we couldn't do it, because so many people have made copies.

The second reason is legal safety.

Unlike some other projects and individual efforts, PG retains documentary proof of the public domain status of its texts. This is more valuable than it might appear at first glance.

Publishers often claim a new copyright [C.17] on works that they republish, and as time goes on, it becomes harder and harder to prove that a particular book is in the public domain. Walk into your local bookstore and check out how many works by Shakespeare, Poe, Dickens, and Twain have copyright notices on them! People who want to translate these, or create derivative works like screenplays or lyrics or films must first prove that they are basing their work on a public domain edition, but the creeping copyright practices of commercial publishers make that difficult.

Here's a practical example: we were approached by a film student who wanted to make a short piece based on characters from James Joyce's "Ulysses". But before he could do that, he needed to confirm that the material on which he was basing his movie was in the public domain, and all the editions he could find were copyrighted. However, because PG had already established the public domain status of Ulysses, we could point him to our established PD version, and even tell him where to find a paper copy published in 1922. Without that evidence, he could not have made his project.

V.64. I have already scanned or typed a book; it's on my web site. How can I get it included in the Gutenberg archives?

Great! We get these a lot, but it's always nice to see another!

You need to send us the TP&V [V.25] so that we can prove that your edition is in the public domain. If you don't have the TP&V, you will need to find a matching paper book with eligible TP&V for us to be able to use it.

V.65. I have already scanned or typed a book; it's on my web site. The world can already access it. Why should I add it to the Gutenberg archives?

The Project Gutenberg archives are widely copied and searched, and much safer and more permanent that any individual website can possibly be. We aim to keep this collection together over not just years, but centuries. You took the trouble to transcribe this book. We can relate; that's what _we_ do, as well. We know you want this work to survive you and your ISP, and we believe we can do that. And it's not as if you have to take it off your website when we make a copy; you're just using your candle to light another!

If you want to let readers know that your site has other related material, you can put that information in the Credits Line [V.47]. Taking a real-world example, you could ask us to add this to the Credits line for a C. M. Yonge text:

A web page for Charlotte M. Yonge will be found at www.menorot.com/cmyonge.htm

V.66. I have already scanned or typed a book, but it's not in plain text format. Can I submit it to PG?

Yes, of course. We'll be happy to discuss format options with you, and we're quite experienced in converting between multiple formats and deciding which formats work best and will have the longest life. All you need is to get us a copy of your TP&V [V.25].

About author-submitted eBooks:

V.67. I've written a book. Will PG publish it?

Maybe.

PG gets submissions from young people, for example, who just want to get a story they wrote published in PG. We wish them well with their writing, but that's not really why we're here.

If you are a published author, or perhaps an academic who wants to put a textbook into the archives, it's quite likely that we will publish it.

V.68. I have translated a classic book from one language to another. Will PG publish my translation?

Yes, if we can.

The book that you translated needs to be in the public domain, and we will need the same proof of eligibility that we would use if you were contributing the book in its original language.

For example, if you were translating Hesse's Siddhartha (published pre-1923 in German, but no pre-1923 English translation available), we would need to copyright clear [V.25] the original German edition from which you worked--it needs to be a pre-1923 or otherwise public domain edition. (We actually did this one, thanks to the hard work and scholarship of some volunteers.)

V.69. OK, this is one of the cases where PG will publish it. What do I do next?

You need to decide about copyright issues. Do you want to release your work to the public domain, or do you want to retain copyright? If you want to retain copyright, what terms do you want to release it under? The next few questions deal with those issues.

Having decided that you want PG to publish it, and decided what restrictions (if any) you want to place on further distribution, you just need to write the appropriate letter and send the text to us. [V.46]

V.70. I hold the copyright on a book. Can I release it to the public domain?

You can. All you need to do is put a statement into the released version of the text saying that you have.

If you want to release it into the public domain and distribute it through Project Gutenberg, you should send us a letter to that effect.

To: Michael S. Hart Founder, Project Gutenberg 405 West Elm Street Urbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to release this work into the public domain, and I invite Project Gutenberg to publish this public domain edition.

Sincerely,

Gregory B. Newby

Once you have released it into the public domain, neither we nor anyone else needs your permission to publish it, but for us to be sure that it _is_ a public domain version, we do need a signed letter.

V.71. I hold the copyright on a book. Do I have to release the book into the public domain for Project Gutenberg to publish it?

Absolutely not! For example, many contributors of copyrighted material want to share it with the world, but do not want it commercially republished by other companies.

You can grant Project Gutenberg perpetual, non-exclusive, world-wide rights to distribute your book on a royalty-free basis by sending a letter to Michael Hart. Your letter may be brief, but must be signed, and must include the name of the book and the assertion that you are the copyright holder or the agent for the copyright holder.

If you want some related information, like a link to your website, included in the text, we will be happy to oblige.

Once we have posted a text, many people will copy it. We have no effective mechanism for "recalling" texts that we have posted, so please be sure, before you commit to this, that you intend to follow through with it, because there is no way to change your mind later.

Here is a sample letter, including the address to send it to:

To: Michael S. Hart Founder, Project Gutenberg 405 West Elm Street Urbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to grant Project Gutenberg perpetual, worldwide, non-exclusive rights to distribute this book in electronic form through Project Gutenberg Web sites, CDs or other current and future formats. No royalties are due for these rights.

Sincerely,

Gregory B. Newby

V.72. I hold the copyright on a book, and would like Project Gutenberg to publish it. Can I choose what rights to assign?

For PG to be in a position to copy it, we do need perpetual, worldwide, non-exclusive, royalty-free rights to distribute the book in electronic form. What rights you choose to assign to readers after that is a decision for you to make.

The Creative Commons site <http://www.creativecommons.org> may give you some ideas of what practical use you can make of your copyright to see that the work is used in the ways you intended.

About what goes into the texts:

V.73. Why does PG format texts the way it does?

PG texts are formatted as plain ASCII, with 60-70 characters per line, with a hard return [CR/LF] at end of line, and some people ask "Why do it _this_ way? You could omit the hard returns and let the reader's word processor or Reader software wrap the lines. You could use "8-bit" accented characters for non-English characters." "You could use ' - ' instead of '--' for an em-dash." And so on, through a different choice we could make for every formatting feature. And the answer, of course, is that we _could_ do it differently, and sometimes we do, but mostly we keep to one consistent style.

We'll be discussing each of the formatting decisions below, not only giving the summary PG answer, but also discussing the plusses and minuses of each, and the possible options.

Like any question beginning "Why does/doesn't PG . . . ?", the answer is "Because that's what the volunteers and readers want!". These conventions have been worked out over the years, largely by Michael Hart, our founder and chief volunteer, in conjunction with all of us volunteers, as the result of feedback from readers.

We are guided throughout by the principle that we want to produce texts in the simplest format that will adequately express the content. Quoting Michael Hart (1994):

Etext as developed and distributed by Project Gutenberg since 1971 was never intended to be a copy of a paper or a parchment [remember, first Project Gutenberg Etext was typed in from parchment replicas of the US Declaration of Independence].

The major purposes of Project Gutenberg have always been:

1. to encourage the creation and distribution of electronic texts for the general audience.

2. to provide these Etexts in a manner available to everyone in terms of price and accessibility [i.e. no special hardware or software], and no price tag attached to the Etexts themselves.

3. to make the Etexts as readily usable as possible, with no forms or other paperwork required, and as easily readable to the human eyes as to computer programs, and in fact, more readable than paper.

There is sometimes a conflict between "simplest format" and "adequately express the content"; further, different people have different views on what is "simple" or "adequate". You, the producer of the text, have spent the time and effort to make the eBook available to the world, you have thought more about it than anyone else, and we respect your informed judgment. However, please make sure that your judgment _has_ been informed, by studying the precedents and reasons behind our guidelines.

Where a simple, standard PG-ASCII layout does not, in your view, "adequately express the content", you should think of making your text in another open format, perhaps HTML or XML or TeX, that allows you to use more characters, more formatting options, and images. We are always happy to accept these kinds of files. In these cases, you should also provide a standard PG-ASCII version, even if you feel it is unacceptably degraded, for those who cannot use your preferred format.

Just ten years ago, presentation as plain ASCII was not only a universal standard, it was effectively the only way that most people could view the books. The first version of the HTML specification had been drafted, but was unknown among the general public. XML did not exist. SGML was (as it still is) the province of specialists. Specialized eBook readers and PDAs had not yet appeared.

In 2002, plain vanilla ASCII is still readable everywhere, but people also want to convert our texts into other formats for more convenient loading on readers and web sites. We therefore have to keep in mind that our works will be processed by automatic conversion programs, none of which is perfect, and we have evolved some "defensive formatting" practices, which, while retaining the universality of plain text, also supply clues to automatic converters about how they should treat the layout. These do help to keep converters from making at least the worst mistakes. The most significant "defensive formatting" practices are indenting unwrappable text like quotations, and using _underscores_ rather than CAPITALS for italics. Different volunteers have different priorities: at one extreme, some people want to make the best plain text they can, giving no weight to conversion issues; at the other, some people emphasize the cues that will allow automatic reformatters to convert the texts well, even if that causes some ugliness in the plain text. Most of us operate somewhere between, making the choices we feel are best depending on the context. Getting a text on-line is the important thing; which choices you make in doing so is a matter of detail.

About the characters you use:

V.74. What characters can I use?

a) You should use plain ASCII for straight English texts.

b) When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.

c) When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese--Big 5]

d) When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.

You should use plain ASCII wherever possible--that is, the letters and numbers and punctuation available on a standard U.S. keyboard, without accented letters. The immediate and major exception to this is when you are typing a text written in a language like French or German that requires accents.

There is a problem with using non-ASCII characters. They do not display consistently on all computers; in fact, they do not even display consistently on the same computer! On my computer, for example, what looks like an e-acute in this editor just shows as a black box in another editor, or even using a different font in the same editor. And this is by no means confined to some theoretical minority; we have to deal with it all the time when posting texts.

Further, standards are changing: ten years ago, the character set Codepage 850 [MS-DOS] was very common; now it's rare except in some texts that have survived those ten years.

We want to preserve these texts over _centuries_, not just decades, and at the moment there is no single clear standard that we can use across all texts. Unicode may perhaps be a future standard, but, right now, it's not something that people use every day, and it's not supported by a lot of common software.

ASCII, while limited, is supported by almost all computers everywhere, so we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original. When we get a text in, say, German, we post two versions of it--one with accents and one without.

V.75. What is ASCII?

Don't get scared by the computer jargon; ASCII (pronounced ASS-key) is just a name for the set of unaccented letters, numbers and other symbols on a standard U.S. keyboard.

ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.

Just about every computer in the world can show ASCII characters correctly, which makes it ideal for PG's purpose of providing texts that can be read by anyone, anywhere, but ASCII does not include accented characters, Greek letters, Arabic script and other non-English characters, which causes some problems when we produce texts that need non-ASCII characters.

V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252? What is MacRoman?

Today's computers mostly work on the basis of dealing with one "byte" at a time. A byte is a unit of storage than can contain any number from 0 through 255--256 values in all. It's very convenient for computers to associate one character with each of these numbers, so that we can have up to 256 "letters" viewable from the values stored in one byte. The first 128 values, zero through 127, are defined by ASCII--so, for example, in ASCII, the number 65 represents a capital "A", 97 represents a lowercase "a", 49 stands for the digit "1", 45 for the hyphen "-", and so on.

ASCII doesn't define characters for the values 128 through 255, and in early days computer manufacturers used these values to hold non-ASCII characters like accented letters and box-drawing lines. Of course, 128 wasn't nearly enough values to hold all of the characters that people needed to use for different languages, so they made the character sets switchable, so that a PC in France could use a different set of accented letters from a PC in Poland. Microsoft's version of this was called Codepages. Each Codepage held a different set of non-ASCII characters. Codepage 437, and later Codepage 850, were commonly used for English and some major Western European languages on MS-DOS.

MacRoman was Apple's first codepage, containing most of the accented letters in Latin-derived languages, and MacRoman is still in common use on Apple Macs today.

Later, the International Standards Organization ISO got around to looking at the problem, and defined ISO-8859-1, ISO-8859-2 and so on, as the standards for different language groups. These sets all define the characters 160 through 255 as accented letters and other symbols, and define the 32 characters from 128 through 159 as control characters.

Since Microsoft Windows has no use for the control characters 128 through 159, Windows fonts commonly use Codepage 1252, which has ASCII in the first 128 characters, ISO-8859-1 in characters 160 through 255, and other symbols in the characters 128 through 159. Just to make an already chaotic system worse, all characters can be defined differently in different fonts!

Of course, most of these codepages are incompatible with each other. For example, the byte value 232 shows as a lower-case "e" with a grave accent in ISO-8859-1 and CP1252, a capital letter "E" with diaeresis in MacRoman, a Latin capital letter "Thorn" in CP850, a Cyrillic lower-case "Sha" in ISO-8859-5, a Greek capital letter "Phi" in CP437, and so on. So if you view a text intended for one of these character sets with a program that assumes a different character set, you see gibberish.

The good news, for mostly-English texts at least, is that ISO-8859-1, Codepage 1252 and Unicode agree on the numerical values of the accented characters and symbols to be represented by the values 160 through 255. And everybody accepts ASCII--a pure ASCII file is valid ISO-8859-anything, valid Codepage-anything, and valid Unicode UTF-8.

For more detail about the mappings between Unicode and other formats, you can view Unicode<-->ISO-8859 mappings at ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/ Unicode<-->Windows mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ and Unicode<-->Apple mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/

If you're not confused enough by now, please read the excellent guide to the whole "alphabet soup" problem at <http://czyborra.com>.

V.77. What is Unicode?

Recognizing that no single set of 256 characters can hold all of the symbols necessary for true multi-lingual texts, ISO 10646 was created. This defined the Universal Character Set (UCS) using 31 bits, which has the potential for a staggering _2 billion_ characters.

The Unicode Consortium is a group of computer industry companies who agree the Unicode standard. Unicode accepts the ISO 10646 standards, and adds some restrictions and implementation processes. It plans for a modest million or so characters; however, this is enough for all living and extinct languages, and imaginable future ones too.

Using 4 bytes for each character is wasteful, though, when most characters need only one or two, and there are programming problems with implementing 4-byte characters, so Unicode provides Transformation Formats (UTF) which allow the characters to be encoded using fewer bytes where possible. UTF-8 and UTF-16 are common.

UTF-8, which is the most practical of these from the PG point of view, allows ASCII to be encoded normally, and usually uses two or three bytes for other non-ASCII characters.

Because of the extra work needed to support this extra space, and the fact that most people work mostly in one or maybe two languages, Unicode is being adopted only slowly, and most computer programs in 2002 do not fully support it. But when you need to mix Arabic, Greek, Ogham and Sanskrit in one text, it's the only possible answer!

For more about this, go straight to the source at <http://www.unicode.org>.

V.78. What is Big-5?

Big 5 is an encoding of a set of 13,000+ traditional Chinese characters.

V.79. What are "8-bit" and "7-bit" texts?

For practical purposes, 7-bit texts are plain ASCII; 8-bit texts have accented letters.

This comes from computer jargon. You can represent the 128 characters of ASCII using 7 bits--binary digits--but to represent the 256 characters needed for the various codepages and ISO-8859 standards, like accented letters, you need 8 bits. Hence, we call a text that uses non-ASCII characters in a character set like Codepage 850 or ISO-8859-1 an "8-bit" text.

When we post a text as both 8-bit and 7-bit, as we do when ASCII is not enough to render the text acceptably, we name the file with an "8" or a "7" at the start. So, for example, Crime and Punishment by Dostoevsky is named 8crmp10 for the 8-bit version with accents, and 7crmp10 for the 7-bit version without accents.

See also FAQ [R.35]: "What do the filenames of the texts mean?"

V.80. I have an English text with some quotations from a language that needs accents--what should I do about the accents?

If stripping the accents would unacceptably degrade the book, then submit two versions, one "8-bit" with the accents included and one "7-bit" plain ASCII, and we will post both.

This is a hard choice. What constitutes "unacceptable degradation"?

Clearly this is a decision that all of us in PG have to make. It's a very common problem, and different people have different views. For that matter, different print publishers have different views; you will see the words "debris", "facade" and "cafe" printed with and without accents in different books, and even in different editions of the same book.

We don't want to post two versions when we don't have to. It doubles the posting work, doubles the disk space needed, potentially confuses downloaders, doubles the maintenance when we need to correct the text. On the other hand, we don't want to degrade the text.

There is no clear line, no definitive answer to what level of degradation is acceptable. Most producers feel that there is no point in making a separate version when dealing only with a few foreign words thrown in among the English, but when, for example, some significant dialog between the characters is in French or Spanish, it's harder to say that stripping the accents is acceptable. You, the producer, need to decide this on a case-by-case basis. If you're not sure, discuss it with one of the Directors of Production or one of the Posting Team.

If you have made the text with accents, you can choose to make your own 7-bit version and send it to us, or just send the 8-bit version and we'll make the 7-bit version from it. Some people prefer to make their own 7-bit editions; some don't. Whether you use a Microsoft Codepage, one of the ISO standards or MacRoman doesn't matter--we can convert any of them for you.

V.81. I have some Greek quotations in my book. How can I handle them?

There is no way to show Greek letters in ASCII. You have three options:

You can just replace the Greek words with [Greek] to indicate to the reader that you have omitted it.

You can "transliterate" the Greek to ASCII. Greek letters do have a correspondence to plain "Latin" letters--for example, the Greek letter "delta" can be represented by the letter "d". There is a simple PG guide to transliteration at <http://www.promo.net/pg/vol/greek.html>. This practice has had a long and honorable history: words like "amphora" and "hubris", for example, are straight transliteration from the Greek. This is usually the best option.

If there is enough Greek to warrant it, and no other accented characters, you may be able to use the ISO-8859-7 character set, and submit both 7-bit and 8-bit versions [V.79]. ISO-8859-7 is for modern rather than classical Greek, but, if necessary, you will surely be able to express the Greek fully in Unicode. However accurate your Greek, that still leaves the issue of what to do with the 7-bit ASCII version, where transliteration is probably still your best bet.

V.82. I want to produce a book in a language like Spanish or French with accented characters. What should I do?

Use the appropriate ISO-8859 Character set [V.76] for your 8-bit version.

About the formatting of a text file:

This section of the FAQ goes into great detail about all kinds of formatting questions. However, looked at from a higher level, the only real issue is that we want to render texts clearly, with formatting that reflects the original, so that readers of the plain text format can read them easily, and people converting them to other formats can do so reliably. When you come across a case that is not covered by the detailed guidelines below, keep this ultimate aim in mind, and make the best decision you can. Don't get hung up for hours or days over a question of formatting--if you want advice, look at how other people have handled the same situation in previous texts, or ask other volunteers for their ideas.

V.83. How long should I make my lines of text?

For normal prose, such as you find in a novel, your lines should mostly be 60 to 70 characters long, not shorter than 55, not longer than 75 except where it can't be helped. Never, ever longer than 80, except where you're trying to render a non-text structure, like a family tree.

For poetry, make the text look as much like the book as possible. This also applies to some plays where the lines are clearly intended to be broken at specific points, whether blank verse or not.

V.84. Why should I break lines at all? Why not make the text as one line per paragraph, and let the reader wrap it?

We could either use 70-character lines and let readers unwrap them if they want to, or use infinite-length lines and let readers wrap them if they want to. We choose to wrap the lines so that they are readable on even the simplest of text editors and viewers.

V.85. Why use a CR/LF at end of line?

CR/LF can lead to double-spacing, notably on Mac and Unix, but at least there _is_ a CR in there for Mac users, and there _is_ an LF for *nix users.

If you don't know or care what this is about, please skip blithely on.

There are three differing standards for how to represent the end of a line of text. In brief, Apple Macs use the CR character. Unix and its variants use the LF character. Microsoft systems, from MS-DOS through Windows, use both together.

If you want the history behind these:

CR stands for Carriage Return, and comes from the old typewriter / teletype idea of a command to move the print head from the right of the page back to the left when it reaches the end;

LF stands for Line Feed, and comes from the old typewriter / teletype idea of a command to move the print head down a line;

CR/LF together indicate moving down a line and back to the left of the page.

The history is not relevant to today's computers in principle, but in practice they all use one of these legacy conventions, and there's nothing we can do about it but pick one.

V.86. One space or two at the end of a sentence?

Whichever you prefer, but if using two spaces, please use them only at the end of a sentence, not after abbreviations like "Dr." and "per cent.", and not after non-sentence-ending punctuation like the question-mark in the sentence: "Must you go? when the night is yet so black!"

Many people have strong views on either side of the "one space or two?" question, and we're not about to try and argue with them. Use whichever is most natural for you.

However, if using two, you take responsibility for deciding where the sentence ends. You can't just place two spaces after every period, question-mark and exclamation mark, since periods are also used for abbreviations end ellipses, and question-marks and exclamation-marks don't always end sentences.

V.87. How do I indicate paragraphs?

Just leave a blank line before each paragraph.

V.88. Should I indent the start of every paragraph?

No.

Printers do this when publishing paper books because they do not leave blank lines in the text, but there is no need for indenting in our eBooks.

V.89. Are there any places where I should indent text?

Yes. You should always make poetry look like the original, and that may mean indenting some lines, for example:

I was a child and she was a child, In a kingdom by the sea; But we loved with a love that was more than love-- I and my Annabel Lee;

Even when poetry doesn't have indented lines, it is a good idea to indent quotations embedded in prose. Remember, others will be converting your text later--to HTML, to PDA reader formats, to formats that don't even exist yet--and much of this conversion will be done automatically, by computer programs. It is very hard for a program to know when it can and can't re-wrap lines to fit a screen size unless it has a clear signal that _this_ line should not be wrapped. This is one of the biggest problems with auto-converting PG texts.

Just about all formatting programs "know" that lines that are indented shouldn't be wrapped, so by indenting lines just a space or two, you can prevent

I think that I shall never see A poem lovely as a tree.

from turning into

I think that I shall never see A poem lovely as a tree.

in some future reader's eBook.

You don't really need to do this in texts where the whole book is poetry or blank verse, since these will probably be recognized as whole books that shouldn't be rewrapped, but when there are a few lines of quotation amid an acre of straight prose, a few spaces will be a life-saver. Even in the original plain text version, the extra spaces serve to set the quotation off from the main text.

You shouldn't get carried away and indent things 20 spaces for this reason, though. Anything up to four spaces is reasonable; more is excessive. If you're indenting many short verses in this way, keep your number of spaces for indentation consistent throughout the book.

There are some other times when you may judge it best to indent, where text is indented in the paper book, like newspaper headlines or pictures of handwritten notes.

V.90. Can I use tabs (the TAB key) to indent?

No.

The problem with tab characters is that they act differently in different applications. Typically a tab will move the text to the next tab stop, which might be four spaces on your PC, but 20, or none, on someone else's. The effects are unpredictable.

V.91. How should I treat dashes (hyphens) between words?

In typography, there are four standard types of dashes: the hyphen, the en-dash, the em-dash, and the three-em-dash.

Originally, printers called these the "em-dash" because it was the same width as the capital letter M in whichever font they were using, the "en-dash" because it was the same width as the capital letter N, and the "three-em-dash" because it was as long as three capital Ms.

The hyphen is used for hyphenated words, like "en-dash" itself, or "to-day" or "drawing-room". For this, you just press the single dash or hyphen key on your keyboard.

In typography, the en-dash is a little longer than the hyphen, and is typically used for duration, where you could substitute the word "to". For example, if you were printing "1830-1874", or "9:00-5:30", you would use an en-dash instead of a hyphen. The en-dash is also sometimes used as hyphenation between words that are already hyphenated, for example, "bed-room-sitting-room" might use an en-dash as its central dash to emphasize that it is a different type of separator from the plain hyphens before "room". However, there is no ASCII character for an en-dash, and we use the hyphen in these cases. (HTML and some character sets do provide separate entities for en-dash and em-dash.)

The em-dash is shown in print as a longer dash, and for PG purposes, you should render it as two hyphens with no spaces around them.

You use the em-dash as a kind of parenthesis--as I am doing here--or to indicate a break in thought or subject within a sentence. There is no ASCII equivalent of the em-dash; there is no key on your keyboard that you can press to get one. For PG texts, we represent the em-dash as two dashes with no space between or around them--like this.

The em-dash can also be used at the end of a sentence or speech to indicate that the speaker stopped or trailed off. For example:

"When I saw you with Emily, I thought you were-- I thought she was--"

In a case like this, there may be a space following the em-dash, and the context may demand that there _should_ be a space following the em-dash, not because of the em-dash as such, but to make the break between the statements or sentences clear.

These two hyphens represent _one_ character, so you should never break them at line end, with one hyphen at the end of the first line and the other at the start of the second. If you have an em-dash near line end, you can break the line either before or after the em-dash, but never in the middle.

The fourth type of dash, the three-em-dash, is used to represent a missing word, or an undetermined number of missing letters. You will often see it in a sentence like:

Dr. P------ was known for his honesty.

or

Dr. ------ was known for his honesty.

where there is a convention that the character's name has been redacted. Logically, we should represent the three-em-dash as six dashes, but you may reduce that to four. Whichever you choose, do use it consistently in the text you're producing.

Unlike the em-dash, you should leave a space in such cases wherever a space would have been before the letters were replaced by dashes.

Here's a summary table of the dashes:

Name ASCII Used for

Hyphen - Hyphenated Words En-dash - Durations, like "3:00-5:30" Em-dash -- Break in sentence or parenthetical comment Three-em-dash ------ Indicating a word that was edited out.

V.92. How should I treat dashes replacing letters?

If the dashes obviously represent individual letters, use the same number of hyphens. Otherwise, you can use a three-em-dash (see above: 6 or 4 hyphens) in such places.

A common convention when a character in a novel is using bad language, or when reference is given to a character whose full name is not being used, is to replace the letters with dashes. For example,

"That D---l, Mr. C------s will regret his hasty actions!"

In this case, it is clear that "D---l" is meant to represent "Devil" and that there is a character whose name begins with "C" and ends in "s" whose name is not spelled out in full. Where the book makes it clear how many letters are represented by hyphens, just use that number of hyphens.

Where the number of letters omitted is not clear, you can decide how long you want to make your extended dash. Typographers often use the "three-em-dash" for this, so called because it is as wide as three capital Ms. Logically, since we represent an em-dash by two hyphens, we might represent a three-em-dash as six, but if you feel that six hyphens is too long, you can choose a shorter length, like four, but if you do, keep it consistent within your text:

It was in the town of S----, walking on M---- Street, that Sowerby came upon Dr. T---- taking the morning air.

V.93. What about hyphens at end of line?

Remove the hyphens from single words that were wrapped by the printer at line-end on the paper copy. Where two words are joined with a hyphen, you can leave the hyphen at end of the text line.

Books are usually printed with words broken at end of line to make the right side of the text perfectly even. You should remove all such hyphens. For example, in the sentence:

Mary's mouth tightened as she saw the marks on the car- pet, and her hands balled into fists.

you should remove the hyphen from "carpet".

Words which are strung together and hyphenated by the author pose a different question. It is perfectly OK from the point of view of a reader of the plain text version for such a hyphen to occur at end of line, for example:

Now that the guns were silent, convoys brought badly- needed medical supplies and food.

However, be aware that if somebody later rewraps the text for use in a different format like HTML, it is possible that they will introduce a space where it should not be:

Now that the guns were silent, convoys brought badly- needed medical supplies and food.

so there is still a small disadvantage to having a hyphen at line-end.

Sometimes it's not entirely clear whether the hyphen is there because it has to be, or just because it happens to fall at the end of the line:

Daisy rushed to the door, but there were no letters for her to- day, and she retreated sadly.

Sometimes "today" is written as "to-day", especially in older works. So which is this? Should we remove the hyphen or not? In this case, the best thing to do is search the rest of the text for the same word, and see whether it is consistently hyphenated or not in other places.

V.94. What should I do with italics?

There are three different ways volunteers currently render italics: like THIS, like _this_ and like /this/. Pick one, and use it consistently in your text.

There are really two questions here: "How should I render italics?" and "When should I render italics?"

The original PG standard for italics was to render emphasis italics as CAPITALS, using underscores for an italicized _I_, and do nothing for non-emphasis italics like foreign words and names of ships, and this is still the most common usage. For reading a plain-text file in a plain text editor, it is still arguably the most reader-friendly usage as well.

It has two drawbacks:

1. if you do want to preserve italics for non-emphasis words, you may end up with a very ugly text where there are too many capitals.

2. it is impossible to convert CAPITALS reliably back into italics, since the original text might have had a capital letter, or even been all capitals in the first place. This is especially true of automatic conversion for people who want to read PG texts on eBook readers.

To overcome these problems, many volunteers now use _underscores_ or /slants/ to render italics. These allow you to preserve all italics without creating an ugly plain-text, and to remove the ambiguity of CAPITALS. Underscores are more popular than slants, but some people feel that underscores should properly be reserved for underlined text. Since printers tend to avoid underlines, however, there aren't many books where this causes a real conflict.

V.95. Yes, but I have a long passage of my book in italics! I can't really CAPITALIZE or _otherwise_ /mark/ all that text, can I?

No, you really can't. On the other hand, if the author intended that section to stand out, you don't want to ignore that information and withhold it from future readers.

What you _can_ do is format it differently from the rest of the text. For example, if you're averaging a 68-character line throughout normal paragraphs, you could reasonably use shorter lines, like 58 characters, for the italicized section. Going a step further, you could shorten the lines and indent them a space or two as well. This will give a clear signal to future readers and converters that this section is to be treated specially.

V.96. Should I capitalize the first word in each chapter?

No.

Capitalization of the first word is often used in printed material to emphasize the break at the start of a section or chapter on the paper, but it is not necessary in an eBook, and leads to the same kind of ambiguity as does the capitalization of italics, and for far less reason.

If you feel you really _must_ capitalize the first word, we probably won't stop you, but if so, please do it consistently throughout the book, not just in one or two places, so that a future reader can be certain that these capitalized words were a chapter-head convention, and not otherwise intended for emphasis.

V.97. What is a Transcriber's Note? When should I add one?

A Transcriber's Note is a small section you can add to a text you produce to give the reader some information about changes you made to the book when rendering it into text.

A Transcriber's Note is not the same as a footnote--a footnote is part of the text you have transcribed; a Transcriber's Note is a note that _you_ add to the text, explaining something _you_ have done or omitted. If there is a Transcriber's Note, it may be at the top or the end of the text, and it should be clearly marked so that a reader cannot confuse it with the main text or an introduction.

The main thing is to ensure that a reader cannot confuse text that you have added with text that was in the original book.

Transcriber's Notes are rarely needed, but if, for example, you found misprints in the text, or things that might look like misprints even though they're not, you may note them here, if it seems relevant. If there is an image in the book that is important to the content, you may describe it in a note. If there was unusual typography that you had to represent in some uncommon way, you might well explain that here.

You don't need to add a Transcriber's Note just for common conversions like italics, and you should not use such a note to add your own comments or views about the text or the author. It's just there to let the reader know what decision you have made about rendering the text.

Here are some examples of Transcribers' Notes:

Transcriber's Note:

The irregular inclusion or omission of commas between repeated words ("well, well"; "there there", etc.) in this etext is reproduced faithfully from the 1914 edition . . .

Transcriber's Note:

Inserted music notation is represented like [MUSIC--2 bars, melody] or [MUSIC--4-part, 8 bars]

[Transcriber's Note: This letter was handwritten in the original.]

Transcriber's Note:

The spelling "Freindship" is thus in the original book.

Transcriber's Note: Some words which appear to be typos are printed thus in the original book. A list of these possible misprints follows:

If there is an image that is important to the content you may describe it at the point in the text where it appears, for example:

[Transcriber's Note: Here there is a map of three islands just West of and parallel to a coastline running SW to NE, with a big X marked on the North of the middle island. A spur of land extends from the mainland, sheltering the islands from the north-east.]

Transcriber's Notes that apply to the whole text should be placed at the start or end of the text--your choice. Notes that pertain to a specific point in the text, like the map example above, should be placed at the point where in the text where they are relevant, but not interrupting a paragraph except where it cannot be avoided.

V.98. Should I keep page numbers in the e-text?

No. But there are exceptional cases . . .

In general, the page numbers of the original book are irrelevant when making a reader's edition for PG; they are annoying and intrusive for anyone trying to read it, and if you did keep them, they would probably be removed by anyone converting it. Get rid of them!

But there are a few books where page numbers are appropriate. Non-fiction books that use page numbers as internal cross-references are the prime example; if, on page 204, the text reads

"Our studies of plants (see pp. 141-145) show that this is true."

and this kind of cross-reference is frequent throughout the text, then it is probably best to keep the page numbers, since it is otherwise very difficult to honor the author's intent.

In the more common case where cross-references exist, but are not frequent, and not essential to the text, you have several choices: leave the cross-references in, meaningless though the page numbers are, remove the cross-references, change the cross-references to something relevant (like "Start of