Publishing scans at the Internet Archive

From DPWiki
Jump to navigation Jump to search

(This page is about an activity that--at least currently--is not part of Distributed Proofreaders (DP), but that may be useful or desirable for many DPers anyway.)

Note: The steps shown below may differ slightly depending on which upload method you choose.

Rationale

Distributed Proofreaders keeps all page scans it produces and is planning to offer them to the general public one day through something called the Open Library System. Project Gutenberg also offers the possibility of publishing scans, but only of books that are accompagnied by a "plain vanilla" text e-text.

So why would you post a project's scanned pages on The Internet Archive? Because...

  • ... you want to publish them now
  • ... you want to offer them as high quality images
  • ... you want to offer them before an e-text has been produced.

Steps

Step 0. Log in at archive.org. If you haven't got an account there yet, create one, then log in.

Step 1. Go to http://www.archive.org/create/.

If you are presented with multiple upload options, pick one.

Step 2. Choose an identifier (called Title) for your "project". The name should be both fairly unique and self-explanatory. It will end up being part of the URL where visitors can access the files you uploaded.

Step 3. Fill out your identifier in the input box, then click Next.

Step 4. Archive.org will start crunching some numbers. If the identifier was created succesfully, the webpage will change in order to tell you so. It will then present you with a form for further meta-data, and for selecting the file to upload.

Step 4.b. Upload your file.

Your file should be called IDENTIFIER_FILETYPE.zip or IDENTIFIER_FILETYPE.rar where IDENTIFIER should be replaced by the identifier you just chose, and FILETYPE by an abbreviation of the name of the filetype that you stored your images in, for example 'jpg' for JPEG.

If you had already named your files, but their names do not match the Internet Archive pattern (IDENTIFIER_NNNN.FILETYPE), now would be a good time to rename.

For example: if your project's identifier is "myproject," and your scans are in JPEG format, then the scans need to be in a ZIP file called myproject_jpg.zip, which should contain a folder called myproject_jpg, which in turn should contain files numbered myproject_0001.jpg, myproject_0002.jpg, and so on.

Uploading files in any other format or with any other naming scheme will work just as well. Using the method outlined above though will cause the Internet Archive's derivation scheme to kick in, which will make all kinds of download formats from your scans.

Step 5. Wait, wait, wait while your file is being uploaded. This can take a fair amount of time, and unfortunately you won't get to see how far in the process you are.

Step 6. Go to the new page of your book (if you aren't redirected there automatically). Click the Edit Item link in the left column.

Step 7. Fill out the rest of the meta-data. Don't forget to click Submit when you are done.

Step 8. If you have other files to upload for the same project, use the Item Manager to upload them: starting at the project's home page, click Edit Item, click Item Manager, click Checkout Edit Item's Files, follow the instructions. You'll get FTP access to the directory that contains your files.

Step 9. If you have uploaded extra files as outlined in step 8, go back to the Edit Item page and fill out the meta-data at the bottom of the page, then click Submit.

Step 10. There is no step 10.

Derivation and file names

You can upload your books any way you like, but TIA has an extra feature called derivation that you may wish to avail yourself of that requires some extra effort. Derivation means that TIA will automatically convert your uploads to all kinds of other formats. If you upload music in FLAC format, they will convert your songs to MP3, Ogg Vorbis and much more for you (the official list of formats is here).

Books are converted from scan image files to other image formats (notably JPEG2000), PDF, DJVU, and OCR-ed text, to mention but a few.

In order to identify which files need to kick-start the derivation process, Archive.org has fairly rigid expectations of how you name your files. If your identifier is called my_identifier and your files are in the JPEG format (.jpg), your files should be named my_identifier_0001.jpg, my_identifier_0002.jpg, my_identifier_0003.jpg, etcetera. They should be in a directory called my_identifier_jpg, and that directory should be stored in a ZIP file called my_identifier_jpg.zip.

A new, more forgiving format has been developed where you store your files in a folder called my_identifier_images and ZIP that up as my_identifier_images.zip, but I am not sure what the rules for this are. I stumbled upon this format when I accidentally labelled my files my_identifier-0001.jpg and so on (hyphen instead of underscore), and some kind administrator at TIA advised me to rename the folder. You can find a description of this process here.

Extras

  • Mention in your description which PG etext is based on these images (if the answer is "as yet none", you can always come back later and edit your entry).
  • Use the tags "project gutenberg; distributed proofreaders; pgdp", so that folks looking for our books can find them.

See also