Harvesting high-resolution images

From DPWiki
Jump to navigation Jump to search

Sometimes when harvesting images from other websites, special techniques or tricks may be useful to get the best quality images. This page documents some of these techniques.

Whichever of the following methods you use, if you are harvesting high resolution images for replacement or addition of images to a DP project, make sure that you contact db-req and arrange a way to have the images added to the project. Do not send images to db-req as an attachment.


Internet Archive

a.k.a. TIA, a.k.a. IA, a.k.a. archive.org

Typically when harvesting from the Internet Archive, you will download a complete set of images for the project. However, at times you may only need a small number of pages, and don't need the entire set of book scans. Some techniques for doing this are described below. Take into consideration the number of images you need, the amount of time it would take to download the whole zip or tar, and the amount of space you have to store the files.

For purposes of these instructions, an example is Alice's Adventures in Wonderland. Scroll through and find a likely looking illustration. If you don't have another archive in mind, you can use this for practice.

Using Internet Archive's interface

The instructions below for accessing images via the read online interface are still valid, but for accessing raw images, where available, the best option is to scroll down the landing page for the book in question, and look for the "Show All" link at the bottom of the sidebar on the right-hand side of the page. This will take you to a list of items in the directory, and, for zip or tar files, will allow you to view the contents of that file.

To harvest individual high-resolution images, click on the "(View Contents)" link beside the appropriate zip or tar archive. This is especially useful for harvesting individual raw images without having to download the whole set of raw images. (Raw images may be identified by the word "raw" or "orig" in the zip or tar filename.)

If the raw scans are jp2 files, they cannot be viewed directly in the browser, but these will be the full-sized images. There will be a link to a browser-viewable jpg for each jp2 that you can use to confirm that you have the correct image. The link to the jpg, where there is also a jp2 file, is only half-sized: once you have found the correct image, you should download the jp2 file to use for the project.

"Get image links for a TIA item" form (superseded)

This section has been largely superseded by the improved Internet Archive interface as described above. The script is still available in noncvs as of May 2022, and instructions have been retained in a collapsed section in case they may be of use.

(click 'expand' to open the superseded content)

Note: As of February 2018, the script was moved into noncvs. The link below has been updated. (14:27, 21 February 2018 (EST))

Follow the instructions at the top of the Get image links for a TIA item form.

Accessing specific pages
If you know which page you want, you can specify that in the 4 digit TIA format. Then click on Create Link. You will see three entries. One for the listing of the contents of the zip or tar, one to the full-sized page image, and one to the medium-sized page image for the page specified.
Note that this is the only way of getting a full-sized .jpg file by harvesting if the source files are jp2. If you choose this method, you should choose the full-sized image for illustration images.
Accessing a list of all images
If you have not specified a page number, you will just see a link to the listing of the contents of the zip or tar when you click on Create Link.
If the source files are .jp2s, the list of the contents of the tar or zip will provide links to both .jp2 and .jpg files. Note that the dimensions of the .jpg files are half those of the .jp2 files. Full-sized .jpg files are not available using this method. If you are harvesting for high-res images, and cannot open .jp2 files, use the Accessing specific pages method above to harvest them. However, whichever OS you use, there should be some way for you to use .jp2 files.

Read Online

The first two methods (method 2 is a variation on method 1) use the Read Online version of the book. Find the image that you want to extract, then:

Method 1

  1. Right-click (also control-click if you're using a Mac) and choose "View Image" (Firefox), "Open Image in a New Window" (Safari), or whatever the equivalent choice is in the browser you use (all major browsers have an equivalent command). In the location bar at the top of your new browser window, you should see something like the following:
    http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_jp2.zip&file=alicesadventur00carr_jp2/alicesadventur00carr_0001.jp2&scale=4&rotate=0
  2. Position your cursor at the end of the url. The last two items at the end of the url should be "&scale=4&rotate=0". The one we're concerned with is "scale". Change the "&scale=4" to "&scale=1",[1] and press return. The pixel count in the title bar should change to much larger numbers than were there before.
  3. Save the image to your computer.

[1] Note that the original scale number can vary widely. 4 is rather typical, but 2, 8, and arcane numbers with multiple digits after the decimal point may sometimes appear. In all cases, just change the scale to 1. If, when you open the image from the Reader in a new window, and it comes up with scale equal to 2, you're not going to get a whole lot larger an image. You can also change the scale number to 0, but generally there doesn't seem to be a difference between scale=0 and scale=1.

Method 2

  1. Using the zoom button in the IA Read Online interface, zoom all the way in on the page you want to save to your computer.
  2. Use the method in step 1 above to bring up a copy of the picture. Check the end of the url to make sure that &scale=1 (paranoid check).
  3. Save the image to your computer.

Using All Files: HTTP

Note: Typically if the images are from Cornell University or Google, they're tif images, are already cropped and rotated, and do not have a raw version. As of 2010-08-09, neither of the following methods works if the raw images are in _jpg.tar form. Methods 3 and 4 are more useful if you wish to harvest the raw images.

Method 3

See Get a link to an image from TIA form section, above, for the current, updated version of this method. The instructions below are being left in place for the time being for potential use in refining the newer instructions.

Sometimes, you may want to check the raw image (assuming it is a project that has them)—for example, if the edge of the text is cut off in the book display copy of the page, checking the raw image may actually show that it's there, but was cropped off when the "read online" images were prepared.

This script allows you to pull images out of any Internet Archive zip file ending in _jp2.zip, _jp2.tar or _tif.zip.

  1. Find the book at IA for which you wish to look at a raw image. In a separate browser window, open this page.
  2. Over on the left-hand side of the IA window, click on All Files: HTTP at the bottom of the "View the book" section.
  3. You'll see a number of files there. Close to the bottom of the list, you should see a file ending in something like _orig_jp2.tar, _raw_jp2.zip or _tif.zip.[1] Copy this link (right-click, or control-click on a Mac).
  4. Paste it into the box provided for the url on the script page.
  5. Guess at what page number you want—it will probably be a four digit number, and leading zeroes are important. TIA page numbering generally starts with 0001, with the cover of the book, so this page number will not correspond to either the printed page number or the DP project png number.
  6. Under "Results" choose from Large image or Medium image. If this is the wrong page, you can either use your browser back button to go back and adjust the page number from the script page, or edit the url to change the page number if you're comfortable doing that.

[1] Generally, if you're using this script, you're looking for the original unprocessed scans. These will probably be labeled _raw_jp2.zip or _orig_jp2.tar, and will be toward the end of the list, if not the last item on it. This script can also be used to access other _jp2.zip files.

Method 4

Manipulate the url of the Book Reader image.

If you know where to look, there are three places in the url that you can manipulate the url to pull up the unprocessed image without using the script.

To use Alice as an example again, compare the first url with the second (The first url is the same one you would get using method 1 or 2 above—by right- or control-clicking on the image in the "read online" version, and bringing up that individual image in another window or tab):

http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_jp2.zip&file=alicesadventur00carr_jp2/alicesadventur00carr_0001.jp2&scale=1&rotate=0


http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_orig_jp2.tar&file=alicesadventur00carr_orig_jp2/alicesadventur00carr_orig_0001.jp2&scale=1&rotate=0


In both urls, above, note the blue and the red text. The blue is there only as an aid in finding the places in the url that need attention. The red is text that either needs to change or be added.

To figure out what to add/substitute into the url, check the "All Files: HTTP" section. This will tell you whether you're dealing with "orig" and "tar" or "raw" and "zip", or some other combination. If the uncropped, unrotated images are labeled with "raw" instead of "orig", and the archive is a .zip instead of a .tar, substitute "raw" where "orig" is in the above example, and .zip where .tar is.

Below is a table of substitutions for the all of the forms that have been tested as of 23:13, 20 August 2010 (PDT). Take special note of the fact that for the raw/tar combination, the extension is in upper case: JPG.

  _orig_jp2.tar:     orig     tar     jp2  
  _raw_jp2.zip:     raw   zip   jp2
  _raw_jpg.tar:     raw   tar   JPG  

Note that the final instance of .jp2 in the url above does not change for the first two cases. Only for the _raw_jpg.tar.

You can also rotate the image by changing the "rotate=0" at the end of the url to "rotate=90" (note that you can also use 270 or 180—just in case you want to view the image upside down).

Google Book images

Firefox

If you don't have another title in mind, you can use Gleanings of Natural History, ... to practice on.

  1. In your browser window, bring up the link to the book in which you are interested, or bring up the practice link above.
  2. Find the image you wish to harvest, or use the example page linked to above.
  3. Zoom in as far as you can using the Google books interface zoom tool.
  4. From the menu bar, choose Tools → Page Info, and then click on the Media tab.
  5. Using the list of links at the top of the page, scroll down till the image that you want appears in the bottom section of the window.
  6. In the center section of the window, copy the link labeled Location: and paste it into the location bar in a new window or tab.
  7. Save the image to your computer.

Note: At the end of the url is a w=nnnn value. You can change this number to a larger value than that achieved by zooming in with the Google books interface, and can save an image up to 2500 pixels wide. You can try saving both, but don't be surprised if the larger image is no better in quality than the image that you can get by using the Google books interface to zoom in as far as you can. In the case of the example used above, 1025 pixels wide is the best image you can get from this source.

Safari

By default, Safari has the Develop menu disabled. You'll need to enable it to use these instructions. Bring up the Preferences menu, click on the Advanced tab and check the checkbox at the bottom of the pane that says "Show Develop menu in menu bar". Note: these instructions were developed and tested on a Mac. It is not currently known how to do this in Safari for Windows.

Update 2011-08-01: Note that the instructions immediately following work through Safari 5.0. With 5.1, the "Inspect Element" interface has changed. Instructions for 5.1 are in process.

  1. Working with the same sample title above, bring the page up in a browser window.
  2. Ctrl-click or right-click on the image in the window, and choose Inspect Element. This will divide the window in half. The top half has the book you brought up, the bottom half has lots of information, most of which is irrelevant, with several tabs across the bar at the top of that section.
  3. Choose the Resources tab. You will probably see a message telling you that you need to enable resource tracking. You have the choice of enabling it just for the current session, or for all subsequent sessions.
  4. Once resource tracking is enabled, you'll see another set of tabs. Click on the one that says Images.
  5. Go back to the top half of the window and scroll down until you see the image that you want to harvest. Notice that as you scroll down (and, it turns out, up) in the top half of the window, the information in the sidebar at the left side of the bottom pane grows, as evidenced by the changing size of the scroll-bar).
  6. Back in the bottom half of the window, the right-hand frame should have two tabs: Headers and Content. Click on the Content tab, then scroll all the way down to the bottom of the left-hand frame, and (unless you spot the image you want, immediately), start with the last one in the list, and move up until you find it. If you stopped on it in the upper frame, it should be fairly close to the end of the list.
  7. Switch from the Content tab to the Header tab. Up at the top is the link you want, the Request URL.
  8. Copy the Request URL and paste it into another window.

Take special note of the "w=nnn" at the end of the url. This is the width, in pixels, of the image. This number will vary, depending on how far in you are zoomed on the image, but is never more than around 1025. You can change this and get an image that is up to 2500 pixels wide.

Opera

These instructions are known to work for Opera 11, using Windows. They have not been tested on other versions or platforms. If you have tried on other versions/platforms and found the method to work (or not work), please update this paragraph.

  1. Working with the same sample title above, bring the page up in a browser window.
  2. Zoom in as far as you can using the Google books interface zoom tool.
  3. This step will differ slightly depending on whether you have the Show Menu Bar option on or off:
    • If you have "Show Menu Bar" option switched OFF, go to Menu → Page → Developer Tools → Cache.
    • If you have "Show Menu Bar" option switch ON, go to View → Developer Tools → Cache.
    • Or you can just type "opera:cache" in the address bar.
  4. Scroll down to google.com and click on Preview.
  5. In the list that appears, you will find thumbnail previews of cached images down the right side. It also gives info such as image size in pixels, and file size. Scroll down to the largest version of your image (you will have all the sizes stored in the cache, so you want to find the version with the highest resolution), which you can determine either by resolution size or file size. In the case of the example title, the one you want is 98KB and 1046 x 1353 pixels.
  6. When you find the image, click on its link (which will say /books), and it will open the image. Altnernatively, you can right-click on the link and then select Save Linked Content As, and skip the next step.
  7. You can now save it as you would normally do: by right-clicking and then selecting Save image; or clicking File → Save as (Menu Bar ON) / Menu → Page → Save as (Menu Bar OFF); or pressing Ctrl+S.

Note that it is better to have a relatively empty cache when using this method, since it will be easier to find the correct image and you won't have a huge list to scroll through to find it.

Update (2015-04-21): Confirmed to work in Opera 12.16 under Mac OS X 10.8.5. Note also that you do not have to zoom in before previewing the Cache. All cached images should be there, regardless of your zoom level. If you wish, you can also manipulate the image size in the URL, as you can in Firefox and Safari. This may or may not gain you any increased detail.

HathiTrust Digital Library

Before harvesting images from Hathi, check their usage policy, here. They have different sets of restrictions, based on the source of their scans.

To harvest the largest available scans:

  1. Find the image you want to harvest.
  2. At the top of the page, switch to Classic mode.
  3. In the URL, you should see size=100. Using the zoom button (magnifying glass with a + on it), you should be able to zoom all the way to size=400, or you can change it in the URL.
  4. In Firefox, one of the following:
    • Right-click, View Image, and save,
    • Right-click, Copy Image Location, paste it into a new window or tab, and save,
    • If you have a plug-in that allows you to open an image in another tab or window, do so, and save,
    • Or right-click, Save Image As....
  5. In Safari: right-click, Open Image in New Window (or tab), and save.