Harvesting high-resolution images

Sometimes when harvesting images from other websites, special techniques or tricks may be useful to get the best quality images. This page documents some of these techniques.

Whichever of the following methods you use, if you are harvesting high resolution images for replacement or addition of images to a DP project, make sure that you contact db-req and arrange a way to have the images added to the project. Do not send images to db-req as an attachment.

Internet Archive

Note

This section is outdated and being revised -- 2023-05-11

a.k.a. TIA, a.k.a. IA, a.k.a. archive.org

Typically when harvesting from the Internet Archive, you will download a complete set of images for the project. However, at times you may only need a small number of pages, and don't need the entire set of book scans. Some techniques for doing this are described below. Take into consideration the number of images you need, the amount of time it would take to download the whole zip or tar, and the amount of space you have to store the files.

For purposes of these instructions, an example is Alice's Adventures in Wonderland. Scroll through and find a likely looking illustration. If you don't have another archive in mind, you can use this for practice.

Using Internet Archive's interface

The instructions below for accessing images via the read online interface are still valid, but for accessing raw images, where available, the best option is to scroll down the landing page for the book in question, and look for the "Show All" link at the bottom of the sidebar on the right-hand side of the page. This will take you to a list of items in the directory, and, for zip or tar files, will allow you to view the contents of that file.

To harvest individual high-resolution images, click on the "(View Contents)" link beside the appropriate zip or tar archive. This is especially useful for harvesting individual raw images without having to download the whole set of raw images. (Raw images may be identified by the word "raw" or "orig" in the zip or tar filename.)

If the raw scans are jp2 files, they cannot be viewed directly in the browser, but these will be the full-sized images. There will be a link to a browser-viewable jpg for each jp2 that you can use to confirm that you have the correct image. The link to the jpg, where there is also a jp2 file, is only half-sized: once you have found the correct image, you should download the jp2 file to use for the project.

"Get image links for a TIA item" form (superseded)

This section has been largely superseded by the improved Internet Archive interface as described above. The script is still available in noncvs as of May 2022, and instructions have been retained in a collapsed section in case they may be of use.

(click 'expand' to open the superseded content)

Note: As of February 2018, the script was moved into noncvs. The link below has been updated. (14:27, 21 February 2018 (EST))

Follow the instructions at the top of the Get image links for a TIA item form.

Accessing specific pages: If you know which page you want, you can specify that in the 4 digit TIA format. Then click on Create Link. You will see three entries. One for the listing of the contents of the zip or tar, one to the full-sized page image, and one to the medium-sized page image for the page specified.

Note that this is the only way of getting a full-sized .jpg file by harvesting if the source files are jp2. If you choose this method, you should choose the full-sized image for illustration images.

Accessing a list of all images: If you have not specified a page number, you will just see a link to the listing of the contents of the zip or tar when you click on Create Link.

If the source files are .jp2s, the list of the contents of the tar or zip will provide links to both .jp2 and .jpg files. Note that the dimensions of the .jpg files are half those of the .jp2 files. Full-sized .jpg files are not available using this method. If you are harvesting for high-res images, and cannot open .jp2 files, use the Accessing specific pages method above to harvest them. However, whichever OS you use, there should be some way for you to use .jp2 files.

Read Online

The first two methods (method 2 is a variation on method 1) use the Read Online version of the book. Find the image that you want to extract, then:

Method 1

Right-click (also control-click if you're using a Mac) and choose "View Image" (Firefox), "Open Image in a New Window" (Safari), or whatever the equivalent choice is in the browser you use (all major browsers have an equivalent command). In the location bar at the top of your new browser window, you should see something like the following:
http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_jp2.zip&file=alicesadventur00carr_jp2/alicesadventur00carr_0001.jp2&scale=4&rotate=0
Position your cursor at the end of the url. The last two items at the end of the url should be "&scale=4&rotate=0". The one we're concerned with is "scale". Change the "&scale=4" to "&scale=1",[1] and press return. The pixel count in the title bar should change to much larger numbers than were there before.
Save the image to your computer.

[1] Note that the original scale number can vary widely. 4 is rather typical, but 2, 8, and arcane numbers with multiple digits after the decimal point may sometimes appear. In all cases, just change the scale to 1. If, when you open the image from the Reader in a new window, and it comes up with scale equal to 2, you're not going to get a whole lot larger an image. You can also change the scale number to 0, but generally there doesn't seem to be a difference between scale=0 and scale=1.

Method 2

Using the zoom button in the IA Read Online interface, zoom all the way in on the page you want to save to your computer.
Use the method in step 1 above to bring up a copy of the picture. Check the end of the url to make sure that &scale=1 (paranoid check).
Save the image to your computer.

Using All Files: HTTP

Note: Typically if the images are from Cornell University or Google, they're tif images, are already cropped and rotated, and do not have a raw version. As of 2010-08-09, neither of the following methods works if the raw images are in _jpg.tar form. Methods 3 and 4 are more useful if you wish to harvest the raw images.

Method 3

See Get a link to an image from TIA form section, above, for the current, updated version of this method. The instructions below are being left in place for the time being for potential use in refining the newer instructions.

Sometimes, you may want to check the raw image (assuming it is a project that has them)—for example, if the edge of the text is cut off in the book display copy of the page, checking the raw image may actually show that it's there, but was cropped off when the "read online" images were prepared.

This script allows you to pull images out of any Internet Archive zip file ending in _jp2.zip, _jp2.tar or _tif.zip.

Find the book at IA for which you wish to look at a raw image. In a separate browser window, open this page.
Over on the left-hand side of the IA window, click on All Files: HTTP at the bottom of the "View the book" section.
You'll see a number of files there. Close to the bottom of the list, you should see a file ending in something like _orig_jp2.tar, _raw_jp2.zip or _tif.zip.[1] Copy this link (right-click, or control-click on a Mac).
Paste it into the box provided for the url on the script page.
Guess at what page number you want—it will probably be a four digit number, and leading zeroes are important. TIA page numbering generally starts with 0001, with the cover of the book, so this page number will not correspond to either the printed page number or the DP project png number.
Under "Results" choose from Large image or Medium image. If this is the wrong page, you can either use your browser back button to go back and adjust the page number from the script page, or edit the url to change the page number if you're comfortable doing that.

[1] Generally, if you're using this script, you're looking for the original unprocessed scans. These will probably be labeled _raw_jp2.zip or _orig_jp2.tar, and will be toward the end of the list, if not the last item on it. This script can also be used to access other _jp2.zip files.

Method 4

Manipulate the url of the Book Reader image.

If you know where to look, there are three places in the url that you can manipulate the url to pull up the unprocessed image without using the script.

To use Alice as an example again, compare the first url with the second (The first url is the same one you would get using method 1 or 2 above—by right- or control-clicking on the image in the "read online" version, and bringing up that individual image in another window or tab):

http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_jp2.zip&file=alicesadventur00carr_jp2/alicesadventur00carr_0001.jp2&scale=1&rotate=0

http://ia311325.us.archive.org/BookReader/BookReaderImages.php?zip=/3/items/alicesadventur00carr/alicesadventur00carr_orig_jp2.tar&file=alicesadventur00carr_orig_jp2/alicesadventur00carr_orig_0001.jp2&scale=1&rotate=0

In both urls, above, note the blue and the red text. The blue is there only as an aid in finding the places in the url that need attention. The red is text that either needs to change or be added.

To figure out what to add/substitute into the url, check the "All Files: HTTP" section. This will tell you whether you're dealing with "orig" and "tar" or "raw" and "zip", or some other combination. If the uncropped, unrotated images are labeled with "raw" instead of "orig", and the archive is a .zip instead of a .tar, substitute "raw" where "orig" is in the above example, and .zip where .tar is.

Below is a table of substitutions for the all of the forms that have been tested as of 23:13, 20 August 2010 (PDT). Take special note of the fact that for the raw/tar combination, the extension is in upper case: JPG.

_orig_jp2.tar:	orig	tar	jp2
_raw_jp2.zip:	raw	zip	jp2
_raw_jpg.tar:	raw	tar	JPG

Note that the final instance of .jp2 in the url above does not change for the first two cases. Only for the _raw_jpg.tar.

You can also rotate the image by changing the "rotate=0" at the end of the url to "rotate=90" (note that you can also use 270 or 180—just in case you want to view the image upside down).

Google Book images

Note

This section is outdated and being revised -- 2023-05-11

Firefox

If you don't have another title in mind, you can use Gleanings of Natural History, ... to practice on.

In your browser window, bring up the link to the book in which you are interested, or bring up the practice link above.
Find the image you wish to harvest, or use the example page linked to above.
Zoom in as far as you can using the Google books interface zoom tool.
From the menu bar, choose Tools → Page Info, and then click on the Media tab.
Using the list of links at the top of the page, scroll down till the image that you want appears in the bottom section of the window.
In the center section of the window, copy the link labeled Location: and paste it into the location bar in a new window or tab.
Save the image to your computer.

Note: At the end of the url is a w=nnnn value. You can change this number to a larger value than that achieved by zooming in with the Google books interface, and can save an image up to 2500 pixels wide. You can try saving both, but don't be surprised if the larger image is no better in quality than the image that you can get by using the Google books interface to zoom in as far as you can. In the case of the example used above, 1025 pixels wide is the best image you can get from this source.

Safari

By default, Safari has the Develop menu disabled. You'll need to enable it to use these instructions. Bring up the Preferences menu, click on the Advanced tab and check the checkbox at the bottom of the pane that says "Show Develop menu in menu bar". Note: these instructions were developed and tested on a Mac. It is not currently known how to do this in Safari for Windows.

Update 2011-08-01: Note that the instructions immediately following work through Safari 5.0. With 5.1, the "Inspect Element" interface has changed. Instructions for 5.1 are in process.

Working with the same sample title above, bring the page up in a browser window.
Ctrl-click or right-click on the image in the window, and choose Inspect Element. This will divide the window in half. The top half has the book you brought up, the bottom half has lots of information, most of which is irrelevant, with several tabs across the bar at the top of that section.
Choose the Resources tab. You will probably see a message telling you that you need to enable resource tracking. You have the choice of enabling it just for the current session, or for all subsequent sessions.
Once resource tracking is enabled, you'll see another set of tabs. Click on the one that says Images.
Go back to the top half of the window and scroll down until you see the image that you want to harvest. Notice that as you scroll down (and, it turns out, up) in the top half of the window, the information in the sidebar at the left side of the bottom pane grows, as evidenced by the changing size of the scroll-bar).
Back in the bottom half of the window, the right-hand frame should have two tabs: Headers and Content. Click on the Content tab, then scroll all the way down to the bottom of the left-hand frame, and (unless you spot the image you want, immediately), start with the last one in the list, and move up until you find it. If you stopped on it in the upper frame, it should be fairly close to the end of the list.
Switch from the Content tab to the Header tab. Up at the top is the link you want, the Request URL.
Copy the Request URL and paste it into another window.

Take special note of the "w=nnn" at the end of the url. This is the width, in pixels, of the image. This number will vary, depending on how far in you are zoomed on the image, but is never more than around 1025. You can change this and get an image that is up to 2500 pixels wide.

Opera

These instructions are known to work for Opera 11, using Windows. They have not been tested on other versions or platforms. If you have tried on other versions/platforms and found the method to work (or not work), please update this paragraph.

Working with the same sample title above, bring the page up in a browser window.
Zoom in as far as you can using the Google books interface zoom tool.
This step will differ slightly depending on whether you have the Show Menu Bar option on or off:
- If you have "Show Menu Bar" option switched OFF, go to Menu → Page → Developer Tools → Cache.
- If you have "Show Menu Bar" option switch ON, go to View → Developer Tools → Cache.
- Or you can just type "opera:cache" in the address bar.
Scroll down to google.com and click on Preview.
In the list that appears, you will find thumbnail previews of cached images down the right side. It also gives info such as image size in pixels, and file size. Scroll down to the largest version of your image (you will have all the sizes stored in the cache, so you want to find the version with the highest resolution), which you can determine either by resolution size or file size. In the case of the example title, the one you want is 98KB and 1046 x 1353 pixels.
When you find the image, click on its link (which will say /books), and it will open the image. Altnernatively, you can right-click on the link and then select Save Linked Content As, and skip the next step.
You can now save it as you would normally do: by right-clicking and then selecting Save image; or clicking File → Save as (Menu Bar ON) / Menu → Page → Save as (Menu Bar OFF); or pressing Ctrl+S.

Note that it is better to have a relatively empty cache when using this method, since it will be easier to find the correct image and you won't have a huge list to scroll through to find it.

Update (2015-04-21): Confirmed to work in Opera 12.16 under Mac OS X 10.8.5. Note also that you do not have to zoom in before previewing the Cache. All cached images should be there, regardless of your zoom level. If you wish, you can also manipulate the image size in the URL, as you can in Firefox and Safari. This may or may not gain you any increased detail.

HathiTrust Digital Library

The example used for the screenshots is the frontispiece from Louisa of Prussia and her times. Often Hathi images are viewable only in the US.

To harvest the largest available scans (this method should work for all browsers):

Find the image you want to harvest, and make sure that the sidebar is showing. Also make sure that the image you want is the one that will be downloaded; different pages may have different maximum sizes.
Open the "Download" menu. Your options will depend on the permissions for the particular book you're looking at. For most books you will only be able to download one image at a time. Choose the format (TIFF will give you the most lossless format, and can be saved later in JPEG format for uploading to the project). Please download the highest resolution image available. Often, "High" and "Full" are the same, but if "Full" gives you a larger image, use that.
If the watermark at the bottom of the pages indicates that the original source was Google, it's a good idea to open the "Get This Item" menu to see if there's a link to the Google original. If there is, it's a good idea to check to see if you can get a larger image directly from Google. For instance, for the frontispiece for Louisa of Prussia, from Hathi, the largest available through the web interface is 1418 x 2133 pixels; from Google, it's 1663 x 2500 pixels.
If the watermark at the bottom of the pages indicates that the original source was The Internet Archive, there will probably not be a link directly to the original. If the "Full" resolution image is 400 dpi or more, you may not be able to do better directly from TIA, but if it's only 300, please look for the corresponding scanset at The Internet Archive, as you will more than likely be able to get better and larger images from the original source there.

Section last updated 19:57, 11 May 2023 (EDT)

DownThemAll Extension

In order to download images from HathiTrust en masse, you will want to install the DownThemAll! extension for your browser. It is available for Mozilla Firefox and Microsoft Edge. After installing, it will be available from your Extensions menu, accessible by clicking on the puzzle piece icon in your browser's top bar (see screenshot to the right).

Click on the puzzle piece icon to open the menu. For Firefox, then click on the gear icon, and in the popup menu choose Pin to Toolbar. For Edge it is faster; simply click on the pin icon next to the extension's name.

Pinning DownThemAll to the toolbar in Firefox

Pinning DownThemAll to the toolbar in Edge

You will then see the DownThemAll arrow to the right of the Extensions menu in Firefox and to the left of the Extensions menu in Edge. Click on it and in the menu that opens, click on Preferences (see screenshot to the right). A new tab in the browser will open, titled DownThemAll! Preferences. There are three tabs at the top: General, Filters, and Network. Click on Network. Because HathiTrust throttles downloads from an individual, we will encounter server errors during the process and need to ensure that downloads are retried after a period of time. Under the Global Network Limits heading, set Concurrent downloads to 1, Number of retries of downloads on temporary errors to 99, and Retry every (in minutes) to 2. Changes are immediately saved and you can close the tab.

Open the DownThemAll menu again as before, clicking on its arrow icon. This time, click Manager, which will open another new tab titled DownThemAll! Manager.

DownThemAll Download Manager

In order to demonstrate the use of the DownThemAll download manager, we will use this book at HathiTrust, which is 24 images in total, including covers. The process will be similar for any book you wish to download. Take a look at the text in the top address bar for the book in question. In our example, it is https://babel.hathitrust.org/cgi/pt?id=njp.32101073965608&seq=5. Without getting too technical, this contains two name-value pairs of interest to us. The first is the id which has a value of njp.32101073965608.

https://babel.hathitrust.org/cgi/pt?id=njp.32101073965608&seq=5

This is a unique identifier for the book. The second name-value pair of interest is the seq which has a value of 5.

https://babel.hathitrust.org/cgi/pt?id=njp.32101073965608&seq=5

This numerical value is the number of the image we are looking at in the sequence of images. In this case it's the fifth. We need to note the value for the last image of the book, though. You can either look at the bottom of the website and note the second number in the X/X or click on the >> button to jump to the last page in the image set, at which point the seq in the address bar will update to the value.

On the DownThemAll! Manager tab, click the + sign button, which will open a popup window. In the Download box, you will need to enter in a value like this: https://babel.hathitrust.org/cgi/imgsrv/image?id={ID NUMBER};res=0;seq=[1:X]. Let's break this down. Copy and paste https://babel.hathitrust.org/cgi/imgsrv/image?id= into the box. Then copy and paste the id we noted in the address above (our example was njp.32101073965608). Copy and paste ;res=0;seq=[1:X] after that. Replace the X at the end with the number that corresponds to the last image for the book. In our example it is 24 and so the whole string of text for that book is https://babel.hathitrust.org/cgi/imgsrv/image?id=njp.32101073965608;res=0;seq=[1:24]

If you know you don't want front or back covers, or otherwise wish to download a subset of the images for a book, you can modify the starting and ending image numbers to your liking. So for this example we might decide to do 5 through 20 to avoid having the blank front/back covers. That would look like https://babel.hathitrust.org/cgi/imgsrv/image?id=njp.32101073965608;res=0;seq=[5:20]

The next step is to click on the Download button at the bottom of the window. You will get a notice that The current URL seems to contain instructions for a batch download, which is expected. Click on the Batch Download button to proceed. We will not do this as a single download because we need to have individual images retried if the server throttles us.

You will be taken back to the download manager and see the individual images being downloaded, along with progress bars. The first handful of images will download fine, then at some point you will see a progress bar turn red with a message Server Error next to it. That is the server throttling your progress. With the network settings we set before, we simply need to wait and it will try getting the rest of the files.

At some point it will finish, but you will see some stragglers that still have red bars. These errored out when the server started temporarily blocking us. Simply right-click on any and in the menu that appears, choose Force Start. To make it easier to find these troublemakers, you can right-click one of the files and in the menu that opens, click Remove Complete Downloads (just below the aforementioned option).

You may come to a point where even retrying failed downloads does not have success. This is because the server is not simply throttling your progress, but halting it entirely. At that point you have to go to https://www.hathitrust.org/ and you'll see one of those verifications to ensure you're not a bot. Once you pass that (maybe just moving the mouse or checking a box) DownThemAll will be able to resume downloading once you retry downloads.

The files will be saved in the download folder you have set in your browser (e.g. Downloads on Windows by default), though you can specify a subfolder in the dialogue where the download address is entered prior to beginning the downloads.