Talk:Harvesting/Google Book Search

From DPWiki
Jump to: navigation, search

status page

I'm not that sure whether we should still refer to the rather huge project status page. Yes, it is huge. But tend to outdate pretty fast. --Keichwa 21:21, 20 May 2006 (PDT)

It is not updated at all right now. The person is creating a webpage to support this list. When it will be available is not known yet. --De2164 16:24, 25 May 2006 (PDT)

In the meanwhile tho' it's hoped that those of us harvesting from google book would log their harvesting activities on the Google_Book_Search_Coordination page so that if nothing else, we don't duplicate efforts. Sihaya 11:46, 15 June 2006 (PDT)

Preparing Images for OCR

How can I make the google scans suitable for OCR purposes? I tried

mogrify -resample 300x300 *.png

See Image_cleanup. But this increases teh file size considerably and I'm not that sure whether it actually improves the OCRibility.--Keichwa 10:26, 15 June 2006 (PDT)

All that does is take the current resolution and creates more pixels in the same pattern as the original resolution. I have found no good way to improve the images yet. De2164 16:19, 21 June 2006 (PDT)

Hmm, I am running into similar problems. When I use ABBYY on the Google PDFs it creates 600+dpi TIFs which are unnecessarily huge. I used "pdfimage" but I'm not sure what exactly the use of the pbm is. Hmm. Gren 01:48, 30 December 2006 (PST)


I am not sure why you created the background section? It does nothing for the page in IMHO. De2164 16:34, 21 June 2006 (PDT)

Yeah, you can remove it. But it think the article is still too long and confusing.--Keichwa 21:05, 21 June 2006 (PDT)

gharvest (obsolete?)

gharvest does not work with the new GBS interface. This is the old description:

Now that Google allows the download of the entire book in a PDF, the manual download or script should not be needed anymore. However, a few books seem to be fully available, but still lack the PDF download option.
Google presents the page images for a book one page at a time. You could download all the images manually, or you could use the gharvest download script to harvest the images. gharvest is a Perl command line script written by bgalbrecht.
If gharvest is running when you get the verification screen, most likely it has stopped and cannot be restarted for 24 hours.