Talk:Harvesting/Google Book Search

From DPWiki
Jump to navigation Jump to search

status page

I'm not that sure whether we should still refer to the rather huge project status page. Yes, it is huge. But tend to outdate pretty fast. --Keichwa 21:21, 20 May 2006 (PDT)

It is not updated at all right now. The person is creating a webpage to support this list. When it will be available is not known yet. --De2164 16:24, 25 May 2006 (PDT)

In the meanwhile tho' it's hoped that those of us harvesting from google book would log their harvesting activities on the Google_Book_Search_Coordination page so that if nothing else, we don't duplicate efforts. Sihaya 11:46, 15 June 2006 (PDT)

Preparing Images for OCR

How can I make the google scans suitable for OCR purposes? I tried

mogrify -resample 300x300 *.png

See Image_cleanup. But this increases teh file size considerably and I'm not that sure whether it actually improves the OCRibility.--Keichwa 10:26, 15 June 2006 (PDT)

All that does is take the current resolution and creates more pixels in the same pattern as the original resolution. I have found no good way to improve the images yet. De2164 16:19, 21 June 2006 (PDT)

Hmm, I am running into similar problems. When I use ABBYY on the Google PDFs it creates 600+dpi TIFs which are unnecessarily huge. I used "pdfimage" but I'm not sure what exactly the use of the pbm is. Hmm. Gren 01:48, 30 December 2006 (PST)

Background

I am not sure why you created the background section? It does nothing for the page in IMHO. De2164 16:34, 21 June 2006 (PDT)

Yeah, you can remove it. But it think the article is still too long and confusing.--Keichwa 21:05, 21 June 2006 (PDT)

gharvest (obsolete?)

gharvest does not work with the new GBS interface. This is the old description:

Now that Google allows the download of the entire book in a PDF, the manual download or script should not be needed anymore. However, a few books seem to be fully available, but still lack the PDF download option.
Google presents the page images for a book one page at a time. You could download all the images manually, or you could use the gharvest download script to harvest the images. gharvest is a Perl command line script written by bgalbrecht.
If gharvest is running when you get the verification screen, most likely it has stopped and cannot be restarted for 24 hours.