User:Bertzi
My books
DP Projects
http://dl.dropbox.com/u/28726210/dpprojects.html
My scripts
Features
- downloading TIA books using full bandwidth (downloading multiple files at once), extracting and renaming (just by one command)
- generating proofing images (centering, cropping margins using unpaper)
- cropping the headers before doing ocr
- whitening out the illustrations (after manual cropping)
- it uses multiple cores, so it wont take more than 1 hour for a 400 pages book on a computer with 4 cores (I am looking for someone who can test this I have only 1 core, for 2 cores was already tested)
- you can do more project at once, you can setup a separate config file for each project (where you have to specify the resolution, alignments, size of margins etc.),
- the whole thing is pausable and resumable (useful during power failures)
Installing the necessary tools
You can find my latest scripts here. Download and extract to a directory (this directory going to be a working directory for all projects).
List of tools
- aria2 (For faster downloading...)
- axel (Another alternative for faster downloading...)
- imagemagick (For image conversion...)
- exact-image (Faster image conversion...)
- optipng (For smaller proofing images...)
- unpaper (Preparing images: centering, cropping margins etc....)
- tesseract (For OCR-ing...)
Archlinux packages
If you use Arhclinux simply run this command:
sudo pacman -S aria2 axel imagemagick optipng unpaper tesseract
Here is a patched tesseract-svn-622-1-i686.pkg.tar.xz package for Archlinux, which inserts blank lines between paraghraphs. Download, than install with the following command:
sudo pacman -U tesseract-svn-622-1-i686.pkg.tar.xz
Or you can compile yourself with makepkg, downloading the tarballs from AUR: https://aur.archlinux.org/packages.php?ID=10320 and https://aur.archlinux.org/packages.php?ID=28718 (don't forget to apply the patch after svn checkout)
Also I suggest to read this and set up a config file in tessdata/configs with whitelisted characters to avoid three different quotation marks for example.
Other package which is missing from official repos is exactimage-svn-1812-1-i686.pkg.tar.xz, install with the following command:
sudo pacman -U exactimage-svn-1812-1-i686.pkg.tar.xz
Ubuntu packages
For Ubuntu unfortunaly I couldnt compile tesseract (and official package is quite old version), but you can still use my script for generating proofing images. Install the tools with:
sudo apt-get install aria2 axel imagemagick optipng unpaper
Downloading from TIA, and setting up the project folder
Video tutorial
Examples
$ ./cp -d http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip
This downloads the zip file to the current directory, makes a new directory (newtendencyinart00poor), extracts the zip to jp2 subdirectory and renames all the jp2 files to 001.jp2, 002.jp2... Also does the same with the jpg files.
$ ./cp -dx http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip
This one uses axel download accelerator instead of aria2.
$ ./cp -d links.txt $ ./cp -dx links.txt
Where links.txt has multiple links, and it must be placed inside the working directory:
http://ia600301.us.archive.org/26/items/newtendencyinart00pooriala/newtendencyinart00pooriala_jp2.zip http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip
This commands do the same as the previous two but for multiple books at once.
Setting up the config file
Video tutorial
Here is a default config file:
# default values resolution=1000 # 1000 pre_crop=50x50 # 50x50 top_margin=2 # 2 (cm) bottom_margin=2 # 2 (cm) threshold=auto # 50 (or auto) exact_image=yes # yes nr_cores=1 # 1 (1, 2 and 4 are supported) crop=50x50 header_crop=120 skip_pages=, blank_pages=, align_center_pages=, align_bottom_pages=, rotate_cw_pages=, rotate_ccw_pages=, double_illos=, triple_illos=, ocr_single_block=yes # change to no if you want column detection ocr_language=eng ocr_config=bertzi #see the wiki page of tesseract
You have to copy into a project folder (every project has its own config file). You have to make a new directory (test-config) and subdirectory jp2 and copy few files to there (You dont want to run the testing script for the whole book, only for 3-4 pages) Then you can run a test command to see the results with the default config file:
$ ./cp -t test-config
This command generate the png, tif, and txt files. (And few more temporary files). You can easily compare the different results, than run the proper config file for the whole book in the next step.
Generating the proofing images (png) and images for ocr (tif)
Video tutorial
$ ./cp -p newtendencyinart00poor
Ocr-ing
$ ./cp -o newtendencyinart00poor
Preparing the Illustrations
You have to make a folder called illos-jpg and move files from jpg folder which has illustration, than run
$ ./cp -i newtendencyinart00poor
This will generate high quality grayscale 8 bit images in a folder called png-8bit.
Rest is in progress...