User:Bertzi

From DPWiki
Jump to navigation Jump to search

My books

DP Projects

http://dl.dropbox.com/u/28726210/dpprojects.html

My scripts

Features

  1. downloading TIA books using full bandwidth (downloading multiple files at once), extracting and renaming (just by one command)
  2. generating proofing images (centering, cropping margins using unpaper)
  3. cropping the headers before doing ocr
  4. whitening out the illustrations (after manual cropping)
  5. it uses multiple cores, so it wont take more than 1 hour for a 400 pages book on a computer with 4 cores (I am looking for someone who can test this I have only 1 core, for 2 cores was already tested)
  6. you can do more project at once, you can setup a separate config file for each project (where you have to specify the resolution, alignments, size of margins etc.),
  7. the whole thing is pausable and resumable (useful during power failures)

Installing the necessary tools

You can find my latest scripts here. Download and extract to a directory (this directory going to be a working directory for all projects).

List of tools

  1. aria2 (For faster downloading...)
  2. axel (Another alternative for faster downloading...)
  3. imagemagick (For image conversion...)
  4. exact-image (Faster image conversion...)
  5. optipng (For smaller proofing images...)
  6. unpaper (Preparing images: centering, cropping margins etc....)
  7. tesseract (For OCR-ing...)

Archlinux packages

If you use Arhclinux simply run this command:

sudo pacman -S aria2 axel imagemagick optipng unpaper tesseract

Here is a patched tesseract-svn-622-1-i686.pkg.tar.xz package for Archlinux, which inserts blank lines between paraghraphs. Download, than install with the following command:

sudo pacman -U tesseract-svn-622-1-i686.pkg.tar.xz

Or you can compile yourself with makepkg, downloading the tarballs from AUR: https://aur.archlinux.org/packages.php?ID=10320 and https://aur.archlinux.org/packages.php?ID=28718 (don't forget to apply the patch after svn checkout)

Also I suggest to read this and set up a config file in tessdata/configs with whitelisted characters to avoid three different quotation marks for example.

Other package which is missing from official repos is exactimage-svn-1812-1-i686.pkg.tar.xz, install with the following command:

sudo pacman -U exactimage-svn-1812-1-i686.pkg.tar.xz

Ubuntu packages

For Ubuntu unfortunaly I couldnt compile tesseract (and official package is quite old version), but you can still use my script for generating proofing images. Install the tools with:

sudo apt-get install aria2 axel imagemagick optipng unpaper

Downloading from TIA, and setting up the project folder

Video tutorial

Examples

$ ./cp -d  http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip

This downloads the zip file to the current directory, makes a new directory (newtendencyinart00poor), extracts the zip to jp2 subdirectory and renames all the jp2 files to 001.jp2, 002.jp2... Also does the same with the jpg files.

$ ./cp -dx  http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip

This one uses axel download accelerator instead of aria2.

$ ./cp -d links.txt
$ ./cp -dx links.txt

Where links.txt has multiple links, and it must be placed inside the working directory:

http://ia600301.us.archive.org/26/items/newtendencyinart00pooriala/newtendencyinart00pooriala_jp2.zip
http://ia600509.us.archive.org/15/items/newtendencyinart00poor/newtendencyinart00poor_jp2.zip

This commands do the same as the previous two but for multiple books at once.

Setting up the config file

Video tutorial

Here is a default config file:

			# default values
resolution=1000		# 1000
pre_crop=50x50		# 50x50
top_margin=2		# 2 (cm)
bottom_margin=2		# 2 (cm)
threshold=auto		# 50 (or auto)
exact_image=yes		# yes

nr_cores=1		# 1 (1, 2 and 4 are supported) 
crop=50x50		
header_crop=120		

skip_pages=,

blank_pages=,
align_center_pages=,
align_bottom_pages=,

rotate_cw_pages=,
rotate_ccw_pages=,

double_illos=,
triple_illos=,

ocr_single_block=yes	# change to no if you want column detection
ocr_language=eng
ocr_config=bertzi	#see the wiki page of tesseract

You have to copy into a project folder (every project has its own config file). You have to make a new directory (test-config) and subdirectory jp2 and copy few files to there (You dont want to run the testing script for the whole book, only for 3-4 pages) Then you can run a test command to see the results with the default config file:

$ ./cp -t test-config

This command generate the png, tif, and txt files. (And few more temporary files). You can easily compare the different results, than run the proper config file for the whole book in the next step.

Generating the proofing images (png) and images for ocr (tif)

Video tutorial

$ ./cp -p newtendencyinart00poor

Ocr-ing

$ ./cp -o newtendencyinart00poor

Preparing the Illustrations

You have to make a folder called illos-jpg and move files from jpg folder which has illustration, than run

$ ./cp -i newtendencyinart00poor

This will generate high quality grayscale 8 bit images in a folder called png-8bit.

Rest is in progress...