User:Srjfoo/scribbles
Guiguts has several features to support Content Providers, including providing functionality similar to guiprep. These features are accessed via the File menu, Content Providing submenu.
Import Prep Text Files
This is used to import individual text files for pre-processing by Content Providers before the files are first uploaded for proofreading.
Guiguts presents a file-open dialog. Navigate to the folder containing the text files and click Open. Guiguts searches that folder for files with names in the expected format and loads the contents of each in numeric sequence. It records a page separator between each file's data. After loading the file(s), Guiguts goes directly into the "file save as" dialog, to save the concatenated file in the parent directory and creates a .json file for it.
Dehyphenation
This tool has two modes. You can run it without a dictionary, and it uses only a list of the unique words in the project, and a few simple rules such as to dehyphenate "******-ing". You can also tell it to use the dictionary for the language of the current document, in which case it will use both sources of information.
As with removing page headers and footers, the list can be sorted by line number, which will interleave message types, or alphabetically, which more clearly delineates which is which.
You'll see a list of words that have been identified as possible "Keep" the hyphen and move the second half up, or "Remove" the hyphen and move the second part of the word up. Most of the "Remove" words are likely to be correctly identified; The "Keep" group will likely have a higher percentage mis-identified, in which case you can change from "Keep" to "Remove"
Guiprep users will be most familiar with this process as two separate tabs, one for headers and one for footers. This tool does the same job, but does it somewhat differently. If you turn on Auto Image from the toolbar or the View Menu, then as you view each header or footer, then scan image for the page will be displayed in the adjacent image viewer.
Header and footer removal in GG2 follows the general GG2 tool layout, and presents headers and footers all in the same menu. If you sort by line numbers, headers and footers will be interleaved, but sorting the list alphabetically separates headers from footers. The tool will also try to guess whether the top or bottom line of a page has a page number associated with it.
With the list sorted alphabetically, you can scroll through text and images together, identify lines that should not be removed and hide them, and then fix all remaining instances of each type separately.
Prep Text Filtering
The 80-odd text filtering options have been combined and condensed into 26 options, and include what was formerly "CP Character Substitutions" They can all be run at once, or turned on individually or in groups and run in separate passes.
Add [Blank Page] to Empty Pages
Does what it says, where a blank page is defined as there being nothing between two page separators.
Fix Olde Englifh
Use with care, and only on books that contain the long-ess (ſ) that ocr'd as the letter "f". Every word in the file that contains or starts with the letter "f" that is also a legitimate word if it has the letter "s" in the same position will have the "f" turned into an asterisk for manual inspection. For example, finger/singer, found/sound, fail/sail, left/lest.
Compress PNG files
In the Advanced tab in Settings, you can specify a command to compress the pngs. The default command is for pngcrush. The tooltip gives the command for both pngcrush and optipng. If you clear the field, GG2 will attempt a built-in file compression, but it is nowhere near as good as either or the two mentioned. If you don't have either program, you'll need to install it; neither one is packaged with GG2.
Renumber PNG files
The current version of this option is very basic, and will rename both the page separators and the .png files starting from 001.png. In order for the renumbering to work, the filenames and the page separator page labels must be in sync. Otherwise, you'll get a Filename mismatch
error.
This is probably most useful if your OCR is from TIA's Abbyy OCR: If you do your own OCR or if it's provided by another volunteer, the files are probably already correctly numbered.
Export As Prep Text Files
After using the above option, and pre-processing the text using Guiguts' other features, this option lets you write the current document as separate files again. Navigate to, and select, the folder where you want to store the many small files nnn .txt. Guiguts writes a file for each page.
Import TIA Abbyy OCR File
This is specifically tailored to import Abbyy OCR files downloaded from TIA. Not all TIA scansets will have them. When investigating the possibilities, it was discovered that the other available possibilities at TIA, and the OCR available via HathiTrust all had problems (like missing end-of-line hyphens) that could not be overcome. This is usable if the file is available, but does not do well with tables.
CP Character Substitutions
Subsumed into "Filter File..." (this section will be removed in a future version of the manual)