PPTools/Guiguts/Guiguts 2 Manual/Content Providing Menu

From DPWiki
Jump to navigation Jump to search


GUIGUTS VERSION 2 MANUAL

Describes features included in release 2.0.3 (September 2025)


Content Providing Submenu

Guiguts has several features to support Content Providers, including providing functionality similar to guiprep. These features are accessed via the File menu, Content Providing submenu.

Content Providing submenu

Import Prep Text Files

This is used to import individual text files for pre-processing by Content Providers before the files are first uploaded for proofreading.

For users of Guiprep, if you generate both textw and textwo files, this should be the textw files.

Guiguts presents a file-open dialog. Navigate to the folder containing the text files and click Open. Guiguts searches that folder for files with names in the expected format and loads the contents of each in numeric sequence. It records a page separator between each file's data. After loading the file(s), Guiguts goes directly into the "file save as" dialog, to save the concatenated file in the parent directory and creates a .json file for it.

Dehyphenation

This tool has two modes. You can run it without a dictionary, and it uses a list of the unique words in the project. If the project is English, it also uses a few simple rules such as to dehyphenate "******-ing". You can also tell it to use the dictionary for the language of the current document, in which case it will use both sources of information. See this section for information on additional dictionaries.

As with removing page headers and footers, the list can be sorted by line number, which will interleave message types, or alphabetically, which more clearly delineates which is which.

You'll see a list of words that have been identified as possible "Keep" the hyphen and move the second half up, or "Remove" the hyphen and move the second part of the word up. Most of the "Remove" words are likely to be correctly identified; The "Keep" group will likely have a higher percentage mis-identified, in which case you can change from "Keep" to "Remove" as follows. If you want to change just one word from "Keep" to "Remove", or vice versa, select the word in the list, and use "Keep⇔Remove" to swap its type. If you want to change all the "Keep" entries in the list to "Remove", use the "All⇒Remove" button, or similarly with the "All⇒Keep" button.

Any entries that you do not want to dehyphenate (i.e. leave the word split across the line break with the end-of-line hyphen for the proofers to deal with) you can hide from the list, either with the Hide button, or by right-clicking it.

Once you are happy that the "Remove" entries in the list are the correct ones, you can dehyphenate all the "Remove" entries at once by selecting one of them and using "Fix All" or "Fix & Hide All". Similarly for the "Keep" entries once you are happy with them.

Header/Footer Removal

Guiprep users will be most familiar with this process as two separate tabs, one for headers and one for footers. This tool does the same job, but does it somewhat differently. If you turn on Auto Image from the toolbar or the View Menu, then as you view each header or footer, the scan image for the page will be displayed in the adjacent image viewer.

Header and footer removal in GG2 follows the general GG2 tool layout, and presents all headers and footers in the same dialog. By default, if you sort by line numbers, headers and footers will be interleaved, but there are checkboxes to control the inclusion of Odd Headers, Even Headers, Odd Footers and Even Footers in the list. Depending on the structure of your book, you may find it more convenient to work on just one of these 4 types at a time, or work on all the headers first, or work on all headers and footers sequentially.

Header-footer-options.png

In addition to listing whether each is an odd/even header/footer, Guiguts will also try to detect certain types of header/footer to make it easier to deal with them. For example, if a footer looks like a simple page number, e.g. "27" or "xiv", it will be reported as "Footer Page Number". If there is one mis-OCRed digit, e.g. "J23" for "123", it will be reported as "Footer Page Num?". If a header contains a (page) number and the remainder is allcaps, e.g. "THE STORY BEGINS 27", it will be reported as "Header Num & Allcap".

With the list sorted by Alpha/Type, each of the above types will be listed together. If you are happy that all the headers/footers of a particular type are true headers and footers, so you want to remove them all, select one of them and use "Fix All" or "Fix & Hide All".

If you spot a header/footer that has been mis-identified, so you do not want it to be removed, just Hide it by right clicking it (or select it and use the Hide button). When you have hidden the headers/footers you do not want removing, you can use "Fix (& Hide) All" as described above.

Shortcut keys for the above buttons are described in the tooltips.

Prep Text Filtering

The 80-odd text filtering options have been combined and condensed into 26 options, and include what was formerly "CP Character Substitutions" They can all be run at once, or turned on individually or in groups and run in separate passes.

Filter-files-options.png

Upon completion, you will briefly see a message at the bottom of the screen in the log that indicates how many changes were made:

Log-confirmation-filter.png

Fix Common English Scannos

Checks through the file for common scannos (English only - do not use if the main language is not English) and fixes them.

Add [Blank Page] to Empty Pages

Does what it says, where a blank page is defined as there being nothing between two page separators.

Fix Olde Englifh

Use with care, and only on books that contain the long-ess (ſ) that ocr'd as the letter "f". Every word in the file that contains or starts with the letter "f" that is also a legitimate word if it has the letter "s" in the same position will have the "f" turned into an asterisk for manual inspection. For example, "finger" will be replaced with "*inger", since "singer" is a legitimate word. Similarly for found/sound, fail/sail, left/lest, etc.

Compress PNG files

In the Advanced tab in Settings, you can specify a command to compress the pngs. The default command is for pngcrush. The tooltip gives the command for both pngcrush and optipng. If you clear the field, GG2 will attempt a built-in file compression, but it is nowhere near as good as either or the two mentioned. If you don't have either program, you'll need to install it; neither one is packaged with GG2.

Renumber Pages and PNG Files

Content Providing submenu

This option allows you to renumber the PNG files and the corresponding page separator lines in the combined text file. When you Export Prep Text Files later, the text files will be given the new numbers.

The dialog consists of 5 rows of entry fields. Each row corresponds to a section of the book that requires a different naming/numbering system, e.g. frontmatter, main matter, index, etc.

  • Each section can have a unique prefix, e.g. "a001", "a002", etc., for the first section, "b001", "b002", etc., for the second section, and so on. These must remain in correct alphabetical order for them to be displayed in the correct order at DP.
  • Next, the start and end number for each section are specified. You may either continue numbering from one section to the next, e.g. "a001" to "a020", then "b021" to "b300", or restart numbering, e.g. "a001 to "a020", then "b001" to "b180".
  • Finally, suffixes may be specified. Suffixes will be used sequentially in a cycle. For example, if an index is in 2 columns, you may enter "a b" or "_1 _2" in the suffix field. If your prefix is "c" and the index begins at page 200, the following page numbers will be used: "c200a", "c200b", "c201a", "c201b", etc., or "c200_1", "c200_2", "c201_1", "c201_2", etc. If there are 3 columns, just specify 3 suffixes, e.g. "_1l, _2c, _3r" (for left, centre and right columns).

Note: This renumbering only allows renumbering sections of the book, up to 5 sections. If the project has multiple tipped-in plates that do not have a formal page number, and you wish to keep the pngs matching the physical page numbers, that will need to be handled separately.

Export As Prep Text Files

After using the above option, and pre-processing the text using Guiguts' other features, this option lets you write the current document as separate files again. Navigate to, and select, the folder where you want to store the many small files nnn.txt. Guiguts writes a file for each page.

Import TIA Abbyy OCR File

This is specifically tailored to import Abbyy OCR files downloaded from TIA. Not all TIA scansets will have them. When investigating the possibilities, it was discovered that the other available possibilities at TIA, and the OCR available via HathiTrust all had problems (like missing end-of-line hyphens) that could not be overcome. This is usable if the file is available, but does not do well with tables.

Highlight WF Chars not in Selected Suites

Highlights the characters in Word Frequency's Character Count that are not in any of the selected character suites for this file.

To utilize this feature, it is recommended to first select any additional character suites you may need for your project from the Manage Character Suites... option in the Content Providing menu.

Then, select the option highlight WF Chars not in Selected Suites in the Content Providing menu. When selected, you should see a checkmark next to the item in the menu:

WF-char-selection.png

Once you have the checkmark, go to Tools > Word Frequency:

Tools-word frequency.png

You will see characters highlighted in yellow with a notation next to them that they are not in any character suite:

Highlight-not-in-char-suite.png

If you click once on the line item, it will bring you to the spot in the text where you can find this character, and you can edit this in the main window, either directly or via the Search and Replace functionality.

Note that you do not need to replace the curly quotes via this feature; they are addressed more easily via the Filter Files tool.

Manage Character Suites

Allows the user to enable or disable any of the DP character suites for this book - for use in conjunction with the above highlighting setting.

CP Character Substitutions

Subsumed into "Filter File..." (this section will be removed in a future version of the manual)