Jargon related to Content-Providing

Jargon Guides

Organizations and specialized activities develop their own sets of specialized terminology, or jargon, and DP is no exception to that. Accordingly, we have developed some FAQ-like Jargon Guides you can access in order to learn some of our lingo.

The LONG DP Jargon Guide, and the Jargon Guides related to The Guidelines, User Roles, and Workflow contain acronyms and terms you will likely encounter as a new volunteer at DP.

Other Jargon Guides contain terms that are a bit more specialized. The Group Activities Jargon Guide will become especially relevant to you if you start using Jabber. The remaining Jargon Guides shown in the Jargon Navigator box relate to the specific activities mentioned in their titles.

If you come across an acronym or term that isn't mentioned in one of these Jargon Guides, please ask about it in one of the DP forums.

Detailed suggestions on how best to add and edit Jargon-related information can be found at Help:Jargon.

ABBYY FineReader

ABBYY FineReader is an OCR software program.

For more information, see DPWiki's full ABBYY FineReader article.

clearance

See copyright clearance.

Cognative OpenOCR

Cognative OpenOCR is an OCR software program.

For more information, see DPWiki's full Cognative OpenOCR article.

copyright

A copyright is the set of exclusive rights granted to the author or creator of an original work, including the right to copy, distribute and adapt the work. Wikipedia's article provides a more detailed definition.*For more information about copyrights and their importance at Distributed Proofreaders (DP), see DPWiki's complete copyright article.

copyright clearance

Copyright clearance (or clearance) is Distributed Proofreaders's verification that a document meets the public domain and copyright criteria established by the Project Gutenberg Literary Archive Foundation.

For more information, see the complete copyright clearance article.

CP: Content Provision/Provider

Content Providing/Provision (CP) is the process of providing the page images used in proofreading, either by scanning a book or harvesting the images from an online source.

Also a person who does such work (Content Provider, or CPer).

If you are interested in becoming a CPer, visit Access Requirements.

You can automate some content providing tasks by using Guiguts 2 (Guiprep has been superseded by Guiguts 2; it can still be downloaded but is no longer being actively maintained). For more information you can see the Content Providing FAQ.

dpscans

dpscans is a special directory on the DP server, accessible via the Remote File Manager. (Accessed via the script: http://www.pgdp.net/c/tools/project_manager/remote_file_manager.php)

Content Providers, OCR Pool volunteers, Project Managers, and others upload zip files to their dpscans folders. These zips may contain images, text, illustrations, and occasionally music files to be loaded into projects.

Once all the project files are ready, the PM "loads" the files from his or her dpscans subfolder into the project itself. This copies the files into the active project database and means that the "original" files in dpscans can now be deleted.

Before June of 2010, access to dpscans was via ftp. After malware-infected files were found in dpscans, ftp access was disabled, and a basic script was provided. The script has been enhanced, and was released in November of 2015 as Remote File Manager.

(This is a brief overview; a squirrel could give a much more technically accurate and detailed description.)

FineReader

See ABBYY FineReader.

graiding

Graiding refers to group-proofraiding.

grOCRing

Group OCR.

GuiPrep

Warning

Guiprep is no longer supported and has been superseded by Guiguts 2

This page has been modified for release .41e, whose release is imminent. For previous versions of this page, click on "View History" above.

Guiprep is a perl script by the author of Guiguts that is used to prepare OCR text for uploading to DP.

Recent versions can be found on the GitHub guiprep releases page. Guiprep works with Strawberry Perl, which is also recommended for use with Guiguts.

The salient changes in version .41e (May 2021) are:

Remove FTP tab. Neither DP nor DPC support the use of the FTP tab.
Make fcannos.bin platform independent (which was the only portion of guiprep that was not platform independent).
Move scrollbar to right-hand side of Select Option tab.
Remove pause from run_guiprep.bat.
Filter form feeds from files. (Tesseract support.)
Make "Convert Windows 1252 codepage glyphs 80-9F" default to off. (UTF-8.)
Make removal of headers and footers utf safe.

The changes in version .41d (August 2020) are:

Added option to tidy up or mark dubious spaced curly quotes (see below for close single quotes)
Added option to fix spaced close single curly quotes (not mark as unknown) - leave unchecked if your book has apostrophes at the start of words, e.g. 'orrible

Newer versions are announced from time to time in the guiprep forum thread. That forum is also the best place for Q&A regarding guiprep. If you would like to report a bug or request changes be made to the package, please enter them into github.

Each download zip contains the guiprep perl script and supporting data files. It also contains a change log (changelog.html) and a manual. Prior to version .41e the manual was guiprep.html; as of .41e it was renamed to guiprep-userguide.html.

harvest

At Distributed Proofreaders (DP), harvesting (also proofraiding) refers to the process of downloading the page images of a book or book-like thing from an online source.

For more information, see DPWiki's harvest article.

life +50 copyright

A life +50 copyright is a copyright that expires on January 1 of the year following the 50th anniversary of the author's death.

For more information about copyrights and their importance at Distributed Proofreaders (DP), see DPWiki's complete copyright article.

life +70 copyright

A life +70 copyright is a copyright that expires on January 1 of the year following the 70th anniversary of the author's death.

For more information about copyrights and their importance at Distributed Proofreaders (DP), see DPWiki's complete copyright article.

LoC

LoC and LOC are the standard abbreviatons used to refer to the Library of Congress (U.S.).

Missing Page Finder

The Missing Page Finders are volunteers who love to spend a lot of time in libraries and don't mind searching for the exact edition of obscure tomes, photographing or scanning the missing pages, and sending the files on to those who need to complete a project. In many ways, these hard workers are the unsung heroes of DP.

See also Missing Page Finders and Missing pages.

OCR

See optical character recognition.

OCR Pool

The OCR Pool is the part of Distributed Proofreaders's (DP's) server used to store the scanned pages of documents that require pre-processing.

For more information, see DPWiki's full OCR Pool article.

Ocrad

Ocrad is an OCR software program.

For more information, see DPWiki's full Ocrad article.

OmniPage

OmniPage is an OCR software program, usable with Windows, OSX and Linux operating systems.

For more information, see DPWiki's full OmniPage article.

optical character recognition

Optical character recognition (OCR) is the electronic translation of scanned images of printed text into editable text.

At Distributed Proofreaders, the abbreviation OCR is used in various contexts (and tenses/forms) to refer to:

OCR software - the software that performs optical character recognition,
the process of using optical character recognition software,
the person using optical character recognition software, and
OCR text - the editable text produced by optical character recognition software.

For more information, see the full optical character recognition article.

PD

See public domain.

PGLAF

See Project Gutenberg Literary Archive Foundation.

PM: Project Manager

The Project Manager (PM) is the person in charge of a project and its progress through the rounds. The ultimate goal of the PM is to help the project be as consistently proofed and formatted as possible for the PPer. One way the PM (usually) does this is by writing Project Comments.

Different PMs have different styles. Some provide a handful of books that they pre-process themselves, then during proofreading monitor the project threads closely, and finally post-process the project themselves; others provide large quantities of books and rely on others to PP them. Other PMs fall somewhere between, perhaps closely following some books, while only glancing in on others, as questions are asked in the project thread.

If you are interested in becoming a PM, visit Access Requirements. If you are a new PM, see the Project Managing FAQ.

Pre-processing

Pre-processing is the process of preparing a book (which becomes known as a "project") for proofreading here at DP. Steps include scanning the book (or "book-like thing"), running the OCR software (which generally includes some spellchecking function), and uploading the files to the DP servers using Remote File Manager. These tasks are performed by a person known as the Content Provider (CP), who may also serve as the Project Manager (PM).

Project Gutenberg Literary Archive Foundation

The Project Gutenberg Literary Archive Foundation (PGLAF) is the legal entity supporting the work of Project Gutenberg (PG). See PG's Project Gutenberg Literary Archive Foundation article for more detailed information.

For more information about Distributed Proofreaders's (DP's) relationship with the PFLAF, see DPWiki's full article about the Project Gutenberg Literary Archive Foundation.

proofraid

See harvest.

public domain

The term public domain (PD) refers to information, creative works, etc. that are part of the common body of knowledge or cultural heritage, which are not protected by any copyright or patent.

For more information about public domain and its importance for Distributed Proofreaders, see the complete public domain article.

Readiris

Readiris is an OCR software program, usable with Windows, OSX and Linux operating systems.

For more information, see DPWiki's full Readiris article.

regex: regular expression

A regular expression (known as regex for short) is a string of characters that describes or matches a set of strings, according to certain syntax rules.

Regexes may be used in many editors and word processors, to provide powerful search and replace functions. DP-specific uses include the Search & Replace feature of the Proofreading interface and guiguts. Regexes can also be used in guiprep, but using the CP prep functionality of Guiguts 2 is recommended.

For a much more detailed article, including rules and examples, see Wikipedia's article on regular expressions. There is even more information, and tutorials, at regular-expressions.info.

scanning/scans

The terms scan, scans, scanner, and scanning are used in many places in multiple ways at DP.

"Scan" and "scans" (n.) usually refer to the image files created by Content Providers (occasionally referred to as "scanners" [n.], in the sense of people who scan [v.]), who use hardware known as "scanners" (n.) to "scan" (v.) the individual pages of a book or other textual material. This process is referred to as "scanning" (v. or gerund). In other words, "scans" are the results of running a "scanner" or "scanning." (Sometimes Content Providers harvest scans from other online sources instead of scanning them themselves.)

OCR software is used to create an OCR text from the scanned images (scans). As a project begins its journey through DP's rounds, the proofers working in P1 compare each page's OCR text to its original scan. Thus, "the scans" are the foundation of the e-texts produced by DP.

Joint Photographic Experts Group (JPG) file format

The Joint Photographic Experts Group (JPEG, JPG; file extention .jpg) file format is a lossy compressed image file format.

For more information, see DPWiki's full JPG article.

JPEG

See Joint Photographic Experts Group (JGP) file format.

JPG

See Joint Photographic Experts Group (JGP) file format.

Portable Network Graphics

Portable Network Graphics (PNG or png; file extension .png) is a lossless compressed image file format.

PNG

See Portable Network Graphics.

TIFF

TIFF, or tiff, which stands for Tagged Image File Format, is an image file format (actually, a group of image file formats) which supports both lossy and lossless compression. The tiff format can be easily extended with non-standard options and tags which can lead to compatibility problems. Developed originally as a file format for desktop scanners. Most document imaging systems still work natively in tiff format.

See a detailed explanation at Wikipedia.

Tesseract

Tesseract is an OCR software program, usable with Windows, OSX and Linux operating systems.

For more information, see DPWiki's full Tesseract article.

TOCR

TOCR is an OCR software program.

For more information, see DPWiki's full TOCR article.

TP&V

TP&V is an abbreviation for title page and verso.

You will often see this abbreviation used in the Content Providing Forums because in order to obtain a copyright clearance for a project from the PG Copyright Team, the minimum amount of documentation a CP must submit is a copy of the work's TP&V.

uberproject

An uberproject is large-scale, multi-volume Distributed Proofreaders project.

For more information, see DPWiki's complete uberprojects article.

WorldCat

WorldCat is a bibliographic database useful for searching library collections. It is frequently used by Content Providers to locate specific editions of books to scan for DP or to supplement Missing pages.

Many university libraries have subscriptions to WorldCat, with direct links to the search pages for students and employees. Some public libraries also make this search resource available to their patrons.

There is now a free portal called WorldCat.org which is a direct search into the WorldCat database created by OCLC members. You can sign for the Affiliate Program and add a WorldCat search box to your website.

There is also a version called Open WorldCat that works through both Yahoo! and Google search engines. To use Open WorldCat, include the phrase "find in a library" along with the title information in your search, and the first hit will be the WorldCat link. You can then continue your searches through this results page. (Through Google, there is a "Find a Library" search text box at the top of the results page, for example.)

More Wiki-quality info about WorldCat is available at Wikipedia.

ZIP archive

See ZIP file.

ZIP file

A ZIP file (also ZIP archive, file extension .zip) contains one or more files that have either been stored intact or been compressed to reduce file size, using the ZIP file format. Wikipedia's article has more detailed information about the ZIP file format and ZIP files.

For more information about Distributed Proofreaders's (DP's) use of ZIP files, see DP's article about ZIP files.