Jargon related to Content-Providing

From DPWiki
Jump to: navigation, search
Jargon Guides

Organizations and specialized activities develop their own sets of specialized terminology, or jargon, and DP is no exception to that. Accordingly, we have developed some FAQ-like Jargon Guides you can access in order to learn some of our lingo.

The LONG DP Jargon Guide, and the Jargon Guides related to The Guidelines, User Roles, and Workflow contain acronyms and terms you will likely encounter as a new volunteer at DP.

Other Jargon Guides contain terms that are a bit more specialized. The Group Activities Jargon Guide will become especially relevant to you if you start using Jabber. The remaining Jargon Guides shown in the Jargon Navigator box relate to the specific activities mentioned in their titles.

If you come across an acronym or term that isn't mentioned in one of these Jargon Guides, please ask about it in one of the DP forums.

Detailed suggestions on how best to add and edit Jargon-related information can be found at Help:Jargon.


See also the CP FAQ.


ABBYY FineReader

ABBYY FineReader is an OCR software program.

clearance

See copyright clearance.

Cognative OpenOCR

Cognative OpenOCR is an OCR software program.

copyright

A copyright is the set of exclusive rights granted to the author or creator of an original work, including the right to copy, distribute and adapt the work. Wikipedia's article provides a more detailed definition.


copyright clearance

Copyright clearance (or clearance) is Distributed Proofreaders's verification that a document meets the public domain and copyright criteria established by the Project Gutenberg Literary Archive Foundation.

CP: Content Provision/Provider

Content Providing/Provision (CP) is the process of providing the page images used in proofreading, either by scanning a book or harvesting the images from an online source.

Also a person who does such work (Content Provider, or CPer).

If you are interested in becoming a CPer, visit Access Requirements.

You can automate some content providing tasks by using guiprep and guiguts. For more information you can see the Content Providing FAQ.


dpscans

dpscans is a special directory on the DP server, accessible via the Remote File Manager. (Accessed via the script: http://www.pgdp.net/c/tools/project_manager/remote_file_manager.php)

Content Providers, OCR Pool volunteers, Project Managers, and others upload zip files to their dpscans folders. These zips may contain images, text, illustrations, and occasionally music files to be loaded into projects.

Once all the project files are ready, the PM "loads" the files from his or her dpscans subfolder into the project itself. This copies the files into the active project database and means that the "original" files in dpscans can now be deleted.

Before June of 2010, access to dpscans was via ftp. After malware-infected files were found in dpscans, ftp access was disabled, and a basic script was provided. The script has been enhanced, and was released in November of 2015 as Remote File Manager.

(This is a brief overview; a squirrel could give a much more technically accurate and detailed description.)


FineReader

See ABBYY FineReader.

graiding

Graiding refers to group-proofraiding.


grOCRing

Group OCR.

GuiPrep

Guiprep is a perl script by the author of Guiguts that is used to prepare OCR text for uploading to DP.

Recent versions can be found on the GitHub guiprep releases page. Guiprep works with Strawberry Perl, which is also recommended for use with Guiguts.

The changes in version .41d are:

  1. Added option to tidy up or mark dubious spaced curly quotes (see below for close single quotes)
  2. Added option to fix spaced close single curly quotes (not mark as unknown) - leave unchecked if your book has apostrophes at the start of words, e.g. 'orrible

The changes in version .41c are:

  1. Fixed failure to mark empty files with [Blank page] when they contain a utf BOM
  2. Fixed '@INC' error message and failure to reload settings with recent Perl versions
  3. Renamed winprep.exe to winprep40.exe to make it clear it's not running the latest version
  4. Added run_guiprep41c.bat to support running it under Windows without a proper Perl installation

Newer versions are announced from time to time in the guiprep forum thread.

Each download zip contains the guiprep perl script and supporting data files. It also contains a change log (changelog.html) and a manual (guiprep.html, current as of guiprep .40).

See also:

harvest

At Distributed Proofreaders (DP), harvest (also proofraid) is defined as the process of downloading the page images of a document from an online source.


life +50 copyright

A life +50 copyright is a copyright that expires on January 1 of the year following the 50th anniversary of the author's death.


life +70 copyright

A life +70 copyright is a copyright that expires on January 1 of the year following the 70th anniversary of the author's death.


LoC

LoC and LOC are the standard abbreviatons used to refer to the Library of Congress (U.S.).


Missing Page Finder

The Missing Page Finders are volunteers who love to spend a lot of time in libraries and don't mind searching for the exact edition of obscure tomes, photographing or scanning the missing pages, and sending the files on to those who need to complete a project. In many ways, these hard workers are the unsung heroes of DP.

See also Missing Page Finders and Missing pages.


OCR

See optical character recognition.

OCR Pool

The OCR Pool is the part of Distributed Proofreaders's (DP's) server used to store the scanned pages of documents that require pre-processing.

Ocrad

Ocrad is an OCR software program.

OmniPage

OmniPage is an OCR software program, usable with Windows, OSX and Linux operating systems.

optical character recognition

Optical character recognition (OCR) is the electronic translation of scanned images of printed text into editable text.

At Distributed Proofreaders, the abbreviation OCR is used in various contexts (and tenses/forms) to refer to:

  • OCR software - the software that performs optical character recognition,
  • the process of using optical character recognition software,
  • the person using optical character recognition software, and
  • OCR text - the editable text produced by optical character recognition software.

PD

See public domain.

PGLAF

See Project Gutenberg Literary Archive Foundation.

PM: Project Manager

The Project Manager (PM) is the person in charge of a project and its progress through the rounds. The ultimate goal of the PM is to help the project be as consistently proofed and formatted as possible for the PPer. One way the PM (usually) does this is by writing Project Comments.

Different PMs have different styles. Some provide a handful of books that they pre-process themselves, then during proofreading monitor the project threads closely, and finally post-process the project themselves; others provide large quantities of books and rely on others to PP them. Other PMs fall somewhere between, perhaps closely following some books, while only glancing in on others, as questions are asked in the project thread.

If you are interested in becoming a PM, visit Access Requirements. If you are a new PM, see the Project Managing FAQ.


Pre-processing

Pre-processing is the process of preparing a book (which becomes known as a "project") for proofreading here at DP. Steps include scanning the book (or "book-like thing"), running the OCR software (which generally includes some spellchecking function), and uploading the files to the DP servers using Remote File Manager. These tasks are performed by a person known as the Content Provider (CP), who may also serve as the Project Manager (PM).


Project Gutenberg Literary Archive Foundation

The Project Gutenberg Literary Archive Foundation (PGLAF) is the legal entity supporting the work of Project Gutenberg (PG). See PG's Project Gutenberg Literary Archive Foundation article for more detailed information.

proofraid

See harvest.

public domain

The term public domain (PD) refers to information, creative works, etc. that are part of the common body of knowledge or cultural heritage, which are not protected by any copyright or patent.

Readiris

Readiris is an OCR software program, usable with Windows, OSX and Linux operating systems.

regex: regular expression

A regular expression (known as regex for short) is a string of characters that describes or matches a set of strings, according to certain syntax rules.

Regexes may be used in many editors and word processors, to provide powerful search and replace functions. DP-specific uses include the Search & Replace feature of the Proofreading interface, guiprep, and guiguts.

For a much more detailed article, including rules and examples, see Wikipedia's article on regular expressions. There is even more information, and tutorials, at regular-expressions.info.

scanning/scans

The terms scan, scans, scanner, and scanning are used in many places in multiple ways at DP.

"Scan" and "scans" (n.) usually refer to the image files created by Content Providers (occasionally referred to as "scanners" [n.], in the sense of people who scan [v.]), who use hardware known as "scanners" (n.) to "scan" (v.) the individual pages of a book or other textual material. This process is referred to as "scanning" (v. or gerund). In other words, "scans" are the results of running a "scanner" or "scanning." (Sometimes Content Providers harvest scans from other online sources instead of scanning them themselves.)

OCR software is used to create an OCR text from the scanned images (scans). As a project begins its journey through DP's rounds, the proofers working in P1 compare each page's OCR text to its original scan. Thus, "the scans" are the foundation of the e-texts produced by DP.


Joint Photographic Experts Group (JPG) file format

The Joint Photographic Experts Group (JPEG, JPG; file extention .jpg) file format is a lossy compressed image file format.

JPEG

See Joint Photographic Experts Group (JGP) file format.

JPG

See Joint Photographic Experts Group (JGP) file format.

Portable Network Graphics

Portable Network Graphics (PNG or png; file extension .png) is a lossless compressed image file format.


PNG

See Portable Network Graphics.

TIFF

TIFF, or tiff, which stands for Tagged Image File Format, is an image file format (actually, a group of image file formats) which supports both lossy and lossless compression. The tiff format can be easily extended with non-standard options and tags which can lead to compatibility problems. Developed originally as a file format for desktop scanners. Most document imaging systems still work natively in tiff format.

See a detailed explanation at Wikipedia.


Tesseract

Tesseract is an OCR software program, usable with Windows, OSX and Linux operating systems.

TOCR

TOCR is an OCR software program.

TP&V

TP&V is an abbreviation for title page and verso.

You will often see this abbreviation used in the Content Providing Forums because in order to obtain a copyright clearance for a project from the PG Copyright Team, the minimum amount of documentation a CP must submit is a copy of the work's TP&V.


uberproject

An uberproject is large-scale, multi-volume Distributed Proofreaders project.

WorldCat

WorldCat is a bibliographic database useful for searching library collections. It is frequently used by Content Providers to locate specific editions of books to scan for DP or to supplement Missing pages.

Many university libraries have subscriptions to WorldCat, with direct links to the search pages for students and employees. Some public libraries also make this search resource available to their patrons.

There is now a free portal called WorldCat.org which is a direct search into the WorldCat database created by OCLC members. You can sign for the Affiliate Program and add a WorldCat search box to your website.

There is also a version called Open WorldCat that works through both Yahoo! and Google search engines. To use Open WorldCat, include the phrase "find in a library" along with the title information in your search, and the first hit will be the WorldCat link. You can then continue your searches through this results page. (Through Google, there is a "Find a Library" search text box at the top of the results page, for example.)

More Wiki-quality info about WorldCat is available at Wikipedia.


ZIP archive

See ZIP file.

ZIP file

A ZIP file (also ZIP archive, file extension .zip) contains one or more files that have either been stored intact or been compressed to reduce file size, using the ZIP file format. Wikipedia's article has more detailed information about the ZIP file format and ZIP files.