Regular Expressions for Copyright Renewals

From DPWiki

Here is the list of regular expressions used for post-processing Copyright Renewals with guiguts taken from this post.

Instructions for use

This list should be saved as a file (CRregex.rc) in the guiguts Scannos subdirectory. Launch Guiguts, then select 'Search' -> 'Stealth Scannos' and load the CRregex.rc file. Make sure the 'Regex' box is ticked on the 'Search and Replace' popup and you can work your way through the scannos.

Please feel free to update the list as required.

%scannoslist = (
'0ct' => 'Oct',            # obvious scanno
'Ju1' => 'Jul',            # obvious scanno
'Alien' => 'Allen',        # obvious scanno
'Prance' => 'France',      # obvious scanno
'Prank' => 'Frank',        # obvious scanno
'PRANK' => 'FRANK',        # obvious scanno
'Hay\s+(\d)' => 'May $1',            # obvious scanno
'PVH' => 'PWH',            # obvious scanno
'PCV' => 'PCW',            # obvious scanno
'PPV' => 'PPW',            # obvious scanno
'pseud,' => 'pseud.',      # obvious scanno
'\bSee\b' => 'SEE',        # obvious scanno -- BEWARE FALSE POSITIVES
' 0\.' => ' O.',           # obvious scanno
' C ' => ' (C) ',            # obvious scanno
' (R) ' => ' (C) ',            # obvious scanno
'--' => '--',               # obvious scanno
' \.' => '.',              # obvious scanno
' ,' => ',',               # obvious scanno
' ;' => ';',               # obvious scanno
'\)j' => ');',             # obvious scanno
'1\'' => 'l\'',            # obvious scanno
'll([A-Z])' => '11$1',     # obvious scanno     
'l([A-Z])' => '1$1',       # obvious scanno
'All([0-9]+)' => 'A11$1',  # Fix 'All' for 'A11' renewal numbers scanno
'q(?!u)' => 'g',      #Find q's with no u's - turn to g's in most instances - watch for words like Iraq
'(\([^ACEW]\))' => 'Check Copyright Holder',           #Find Single Character Copyright Holders which aren't A, C, E, or W
'(\((\b([^AWN][^srK])|Ar|AK|Ws|WK|Ns|Nr)\b\))' => 'Check Copyright Holder',       #Find Copyright Holders which aren't As, Wr or NK - warning false positives for volumes like 2d
'(\((\b([^P][^CWP][^BWH])|PCH||PPB|PWW|PWB|)\b\))' => 'Check Copyright Holder',    #Find Copyright Holders which aren't PCB, PCW, PWH or PPW
# '^\p{IsUpper}.*SEE' => 'FIX SEE REF ON SAME LINE AS AUTHOR',  # points out lines starting with an upper-case char with SEE on the same line
'^(?!$|\S+| {3}\S+| {5}\S+| {7}\S+).*$' => 'FIX INDENTATION', # points out lines not starting with 0, 3, 5, or 7 spaces
' [PFK](\d+)' => ' R$1',    # find scannos in the renewals numbers that start with P, F, K
'([a-z])1([a-z])' => '$1l$2', # 1's embedded in lowercase chars
'([A-Z])1([A-Z])' => '$1I$2', # 1's embedded in uppercase chars
'([A-Z])0([A-Z])' => '$1O$2', # 0's embedded in uppercase chars
'^[A-Z]\s*$|^<b>[A-Z]' => 'Remove "Chapter Heading"',       # Get rid of A, B, C....
'^[^R](?=\d{6})' => 'R',        #Find a renewal number at a line start that doesn't begin with R
'[^R](?=\d{6}\.)$' => 'R',      #Find a renewal number at a line end that doesn't begin with R
'(R\d{6})$' => '$1.',           #Find a renewal number that doesn't end with a period
'\bG(?=\s)' => '&',             #Convert lone G to &
'[b-df-hj-np-tv-xz]{5,}' => '' , # **check "case insensitive" Finds 5 or more consonants in a row. Typically either missing vowel or German word. :-)

'(\b\d\w*)[^\w-](\w*\d)\b' =>'',                #A better regex for finding bad characters in dates, not so many false positives
'(\(\p{Alpha}{1,3}\))([^;]|$)' => '$1; ',       #Find copyright holders that aren't followed by semicolon. BEWARE FALSE POSITIVES!
'(c)(\S+)' => '(c) $1',                         #copyright symbols without whitespace
'(\b\d{1,2}\p{Alpha}{3}\d{2})[^;,]' => '$1; ',  #Find dates that aren't followed by semicolon or comma
'\b(\d{1,2}\p{Alpha}{3})[^47](\d)\b' => '$14$2', #or 7  Find dates that aren't in the Forties or Seventies, but 4s seem to be especially hard for the OCR
'\b(\p{Upper})(\d*?)l(\d+)' => '$1$21$3',       #Find l for 1 in /renewal numbers
'\b(\p{Upper})(\d+?)l(\d*)' => '$1$21$3',       #Find l for 1 in /renewal numbers that slipped through
'\b(\p{Upper})(\d*?)O(\d+)' => '$1$20$3',       #Find O for 0 in /renewal numbers
'\b(\p{Upper})(\d+?)O(\d*)' => '$1$20$3',       #Find O for 0 in /renewal numbers that slipped through
'l(\p{Upper}\p{Lower}{2})' => '1$1',            #Find l for 1 in dates
'O(\p{Upper}\p{Lower}{2})' => '1$1',            #Find O for 0 in dates
'\bl(\d\p{Upper}\p{Lower})' => '1$1',           #Find yet more l for 1 in dates
'\b(\d\d?\p{Alpha}{3}\d)l' => '$11',            #Find still further l for 1 in dates
'\b(\d\d?\p{Upper})(\p{Upper}{2})(\d\d)\b'=>'$1\L$2\E$3',  #Find ALL CAPS dates and transform them
'(\S)\s\s+(\S)' => '$1 $2',                     #Fix multiple consecutive spaces
'([MW])l' => '$1i',                             #Find l following M or W and replace with i
'!(\w)' => 'l$1' #Find an exclamation point that is NOT at the end of a word, and replace with l
);