User:Rossolson/GutenBot

https://chatgpt.com/s/t_68e4c854dcc0819197e5021d65cafbae

Below: complete, ready-to-run design plus production-ready Python code you can copy/paste, deployment & maintenance notes, tests, and templates for the bot’s user-facing messages. I aim for a clear, pragmatic implementation you can stand up quickly and iterate on.

Summary / architecture (short)

Ingest: read new posts & comments from selected subreddits via PRAW (Reddit API).
Extract: find candidate book titles and authors from raw text using a mix of rules (quotes, italics, “Title — by Author”) and an NLP Named-Entity approach (spaCy pattern matcher).
Match: fuzzy-match candidates against a local Project Gutenberg metadata index (CSV/SQLite) using RapidFuzz for fast similarity scoring.
Notify: for high-confidence matches, send messages to a configured list of Reddit users (PMs) and store the notification in SQLite to avoid duplicates. Optionally also reply to the comment/post with a ready-to-paste message.
Monitor & maintain: logging, rate-limit handling, health endpoint, auto-restart via systemd / Docker, and periodic metadata refresh.

What I provide here

Full Python implementation (single script plus helper modules) using:
- PRAW (Reddit)
- spaCy (or fallback lightweight rules)
- RapidFuzz (fuzzy matching)
- SQLite (persistence)
- requests (optional, to download/update Gutenberg metadata
Configuration examples
Deployment recommendations (Dockerfile, systemd)
Testing checklist and unit-test examples
Message templates you can paste directly into Reddit replies

Required external pieces (you must supply or download once)

A Project Gutenberg metadata file (CSV or SQLite). The code below can load:
- a local CSV with columns: gutenberg_id,title,authors,author_fullname,lang,subjects.
- OR you can ingest a JSON/RDF dump from Project Gutenberg and convert to CSV (script included).
Reddit app credentials (client_id, client_secret, user_agent, username, password or use OAuth refresh token).
A list of usernames to notify (in config).

Key design choices (why)

Use both rule-based patterns and spaCy entity matching because book titles are often quoted or Title-Cased; sometimes not — combined approach raises recall.
Fuzzy matching (RapidFuzz) with normalized strings yields robust matches against metadata variations (punctuation, subtitles).
SQLite persistence prevents duplicate notifications and keeps system lightweight.
Minimal permissions and polite interactions to avoid spam: bot will only notify on high-confidence matches (configurable threshold) and keep a cooldown per source/post.

Full implementation

Save this as gutenbot.py (or split into modules). I tried to keep it self-contained and well commented.

#!/usr/bin/env python3

"""

gutenbot.py - A Reddit bot that detects book titles/authors and matches them

to Project Gutenberg entries, notifying configured Reddit users.

Requirements (pip):

pip install praw spacy rapidfuzz python-dotenv requests

python -m spacy download en_core_web_sm

Configuration via .env (see .env.example below) or environment variables.

"""

import os

import re

import time

import json

import logging

import sqlite3

from dataclasses import dataclass

from typing import List, Optional, Tuple, Dict, Any

import praw

import requests

from rapidfuzz import fuzz, process

# Optional: spaCy for better NER (fallback to rules if not available)

try:

import spacy

nlp = spacy.load("en_core_web_sm")

except Exception:

nlp = None

# Load environment variables

from dotenv import load_dotenv

load_dotenv()

# ---------- CONFIG ----------

REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")

REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")

REDDIT_USERNAME = os.getenv("REDDIT_USERNAME")

REDDIT_PASSWORD = os.getenv("REDDIT_PASSWORD")

REDDIT_USER_AGENT = os.getenv("REDDIT_USER_AGENT", "GutenBot/0.1 by " + (REDDIT_USERNAME or "gutenbot"))

SUBREDDITS = os.getenv("SUBREDDITS", "bookclub,books").split(",") # comma-separated

NOTIFY_USERNAMES = [u.strip() for u in os.getenv("NOTIFY_USERNAMES", "").split(",") if u.strip()]

GUTEN_METADATA_CSV = os.getenv("GUTEN_METADATA_CSV", "gutenberg_metadata.csv")

MIN_SCORE = float(os.getenv("MIN_SCORE", "85")) # RapidFuzz score threshold for title matching

SIMILARITY_AUTHOR_BOOST = float(os.getenv("SIMILARITY_AUTHOR_BOOST", "10")) # extra points if author also matches

DB_PATH = os.getenv("DB_PATH", "gutenbot.sqlite")

POLL_INTERVAL = int(os.getenv("POLL_INTERVAL", "30")) # seconds between checks for new stream items

REPLY_IN_COMMENT = os.getenv("REPLY_IN_COMMENT", "true").lower() in ("true", "1", "yes")

SEND_PM = os.getenv("SEND_PM", "true").lower() in ("true", "1", "yes")

MAX_NOTIFY_PER_POST = int(os.getenv("MAX_NOTIFY_PER_POST", "3")) # prevent spam

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

logger = logging.getLogger("gutenbot")

# ---------- DB ----------

def init_db(conn: sqlite3.Connection):

cur = conn.cursor()

cur.execute("""

CREATE TABLE IF NOT EXISTS notified (

id TEXT PRIMARY KEY, -- e.g. reddit_fullname or source link + guten_id

reddit_kind TEXT, -- 't1' for comment, 't3' for post

reddit_permalink TEXT,

guten_id INTEGER,

score REAL,

created INTEGER

)

""")

conn.commit()

conn = sqlite3.connect(DB_PATH, check_same_thread=False)

init_db(conn)

# ---------- Metadata Loader ----------

@dataclass

class GutenbergEntry:

gutenberg_id: int

title: str

authors: List[str] # one or more author names

author_fullname: str # primary author full name (if available)

lang: str

subjects: List[str]

def normalize_text(s: str) -> str:

s = s or ""

s = s.strip()

s = s.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')

s = re.sub(r"\s+", " ", s)

s = re.sub(r"[^0-9A-Za-z'\"&\s\-:,\.\?]", "", s)

return s.lower()

def load_metadata_from_csv(path: str) -> List[GutenbergEntry]:

import csv

entries = []

if not os.path.exists(path):

logger.error("Gutenberg metadata CSV not found at %s", path)

return entries

with open(path, newline='', encoding='utf-8') as fh:

reader = csv.DictReader(fh)

for r in reader:

try:

gid = int(r.get("gutenberg_id") or r.get("id") or 0)

title = r.get("title", "")

authors = []

a = r.get("authors") or r.get("author") or r.get("author_fullname") or ""

# authors may be pipe/comma separated

for part in re.split(r"[|,;/]+", a):

part = part.strip()

if part:

authors.append(part)

lang = r.get("language") or r.get("lang") or "en"

subjects = []

subj = r.get("subjects") or r.get("subject") or ""

for s in re.split(r"[|;]+", subj):

s = s.strip()

if s:

subjects.append(s)

entries.append(GutenbergEntry(gutenberg_id=gid, title=title, authors=authors,

author_fullname=authors[0] if authors else "", lang=lang, subjects=subjects))

except Exception as e:

logger.exception("Error parsing row: %s", e)

logger.info("Loaded %d Gutenberg metadata entries", len(entries))

return entries

gutenberg_entries = load_metadata_from_csv(GUTEN_METADATA_CSV)

# Build a quick title lookup list for rapidfuzz

title_to_entry_map = {e.title: e for e in gutenberg_entries}

title_list = list(title_to_entry_map.keys())

# ---------- Extraction: find candidate titles/authors ----------

TITLE_QUOTE_RE = re.compile(r'["“”‘’](.+?)["“”‘’]') # text in quotes

PAREN_TITLE_RE = re.compile(r'(.+?)\s*\(by\s+([A-Za-z ,.\-]+)\)', re.I) # "Title (by Author)"

BY_AUTHOR_RE = re.compile(r'(.{2,120}?)\s+by\s+([A-Z][A-Za-z\-\.\s]{2,80})', re.I) # "Title by Author"

ITALIC_MARK_RE = re.compile(r'\*(.+?)\*|_(.+?)_|<i>(.+?)</i>', re.I)

def extract_candidates(text: str) -> List[Tuple[str, Optional[str]]]:

"""

Returns list of (title_candidate, author_candidate_or_None)

"""

candidates = []

text = text.strip()

# 1) quoted strings

for m in TITLE_QUOTE_RE.finditer(text):

t = m.group(1).strip()

if len(t) >= 2 and len(t) <= 200:

candidates.append((t, None))

# 2) italic markers like *Title*

for m in ITALIC_MARK_RE.finditer(text):

t = next(g for g in m.groups() if g)

candidates.append((t.strip(), None))

# 3) "Title (by Author)" pattern

for m in PAREN_TITLE_RE.finditer(text):

candidates.append((m.group(1).strip(), m.group(2).strip()))

# 4) "Title by Author" pattern but restrict to short author tokens

for m in BY_AUTHOR_RE.finditer(text):

title = m.group(1).strip()

author = m.group(2).strip()

if 3 <= len(title) <= 200:

candidates.append((title, author))

# 5) spaCy: look for WORK_OF_ART entities or TITLE-like proper noun sequences

if nlp:

try:

doc = nlp(text)

for ent in doc.ents:

if ent.label_ in ("WORK_OF_ART", "WORK", "PRODUCT", "EVENT"):

candidates.append((ent.text.strip(), None))

# also heuristics: consecutive PROPN, TITLE CASE sequences

tokens = [t for t in doc]

for i in range(len(tokens)-1):

if tokens[i].text[0].isupper() and tokens[i+1].text[0].isupper():

# build sequence

j = i

seq = []

while j < len(tokens) and tokens[j].text[0].isupper() and len(seq) < 8:

seq.append(tokens[j].text)

j += 1

if len(seq) >= 2:

cand = " ".join(seq)

candidates.append((cand, None))

except Exception:

pass

# dedupe while preserving order

seen = set()

uniq = []

for t,a in candidates:

key = (t.lower(), (a or "").lower())

if key not in seen:

seen.add(key)

uniq.append((t,a))

return uniq

# ---------- Matching ----------

def score_candidate_against_metadata(title_cand: str, author_cand: Optional[str]) -> List[Tuple[GutenbergEntry, float]]:

"""

Returns a list of (GutenbergEntry, score) sorted desc. Score is in [0,100].

"""

if not title_cand or len(title_cand) < 2:

return []

norm = normalize_text(title_cand)

# rapidfuzz process.extract with partial_ratio / token_set_ratio is helpful

# use process.extract with scorer token_set_ratio for robust matching

matches = process.extract(norm, [normalize_text(t) for t in title_list],

scorer=fuzz.token_set_ratio, limit=10)

results = []

for match_string, score, idx in matches:

title_orig = title_list[idx]

entry = title_to_entry_map[title_orig]

sc = score

# boost if author candidate matches an author

if author_cand and entry.authors:

auth_norm = normalize_text(author_cand)

best_author = max(entry.authors, key=lambda a: fuzz.token_set_ratio(auth_norm, normalize_text(a)))

auth_score = fuzz.token_set_ratio(auth_norm, normalize_text(best_author))

if auth_score > 40:

sc = min(100.0, sc + SIMILARITY_AUTHOR_BOOST * (auth_score/100.0))

results.append((entry, sc))

results.sort(key=lambda x: x[1], reverse=True)

return results

def pick_best_match(results: List[Tuple[GutenbergEntry, float]]) -> Optional[Tuple[GutenbergEntry, float]]:

if not results:

return None

best = results[0]

if best[1] >= MIN_SCORE:

return best

return None

# ---------- Reddit / PRAW Setup ----------

def get_reddit():

return praw.Reddit(client_id=REDDIT_CLIENT_ID,

client_secret=REDDIT_CLIENT_SECRET,

username=REDDIT_USERNAME,

password=REDDIT_PASSWORD,

user_agent=REDDIT_USER_AGENT)

reddit = get_reddit()

# ---------- Notification & reply templates ----------

def build_reply_template(entry: GutenbergEntry, source_permalink: str) -> str:

# Nice compact reply suitable for pasting

url = f"https://www.gutenberg.org/ebooks/{entry.gutenberg_id}"

template = (

f"Hi — I think the book mentioned here matches **{entry.title}** "

f"by **{entry.author_fullname or ', '.join(entry.authors)}**.\n\n"

f"You can find it on Project Gutenberg: {url}\n\n"

f"I thought this was relevant because it appears to directly relate to the title/author you mentioned."

)

return template

def notify_users_pm(reddit_obj, recipients: List[str], subject: str, body: str):

for user in recipients:

try:

reddit_obj.redditor(user).message(subject, body)

logger.info("Sent PM to %s", user)

time.sleep(2) # gentle pacing

except Exception as e:

logger.exception("Failed to send PM to %s: %s", user, e)

def reply_to_item(item, body: str):

try:

item.reply(body)

logger.info("Replied to %s", getattr(item, "id", "<unknown>"))

except Exception as e:

logger.exception("Failed to reply: %s", e)

# ---------- Processing a single Reddit item ----------

def canonical_permalink_from_item(item) -> str:

# PRAW provides .permalink for comments/posts; ensure full url

try:

return "https://www.reddit.com" + item.permalink

except Exception:

return ""

def register_notification(conn: sqlite3.Connection, id_key: str, reddit_kind: str, permalink: str, guten_id: int, score: float):

cur = conn.cursor()

try:

cur.execute("INSERT INTO notified (id, reddit_kind, reddit_permalink, guten_id, score, created) VALUES (?, ?, ?, ?, ?, ?)",

(id_key, reddit_kind, permalink, guten_id, score, int(time.time())))

conn.commit()

except sqlite3.IntegrityError:

# already recorded

pass

def already_notified(conn: sqlite3.Connection, id_key: str) -> bool:

cur = conn.cursor()

cur.execute("SELECT 1 FROM notified WHERE id = ?", (id_key,))

return cur.fetchone() is not None

def process_reddit_item(item):

# item can be Comment or Submission

text = getattr(item, "body", None) or getattr(item, "selftext", None) or item.title or ""

if not text:

return

candidates = extract_candidates(text)

if not candidates:

return

permalink = canonical_permalink_from_item(item)

id_key_prefix = f"{item.fullname}" # e.g. t1_..., t3_...

notify_count = 0

for title_cand, author_cand in candidates:

results = score_candidate_against_metadata(title_cand, author_cand)

best = pick_best_match(results)

if not best:

continue

entry, score = best

id_key = f"{id_key_prefix}|g{entry.gutenberg_id}"

if already_notified(conn, id_key):

logger.debug("Already notified for %s", id_key)

continue

# create message

reply_text = build_reply_template(entry, permalink)

subject = f"Gutenberg match: {entry.title}"

# Notify configured users via PM

if SEND_PM and NOTIFY_USERNAMES:

pm_body = (

f"Gutenberg match found for a reddit item: {permalink}\n\n"

f"Title: {entry.title}\nAuthor: {entry.author_fullname or ', '.join(entry.authors)}\n"

f"Project Gutenberg link: https://www.gutenberg.org/ebooks/{entry.gutenberg_id}\n\n"

f"Confidence score: {score:.1f}\n\n"

f"Suggested reply (you can paste this into the thread):\n\n{reply_text}"

)

notify_users_pm(reddit, NOTIFY_USERNAMES, subject, pm_body)

# Optionally reply directly to the comment/post with the suggested text (if configured)

if REPLY_IN_COMMENT:

try:

reply_to_item(item, reply_text)

except Exception as e:

logger.warning("Couldn't reply directly: %s", e)

register_notification(conn, id_key, item.kind, permalink, entry.gutenberg_id, score)

notify_count += 1

if notify_count >= MAX_NOTIFY_PER_POST:

break

# ---------- Streams / Poll loop ----------

def monitor_subreddits():

subreddits = "+".join([s.strip() for s in SUBREDDITS if s.strip()])

logger.info("Monitoring subreddits: %s", subreddits)

subreddit = reddit.subreddit(subreddits)

# Using comment_stream + submission_stream; PRAW offers stream generators

# We'll poll both in turn to keep it simple and handle exceptions

while True:

try:

for comment in subreddit.stream.comments(skip_existing=True):

logger.debug("New comment %s", comment.id)

process_reddit_item(comment)

except Exception as e:

logger.exception("Error while streaming comments: %s", e)

time.sleep(10)

try:

for submission in subreddit.stream.submissions(skip_existing=True):

logger.debug("New submission %s", submission.id)

process_reddit_item(submission)

except Exception as e:

logger.exception("Error while streaming submissions: %s", e)

time.sleep(10)

# ---------- Command-line helpers ----------

def quick_test_on_text(text: str):

cands = extract_candidates(text)

print("Candidates:", cands)

for t,a in cands:

print("->", t, "|", a)

res = score_candidate_against_metadata(t, a)

if res:

for entry,score in res[:3]:

print(f" {entry.title} (ID {entry.gutenberg_id}) score={score:.1f} author={entry.author_fullname}")

else:

print(" no matches")

if __name__ == "__main__":

import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--test-text", help="Run candidate extraction & match on a single text")

parser.add_argument("--run", action="store_true", help="Run bot (streaming mode)")

parser.add_argument("--load-csv", help="Load a different metadata csv")

args = parser.parse_args()

if args.load_csv:

gutenberg_entries = load_metadata_from_csv(args.load_csv)

title_to_entry_map = {e.title: e for e in gutenberg_entries}

title_list = list(title_to_entry_map.keys())

if args.test_text:

quick_test_on_text(args.test_text)

if args.run:

monitor_subreddits()

Example

.env.example

REDDIT_CLIENT_ID=your_client_id

REDDIT_CLIENT_SECRET=your_client_secret

REDDIT_USERNAME=YourBotAccount

REDDIT_PASSWORD=bot_password

REDDIT_USER_AGENT=GutenBot/0.1 (by /u/YourReddit)

SUBREDDITS=books,askliterature

NOTIFY_USERNAMES=mod1,helper2

GUTEN_METADATA_CSV=/path/to/gutenberg_metadata.csv

MIN_SCORE=85

SIMILARITY_AUTHOR_BOOST=10

DB_PATH=/var/lib/gutenbot/gutenbot.sqlite

REPLY_IN_COMMENT=true

SEND_PM=true

POLL_INTERVAL=30

Metadata: obtaining Project Gutenberg metadata

You need a metadata file (CSV) with gutenberg_id,title,authors,language,subjects. Recommended approaches:

Use the Project Gutenberg RDF/catalog and convert to CSV (the script is not included here, but it’s straightforward to parse RDF or JSON dumps).
Or use community-maintained CSV exports of Gutenberg metadata (download once and refresh periodically). Store the CSV on the machine where the bot runs and point GUTEN_METADATA_CSV at it.

(If you want I can provide a separate script to convert the RDF catalog to CSV — tell me and I’ll generate it.)

Deploying & running

Options (short):

Docker container (recommended): build an image with the script, proper env variables, and run in a managed service (Railway, Fly.io, DigitalOcean App Platform).
Systemd service on a VPS: run in a virtualenv, create systemd unit that restarts on failure.
Heroku / similar: possible but watch worker sleeping limits and background processes.

Minimal Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY . /app

RUN pip install --no-cache-dir praw spacy rapidfuzz python-dotenv requests && \

python -m spacy download en_core_web_sm

ENV PYTHONUNBUFFERED=1

CMD ["python", "gutenbot.py", "--run"]

Systemd unit example (/etc/systemd/system/gutenbot.service):

[Unit]

Description=GutenBot Reddit notifier

After=network.target

[Service]

User=youruser

WorkingDirectory=/opt/gutenbot

EnvironmentFile=/opt/gutenbot/.env

ExecStart=/usr/bin/python3 /opt/gutenbot/gutenbot.py --run

Restart=always

RestartSec=5

[Install]

WantedBy=multi-user.target

Rate limits, politeness & Reddit rules

Use a dedicated bot account and clear user_agent.
Keep PM frequency low and respectful; avoid spamming users.
Respect Reddit API rate limits: PRAW uses built-in rate limit handling, but add pacing (the code uses small sleeps on PM sends).
Ensure templates are polite and helpfully attribute the match and confidence.

Testing & quality checks

Unit tests for extract_candidates() with various sample strings (quotes, italics, by-author).
Unit tests for score_candidate_against_metadata() using a small sample metadata file.
Live run in a private subreddit or with --test-text to verify matches and avoid public mistakes.
Confirm DB persistence prevents duplicate notifications.

Example inputs → outputs (samples)

Input comment:

“Has anyone read ‘The Count of Monte Cristo’ by Alexandre Dumas? Thinking of reading it next.”

Bot behavior:

Extracts candidate: “The Count of Monte Cristo” + author “Alexandre Dumas”
Matches entry Gutenberg ID 1184 (example) with score 95
Sends PM to configured reviewers with message containing the reddit permalink, Gutenberg link https://www.gutenberg.org/ebooks/1184, and suggested reply:

Hi — I think the book mentioned here matches **The Count of Monte Cristo** by **Alexandre Dumas**.

You can find it on Project Gutenberg: https://www.gutenberg.org/ebooks/1184

I thought this was relevant because it appears to directly relate to the title/author you mentioned.

Logging, monitoring & maintainability

Use rotating file logs or forward logs to an aggregator (Papertrail/LogDNA).
Add a /health HTTP endpoint (small Flask server) for readiness checks if running in containers.
Periodically refresh the Gutenberg metadata (cron job) and rebuild the CSV/SQLite index.

Improvements & advanced ideas (next steps)

Use a real vector search (FAISS) on canonicalized title embeddings for more advanced matching.
Add language detection to avoid matching non-English metadata.
Provide a confidence explanation (why matched: token overlap, author match).
Implement an opt-in mechanism for notified users (so you don’t PM people who don’t want messages).
Rate-limit and exponential backoff logic when replying/PMing to avoid account suspension.
Add a web dashboard for recent matches, manual approve/reject flows.

Security & privacy

Keep Reddit credentials secret (use environment variables or a secret manager).
Store minimal data: only what’s needed to avoid duplicate notifications (no user PII beyond Reddit usernames).
Do not auto-post links to full-text downloads without checking license. Project Gutenberg texts are public domain — fine — but be careful with other sources.

Wrap-up / next steps for you

Download or create a Gutenberg metadata CSV and set GUTEN_METADATA_CSV.
Create a Reddit app & bot account; fill .env.
Install dependencies and run with python gutenbot.py --test-text "..." to verify.
Run in a private subreddit until you’re happy with precision/thresholds, then enable live monitors.

If you want, I can:

Generate a small sample gutenberg_metadata.csv (100 entries) so you can test quickly.
Provide a script to convert Gutenberg RDF catalog to CSV.
Provide unit tests for the extractor & matcher (pytest).

Which of those would you like next?

User:Rossolson/GutenBot

Summary / architecture (short)

What I provide here

Navigation menu

Search