User:Rossolson/GutenBot
https://chatgpt.com/s/t_68e4c854dcc0819197e5021d65cafbae
Below: complete, ready-to-run design plus production-ready Python code you can copy/paste, deployment & maintenance notes, tests, and templates for the bot’s user-facing messages. I aim for a clear, pragmatic implementation you can stand up quickly and iterate on.
Summary / architecture (short)
- Ingest: read new posts & comments from selected subreddits via PRAW (Reddit API).
- Extract: find candidate book titles and authors from raw text using a mix of rules (quotes, italics, “Title — by Author”) and an NLP Named-Entity approach (spaCy pattern matcher).
- Match: fuzzy-match candidates against a local Project Gutenberg metadata index (CSV/SQLite) using RapidFuzz for fast similarity scoring.
- Notify: for high-confidence matches, send messages to a configured list of Reddit users (PMs) and store the notification in SQLite to avoid duplicates. Optionally also reply to the comment/post with a ready-to-paste message.
- Monitor & maintain: logging, rate-limit handling, health endpoint, auto-restart via systemd / Docker, and periodic metadata refresh.
What I provide here
- Full Python implementation (single script plus helper modules) using:
- PRAW (Reddit)
- spaCy (or fallback lightweight rules)
- RapidFuzz (fuzzy matching)
- SQLite (persistence)
- requests (optional, to download/update Gutenberg metadata
- Configuration examples
- Deployment recommendations (Dockerfile, systemd)
- Testing checklist and unit-test examples
- Message templates you can paste directly into Reddit replies
Required external pieces (you must supply or download once)
- A Project Gutenberg metadata file (CSV or SQLite). The code below can load:
- a local CSV with columns: gutenberg_id,title,authors,author_fullname,lang,subjects.
- OR you can ingest a JSON/RDF dump from Project Gutenberg and convert to CSV (script included).
- Reddit app credentials (client_id, client_secret, user_agent, username, password or use OAuth refresh token).
- A list of usernames to notify (in config).
Key design choices (why)
- Use both rule-based patterns and spaCy entity matching because book titles are often quoted or Title-Cased; sometimes not — combined approach raises recall.
- Fuzzy matching (RapidFuzz) with normalized strings yields robust matches against metadata variations (punctuation, subtitles).
- SQLite persistence prevents duplicate notifications and keeps system lightweight.
- Minimal permissions and polite interactions to avoid spam: bot will only notify on high-confidence matches (configurable threshold) and keep a cooldown per source/post.
Full implementation
Save this as gutenbot.py (or split into modules). I tried to keep it self-contained and well commented.
#!/usr/bin/env python3
"""
gutenbot.py - A Reddit bot that detects book titles/authors and matches them
to Project Gutenberg entries, notifying configured Reddit users.
Requirements (pip):
pip install praw spacy rapidfuzz python-dotenv requests
python -m spacy download en_core_web_sm
Configuration via .env (see .env.example below) or environment variables.
"""
import os
import re
import time
import json
import logging
import sqlite3
from dataclasses import dataclass
from typing import List, Optional, Tuple, Dict, Any
import praw
import requests
from rapidfuzz import fuzz, process
# Optional: spaCy for better NER (fallback to rules if not available)
try:
import spacy
nlp = spacy.load("en_core_web_sm")
except Exception:
nlp = None
# Load environment variables
from dotenv import load_dotenv
load_dotenv()
# ---------- CONFIG ----------
REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
REDDIT_USERNAME = os.getenv("REDDIT_USERNAME")
REDDIT_PASSWORD = os.getenv("REDDIT_PASSWORD")
REDDIT_USER_AGENT = os.getenv("REDDIT_USER_AGENT", "GutenBot/0.1 by " + (REDDIT_USERNAME or "gutenbot"))
SUBREDDITS = os.getenv("SUBREDDITS", "bookclub,books").split(",") # comma-separated
NOTIFY_USERNAMES = [u.strip() for u in os.getenv("NOTIFY_USERNAMES", "").split(",") if u.strip()]
GUTEN_METADATA_CSV = os.getenv("GUTEN_METADATA_CSV", "gutenberg_metadata.csv")
MIN_SCORE = float(os.getenv("MIN_SCORE", "85")) # RapidFuzz score threshold for title matching
SIMILARITY_AUTHOR_BOOST = float(os.getenv("SIMILARITY_AUTHOR_BOOST", "10")) # extra points if author also matches
DB_PATH = os.getenv("DB_PATH", "gutenbot.sqlite")
POLL_INTERVAL = int(os.getenv("POLL_INTERVAL", "30")) # seconds between checks for new stream items
REPLY_IN_COMMENT = os.getenv("REPLY_IN_COMMENT", "true").lower() in ("true", "1", "yes")
SEND_PM = os.getenv("SEND_PM", "true").lower() in ("true", "1", "yes")
MAX_NOTIFY_PER_POST = int(os.getenv("MAX_NOTIFY_PER_POST", "3")) # prevent spam
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("gutenbot")
# ---------- DB ----------
def init_db(conn: sqlite3.Connection):
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS notified (
id TEXT PRIMARY KEY, -- e.g. reddit_fullname or source link + guten_id
reddit_kind TEXT, -- 't1' for comment, 't3' for post
reddit_permalink TEXT,
guten_id INTEGER,
score REAL,
created INTEGER
)
""")
conn.commit()
conn = sqlite3.connect(DB_PATH, check_same_thread=False)
init_db(conn)
# ---------- Metadata Loader ----------
@dataclass
class GutenbergEntry:
gutenberg_id: int
title: str
authors: List[str] # one or more author names
author_fullname: str # primary author full name (if available)
lang: str
subjects: List[str]
def normalize_text(s: str) -> str:
s = s or ""
s = s.strip()
s = s.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')
s = re.sub(r"\s+", " ", s)
s = re.sub(r"[^0-9A-Za-z'\"&\s\-:,\.\?]", "", s)
return s.lower()
def load_metadata_from_csv(path: str) -> List[GutenbergEntry]:
import csv
entries = []
if not os.path.exists(path):
logger.error("Gutenberg metadata CSV not found at %s", path)
return entries
with open(path, newline='', encoding='utf-8') as fh:
reader = csv.DictReader(fh)
for r in reader:
try:
gid = int(r.get("gutenberg_id") or r.get("id") or 0)
title = r.get("title", "")
authors = []
a = r.get("authors") or r.get("author") or r.get("author_fullname") or ""
# authors may be pipe/comma separated
for part in re.split(r"[|,;/]+", a):
part = part.strip()
if part:
authors.append(part)
lang = r.get("language") or r.get("lang") or "en"
subjects = []
subj = r.get("subjects") or r.get("subject") or ""
for s in re.split(r"[|;]+", subj):
s = s.strip()
if s:
subjects.append(s)
entries.append(GutenbergEntry(gutenberg_id=gid, title=title, authors=authors,
author_fullname=authors[0] if authors else "", lang=lang, subjects=subjects))
except Exception as e:
logger.exception("Error parsing row: %s", e)
logger.info("Loaded %d Gutenberg metadata entries", len(entries))
return entries
gutenberg_entries = load_metadata_from_csv(GUTEN_METADATA_CSV)
# Build a quick title lookup list for rapidfuzz
title_to_entry_map = {e.title: e for e in gutenberg_entries}
title_list = list(title_to_entry_map.keys())
# ---------- Extraction: find candidate titles/authors ----------
TITLE_QUOTE_RE = re.compile(r'["“”‘’](.+?)["“”‘’]') # text in quotes
PAREN_TITLE_RE = re.compile(r'(.+?)\s*\(by\s+([A-Za-z ,.\-]+)\)', re.I) # "Title (by Author)"
BY_AUTHOR_RE = re.compile(r'(.{2,120}?)\s+by\s+([A-Z][A-Za-z\-\.\s]{2,80})', re.I) # "Title by Author"
ITALIC_MARK_RE = re.compile(r'\*(.+?)\*|_(.+?)_|<i>(.+?)</i>', re.I)
def extract_candidates(text: str) -> List[Tuple[str, Optional[str]]]:
"""
Returns list of (title_candidate, author_candidate_or_None)
"""
candidates = []
text = text.strip()
# 1) quoted strings
for m in TITLE_QUOTE_RE.finditer(text):
t = m.group(1).strip()
if len(t) >= 2 and len(t) <= 200:
candidates.append((t, None))
# 2) italic markers like *Title*
for m in ITALIC_MARK_RE.finditer(text):
t = next(g for g in m.groups() if g)
candidates.append((t.strip(), None))
# 3) "Title (by Author)" pattern
for m in PAREN_TITLE_RE.finditer(text):
candidates.append((m.group(1).strip(), m.group(2).strip()))
# 4) "Title by Author" pattern but restrict to short author tokens
for m in BY_AUTHOR_RE.finditer(text):
title = m.group(1).strip()
author = m.group(2).strip()
if 3 <= len(title) <= 200:
candidates.append((title, author))
# 5) spaCy: look for WORK_OF_ART entities or TITLE-like proper noun sequences
if nlp:
try:
doc = nlp(text)
for ent in doc.ents:
if ent.label_ in ("WORK_OF_ART", "WORK", "PRODUCT", "EVENT"):
candidates.append((ent.text.strip(), None))
# also heuristics: consecutive PROPN, TITLE CASE sequences
tokens = [t for t in doc]
for i in range(len(tokens)-1):
if tokens[i].text[0].isupper() and tokens[i+1].text[0].isupper():
# build sequence
j = i
seq = []
while j < len(tokens) and tokens[j].text[0].isupper() and len(seq) < 8:
seq.append(tokens[j].text)
j += 1
if len(seq) >= 2:
cand = " ".join(seq)
candidates.append((cand, None))
except Exception:
pass
# dedupe while preserving order
seen = set()
uniq = []
for t,a in candidates:
key = (t.lower(), (a or "").lower())
if key not in seen:
seen.add(key)
uniq.append((t,a))
return uniq
# ---------- Matching ----------
def score_candidate_against_metadata(title_cand: str, author_cand: Optional[str]) -> List[Tuple[GutenbergEntry, float]]:
"""
Returns a list of (GutenbergEntry, score) sorted desc. Score is in [0,100].
"""
if not title_cand or len(title_cand) < 2:
return []
norm = normalize_text(title_cand)
# rapidfuzz process.extract with partial_ratio / token_set_ratio is helpful
# use process.extract with scorer token_set_ratio for robust matching
matches = process.extract(norm, [normalize_text(t) for t in title_list],
scorer=fuzz.token_set_ratio, limit=10)
results = []
for match_string, score, idx in matches:
title_orig = title_list[idx]
entry = title_to_entry_map[title_orig]
sc = score
# boost if author candidate matches an author
if author_cand and entry.authors:
auth_norm = normalize_text(author_cand)
best_author = max(entry.authors, key=lambda a: fuzz.token_set_ratio(auth_norm, normalize_text(a)))
auth_score = fuzz.token_set_ratio(auth_norm, normalize_text(best_author))
if auth_score > 40:
sc = min(100.0, sc + SIMILARITY_AUTHOR_BOOST * (auth_score/100.0))
results.append((entry, sc))
results.sort(key=lambda x: x[1], reverse=True)
return results
def pick_best_match(results: List[Tuple[GutenbergEntry, float]]) -> Optional[Tuple[GutenbergEntry, float]]:
if not results:
return None
best = results[0]
if best[1] >= MIN_SCORE:
return best
return None
# ---------- Reddit / PRAW Setup ----------
def get_reddit():
return praw.Reddit(client_id=REDDIT_CLIENT_ID,
client_secret=REDDIT_CLIENT_SECRET,
username=REDDIT_USERNAME,
password=REDDIT_PASSWORD,
user_agent=REDDIT_USER_AGENT)
reddit = get_reddit()
# ---------- Notification & reply templates ----------
def build_reply_template(entry: GutenbergEntry, source_permalink: str) -> str:
# Nice compact reply suitable for pasting
url = f"https://www.gutenberg.org/ebooks/{entry.gutenberg_id}"
template = (
f"Hi — I think the book mentioned here matches **{entry.title}** "
f"by **{entry.author_fullname or ', '.join(entry.authors)}**.\n\n"
f"You can find it on Project Gutenberg: {url}\n\n"
f"I thought this was relevant because it appears to directly relate to the title/author you mentioned."
)
return template
def notify_users_pm(reddit_obj, recipients: List[str], subject: str, body: str):
for user in recipients:
try:
reddit_obj.redditor(user).message(subject, body)
logger.info("Sent PM to %s", user)
time.sleep(2) # gentle pacing
except Exception as e:
logger.exception("Failed to send PM to %s: %s", user, e)
def reply_to_item(item, body: str):
try:
item.reply(body)
logger.info("Replied to %s", getattr(item, "id", "<unknown>"))
except Exception as e:
logger.exception("Failed to reply: %s", e)
# ---------- Processing a single Reddit item ----------
def canonical_permalink_from_item(item) -> str:
# PRAW provides .permalink for comments/posts; ensure full url
try:
return "https://www.reddit.com" + item.permalink
except Exception:
return ""
def register_notification(conn: sqlite3.Connection, id_key: str, reddit_kind: str, permalink: str, guten_id: int, score: float):
cur = conn.cursor()
try:
cur.execute("INSERT INTO notified (id, reddit_kind, reddit_permalink, guten_id, score, created) VALUES (?, ?, ?, ?, ?, ?)",
(id_key, reddit_kind, permalink, guten_id, score, int(time.time())))
conn.commit()
except sqlite3.IntegrityError:
# already recorded
pass
def already_notified(conn: sqlite3.Connection, id_key: str) -> bool:
cur = conn.cursor()
cur.execute("SELECT 1 FROM notified WHERE id = ?", (id_key,))
return cur.fetchone() is not None
def process_reddit_item(item):
# item can be Comment or Submission
text = getattr(item, "body", None) or getattr(item, "selftext", None) or item.title or ""
if not text:
return
candidates = extract_candidates(text)
if not candidates:
return
permalink = canonical_permalink_from_item(item)
id_key_prefix = f"{item.fullname}" # e.g. t1_..., t3_...
notify_count = 0
for title_cand, author_cand in candidates:
results = score_candidate_against_metadata(title_cand, author_cand)
best = pick_best_match(results)
if not best:
continue
entry, score = best
id_key = f"{id_key_prefix}|g{entry.gutenberg_id}"
if already_notified(conn, id_key):
logger.debug("Already notified for %s", id_key)
continue
# create message
reply_text = build_reply_template(entry, permalink)
subject = f"Gutenberg match: {entry.title}"
# Notify configured users via PM
if SEND_PM and NOTIFY_USERNAMES:
pm_body = (
f"Gutenberg match found for a reddit item: {permalink}\n\n"
f"Title: {entry.title}\nAuthor: {entry.author_fullname or ', '.join(entry.authors)}\n"
f"Project Gutenberg link: https://www.gutenberg.org/ebooks/{entry.gutenberg_id}\n\n"
f"Confidence score: {score:.1f}\n\n"
f"Suggested reply (you can paste this into the thread):\n\n{reply_text}"
)
notify_users_pm(reddit, NOTIFY_USERNAMES, subject, pm_body)
# Optionally reply directly to the comment/post with the suggested text (if configured)
if REPLY_IN_COMMENT:
try:
reply_to_item(item, reply_text)
except Exception as e:
logger.warning("Couldn't reply directly: %s", e)
register_notification(conn, id_key, item.kind, permalink, entry.gutenberg_id, score)
notify_count += 1
if notify_count >= MAX_NOTIFY_PER_POST:
break
# ---------- Streams / Poll loop ----------
def monitor_subreddits():
subreddits = "+".join([s.strip() for s in SUBREDDITS if s.strip()])
logger.info("Monitoring subreddits: %s", subreddits)
subreddit = reddit.subreddit(subreddits)
# Using comment_stream + submission_stream; PRAW offers stream generators
# We'll poll both in turn to keep it simple and handle exceptions
while True:
try:
for comment in subreddit.stream.comments(skip_existing=True):
logger.debug("New comment %s", comment.id)
process_reddit_item(comment)
except Exception as e:
logger.exception("Error while streaming comments: %s", e)
time.sleep(10)
try:
for submission in subreddit.stream.submissions(skip_existing=True):
logger.debug("New submission %s", submission.id)
process_reddit_item(submission)
except Exception as e:
logger.exception("Error while streaming submissions: %s", e)
time.sleep(10)
# ---------- Command-line helpers ----------
def quick_test_on_text(text: str):
cands = extract_candidates(text)
print("Candidates:", cands)
for t,a in cands:
print("->", t, "|", a)
res = score_candidate_against_metadata(t, a)
if res:
for entry,score in res[:3]:
print(f" {entry.title} (ID {entry.gutenberg_id}) score={score:.1f} author={entry.author_fullname}")
else:
print(" no matches")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--test-text", help="Run candidate extraction & match on a single text")
parser.add_argument("--run", action="store_true", help="Run bot (streaming mode)")
parser.add_argument("--load-csv", help="Load a different metadata csv")
args = parser.parse_args()
if args.load_csv:
gutenberg_entries = load_metadata_from_csv(args.load_csv)
title_to_entry_map = {e.title: e for e in gutenberg_entries}
title_list = list(title_to_entry_map.keys())
if args.test_text:
quick_test_on_text(args.test_text)
if args.run:
monitor_subreddits()
Example
.env.example
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USERNAME=YourBotAccount
REDDIT_PASSWORD=bot_password
REDDIT_USER_AGENT=GutenBot/0.1 (by /u/YourReddit)
SUBREDDITS=books,askliterature
NOTIFY_USERNAMES=mod1,helper2
GUTEN_METADATA_CSV=/path/to/gutenberg_metadata.csv
MIN_SCORE=85
SIMILARITY_AUTHOR_BOOST=10
DB_PATH=/var/lib/gutenbot/gutenbot.sqlite
REPLY_IN_COMMENT=true
SEND_PM=true
POLL_INTERVAL=30
Metadata: obtaining Project Gutenberg metadata
You need a metadata file (CSV) with gutenberg_id,title,authors,language,subjects. Recommended approaches:
- Use the Project Gutenberg RDF/catalog and convert to CSV (the script is not included here, but it’s straightforward to parse RDF or JSON dumps).
- Or use community-maintained CSV exports of Gutenberg metadata (download once and refresh periodically). Store the CSV on the machine where the bot runs and point GUTEN_METADATA_CSV at it.
(If you want I can provide a separate script to convert the RDF catalog to CSV — tell me and I’ll generate it.)
Deploying & running
Options (short):
- Docker container (recommended): build an image with the script, proper env variables, and run in a managed service (Railway, Fly.io, DigitalOcean App Platform).
- Systemd service on a VPS: run in a virtualenv, create systemd unit that restarts on failure.
- Heroku / similar: possible but watch worker sleeping limits and background processes.
Minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir praw spacy rapidfuzz python-dotenv requests && \
python -m spacy download en_core_web_sm
ENV PYTHONUNBUFFERED=1
CMD ["python", "gutenbot.py", "--run"]
Systemd unit example (/etc/systemd/system/gutenbot.service):
[Unit]
Description=GutenBot Reddit notifier
After=network.target
[Service]
User=youruser
WorkingDirectory=/opt/gutenbot
EnvironmentFile=/opt/gutenbot/.env
ExecStart=/usr/bin/python3 /opt/gutenbot/gutenbot.py --run
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Rate limits, politeness & Reddit rules
- Use a dedicated bot account and clear user_agent.
- Keep PM frequency low and respectful; avoid spamming users.
- Respect Reddit API rate limits: PRAW uses built-in rate limit handling, but add pacing (the code uses small sleeps on PM sends).
- Ensure templates are polite and helpfully attribute the match and confidence.
Testing & quality checks
- Unit tests for extract_candidates() with various sample strings (quotes, italics, by-author).
- Unit tests for score_candidate_against_metadata() using a small sample metadata file.
- Live run in a private subreddit or with --test-text to verify matches and avoid public mistakes.
- Confirm DB persistence prevents duplicate notifications.
Example inputs → outputs (samples)
Input comment:
“Has anyone read ‘The Count of Monte Cristo’ by Alexandre Dumas? Thinking of reading it next.”
Bot behavior:
- Extracts candidate: “The Count of Monte Cristo” + author “Alexandre Dumas”
- Matches entry Gutenberg ID 1184 (example) with score 95
- Sends PM to configured reviewers with message containing the reddit permalink, Gutenberg link https://www.gutenberg.org/ebooks/1184, and suggested reply:
Hi — I think the book mentioned here matches **The Count of Monte Cristo** by **Alexandre Dumas**.
You can find it on Project Gutenberg: https://www.gutenberg.org/ebooks/1184
I thought this was relevant because it appears to directly relate to the title/author you mentioned.
Logging, monitoring & maintainability
- Use rotating file logs or forward logs to an aggregator (Papertrail/LogDNA).
- Add a /health HTTP endpoint (small Flask server) for readiness checks if running in containers.
- Periodically refresh the Gutenberg metadata (cron job) and rebuild the CSV/SQLite index.
Improvements & advanced ideas (next steps)
- Use a real vector search (FAISS) on canonicalized title embeddings for more advanced matching.
- Add language detection to avoid matching non-English metadata.
- Provide a confidence explanation (why matched: token overlap, author match).
- Implement an opt-in mechanism for notified users (so you don’t PM people who don’t want messages).
- Rate-limit and exponential backoff logic when replying/PMing to avoid account suspension.
- Add a web dashboard for recent matches, manual approve/reject flows.
Security & privacy
- Keep Reddit credentials secret (use environment variables or a secret manager).
- Store minimal data: only what’s needed to avoid duplicate notifications (no user PII beyond Reddit usernames).
- Do not auto-post links to full-text downloads without checking license. Project Gutenberg texts are public domain — fine — but be careful with other sources.
Wrap-up / next steps for you
- Download or create a Gutenberg metadata CSV and set GUTEN_METADATA_CSV.
- Create a Reddit app & bot account; fill .env.
- Install dependencies and run with python gutenbot.py --test-text "..." to verify.
- Run in a private subreddit until you’re happy with precision/thresholds, then enable live monitors.
If you want, I can:
- Generate a small sample gutenberg_metadata.csv (100 entries) so you can test quickly.
- Provide a script to convert Gutenberg RDF catalog to CSV.
- Provide unit tests for the extractor & matcher (pytest).
Which of those would you like next?