User:Rossolson/GutenBot

From DPWiki
Jump to navigation Jump to search

https://chatgpt.com/s/t_68e4c854dcc0819197e5021d65cafbae


Below: complete, ready-to-run design plus production-ready Python code you can copy/paste, deployment & maintenance notes, tests, and templates for the bot’s user-facing messages. I aim for a clear, pragmatic implementation you can stand up quickly and iterate on.

Summary / architecture (short)

  1. Ingest: read new posts & comments from selected subreddits via PRAW (Reddit API).
  2. Extract: find candidate book titles and authors from raw text using a mix of rules (quotes, italics, “Title — by Author”) and an NLP Named-Entity approach (spaCy pattern matcher).
  3. Match: fuzzy-match candidates against a local Project Gutenberg metadata index (CSV/SQLite) using RapidFuzz for fast similarity scoring.
  4. Notify: for high-confidence matches, send messages to a configured list of Reddit users (PMs) and store the notification in SQLite to avoid duplicates. Optionally also reply to the comment/post with a ready-to-paste message.
  5. Monitor & maintain: logging, rate-limit handling, health endpoint, auto-restart via systemd / Docker, and periodic metadata refresh.

What I provide here

  • Full Python implementation (single script plus helper modules) using:
    • PRAW (Reddit)
    • spaCy (or fallback lightweight rules)
    • RapidFuzz (fuzzy matching)
    • SQLite (persistence)
    • requests (optional, to download/update Gutenberg metadata
  • Configuration examples
  • Deployment recommendations (Dockerfile, systemd)
  • Testing checklist and unit-test examples
  • Message templates you can paste directly into Reddit replies

Required external pieces (you must supply or download once)

  1. A Project Gutenberg metadata file (CSV or SQLite). The code below can load:
    • a local CSV with columns: gutenberg_id,title,authors,author_fullname,lang,subjects.
    • OR you can ingest a JSON/RDF dump from Project Gutenberg and convert to CSV (script included).
  2. Reddit app credentials (client_id, client_secret, user_agent, username, password or use OAuth refresh token).
  3. A list of usernames to notify (in config).

Key design choices (why)

  • Use both rule-based patterns and spaCy entity matching because book titles are often quoted or Title-Cased; sometimes not — combined approach raises recall.
  • Fuzzy matching (RapidFuzz) with normalized strings yields robust matches against metadata variations (punctuation, subtitles).
  • SQLite persistence prevents duplicate notifications and keeps system lightweight.
  • Minimal permissions and polite interactions to avoid spam: bot will only notify on high-confidence matches (configurable threshold) and keep a cooldown per source/post.

Full implementation

Save this as gutenbot.py (or split into modules). I tried to keep it self-contained and well commented.

#!/usr/bin/env python3

"""

gutenbot.py - A Reddit bot that detects book titles/authors and matches them

to Project Gutenberg entries, notifying configured Reddit users.

Requirements (pip):

  pip install praw spacy rapidfuzz python-dotenv requests

  python -m spacy download en_core_web_sm

Configuration via .env (see .env.example below) or environment variables.

"""

import os

import re

import time

import json

import logging

import sqlite3

from dataclasses import dataclass

from typing import List, Optional, Tuple, Dict, Any

import praw

import requests

from rapidfuzz import fuzz, process

# Optional: spaCy for better NER (fallback to rules if not available)

try:

    import spacy

    nlp = spacy.load("en_core_web_sm")

except Exception:

    nlp = None

# Load environment variables

from dotenv import load_dotenv

load_dotenv()

# ---------- CONFIG ----------

REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")

REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")

REDDIT_USERNAME = os.getenv("REDDIT_USERNAME")

REDDIT_PASSWORD = os.getenv("REDDIT_PASSWORD")

REDDIT_USER_AGENT = os.getenv("REDDIT_USER_AGENT", "GutenBot/0.1 by " + (REDDIT_USERNAME or "gutenbot"))

SUBREDDITS = os.getenv("SUBREDDITS", "bookclub,books").split(",")  # comma-separated

NOTIFY_USERNAMES = [u.strip() for u in os.getenv("NOTIFY_USERNAMES", "").split(",") if u.strip()]

GUTEN_METADATA_CSV = os.getenv("GUTEN_METADATA_CSV", "gutenberg_metadata.csv")

MIN_SCORE = float(os.getenv("MIN_SCORE", "85"))  # RapidFuzz score threshold for title matching

SIMILARITY_AUTHOR_BOOST = float(os.getenv("SIMILARITY_AUTHOR_BOOST", "10"))  # extra points if author also matches

DB_PATH = os.getenv("DB_PATH", "gutenbot.sqlite")

POLL_INTERVAL = int(os.getenv("POLL_INTERVAL", "30"))  # seconds between checks for new stream items

REPLY_IN_COMMENT = os.getenv("REPLY_IN_COMMENT", "true").lower() in ("true", "1", "yes")

SEND_PM = os.getenv("SEND_PM", "true").lower() in ("true", "1", "yes")

MAX_NOTIFY_PER_POST = int(os.getenv("MAX_NOTIFY_PER_POST", "3"))  # prevent spam

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

logger = logging.getLogger("gutenbot")

# ---------- DB ----------

def init_db(conn: sqlite3.Connection):

    cur = conn.cursor()

    cur.execute("""

    CREATE TABLE IF NOT EXISTS notified (

        id TEXT PRIMARY KEY,      -- e.g. reddit_fullname or source link + guten_id

        reddit_kind TEXT,         -- 't1' for comment, 't3' for post

        reddit_permalink TEXT,

        guten_id INTEGER,

        score REAL,

        created INTEGER

    )

    """)

    conn.commit()

conn = sqlite3.connect(DB_PATH, check_same_thread=False)

init_db(conn)

# ---------- Metadata Loader ----------

@dataclass

class GutenbergEntry:

    gutenberg_id: int

    title: str

    authors: List[str]  # one or more author names

    author_fullname: str  # primary author full name (if available)

    lang: str

    subjects: List[str]

def normalize_text(s: str) -> str:

    s = s or ""

    s = s.strip()

    s = s.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')

    s = re.sub(r"\s+", " ", s)

    s = re.sub(r"[^0-9A-Za-z'\"&\s\-:,\.\?]", "", s)

    return s.lower()

def load_metadata_from_csv(path: str) -> List[GutenbergEntry]:

    import csv

    entries = []

    if not os.path.exists(path):

        logger.error("Gutenberg metadata CSV not found at %s", path)

        return entries

    with open(path, newline='', encoding='utf-8') as fh:

        reader = csv.DictReader(fh)

        for r in reader:

            try:

                gid = int(r.get("gutenberg_id") or r.get("id") or 0)

                title = r.get("title", "")

                authors = []

                a = r.get("authors") or r.get("author") or r.get("author_fullname") or ""

                # authors may be pipe/comma separated

                for part in re.split(r"[|,;/]+", a):

                    part = part.strip()

                    if part:

                        authors.append(part)

                lang = r.get("language") or r.get("lang") or "en"

                subjects = []

                subj = r.get("subjects") or r.get("subject") or ""

                for s in re.split(r"[|;]+", subj):

                    s = s.strip()

                    if s:

                        subjects.append(s)

                entries.append(GutenbergEntry(gutenberg_id=gid, title=title, authors=authors,

                                            author_fullname=authors[0] if authors else "", lang=lang, subjects=subjects))

            except Exception as e:

                logger.exception("Error parsing row: %s", e)

    logger.info("Loaded %d Gutenberg metadata entries", len(entries))

    return entries

gutenberg_entries = load_metadata_from_csv(GUTEN_METADATA_CSV)

# Build a quick title lookup list for rapidfuzz

title_to_entry_map = {e.title: e for e in gutenberg_entries}

title_list = list(title_to_entry_map.keys())

# ---------- Extraction: find candidate titles/authors ----------

TITLE_QUOTE_RE = re.compile(r'["“”‘’](.+?)["“”‘’]')  # text in quotes

PAREN_TITLE_RE = re.compile(r'(.+?)\s*\(by\s+([A-Za-z ,.\-]+)\)', re.I)  # "Title (by Author)"

BY_AUTHOR_RE = re.compile(r'(.{2,120}?)\s+by\s+([A-Z][A-Za-z\-\.\s]{2,80})', re.I)  # "Title by Author"

ITALIC_MARK_RE = re.compile(r'\*(.+?)\*|_(.+?)_|<i>(.+?)</i>', re.I)

def extract_candidates(text: str) -> List[Tuple[str, Optional[str]]]:

    """

    Returns list of (title_candidate, author_candidate_or_None)

    """

    candidates = []

    text = text.strip()

    # 1) quoted strings

    for m in TITLE_QUOTE_RE.finditer(text):

        t = m.group(1).strip()

        if len(t) >= 2 and len(t) <= 200:

            candidates.append((t, None))

    # 2) italic markers like *Title*

    for m in ITALIC_MARK_RE.finditer(text):

        t = next(g for g in m.groups() if g)

        candidates.append((t.strip(), None))

    # 3) "Title (by Author)" pattern

    for m in PAREN_TITLE_RE.finditer(text):

        candidates.append((m.group(1).strip(), m.group(2).strip()))

    # 4) "Title by Author" pattern but restrict to short author tokens

    for m in BY_AUTHOR_RE.finditer(text):

        title = m.group(1).strip()

        author = m.group(2).strip()

        if 3 <= len(title) <= 200:

            candidates.append((title, author))

    # 5) spaCy: look for WORK_OF_ART entities or TITLE-like proper noun sequences

    if nlp:

        try:

            doc = nlp(text)

            for ent in doc.ents:

                if ent.label_ in ("WORK_OF_ART", "WORK", "PRODUCT", "EVENT"):

                    candidates.append((ent.text.strip(), None))

            # also heuristics: consecutive PROPN, TITLE CASE sequences

            tokens = [t for t in doc]

            for i in range(len(tokens)-1):

                if tokens[i].text[0].isupper() and tokens[i+1].text[0].isupper():

                    # build sequence

                    j = i

                    seq = []

                    while j < len(tokens) and tokens[j].text[0].isupper() and len(seq) < 8:

                        seq.append(tokens[j].text)

                        j += 1

                    if len(seq) >= 2:

                        cand = " ".join(seq)

                        candidates.append((cand, None))

        except Exception:

            pass

    # dedupe while preserving order

    seen = set()

    uniq = []

    for t,a in candidates:

        key = (t.lower(), (a or "").lower())

        if key not in seen:

            seen.add(key)

            uniq.append((t,a))

    return uniq

# ---------- Matching ----------

def score_candidate_against_metadata(title_cand: str, author_cand: Optional[str]) -> List[Tuple[GutenbergEntry, float]]:

    """

    Returns a list of (GutenbergEntry, score) sorted desc. Score is in [0,100].

    """

    if not title_cand or len(title_cand) < 2:

        return []

    norm = normalize_text(title_cand)

    # rapidfuzz process.extract with partial_ratio / token_set_ratio is helpful

    # use process.extract with scorer token_set_ratio for robust matching

    matches = process.extract(norm, [normalize_text(t) for t in title_list],

                              scorer=fuzz.token_set_ratio, limit=10)

    results = []

    for match_string, score, idx in matches:

        title_orig = title_list[idx]

        entry = title_to_entry_map[title_orig]

        sc = score

        # boost if author candidate matches an author

        if author_cand and entry.authors:

            auth_norm = normalize_text(author_cand)

            best_author = max(entry.authors, key=lambda a: fuzz.token_set_ratio(auth_norm, normalize_text(a)))

            auth_score = fuzz.token_set_ratio(auth_norm, normalize_text(best_author))

            if auth_score > 40:

                sc = min(100.0, sc + SIMILARITY_AUTHOR_BOOST * (auth_score/100.0))

        results.append((entry, sc))

    results.sort(key=lambda x: x[1], reverse=True)

    return results

def pick_best_match(results: List[Tuple[GutenbergEntry, float]]) -> Optional[Tuple[GutenbergEntry, float]]:

    if not results:

        return None

    best = results[0]

    if best[1] >= MIN_SCORE:

        return best

    return None

# ---------- Reddit / PRAW Setup ----------

def get_reddit():

    return praw.Reddit(client_id=REDDIT_CLIENT_ID,

                      client_secret=REDDIT_CLIENT_SECRET,

                      username=REDDIT_USERNAME,

                      password=REDDIT_PASSWORD,

                      user_agent=REDDIT_USER_AGENT)

reddit = get_reddit()

# ---------- Notification & reply templates ----------

def build_reply_template(entry: GutenbergEntry, source_permalink: str) -> str:

    # Nice compact reply suitable for pasting

    url = f"https://www.gutenberg.org/ebooks/{entry.gutenberg_id}"

    template = (

        f"Hi — I think the book mentioned here matches **{entry.title}** "

        f"by **{entry.author_fullname or ', '.join(entry.authors)}**.\n\n"

        f"You can find it on Project Gutenberg: {url}\n\n"

        f"I thought this was relevant because it appears to directly relate to the title/author you mentioned."

    )

    return template

def notify_users_pm(reddit_obj, recipients: List[str], subject: str, body: str):

    for user in recipients:

        try:

            reddit_obj.redditor(user).message(subject, body)

            logger.info("Sent PM to %s", user)

            time.sleep(2)  # gentle pacing

        except Exception as e:

            logger.exception("Failed to send PM to %s: %s", user, e)

def reply_to_item(item, body: str):

    try:

        item.reply(body)

        logger.info("Replied to %s", getattr(item, "id", "<unknown>"))

    except Exception as e:

        logger.exception("Failed to reply: %s", e)

# ---------- Processing a single Reddit item ----------

def canonical_permalink_from_item(item) -> str:

    # PRAW provides .permalink for comments/posts; ensure full url

    try:

        return "https://www.reddit.com" + item.permalink

    except Exception:

        return ""

def register_notification(conn: sqlite3.Connection, id_key: str, reddit_kind: str, permalink: str, guten_id: int, score: float):

    cur = conn.cursor()

    try:

        cur.execute("INSERT INTO notified (id, reddit_kind, reddit_permalink, guten_id, score, created) VALUES (?, ?, ?, ?, ?, ?)",

                    (id_key, reddit_kind, permalink, guten_id, score, int(time.time())))

        conn.commit()

    except sqlite3.IntegrityError:

        # already recorded

        pass

def already_notified(conn: sqlite3.Connection, id_key: str) -> bool:

    cur = conn.cursor()

    cur.execute("SELECT 1 FROM notified WHERE id = ?", (id_key,))

    return cur.fetchone() is not None

def process_reddit_item(item):

    # item can be Comment or Submission

    text = getattr(item, "body", None) or getattr(item, "selftext", None) or item.title or ""

    if not text:

        return

    candidates = extract_candidates(text)

    if not candidates:

        return

    permalink = canonical_permalink_from_item(item)

    id_key_prefix = f"{item.fullname}"  # e.g. t1_..., t3_...

    notify_count = 0

    for title_cand, author_cand in candidates:

        results = score_candidate_against_metadata(title_cand, author_cand)

        best = pick_best_match(results)

        if not best:

            continue

        entry, score = best

        id_key = f"{id_key_prefix}|g{entry.gutenberg_id}"

        if already_notified(conn, id_key):

            logger.debug("Already notified for %s", id_key)

            continue

        # create message

        reply_text = build_reply_template(entry, permalink)

        subject = f"Gutenberg match: {entry.title}"

        # Notify configured users via PM

        if SEND_PM and NOTIFY_USERNAMES:

            pm_body = (

                f"Gutenberg match found for a reddit item: {permalink}\n\n"

                f"Title: {entry.title}\nAuthor: {entry.author_fullname or ', '.join(entry.authors)}\n"

                f"Project Gutenberg link: https://www.gutenberg.org/ebooks/{entry.gutenberg_id}\n\n"

                f"Confidence score: {score:.1f}\n\n"

                f"Suggested reply (you can paste this into the thread):\n\n{reply_text}"

            )

            notify_users_pm(reddit, NOTIFY_USERNAMES, subject, pm_body)

        # Optionally reply directly to the comment/post with the suggested text (if configured)

        if REPLY_IN_COMMENT:

            try:

                reply_to_item(item, reply_text)

            except Exception as e:

                logger.warning("Couldn't reply directly: %s", e)

        register_notification(conn, id_key, item.kind, permalink, entry.gutenberg_id, score)

        notify_count += 1

        if notify_count >= MAX_NOTIFY_PER_POST:

            break

# ---------- Streams / Poll loop ----------

def monitor_subreddits():

    subreddits = "+".join([s.strip() for s in SUBREDDITS if s.strip()])

    logger.info("Monitoring subreddits: %s", subreddits)

    subreddit = reddit.subreddit(subreddits)

    # Using comment_stream + submission_stream; PRAW offers stream generators

    # We'll poll both in turn to keep it simple and handle exceptions

    while True:

        try:

            for comment in subreddit.stream.comments(skip_existing=True):

                logger.debug("New comment %s", comment.id)

                process_reddit_item(comment)

        except Exception as e:

            logger.exception("Error while streaming comments: %s", e)

            time.sleep(10)

        try:

            for submission in subreddit.stream.submissions(skip_existing=True):

                logger.debug("New submission %s", submission.id)

                process_reddit_item(submission)

        except Exception as e:

            logger.exception("Error while streaming submissions: %s", e)

            time.sleep(10)

# ---------- Command-line helpers ----------

def quick_test_on_text(text: str):

    cands = extract_candidates(text)

    print("Candidates:", cands)

    for t,a in cands:

        print("->", t, "|", a)

        res = score_candidate_against_metadata(t, a)

        if res:

            for entry,score in res[:3]:

                print(f"    {entry.title} (ID {entry.gutenberg_id}) score={score:.1f} author={entry.author_fullname}")

        else:

            print("    no matches")

if __name__ == "__main__":

    import argparse

    parser = argparse.ArgumentParser()

    parser.add_argument("--test-text", help="Run candidate extraction & match on a single text")

    parser.add_argument("--run", action="store_true", help="Run bot (streaming mode)")

    parser.add_argument("--load-csv", help="Load a different metadata csv")

    args = parser.parse_args()

    if args.load_csv:

        gutenberg_entries = load_metadata_from_csv(args.load_csv)

        title_to_entry_map = {e.title: e for e in gutenberg_entries}

        title_list = list(title_to_entry_map.keys())

    if args.test_text:

        quick_test_on_text(args.test_text)

    if args.run:

        monitor_subreddits()

Example

.env.example

REDDIT_CLIENT_ID=your_client_id

REDDIT_CLIENT_SECRET=your_client_secret

REDDIT_USERNAME=YourBotAccount

REDDIT_PASSWORD=bot_password

REDDIT_USER_AGENT=GutenBot/0.1 (by /u/YourReddit)

SUBREDDITS=books,askliterature

NOTIFY_USERNAMES=mod1,helper2

GUTEN_METADATA_CSV=/path/to/gutenberg_metadata.csv

MIN_SCORE=85

SIMILARITY_AUTHOR_BOOST=10

DB_PATH=/var/lib/gutenbot/gutenbot.sqlite

REPLY_IN_COMMENT=true

SEND_PM=true

POLL_INTERVAL=30

Metadata: obtaining Project Gutenberg metadata

You need a metadata file (CSV) with gutenberg_id,title,authors,language,subjects. Recommended approaches:

  • Use the Project Gutenberg RDF/catalog and convert to CSV (the script is not included here, but it’s straightforward to parse RDF or JSON dumps).
  • Or use community-maintained CSV exports of Gutenberg metadata (download once and refresh periodically). Store the CSV on the machine where the bot runs and point GUTEN_METADATA_CSV at it.

(If you want I can provide a separate script to convert the RDF catalog to CSV — tell me and I’ll generate it.)

Deploying & running

Options (short):

  • Docker container (recommended): build an image with the script, proper env variables, and run in a managed service (Railway, Fly.io, DigitalOcean App Platform).
  • Systemd service on a VPS: run in a virtualenv, create systemd unit that restarts on failure.
  • Heroku / similar: possible but watch worker sleeping limits and background processes.

Minimal Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY . /app

RUN pip install --no-cache-dir praw spacy rapidfuzz python-dotenv requests && \

    python -m spacy download en_core_web_sm

ENV PYTHONUNBUFFERED=1

CMD ["python", "gutenbot.py", "--run"]

Systemd unit example (/etc/systemd/system/gutenbot.service):

[Unit]

Description=GutenBot Reddit notifier

After=network.target

[Service]

User=youruser

WorkingDirectory=/opt/gutenbot

EnvironmentFile=/opt/gutenbot/.env

ExecStart=/usr/bin/python3 /opt/gutenbot/gutenbot.py --run

Restart=always

RestartSec=5

[Install]

WantedBy=multi-user.target

Rate limits, politeness & Reddit rules

  • Use a dedicated bot account and clear user_agent.
  • Keep PM frequency low and respectful; avoid spamming users.
  • Respect Reddit API rate limits: PRAW uses built-in rate limit handling, but add pacing (the code uses small sleeps on PM sends).
  • Ensure templates are polite and helpfully attribute the match and confidence.

Testing & quality checks

  1. Unit tests for extract_candidates() with various sample strings (quotes, italics, by-author).
  2. Unit tests for score_candidate_against_metadata() using a small sample metadata file.
  3. Live run in a private subreddit or with --test-text to verify matches and avoid public mistakes.
  4. Confirm DB persistence prevents duplicate notifications.

Example inputs → outputs (samples)

Input comment:

“Has anyone read ‘The Count of Monte Cristo’ by Alexandre Dumas? Thinking of reading it next.”

Bot behavior:

  • Extracts candidate: “The Count of Monte Cristo” + author “Alexandre Dumas”
  • Matches entry Gutenberg ID 1184 (example) with score 95
  • Sends PM to configured reviewers with message containing the reddit permalink, Gutenberg link https://www.gutenberg.org/ebooks/1184, and suggested reply:

Hi — I think the book mentioned here matches **The Count of Monte Cristo** by **Alexandre Dumas**.

You can find it on Project Gutenberg: https://www.gutenberg.org/ebooks/1184

I thought this was relevant because it appears to directly relate to the title/author you mentioned.

Logging, monitoring & maintainability

  • Use rotating file logs or forward logs to an aggregator (Papertrail/LogDNA).
  • Add a /health HTTP endpoint (small Flask server) for readiness checks if running in containers.
  • Periodically refresh the Gutenberg metadata (cron job) and rebuild the CSV/SQLite index.

Improvements & advanced ideas (next steps)

  • Use a real vector search (FAISS) on canonicalized title embeddings for more advanced matching.
  • Add language detection to avoid matching non-English metadata.
  • Provide a confidence explanation (why matched: token overlap, author match).
  • Implement an opt-in mechanism for notified users (so you don’t PM people who don’t want messages).
  • Rate-limit and exponential backoff logic when replying/PMing to avoid account suspension.
  • Add a web dashboard for recent matches, manual approve/reject flows.

Security & privacy

  • Keep Reddit credentials secret (use environment variables or a secret manager).
  • Store minimal data: only what’s needed to avoid duplicate notifications (no user PII beyond Reddit usernames).
  • Do not auto-post links to full-text downloads without checking license. Project Gutenberg texts are public domain — fine — but be careful with other sources.

Wrap-up / next steps for you

  1. Download or create a Gutenberg metadata CSV and set GUTEN_METADATA_CSV.
  2. Create a Reddit app & bot account; fill .env.
  3. Install dependencies and run with python gutenbot.py --test-text "..." to verify.
  4. Run in a private subreddit until you’re happy with precision/thresholds, then enable live monitors.

If you want, I can:

  • Generate a small sample gutenberg_metadata.csv (100 entries) so you can test quickly.
  • Provide a script to convert Gutenberg RDF catalog to CSV.
  • Provide unit tests for the extractor & matcher (pytest).

Which of those would you like next?