corkboard

onboarding — set up a project & hand it to an agent

a 5-minute walkthrough · v0

what it is

networks of evidence-backed claims

corkboard helps you build a graph of entities — people, companies, papers, whatever — connected by memos that cite the sources they came from.


You give it seed entities and an ontology. An AI agent runs the search → scrape → extract loop. You get a queryable corkboard with full provenance.

Good for: investigations, lit reviews, influence maps, prosopography, any "who's connected to whom and how do we know" question.

the loop

how a corkboard grows

searchSerper, pairs of nodes
scrapeaiohttp · cloudscraper · archive.is
extractLLM → your Pydantic schema
triagecandidates → new nodes

Every step writes to the corkboard. Every action is logged. Resuming is automatic — queues are derived from what's missing, not from a checkpoint file.


No run_all(). The agent calls primitives; the library handles state, caching, provenance, and cost tracking.

data model

four things to know

thingwhat it iswhere you customise
Node an entity. node_id · aliases · meta. meta via your schema
Memo an evidence-backed claim over one or more nodes. target_node_ids · meta · fodder_ids. meta via your schema
Fodder raw scraped content. URL, body, blob.
Candidate an entity an LLM mentioned but you haven't promoted to a node yet. Triage decides.
Methodology, not ontology. The library is opinionated about how you build the network. It has zero opinions about what kinds of nodes or memos you want. That's meta + your schema.
setup

step 1 — install

# Install corkboard. With uv (recommended):
uv tool install corkboard

# Or pip:
pip install corkboard

# Optional extras:
pip install 'corkboard[archive]'   # archive.is Selenium fallback
pip install 'corkboard[wiki]'      # Wikipedia matcher (v1+)

Python 3.11+. You'll also want a Serper API key (search) and an Anthropic API key (extraction). Set them in your shell — the library reads SERPER_API_KEY and ANTHROPIC_API_KEY.

setup

step 2 — create a project

corkboard init beat_generation
cd beat_generation

You now have an empty corkboard project, scaffolded from the built-in template.


what got written

beat_generation/
├── corkboard.db        # SQLite (WAL). nodes · memos · fodder · candidates · runs
├── blobs/              # content-addressed raw HTML/PDF/text
├── caches/             # LLM + Serper response caches
├── config.toml         # schemas + LLM + rate-limit config
├── schemas.py          # your project-local Pydantic schemas
├── pyproject.toml      # so `uv run python -c '…'` resolves `corkboard`
└── AGENTS.md           # instructions for the AI agent that runs the work
key file

AGENTS.md is your runbook

The template drops a comprehensive AGENTS.md at the project root. It's written for an AI agent — and it's what makes the "point Claude/Cursor at the directory" workflow work.

It covers:

  • Project layout & what each file is for
  • How to call the library: uv run python -c '…' snippets, not a CLI
  • One-liners for every step of the loop (alias · search · scrape · extract · triage)
  • The fallback path when scraping hits a Cloudflare wall
  • Implicit dedup behaviours to be aware of
  • How to extend the schema

Treat it as living documentation — agents and humans alike read it; project-specific notes belong next to it (e.g. SCRAPING.md).

setup

step 3 — define your ontology

Edit schemas.py with the shapes your domain needs. The template ships commented examples:

from typing import Annotated, Literal
from pydantic import BaseModel, Field


class _PersonNode(BaseModel):
    kind: Literal["person"]
    country: str | None = None
    roles: list[str] = []

class _CompanyNode(BaseModel):
    kind: Literal["company"]
    web_domain: str | None = None

# Discriminated union over meta['kind']:
NodeMetaSchema = Annotated[
    _PersonNode | _CompanyNode, Field(discriminator="kind")
]


class _CollaboratedMemo(BaseModel):
    kind: Literal["collaborated"]
    year_started: int | None = None
    notes: str | None = None

class _MentoredMemo(BaseModel):
    kind: Literal["mentored"]
    mentor: str         # entity name
    mentee: str         # entity name

MemoMetaSchema = Annotated[
    _CollaboratedMemo | _MentoredMemo, Field(discriminator="kind")
]
setup

step 4 — wire the schema in config.toml

[schemas]
node = "schemas:NodeMetaSchema"
memo = "schemas:MemoMetaSchema"

[llm]
default_model = "claude-opus-4-7"
# cache_path = "../shared-llm-cache"   # share cache across sibling projects

[rate_limits]
serper_rps = 30
requests_rps = 10
per_site_request_wait_s = 10

Now every add_node / add_memo call is validated against your Pydantic schema. The LLM extraction prompt is built from the same schema, so you get matching structure end-to-end.

Schemas are optional. If you don't set them, meta is a free-form dict and you'll get back whatever the LLM decides is relevant. Useful for exploration — switch on schemas when the shape settles.
setup

step 5 — seed a few nodes

From inside the project directory:

uv run python -c "
from corkboard.board import Corkboard
from corkboard.nodes import add_node

cb = Corkboard.open()      # walks up to find corkboard.db
add_node(cb, 'jack_kerouac',     meta={'kind': 'person'})
add_node(cb, 'allen_ginsberg',   meta={'kind': 'person'})
add_node(cb, 'william_burroughs', meta={'kind': 'person'})
"

That's enough to start the loop. The agent will resolve aliases (Wikipedia-style name variants), then search every pair.

Corkboard.open() walks upward from the cwd looking for corkboard.db, like git finds .git. Scripts run from any subdir of the project.

step 6

hand the project to an agent

corkboard is agent-driven by default. You don't run a pipeline — you point a coding agent at the project directory and tell it what you want.


Claude Code

cd beat_generation
claude

Cursor / others

Open the project directory. Most coding agents auto-discover AGENTS.md; if yours doesn't, paste its contents into the system prompt.


Then say something like:

"Resolve aliases for all unaliased nodes, search every pair, scrape the results, extract memos. Keep going until corkboard coverage reports zero unfetched pairs."
under the hood

what the agent actually runs

The agent calls library primitives directly via short uv run python -c '…' snippets — no project-specific CLI. Concretely, one full loop pass looks like:

import asyncio
from corkboard.board import Corkboard
from corkboard.agent import resolve_aliases, extract_memo
from corkboard.search.batch import search_pairs
from corkboard.scrape.fetch import batch_fetch
from corkboard.triage import (
    nodes_missing_aliases, pairs_without_fodder,
    fodder_unfetched, pairs_without_memos,
)

async def main():
    cb = Corkboard.open()

    for n in nodes_missing_aliases(cb, limit=20):
        await resolve_aliases(cb, n.node_id)

    pairs = pairs_without_fodder(cb, accept_n_squared=True, limit=50)
    await search_pairs(cb, pairs)

    todo = [(f.source_url, f.target_node_ids)
            for f in fodder_unfetched(cb, limit=50)]
    await batch_fetch(cb, todo)

    for pair in pairs_without_memos(cb, limit=20):
        await extract_memo(cb, list(pair),
                           schema=cb.memo_schema,
                           model='claude-sonnet-4-6',
                           chunk_strategy='paragraphs_with_both',
                           chunk_window=1)

asyncio.run(main())

Triage queues (nodes_missing_aliases, pairs_without_fodder, …) are SQL views — they always return what's still outstanding, so re-running is safe and resumes automatically.

inspect

what's in the corkboard right now?

# Aggregate stats — nodes, memos, fodder, pending pairs, total LLM cost:
corkboard coverage .

# Tail the audit log (every action, every cost, every error):
corkboard runs tail .

# Or just query the SQLite directly:
sqlite3 corkboard.db 'SELECT node_id, aliases FROM nodes LIMIT 5'

Three CLI commands — init, coverage, runs tail — and that's the whole CLI. Everything else is Python.

Why only three? Agents call Python directly; humans want a quick "where am I" snapshot. The CLI is the second list, not the first.
triage

candidates — entities the LLM noticed

When extract_memo runs, the prompt asks the LLM to also surface any third-party entities it mentioned. Those become candidates.

# Pending candidates:
uv run python -c "
from corkboard.board import Corkboard
from corkboard.triage import unpromoted_candidates
cb = Corkboard.open()
for c in unpromoted_candidates(cb):
    print(f'{c.suggested_name:30} mentions={c.mention_count}')
"

# Promote one — it becomes a Node and joins the next search round:
uv run python -c "
from corkboard.board import Corkboard
from corkboard.candidates import promote_candidate
cb = Corkboard.open()
promote_candidate(cb, 'CANDIDATE_ID')
"

Promote → next loop iteration searches the new node against everything. That's how networks grow beyond the seed.

when things go wrong

scraping hit a wall — three options

aiohttp + cloudscraper covers most sites. Cloudflare-heavy or paywalled ones don't. AGENTS.md documents the fallbacks in detail; the short version:

  1. 1batch_archive_fetch — Selenium-driven archive.is fallback. Captcha'd, so interactive by default. For unattended runs use captcha_strategy="fail_fast" and use_headless_browser=True.
  2. 2Agent fetches it themselves. Got a browser-MCP tool, a paywalled session, or curl in a different network context? Pull the content and call add_fodder(cb, ..., full_text=..., raw_bytes=...). The extract pipeline treats it identically.
  3. 3Skip it. Failed fetches are recorded with the error. Move on; some URLs aren't worth fighting.

Use pairs_blocked_by_fetch(cb) to see the "dead-letter" queue — pairs where every fodder URL failed to fetch.

tuning

extract knobs worth knowing

knobwhat it doeswhen to use
model='claude-sonnet-4-6' Sonnet instead of Opus (the library default). Default. ~5× cheaper, handles memo extraction well. Drop to Opus only on retries where Sonnet got the structure wrong.
chunk_strategy='paragraphs_with_both' Only send the LLM paragraphs that mention both entities. Long sources (Wikipedia, biographies). Massive token savings, sharper memos.
chunk_window=1 Include the paragraphs before and after each match. When one entity is the topic and the other is mentioned in passing — gives the LLM enough context to type the relation instead of falling back to kind="other".
max_tokens=8000 Raise the response budget. Verbose schemas or long target tuples where extractions truncate mid-JSON.
gotchas

implicit behaviours to know about

The library silently drops a few things to avoid garbage state. Each has an opt-out — but the defaults are what you want 99% of the time.

  • Pair search drops shared aliases. If "Father of the Atomic Bomb" is an alias on two physicists, it's filtered out of the boolean query (otherwise a single mention satisfies both AND-groups). Aliases on Node rows stay; only the query is filtered.
  • Candidates matching existing nodes are silently skipped. LLM extractions often surface entities you already have. add_candidates filters them case-insensitively. Override with skip_known_nodes=False.
  • Corkboard.open() walks upward from the cwd looking for corkboard.db, like git. Pass an explicit path to override.
  • Memo prompts use Entity A/B/C labels in target_node_ids order. Schema fields can reference them positionally (direction: Literal["a_cites_b"]) or by name (mentor: str, mentee: str). Free-text fields like summary use actual names — the prompt forbids leaking the A/B labels there.
recipe

typical first session

  1. 1corkboard init my_project && cd my_project
  2. 2Edit schemas.py, uncomment the [schemas] block in config.toml.
  3. 3Seed 3–10 nodes by hand (or from a CSV).
  4. 4Open the project in your coding agent. Tell it to run the loop until corkboard coverage reports zero pending pairs.
  5. 5Triage candidates. Promote the ones worth pursuing. Repeat from step 4.
  6. 6Query the SQLite, or write a small query.py that uses list_memos, cb.match, etc. Render a graph, write a report, hand it to the next agent.

Look at sample_projects/ in the corkboard repo for full worked examples: beat_generation, cambridge_five, quantum_pioneers, manhattan_project, etc.

further reading

where to look next

inside your project

  • AGENTS.md — full one-liner reference
  • schemas.py — your ontology lives here
  • config.toml — rate limits, LLM, schemas
  • corkboard.db — query it directly when curious

the library

  • corkboard.boardCorkboard.open / create
  • corkboard.nodes / memos / fodder / candidates — CRUD
  • corkboard.search / scrape / extract — the loop
  • corkboard.triage — derived queues
  • corkboard.match — predicates with & | ~
  • corkboard.agent — convenience helpers

When in doubt: read AGENTS.md — it's written for exactly this question.

that's it — go build a corkboard

corkboard init <dir> · edit schemas.py · seed · hand to agent

questions live in _scratch/CORKBOARD_PLAN.md