onboarding — set up a project & hand it to an agent
a 5-minute walkthrough · v0
corkboard helps you build a graph of entities — people, companies, papers, whatever — connected by memos that cite the sources they came from.
You give it seed entities and an ontology. An AI agent runs the search → scrape → extract loop. You get a queryable corkboard with full provenance.
Good for: investigations, lit reviews, influence maps, prosopography, any "who's connected to whom and how do we know" question.
Every step writes to the corkboard. Every action is logged. Resuming is automatic — queues are derived from what's missing, not from a checkpoint file.
No run_all(). The agent calls primitives; the library handles state, caching, provenance, and cost tracking.
| thing | what it is | where you customise |
|---|---|---|
Node |
an entity. node_id · aliases · meta. |
meta via your schema |
Memo |
an evidence-backed claim over one or more nodes. target_node_ids · meta · fodder_ids. |
meta via your schema |
Fodder |
raw scraped content. URL, body, blob. | — |
Candidate |
an entity an LLM mentioned but you haven't promoted to a node yet. Triage decides. | — |
meta + your schema.
# Install corkboard. With uv (recommended):
uv tool install corkboard
# Or pip:
pip install corkboard
# Optional extras:
pip install 'corkboard[archive]' # archive.is Selenium fallback
pip install 'corkboard[wiki]' # Wikipedia matcher (v1+)
Python 3.11+. You'll also want a Serper API key (search) and an Anthropic API key (extraction). Set them in your shell — the library reads SERPER_API_KEY and ANTHROPIC_API_KEY.
corkboard init beat_generation
cd beat_generation
You now have an empty corkboard project, scaffolded from the built-in template.
beat_generation/
├── corkboard.db # SQLite (WAL). nodes · memos · fodder · candidates · runs
├── blobs/ # content-addressed raw HTML/PDF/text
├── caches/ # LLM + Serper response caches
├── config.toml # schemas + LLM + rate-limit config
├── schemas.py # your project-local Pydantic schemas
├── pyproject.toml # so `uv run python -c '…'` resolves `corkboard`
└── AGENTS.md # instructions for the AI agent that runs the work
AGENTS.md is your runbookThe template drops a comprehensive AGENTS.md at the project root. It's written for an AI agent — and it's what makes the "point Claude/Cursor at the directory" workflow work.
It covers:
uv run python -c '…' snippets, not a CLITreat it as living documentation — agents and humans alike read it; project-specific notes belong next to it (e.g. SCRAPING.md).
Edit schemas.py with the shapes your domain needs. The template ships commented examples:
from typing import Annotated, Literal
from pydantic import BaseModel, Field
class _PersonNode(BaseModel):
kind: Literal["person"]
country: str | None = None
roles: list[str] = []
class _CompanyNode(BaseModel):
kind: Literal["company"]
web_domain: str | None = None
# Discriminated union over meta['kind']:
NodeMetaSchema = Annotated[
_PersonNode | _CompanyNode, Field(discriminator="kind")
]
class _CollaboratedMemo(BaseModel):
kind: Literal["collaborated"]
year_started: int | None = None
notes: str | None = None
class _MentoredMemo(BaseModel):
kind: Literal["mentored"]
mentor: str # entity name
mentee: str # entity name
MemoMetaSchema = Annotated[
_CollaboratedMemo | _MentoredMemo, Field(discriminator="kind")
]
config.toml[schemas]
node = "schemas:NodeMetaSchema"
memo = "schemas:MemoMetaSchema"
[llm]
default_model = "claude-opus-4-7"
# cache_path = "../shared-llm-cache" # share cache across sibling projects
[rate_limits]
serper_rps = 30
requests_rps = 10
per_site_request_wait_s = 10
Now every add_node / add_memo call is validated against your Pydantic schema. The LLM extraction prompt is built from the same schema, so you get matching structure end-to-end.
meta is a free-form dict and you'll get back whatever the LLM decides is relevant. Useful for exploration — switch on schemas when the shape settles.
From inside the project directory:
uv run python -c "
from corkboard.board import Corkboard
from corkboard.nodes import add_node
cb = Corkboard.open() # walks up to find corkboard.db
add_node(cb, 'jack_kerouac', meta={'kind': 'person'})
add_node(cb, 'allen_ginsberg', meta={'kind': 'person'})
add_node(cb, 'william_burroughs', meta={'kind': 'person'})
"
That's enough to start the loop. The agent will resolve aliases (Wikipedia-style name variants), then search every pair.
Corkboard.open() walks upward from the cwd looking for corkboard.db, like git finds .git. Scripts run from any subdir of the project.
corkboard is agent-driven by default. You don't run a pipeline — you point a coding agent at the project directory and tell it what you want.
cd beat_generation
claude
Open the project directory. Most coding agents auto-discover AGENTS.md; if yours doesn't, paste its contents into the system prompt.
Then say something like:
"Resolve aliases for all unaliased nodes, search every pair, scrape the results, extract memos. Keep going until corkboard coverage reports zero unfetched pairs."
The agent calls library primitives directly via short uv run python -c '…' snippets — no project-specific CLI. Concretely, one full loop pass looks like:
import asyncio
from corkboard.board import Corkboard
from corkboard.agent import resolve_aliases, extract_memo
from corkboard.search.batch import search_pairs
from corkboard.scrape.fetch import batch_fetch
from corkboard.triage import (
nodes_missing_aliases, pairs_without_fodder,
fodder_unfetched, pairs_without_memos,
)
async def main():
cb = Corkboard.open()
for n in nodes_missing_aliases(cb, limit=20):
await resolve_aliases(cb, n.node_id)
pairs = pairs_without_fodder(cb, accept_n_squared=True, limit=50)
await search_pairs(cb, pairs)
todo = [(f.source_url, f.target_node_ids)
for f in fodder_unfetched(cb, limit=50)]
await batch_fetch(cb, todo)
for pair in pairs_without_memos(cb, limit=20):
await extract_memo(cb, list(pair),
schema=cb.memo_schema,
model='claude-sonnet-4-6',
chunk_strategy='paragraphs_with_both',
chunk_window=1)
asyncio.run(main())
Triage queues (nodes_missing_aliases, pairs_without_fodder, …) are SQL views — they always return what's still outstanding, so re-running is safe and resumes automatically.
# Aggregate stats — nodes, memos, fodder, pending pairs, total LLM cost:
corkboard coverage .
# Tail the audit log (every action, every cost, every error):
corkboard runs tail .
# Or just query the SQLite directly:
sqlite3 corkboard.db 'SELECT node_id, aliases FROM nodes LIMIT 5'
Three CLI commands — init, coverage, runs tail — and that's the whole CLI. Everything else is Python.
When extract_memo runs, the prompt asks the LLM to also surface any third-party entities it mentioned. Those become candidates.
# Pending candidates:
uv run python -c "
from corkboard.board import Corkboard
from corkboard.triage import unpromoted_candidates
cb = Corkboard.open()
for c in unpromoted_candidates(cb):
print(f'{c.suggested_name:30} mentions={c.mention_count}')
"
# Promote one — it becomes a Node and joins the next search round:
uv run python -c "
from corkboard.board import Corkboard
from corkboard.candidates import promote_candidate
cb = Corkboard.open()
promote_candidate(cb, 'CANDIDATE_ID')
"
Promote → next loop iteration searches the new node against everything. That's how networks grow beyond the seed.
aiohttp + cloudscraper covers most sites. Cloudflare-heavy or paywalled ones don't. AGENTS.md documents the fallbacks in detail; the short version:
batch_archive_fetch — Selenium-driven archive.is fallback. Captcha'd, so interactive by default. For unattended runs use captcha_strategy="fail_fast" and use_headless_browser=True.add_fodder(cb, ..., full_text=..., raw_bytes=...). The extract pipeline treats it identically.Use pairs_blocked_by_fetch(cb) to see the "dead-letter" queue — pairs where every fodder URL failed to fetch.
| knob | what it does | when to use |
|---|---|---|
model='claude-sonnet-4-6' |
Sonnet instead of Opus (the library default). | Default. ~5× cheaper, handles memo extraction well. Drop to Opus only on retries where Sonnet got the structure wrong. |
chunk_strategy='paragraphs_with_both' |
Only send the LLM paragraphs that mention both entities. | Long sources (Wikipedia, biographies). Massive token savings, sharper memos. |
chunk_window=1 |
Include the paragraphs before and after each match. | When one entity is the topic and the other is mentioned in passing — gives the LLM enough context to type the relation instead of falling back to kind="other". |
max_tokens=8000 |
Raise the response budget. | Verbose schemas or long target tuples where extractions truncate mid-JSON. |
The library silently drops a few things to avoid garbage state. Each has an opt-out — but the defaults are what you want 99% of the time.
add_candidates filters them case-insensitively. Override with skip_known_nodes=False.Corkboard.open() walks upward from the cwd looking for corkboard.db, like git. Pass an explicit path to override.target_node_ids order. Schema fields can reference them positionally (direction: Literal["a_cites_b"]) or by name (mentor: str, mentee: str). Free-text fields like summary use actual names — the prompt forbids leaking the A/B labels there.corkboard init my_project && cd my_projectschemas.py, uncomment the [schemas] block in config.toml.corkboard coverage reports zero pending pairs.query.py that uses list_memos, cb.match, etc. Render a graph, write a report, hand it to the next agent.Look at sample_projects/ in the corkboard repo for full worked examples: beat_generation, cambridge_five, quantum_pioneers, manhattan_project, etc.
AGENTS.md — full one-liner referenceschemas.py — your ontology lives hereconfig.toml — rate limits, LLM, schemascorkboard.db — query it directly when curiouscorkboard.board — Corkboard.open / createcorkboard.nodes / memos / fodder / candidates — CRUDcorkboard.search / scrape / extract — the loopcorkboard.triage — derived queuescorkboard.match — predicates with & | ~corkboard.agent — convenience helpersWhen in doubt: read AGENTS.md — it's written for exactly this question.
corkboard init <dir> · edit schemas.py · seed · hand to agent
questions live in _scratch/CORKBOARD_PLAN.md