Bootstrap ingest plan
Seed ~/cosmo-memory/topics/ from existing knowledge sources. Three-pass Opus pipeline. Shadow-mode review. Dashboard upgrade first.
~/cosmo-memory/topics/ is empty. The router can't pull anything into Cosmo's prompt because there's nothing to pull. Until topics exist, the read path is task-only. Bootstrap fills topics from sources we already have, in one careful pass with verification, before the dream pass starts running nightly.
What's at stake
Bootstrap is the only ingest pass that runs over all existing knowledge before the dream pass takes over. The dream pass after this only sees new episodes from chat turns. Anything not picked up here either:
- Lives forever in the source file (CLAUDE.md, skill docs) and Cosmo has to re-read it on every relevant turn — slow and tokens-expensive, OR
- Gets caught later by the dream pass when the user mentions it again. If the user never mentions it again, it's effectively invisible.
So: we ARE aiming for completeness, but only over a defined corpus. The trick is defining the corpus precisely, then verifying coverage of THAT corpus. We're not aiming for "all knowledge ever" — that's impossible. We're aiming for "every durable fact present in these specific files."
How do we know what we missed? Three verification mechanisms working together. No single one is sufficient.
- Coverage pass (Pass 2 of the pipeline). Opus reads source files + generated topics, lists facts NOT extracted, categorises as (a) intentionally excluded, (b) ambiguous, (c) genuine miss. Human reviews the miss list.
- Round-trip pass (Pass 3). Opus reads each topic, traces it back to source paragraphs. Untraceable topics = invented = revise or delete.
- Use-driven gap detection (deferred). Over time, log router-misses (zero topics matched) and direct-source-reads. Each is a missing-topic signal. Captured as a future task; lands with step 3 dream pass infrastructure.
- Topic count: probably 60-100 (more than v1's "40-60" estimate, since corpus is broader)
- Asking Cosmo "what's my training plan?" routes to the right topic
- Asking "where do I live?" routes correctly
- Asking "what's H2OS?" routes correctly
- Coverage pass shows <5% genuine misses on a manual sample of source files
- Round-trip pass shows zero invented topics
- Every fact has a date per the SCHEMA date everything rule
The corpus — what's in scope
Programmatic enumeration. The bootstrap script discovers sources at runtime rather than hardcoding paths. Categories:
| Category | Pattern | Notes |
|---|---|---|
| Auto-memory (existing) | ~/.claude/projects/-Users-sahil-cosmo/memory/MEMORY.md + linked files |
Highest-signal source. Treated as migration (see next section). |
| Cosmo project rules | ~/cosmo/CLAUDE.md |
Deployment, ports, MCP, communication preferences. |
| User global rules | ~/.claude/CLAUDE.md |
Cross-project conventions, dates format, em-dash rule, etc. |
| All Sites projects | ~/Sites/**/CLAUDE.md (recursive) |
Per-project rules. Includes subprojects under ~/Sites/labs/ etc. All of them, not just active. |
| Project skills (cosmo) | ~/cosmo/.claude/skills/**/*.md |
health (extensive), dms, fleet, cosmo-development, etc. |
| User skills (global) | ~/.claude/skills/**/*.md |
Bear, gmail, calendar, contacts, etc. — the user-level skills. |
| Specs & reference docs | ~/cosmo/specs/**/*.md~/cosmo/.claude/skills/health/reference/**/*.md |
Architecture, research notes, health protocols, daily-commands, progressive-overload reference. |
| Session continuation files | Programmatically discovered by reading /continue and /catchup skill source code to enumerate every location they touch. |
This includes: ~/cosmo/session-docs/, ~/cosmo/.sessions/, ~/cosmo/.claude/, per-project ~/Sites/{project}/session-docs/ + .sessions/ + .claude/, domain logs at ~/cosmo/.claude/skills/health/sessions/, etc. Old AND new files. |
| Domain session logs | ~/cosmo/.claude/skills/*/sessions/**/*.md |
Coaching threads (health/Phase 9), work coaching, etc. Each session log is a continuity artefact. |
Total input estimate: 150-250K tokens. Larger than v1 estimate because session-files set was undercounted before.
/continue and /catchup skill source files (under ~/.claude/skills/continue/ and /catchup/) and extracts the directory patterns they reference. Those skills already enumerate every session file location authoritatively. This makes the bootstrap script's source list match reality automatically.
What's NOT in scope (yet)
These are real sources of memory but require their own ingest strategy. Captured as task files in ~/cosmo-memory/tasks/active/ so we don't forget:
- deferred Bear notes — encrypted SQLite, ~hundreds of long-form notes. Different SCHEMA decisions (granularity, tag mapping). Separate ingest later. (Task: Future ingest: Bear notes into cosmo-memory)
- deferred Strava activities — already accessible via healthyst integration. Separate ingest tied to running history. (Task: Future ingest: Strava activities)
- deferred Gmail — needs email-specific extraction rules. Separate ingest later. (Task: Future ingest: Gmail)
- deferred Photos library — separate ingest with vision-based metadata. (Task: Future ingest: Photos library)
- deferred Alternative note-app evaluation — Bear may not be the long-term home. (Task: Evaluate alternative note app)
And the SCHEMA-level exclusions (always):
- always exclude Secrets / credentials. Skip
.enventirely. Reference patterns ("VAR lives in cosmo/.env") OK; values not. - always exclude Code. No JS/Python/HTML files. Bootstrap reads docs, not source.
- always exclude Transient state derivable from runtime (process lists, etc.).
MEMORY.md as migration, not rebuild
The existing MEMORY.md at ~/.claude/projects/-Users-sahil-cosmo/memory/ is human-curated. Each entry has explicit Why: and How to apply: lines that took real failures to learn. Re-deriving via Opus risks losing those load-bearing rationales.
Migration approach:
- Preserve the rule text verbatim. "NEVER copy dist files to server" stays exactly as is.
- Preserve Why: and How to apply: lines intact.
- Preserve the date when known.
- Drop the
<type>categorisation (user/feedback/project/reference). The old structure organises by category-of-rule. The new system organises by life domain. The type is metadata that doesn't carry semantic weight in topic-space. - Re-route by domain. A
feedback_h2os_deployment.mdbecomes a section insidetopics/work-h2os.md, OR its own topic file if substantial enough.
This is migration in the sense that matters: facts kept, structure rebuilt around them. Pass 1 of the pipeline handles MEMORY.md with explicit instructions to preserve verbatim rule text + rationale.
Topic granularity rules
One topic per:
- Distinct life domain — health is one, work-h2os is another, activism is a third
- Distinct durable initiative inside a domain — phase-9-comeback, parkruns-history, migraine-history are three separate health topics
- Distinct entity that gets referenced repeatedly — Kathryn (relationship), CCSA (org), AGM action (campaign)
- Distinct procedural reference — h2os-deployment-process, claude-code-keybindings, port-reservations
NOT one topic per:
- Time slice — don't make "April 2026" a topic. Time = episode date, not topic identity.
- Person mentioned once — doesn't get a topic. Wait for dream pass to elevate.
- Generic concept — "running" alone is too broad. Specific (e.g. user's own training phases) or skip.
Splitting heuristic
If a topic file would mix "current state" + "history" of something, split into two: X-current and X-history. The router pulls current-state files way more often; history files only when the user explicitly asks back in time.
Date extraction rules
| Date type | Format | Example |
|---|---|---|
| Ongoing fact | *(since YYYY-MM)* | "Lives in Adelaide *(since 2024-03)*" |
| Range | *(YYYY-MM to YYYY-MM)* | "Used Mem0 *(2025-04 to 2026-04)*" |
| Exact day | *(YYYY-MM-DD)* | "First date with Kathryn *(2026-01-25)*" |
| Fuzzy | *(~YYYY)* | "Started caring about climate *(~2018)*" |
| Unknown | (no date — flag in dream pass review) | — |
If the source doesn't include a date and Pass 1 can't infer with confidence, leave dateless and add to a TODO list for human review. Don't invent precision.
Three-pass Opus pipeline
One Opus run isn't enough. Three sequential passes, each with one job:
1Extract
Input: all source files concatenated, plus SCHEMA.md, plus the rule set above.
Job: produce topic files. Output format: stream of === FILE: topics/<basename>.md === blocks. Plus a TODO list of dates/facts that were ambiguous.
Cost: ~150-250K input + ~50K output ≈ $4-6 with Opus 4.7.
Time: 10-15 min.
2Coverage check
Input: all source files + Pass 1's topic outputs.
Job: for each source file, list facts that did NOT make it into a topic. Categorise as:
- (a) intentionally excluded per rules (secret, code, transient)
- (b) ambiguous — the model wasn't sure if it qualified
- (c) genuine miss — should have been extracted but wasn't
Output: a coverage report at ~/cosmo-memory/inbox/bootstrap-coverage-YYYY-MM-DD.md. Human reviews the (c) list.
Cost: ~250-300K input + ~5-10K output ≈ $4-5.
Time: 10-15 min.
3Round-trip
Input: Pass 1's topic outputs + source files.
Job: for each topic, identify which source paragraph(s) produced it. Flag any topic that can't be traced as "potentially invented."
Output: source-link metadata appended to topic frontmatter as sources: [path1, path2, ...], plus a "potentially invented" list at ~/cosmo-memory/inbox/bootstrap-roundtrip-YYYY-MM-DD.md.
Cost: ~200-300K input + ~5K output ≈ $3-4.
Time: 10-15 min.
Dashboard upgrade — built first
The current dashboard at http://127.0.0.1:9876 is a viewer. Bootstrap output is 60-100 topic files. Reviewing in a viewer with no review tooling is brutal. So upgrade dashboard FIRST, then run bootstrap.
New features:
- Per-topic review state — a
bootstrap_reviewed: approved | needs-revision | deletefield stored in frontmatter (or a side-file). Buttons in the dashboard to set state. - Comments — a
bootstrap_notes:field where you leave a one-liner per file ("missing the X fact", "merge with Y", etc.). Editable from the dashboard. - Source links — each topic shows which source(s) produced it (from Pass 3 metadata). Clickable.
- Coverage view — a tab showing every source file → which topics reference it. Highlights sources with zero topic references.
- Diff vs source — for a topic, show the source paragraphs that fed into it. Lets you spot misinterpretations.
Time: ~2-3 hours of work.
Shadow-mode review flow
Bootstrap writes to ~/cosmo-memory/topics-draft-YYYY-MM-DD/, NOT directly to topics/. Review happens in the draft dir. Once approved, files move into topics/.
The flow:
- Run bootstrap script → writes to
topics-draft-2026-04-29/ - Open dashboard → see all draft topics with their review state (initially all "pending")
- Review each: approved | needs-revision | delete
- For "needs-revision": edit the file directly via VSCode or leave a note and come back
- For "delete": click delete, file goes to
topics-draft-DATE/.deleted/(audit trail, not lost) - Once everything is approved or deleted: run
cosmo-mem bootstrap merge→ moves approved files intotopics/, archives the draft dir - Commit cosmo-memory repo
- Restart cosmo-agent so the live system picks up the new topics
topics/ stays untouched until merge. This makes re-runs trivial and risk-free.
Estimated cost & time
| Item | Estimate |
|---|---|
| Pass 1 (extract) | ~$4-6 · 10-15 min |
| Pass 2 (coverage) | ~$4-5 · 10-15 min |
| Pass 3 (round-trip) | ~$3-4 · 10-15 min |
| LLM cost total | ~$11-15 |
| Pipeline wall-clock | ~30-45 min |
| Dashboard upgrade build | ~2-3 h |
| Bootstrap script build | ~1-2 h |
| Human review time | ~3-4 h spread over a day |
| Total wall time | ~1-1.5 days |
This is bounded and worth it. The downstream value (router has things to pick from, dream pass starts running on real topics, "what does Cosmo know about me" actually works) is huge.
Deferred to later steps
Captured as task files in ~/cosmo-memory/tasks/active/ so they don't get forgotten:
future-ingest-bear-notes-into-cosmo-memoryfuture-ingest-strava-activities-into-cosmo-memoryfuture-ingest-gmail-into-cosmo-memoryfuture-ingest-photos-library-into-cosmo-memoryevaluate-alternative-note-app-replace-bear-long-termbuild-use-driven-gap-detection-log-router-0-topic-returns-direct-source-reads
Build order
- This doc, signed off. ✅ (if you're reading this, we're here)
- Deploy doc to Cloudflare Pages (sibling to memory-v2.pages.dev).
- Dashboard upgrade — review state, comments, source links, coverage view, diff-vs-source. (~2-3h)
- Bootstrap script at
~/cosmo/scripts/cosmo-mem-bootstrap.js. Reads/continue+/catchupskills to enumerate sources, runs Pass 1, Pass 2, Pass 3 sequentially, writes totopics-draft-YYYY-MM-DD/, generates inbox reports. (~1-2h) - Dry-run on MEMORY.md only — sanity check the migration approach for ~$1, ~3 min.
- Full bootstrap run — all sources, three passes. (~30-45 min, ~$11-15)
- Review via dashboard — mark every topic. (~3-4h spread over a day)
- Merge into
topics/. Archive draft dir. Commit cosmo-memory. - Restart cosmo-agent. Live system picks up new INDEX (auto-loaded).
- Smoke-test via Telegram — "what's my current training plan?", "where do I live?", "what's H2OS?".
- After step 5 (dry-run): show migration output for MEMORY.md, decide if approach is right before full run
- After step 7 (review): final approval before merging into
topics/