knowledge_base Schema

Collection: `knowledge_base`

Purpose: Personal knowledge repository organized by topic/domain, not by source or project.

Metadata Schema

{
  "domain": "Python",                    // Primary knowledge area (Python, Networking, Android...)
  "path": "Python/AsyncIO/Patterns",     // Hierarchical: domain/subject/specific
  "subjects": ["async", "concurrency"],  // Cross-linking topics
  
  "category": "reference",               // reference | tutorial | snippet | troubleshooting | concept
  "content_type": "code",                // web_page | code | markdown | pdf | note
  
  "title": "Async Context Managers",     // Display name
  "checksum": "sha256:...",              // For duplicate detection
  "source_url": "https://...",           // Source attribution (always stored)
  "date_added": "2026-02-05",            // Date first stored
  "date_scraped": "2026-02-05T10:30:00"  // Exact timestamp scraped
}

Field Descriptions

Field	Required	Description
`domain`	Yes	Primary knowledge domain (e.g., Python, Networking)
`path`	Yes	Hierarchical location: `Domain/Subject/Specific`
`subjects`	No	Array of related topics for cross-linking
`category`	Yes	Content type classification
`content_type`	Yes	Format: web_page, code, markdown, pdf, note
`title`	Yes	Human-readable title
`checksum`	Auto	SHA256 hash for duplicate detection
`source_url`	Yes	Original source (web pages) or reference
`date_added`	Auto	Date stored (YYYY-MM-DD)
`date_scraped`	Auto	ISO timestamp when content was acquired
`text_preview`	Auto	First 300 chars of content (for display)

Content Categories

Category	Use For
`reference`	Documentation, specs, cheat sheets
`tutorial`	Step-by-step guides, how-tos
`snippet`	Code snippets, short examples
`troubleshooting`	Error fixes, debugging steps
`concept`	Explanations, theory, patterns

Examples

Content	Domain	Path	Category
DNS troubleshooting	Networking	Networking/DNS/Reverse-Lookup	troubleshooting
Kotlin coroutines	Android	Android/Kotlin/Coroutines	tutorial
Systemd timers	Linux	Linux/Systemd/Timers	reference
Python async patterns	Python	Python/AsyncIO/Patterns	code

Workflow

Smart Search (`smart_search.py`)

Always follow this pattern:

Search knowledge_base first — vector similarity search
Search web via SearXNG — get fresh results
Synthesize — combine KB + web findings
Store new info — if web has substantial new content
- Auto-check for duplicates (checksum comparison)
- Only store if content is unique and substantial (>500 chars)
- Auto-tag with domain, date_scraped, source_url

Storage Policy

Store when:

Content is substantial (>500 chars)
Not duplicate of existing KB entry
Has clear source attribution
Belongs to a defined domain

Skip when:

Too short (<500 chars)
Duplicate/similar content exists
No clear source URL

Review Schedule

Monthly review (cron: 1st of month at 3 AM):

Check entries older than 180 days
Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days
Remove outdated entries or flag for update

Fast-Moving Domains

These domains get shorter freshness thresholds:

AI/ML (models change fast)
Python (new versions, packages)
JavaScript (framework churn)
Docker (image updates)
OpenClaw (active development)
DevOps (tools evolve)

Scripts

Script	Purpose
`smart_search.py`	KB → web → store workflow
`kb_store.py`	Manual content storage
`kb_review.py`	Monthly outdated review
`scrape_to_kb.py`	Direct URL scraping

Design Decisions

Subject-first: Organize by knowledge type, not source
Path-based hierarchy: Navigate Domain/Subject/Specific
Separate from memories: knowledge_base and openclaw_memories are isolated
Duplicate handling: Checksum + content similarity → skip duplicates
Auto-freshness: Monthly cleanup of outdated entries
Full attribution: Always store source_url and date_scraped

4.2 KiB Raw Blame History