# knowledge_base Schema ## Collection: `knowledge_base` Purpose: Personal knowledge repository organized by topic/domain, not by source or project. ## Metadata Schema ```json { "domain": "Python", // Primary knowledge area (Python, Networking, Android...) "path": "Python/AsyncIO/Patterns", // Hierarchical: domain/subject/specific "subjects": ["async", "concurrency"], // Cross-linking topics "category": "reference", // reference | tutorial | snippet | troubleshooting | concept "content_type": "code", // web_page | code | markdown | pdf | note "title": "Async Context Managers", // Display name "checksum": "sha256:...", // For duplicate detection "source_url": "https://...", // Source attribution (always stored) "date_added": "2026-02-05", // Date first stored "date_scraped": "2026-02-05T10:30:00" // Exact timestamp scraped } ``` ## Field Descriptions | Field | Required | Description | |-------|----------|-------------| | `domain` | Yes | Primary knowledge domain (e.g., Python, Networking) | | `path` | Yes | Hierarchical location: `Domain/Subject/Specific` | | `subjects` | No | Array of related topics for cross-linking | | `category` | Yes | Content type classification | | `content_type` | Yes | Format: web_page, code, markdown, pdf, note | | `title` | Yes | Human-readable title | | `checksum` | Auto | SHA256 hash for duplicate detection | | `source_url` | Yes | Original source (web pages) or reference | | `date_added` | Auto | Date stored (YYYY-MM-DD) | | `date_scraped` | Auto | ISO timestamp when content was acquired | | `text_preview` | Auto | First 300 chars of content (for display) | ## Content Categories | Category | Use For | |----------|---------| | `reference` | Documentation, specs, cheat sheets | | `tutorial` | Step-by-step guides, how-tos | | `snippet` | Code snippets, short examples | | `troubleshooting` | Error fixes, debugging steps | | `concept` | Explanations, theory, patterns | ## Examples | Content | Domain | Path | Category | |---------|--------|------|----------| | DNS troubleshooting | Networking | Networking/DNS/Reverse-Lookup | troubleshooting | | Kotlin coroutines | Android | Android/Kotlin/Coroutines | tutorial | | Systemd timers | Linux | Linux/Systemd/Timers | reference | | Python async patterns | Python | Python/AsyncIO/Patterns | code | ## Workflow ### Smart Search (`smart_search.py`) Always follow this pattern: 1. **Search knowledge_base first** — vector similarity search 2. **Search web via SearXNG** — get fresh results 3. **Synthesize** — combine KB + web findings 4. **Store new info** — if web has substantial new content - Auto-check for duplicates (checksum comparison) - Only store if content is unique and substantial (>500 chars) - Auto-tag with domain, date_scraped, source_url ### Storage Policy **Store when:** - Content is substantial (>500 chars) - Not duplicate of existing KB entry - Has clear source attribution - Belongs to a defined domain **Skip when:** - Too short (<500 chars) - Duplicate/similar content exists - No clear source URL ### Review Schedule **Monthly review** (cron: 1st of month at 3 AM): - Check entries older than 180 days - Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days - Remove outdated entries or flag for update ### Fast-Moving Domains These domains get shorter freshness thresholds: - AI/ML (models change fast) - Python (new versions, packages) - JavaScript (framework churn) - Docker (image updates) - OpenClaw (active development) - DevOps (tools evolve) ## Scripts | Script | Purpose | |--------|---------| | `smart_search.py` | KB → web → store workflow | | `kb_store.py` | Manual content storage | | `kb_review.py` | Monthly outdated review | | `scrape_to_kb.py` | Direct URL scraping | ## Design Decisions - **Subject-first**: Organize by knowledge type, not source - **Path-based hierarchy**: Navigate `Domain/Subject/Specific` - **Separate from memories**: `knowledge_base` and `openclaw_memories` are isolated - **Duplicate handling**: Checksum + content similarity → skip duplicates - **Auto-freshness**: Monthly cleanup of outdated entries - **Full attribution**: Always store source_url and date_scraped