4.2 KiB
4.2 KiB
knowledge_base Schema
Collection: knowledge_base
Purpose: Personal knowledge repository organized by topic/domain, not by source or project.
Metadata Schema
{
"domain": "Python", // Primary knowledge area (Python, Networking, Android...)
"path": "Python/AsyncIO/Patterns", // Hierarchical: domain/subject/specific
"subjects": ["async", "concurrency"], // Cross-linking topics
"category": "reference", // reference | tutorial | snippet | troubleshooting | concept
"content_type": "code", // web_page | code | markdown | pdf | note
"title": "Async Context Managers", // Display name
"checksum": "sha256:...", // For duplicate detection
"source_url": "https://...", // Source attribution (always stored)
"date_added": "2026-02-05", // Date first stored
"date_scraped": "2026-02-05T10:30:00" // Exact timestamp scraped
}
Field Descriptions
| Field | Required | Description |
|---|---|---|
domain |
Yes | Primary knowledge domain (e.g., Python, Networking) |
path |
Yes | Hierarchical location: Domain/Subject/Specific |
subjects |
No | Array of related topics for cross-linking |
category |
Yes | Content type classification |
content_type |
Yes | Format: web_page, code, markdown, pdf, note |
title |
Yes | Human-readable title |
checksum |
Auto | SHA256 hash for duplicate detection |
source_url |
Yes | Original source (web pages) or reference |
date_added |
Auto | Date stored (YYYY-MM-DD) |
date_scraped |
Auto | ISO timestamp when content was acquired |
text_preview |
Auto | First 300 chars of content (for display) |
Content Categories
| Category | Use For |
|---|---|
reference |
Documentation, specs, cheat sheets |
tutorial |
Step-by-step guides, how-tos |
snippet |
Code snippets, short examples |
troubleshooting |
Error fixes, debugging steps |
concept |
Explanations, theory, patterns |
Examples
| Content | Domain | Path | Category |
|---|---|---|---|
| DNS troubleshooting | Networking | Networking/DNS/Reverse-Lookup | troubleshooting |
| Kotlin coroutines | Android | Android/Kotlin/Coroutines | tutorial |
| Systemd timers | Linux | Linux/Systemd/Timers | reference |
| Python async patterns | Python | Python/AsyncIO/Patterns | code |
Workflow
Smart Search (smart_search.py)
Always follow this pattern:
- Search knowledge_base first — vector similarity search
- Search web via SearXNG — get fresh results
- Synthesize — combine KB + web findings
- Store new info — if web has substantial new content
- Auto-check for duplicates (checksum comparison)
- Only store if content is unique and substantial (>500 chars)
- Auto-tag with domain, date_scraped, source_url
Storage Policy
Store when:
- Content is substantial (>500 chars)
- Not duplicate of existing KB entry
- Has clear source attribution
- Belongs to a defined domain
Skip when:
- Too short (<500 chars)
- Duplicate/similar content exists
- No clear source URL
Review Schedule
Monthly review (cron: 1st of month at 3 AM):
- Check entries older than 180 days
- Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days
- Remove outdated entries or flag for update
Fast-Moving Domains
These domains get shorter freshness thresholds:
- AI/ML (models change fast)
- Python (new versions, packages)
- JavaScript (framework churn)
- Docker (image updates)
- OpenClaw (active development)
- DevOps (tools evolve)
Scripts
| Script | Purpose |
|---|---|
smart_search.py |
KB → web → store workflow |
kb_store.py |
Manual content storage |
kb_review.py |
Monthly outdated review |
scrape_to_kb.py |
Direct URL scraping |
Design Decisions
- Subject-first: Organize by knowledge type, not source
- Path-based hierarchy: Navigate
Domain/Subject/Specific - Separate from memories:
knowledge_baseandopenclaw_memoriesare isolated - Duplicate handling: Checksum + content similarity → skip duplicates
- Auto-freshness: Monthly cleanup of outdated entries
- Full attribution: Always store source_url and date_scraped