122 lines
4.2 KiB
Markdown
122 lines
4.2 KiB
Markdown
|
|
# knowledge_base Schema
|
||
|
|
|
||
|
|
## Collection: `knowledge_base`
|
||
|
|
|
||
|
|
Purpose: Personal knowledge repository organized by topic/domain, not by source or project.
|
||
|
|
|
||
|
|
## Metadata Schema
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"domain": "Python", // Primary knowledge area (Python, Networking, Android...)
|
||
|
|
"path": "Python/AsyncIO/Patterns", // Hierarchical: domain/subject/specific
|
||
|
|
"subjects": ["async", "concurrency"], // Cross-linking topics
|
||
|
|
|
||
|
|
"category": "reference", // reference | tutorial | snippet | troubleshooting | concept
|
||
|
|
"content_type": "code", // web_page | code | markdown | pdf | note
|
||
|
|
|
||
|
|
"title": "Async Context Managers", // Display name
|
||
|
|
"checksum": "sha256:...", // For duplicate detection
|
||
|
|
"source_url": "https://...", // Source attribution (always stored)
|
||
|
|
"date_added": "2026-02-05", // Date first stored
|
||
|
|
"date_scraped": "2026-02-05T10:30:00" // Exact timestamp scraped
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Field Descriptions
|
||
|
|
|
||
|
|
| Field | Required | Description |
|
||
|
|
|-------|----------|-------------|
|
||
|
|
| `domain` | Yes | Primary knowledge domain (e.g., Python, Networking) |
|
||
|
|
| `path` | Yes | Hierarchical location: `Domain/Subject/Specific` |
|
||
|
|
| `subjects` | No | Array of related topics for cross-linking |
|
||
|
|
| `category` | Yes | Content type classification |
|
||
|
|
| `content_type` | Yes | Format: web_page, code, markdown, pdf, note |
|
||
|
|
| `title` | Yes | Human-readable title |
|
||
|
|
| `checksum` | Auto | SHA256 hash for duplicate detection |
|
||
|
|
| `source_url` | Yes | Original source (web pages) or reference |
|
||
|
|
| `date_added` | Auto | Date stored (YYYY-MM-DD) |
|
||
|
|
| `date_scraped` | Auto | ISO timestamp when content was acquired |
|
||
|
|
| `text_preview` | Auto | First 300 chars of content (for display) |
|
||
|
|
|
||
|
|
## Content Categories
|
||
|
|
|
||
|
|
| Category | Use For |
|
||
|
|
|----------|---------|
|
||
|
|
| `reference` | Documentation, specs, cheat sheets |
|
||
|
|
| `tutorial` | Step-by-step guides, how-tos |
|
||
|
|
| `snippet` | Code snippets, short examples |
|
||
|
|
| `troubleshooting` | Error fixes, debugging steps |
|
||
|
|
| `concept` | Explanations, theory, patterns |
|
||
|
|
|
||
|
|
## Examples
|
||
|
|
|
||
|
|
| Content | Domain | Path | Category |
|
||
|
|
|---------|--------|------|----------|
|
||
|
|
| DNS troubleshooting | Networking | Networking/DNS/Reverse-Lookup | troubleshooting |
|
||
|
|
| Kotlin coroutines | Android | Android/Kotlin/Coroutines | tutorial |
|
||
|
|
| Systemd timers | Linux | Linux/Systemd/Timers | reference |
|
||
|
|
| Python async patterns | Python | Python/AsyncIO/Patterns | code |
|
||
|
|
|
||
|
|
## Workflow
|
||
|
|
|
||
|
|
### Smart Search (`smart_search.py`)
|
||
|
|
|
||
|
|
Always follow this pattern:
|
||
|
|
|
||
|
|
1. **Search knowledge_base first** — vector similarity search
|
||
|
|
2. **Search web via SearXNG** — get fresh results
|
||
|
|
3. **Synthesize** — combine KB + web findings
|
||
|
|
4. **Store new info** — if web has substantial new content
|
||
|
|
- Auto-check for duplicates (checksum comparison)
|
||
|
|
- Only store if content is unique and substantial (>500 chars)
|
||
|
|
- Auto-tag with domain, date_scraped, source_url
|
||
|
|
|
||
|
|
### Storage Policy
|
||
|
|
|
||
|
|
**Store when:**
|
||
|
|
- Content is substantial (>500 chars)
|
||
|
|
- Not duplicate of existing KB entry
|
||
|
|
- Has clear source attribution
|
||
|
|
- Belongs to a defined domain
|
||
|
|
|
||
|
|
**Skip when:**
|
||
|
|
- Too short (<500 chars)
|
||
|
|
- Duplicate/similar content exists
|
||
|
|
- No clear source URL
|
||
|
|
|
||
|
|
### Review Schedule
|
||
|
|
|
||
|
|
**Monthly review** (cron: 1st of month at 3 AM):
|
||
|
|
- Check entries older than 180 days
|
||
|
|
- Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days
|
||
|
|
- Remove outdated entries or flag for update
|
||
|
|
|
||
|
|
### Fast-Moving Domains
|
||
|
|
|
||
|
|
These domains get shorter freshness thresholds:
|
||
|
|
- AI/ML (models change fast)
|
||
|
|
- Python (new versions, packages)
|
||
|
|
- JavaScript (framework churn)
|
||
|
|
- Docker (image updates)
|
||
|
|
- OpenClaw (active development)
|
||
|
|
- DevOps (tools evolve)
|
||
|
|
|
||
|
|
## Scripts
|
||
|
|
|
||
|
|
| Script | Purpose |
|
||
|
|
|--------|---------|
|
||
|
|
| `smart_search.py` | KB → web → store workflow |
|
||
|
|
| `kb_store.py` | Manual content storage |
|
||
|
|
| `kb_review.py` | Monthly outdated review |
|
||
|
|
| `scrape_to_kb.py` | Direct URL scraping |
|
||
|
|
|
||
|
|
## Design Decisions
|
||
|
|
|
||
|
|
- **Subject-first**: Organize by knowledge type, not source
|
||
|
|
- **Path-based hierarchy**: Navigate `Domain/Subject/Specific`
|
||
|
|
- **Separate from memories**: `knowledge_base` and `openclaw_memories` are isolated
|
||
|
|
- **Duplicate handling**: Checksum + content similarity → skip duplicates
|
||
|
|
- **Auto-freshness**: Monthly cleanup of outdated entries
|
||
|
|
- **Full attribution**: Always store source_url and date_scraped
|