skills/qdrant-memory/knowledge_base_schema.md

# knowledge_base Schema

## Collection: `knowledge_base`

Purpose: Personal knowledge repository organized by topic/domain, not by source or project.

## Metadata Schema

```json
{
  "domain": "Python",                    // Primary knowledge area (Python, Networking, Android...)
  "path": "Python/AsyncIO/Patterns",     // Hierarchical: domain/subject/specific
  "subjects": ["async", "concurrency"],  // Cross-linking topics
  
  "category": "reference",               // reference | tutorial | snippet | troubleshooting | concept
  "content_type": "code",                // web_page | code | markdown | pdf | note
  
  "title": "Async Context Managers",     // Display name
  "checksum": "sha256:...",              // For duplicate detection
  "source_url": "https://...",           // Source attribution (always stored)
  "date_added": "2026-02-05",            // Date first stored
  "date_scraped": "2026-02-05T10:30:00"  // Exact timestamp scraped
}
```

## Field Descriptions

| Field | Required | Description |
|-------|----------|-------------|
| `domain` | Yes | Primary knowledge domain (e.g., Python, Networking) |
| `path` | Yes | Hierarchical location: `Domain/Subject/Specific` |
| `subjects` | No | Array of related topics for cross-linking |
| `category` | Yes | Content type classification |
| `content_type` | Yes | Format: web_page, code, markdown, pdf, note |
| `title` | Yes | Human-readable title |
| `checksum` | Auto | SHA256 hash for duplicate detection |
| `source_url` | Yes | Original source (web pages) or reference |
| `date_added` | Auto | Date stored (YYYY-MM-DD) |
| `date_scraped` | Auto | ISO timestamp when content was acquired |
| `text_preview` | Auto | First 300 chars of content (for display) |

## Content Categories

| Category | Use For |
|----------|---------|
| `reference` | Documentation, specs, cheat sheets |
| `tutorial` | Step-by-step guides, how-tos |
| `snippet` | Code snippets, short examples |
| `troubleshooting` | Error fixes, debugging steps |
| `concept` | Explanations, theory, patterns |

## Examples

| Content | Domain | Path | Category |
|---------|--------|------|----------|
| DNS troubleshooting | Networking | Networking/DNS/Reverse-Lookup | troubleshooting |
| Kotlin coroutines | Android | Android/Kotlin/Coroutines | tutorial |
| Systemd timers | Linux | Linux/Systemd/Timers | reference |
| Python async patterns | Python | Python/AsyncIO/Patterns | code |

## Workflow

### Smart Search (`smart_search.py`)

Always follow this pattern:

1. **Search knowledge_base first** — vector similarity search
2. **Search web via SearXNG** — get fresh results
3. **Synthesize** — combine KB + web findings
4. **Store new info** — if web has substantial new content
   - Auto-check for duplicates (checksum comparison)
   - Only store if content is unique and substantial (>500 chars)
   - Auto-tag with domain, date_scraped, source_url

### Storage Policy

**Store when:**
- Content is substantial (>500 chars)
- Not duplicate of existing KB entry
- Has clear source attribution
- Belongs to a defined domain

**Skip when:**
- Too short (<500 chars)
- Duplicate/similar content exists
- No clear source URL

### Review Schedule

**Monthly review** (cron: 1st of month at 3 AM):
- Check entries older than 180 days
- Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days
- Remove outdated entries or flag for update

### Fast-Moving Domains

These domains get shorter freshness thresholds:
- AI/ML (models change fast)
- Python (new versions, packages)
- JavaScript (framework churn)
- Docker (image updates)
- OpenClaw (active development)
- DevOps (tools evolve)

## Scripts

| Script | Purpose |
|--------|---------|
| `smart_search.py` | KB → web → store workflow |
| `kb_store.py` | Manual content storage |
| `kb_review.py` | Monthly outdated review |
| `scrape_to_kb.py` | Direct URL scraping |

## Design Decisions

- **Subject-first**: Organize by knowledge type, not source
- **Path-based hierarchy**: Navigate `Domain/Subject/Specific`
- **Separate from memories**: `knowledge_base` and `openclaw_memories` are isolated
- **Duplicate handling**: Checksum + content similarity → skip duplicates
- **Auto-freshness**: Monthly cleanup of outdated entries
- **Full attribution**: Always store source_url and date_scraped
Initial commit: workspace setup with skills, memory, config 2026-02-10 14:37:49 -06:00			`# knowledge_base Schema`

			## Collection: `knowledge_base`

			`Purpose: Personal knowledge repository organized by topic/domain, not by source or project.`

			`## Metadata Schema`

			```json
			`{`
			`"domain": "Python", // Primary knowledge area (Python, Networking, Android...)`
			`"path": "Python/AsyncIO/Patterns", // Hierarchical: domain/subject/specific`
			`"subjects": ["async", "concurrency"], // Cross-linking topics`

			`"category": "reference", // reference \| tutorial \| snippet \| troubleshooting \| concept`
			`"content_type": "code", // web_page \| code \| markdown \| pdf \| note`

			`"title": "Async Context Managers", // Display name`
			`"checksum": "sha256:...", // For duplicate detection`
			`"source_url": "https://...", // Source attribution (always stored)`
			`"date_added": "2026-02-05", // Date first stored`
			`"date_scraped": "2026-02-05T10:30:00" // Exact timestamp scraped`
			`}`
			```

			`## Field Descriptions`

			`\| Field \| Required \| Description \|`
			`\|-------\|----------\|-------------\|`
			\| `domain` \| Yes \| Primary knowledge domain (e.g., Python, Networking) \|
			\| `path` \| Yes \| Hierarchical location: `Domain/Subject/Specific` \|
			\| `subjects` \| No \| Array of related topics for cross-linking \|
			\| `category` \| Yes \| Content type classification \|
			\| `content_type` \| Yes \| Format: web_page, code, markdown, pdf, note \|
			\| `title` \| Yes \| Human-readable title \|
			\| `checksum` \| Auto \| SHA256 hash for duplicate detection \|
			\| `source_url` \| Yes \| Original source (web pages) or reference \|
			\| `date_added` \| Auto \| Date stored (YYYY-MM-DD) \|
			\| `date_scraped` \| Auto \| ISO timestamp when content was acquired \|
			\| `text_preview` \| Auto \| First 300 chars of content (for display) \|

			`## Content Categories`

			`\| Category \| Use For \|`
			`\|----------\|---------\|`
			\| `reference` \| Documentation, specs, cheat sheets \|
			\| `tutorial` \| Step-by-step guides, how-tos \|
			\| `snippet` \| Code snippets, short examples \|
			\| `troubleshooting` \| Error fixes, debugging steps \|
			\| `concept` \| Explanations, theory, patterns \|

			`## Examples`

			`\| Content \| Domain \| Path \| Category \|`
			`\|---------\|--------\|------\|----------\|`
			`\| DNS troubleshooting \| Networking \| Networking/DNS/Reverse-Lookup \| troubleshooting \|`
			`\| Kotlin coroutines \| Android \| Android/Kotlin/Coroutines \| tutorial \|`
			`\| Systemd timers \| Linux \| Linux/Systemd/Timers \| reference \|`
			`\| Python async patterns \| Python \| Python/AsyncIO/Patterns \| code \|`

			`## Workflow`

			### Smart Search (`smart_search.py`)

			`Always follow this pattern:`

			`1. Search knowledge_base first — vector similarity search`
			`2. Search web via SearXNG — get fresh results`
			`3. Synthesize — combine KB + web findings`
			`4. Store new info — if web has substantial new content`
			`- Auto-check for duplicates (checksum comparison)`
			`- Only store if content is unique and substantial (>500 chars)`
			`- Auto-tag with domain, date_scraped, source_url`

			`### Storage Policy`

			`Store when:`
			`- Content is substantial (>500 chars)`
			`- Not duplicate of existing KB entry`
			`- Has clear source attribution`
			`- Belongs to a defined domain`

			`Skip when:`
			`- Too short (<500 chars)`
			`- Duplicate/similar content exists`
			`- No clear source URL`

			`### Review Schedule`

			`Monthly review (cron: 1st of month at 3 AM):`
			`- Check entries older than 180 days`
			`- Fast-moving domains (AI/ML, Python, JavaScript, Docker, DevOps): 90 days`
			`- Remove outdated entries or flag for update`

			`### Fast-Moving Domains`

			`These domains get shorter freshness thresholds:`
			`- AI/ML (models change fast)`
			`- Python (new versions, packages)`
			`- JavaScript (framework churn)`
			`- Docker (image updates)`
			`- OpenClaw (active development)`
			`- DevOps (tools evolve)`

			`## Scripts`

			`\| Script \| Purpose \|`
			`\|--------\|---------\|`
			\| `smart_search.py` \| KB → web → store workflow \|
			\| `kb_store.py` \| Manual content storage \|
			\| `kb_review.py` \| Monthly outdated review \|
			\| `scrape_to_kb.py` \| Direct URL scraping \|

			`## Design Decisions`

			`- Subject-first: Organize by knowledge type, not source`
			- Path-based hierarchy: Navigate `Domain/Subject/Specific`
			- Separate from memories: `knowledge_base` and `openclaw_memories` are isolated
			`- Duplicate handling: Checksum + content similarity → skip duplicates`
			`- Auto-freshness: Monthly cleanup of outdated entries`
			`- Full attribution: Always store source_url and date_scraped`