diff --git a/README.md b/README.md index f7ec823..acf1955 100644 --- a/README.md +++ b/README.md @@ -246,7 +246,132 @@ python3 clean_memories_tr.py --execute --limit 100 --- -### 5. OpenClaw Compactor Configuration +### 5. Semantic Deduplication (Similarity Checking) + +**Why:** Smaller models (4b) often extract duplicate or near-duplicate gems. Without checking, your `gems_tr` collection fills with redundant entries. + +**The Problem:** +- "User decided on Redis" and "User selected Redis for caching" are the same gem +- Smaller models lack nuance — they extract surface variations as separate gems +- Over time, 30-50% of gems may be duplicates + +**Solution: Semantic Similarity Check** + +Before inserting a new gem: +1. Embed the candidate gem text +2. Search `gems_tr` for similar embeddings (past 24h) +3. If similarity > 0.85, SKIP (don't insert) +4. If similarity 0.70-0.85, MERGE (update existing with richer context) +5. If similarity < 0.70, INSERT (new unique gem) + +**Implementation Options:** + +#### Option A: Built-in Curator Check (Recommended) + +Modify `curator_timer.py` to add pre-insertion similarity check: + +```python +import numpy as np +from qdrant_client import QdrantClient + +qdrant = QdrantClient("http://:6333") + +def is_duplicate(gem_text: str, user_id: str = "rob", threshold: float = 0.85) -> bool: + """Check if similar gem exists in past 24h""" + # Embed the candidate + response = requests.post( + "http://:11434/api/embeddings", + json={"model": "mxbai-embed-large", "prompt": gem_text} + ) + embedding = response.json()["embedding"] + + # Search for similar gems + results = qdrant.search( + collection_name="gems_tr", + query_vector=embedding, + limit=3, + query_filter={ + "must": [ + {"key": "user_id", "match": {"value": user_id}}, + {"key": "timestamp", "range": {"gte": "now-24h"}} + ] + } + ) + + # Check similarity scores + for result in results: + if result.score > threshold: + return True # Duplicate found + return False + +# In main loop, before inserting: +if is_duplicate(gem["gem"]): + log.info(f"Skipping duplicate gem: {gem['gem'][:50]}...") + continue +``` + +**Pros:** Catches duplicates at source, no extra jobs +**Cons:** Adds ~50-100ms per gem (embedding call) + +#### Option B: Periodic AI Review (Subagent Task) + +Have a subagent periodically review and merge duplicates: + +```bash +# Run weekly via cron +0 3 * * 0 cd && python3 dedup_gems.py +``` + +**dedup_gems.py approach:** +1. Load all gems from past 7 days +2. Group by semantic similarity (clustering) +3. For each cluster > 1 gem: + - Keep highest confidence gem as primary + - Merge context from others into primary + - Delete duplicates + +**Pros:** Can use reasoning model for nuanced merging +**Cons:** Batch job, duplicates exist until cleanup runs + +#### Option C: Real-time Watcher Hook + +Add deduplication to the real-time watcher before memories are even stored: + +```python +# In watcher, before upsert to memories_tr +if is_similar_to_recent(memory_text, window="1h"): + memory["duplicate_of"] = similar_id # Tag but still store +``` + +**Pros:** Prevents duplicate memories upstream +**Cons:** Memories may differ slightly even if gems would be same + +**Recommendation by Model:** + +| Model | Recommended Approach | Reason | +|-------|---------------------|--------| +| **4b** | **Option A + B** | Built-in check prevents duplicates; periodic review catches edge cases | +| **30b** | **Option B only** | 30b produces fewer duplicates; weekly review sufficient | +| **Production** | **Option A** | Best balance of prevention and performance | + +**Configuration:** + +Add to `curator_config.json`: + +```json +{ + "deduplication": { + "enabled": true, + "similarity_threshold": 0.85, + "lookback_hours": 24, + "mode": "skip" // "skip", "merge", or "flag" + } +} +``` + +--- + +### 6. OpenClaw Compactor Configuration **Status:** ✅ Applied @@ -278,7 +403,7 @@ python3 clean_memories_tr.py --execute --limit 100 --- -### 6. Configuration Options Reference +### 7. Configuration Options Reference **All configurable options with defaults:** @@ -308,7 +433,7 @@ python3 clean_memories_tr.py --execute --limit 100 --- -### 7. Embedding Models +### 8. Embedding Models **Current Setup:** - `memories_tr`: `snowflake-arctic-embed2` (capture similarity) @@ -322,7 +447,7 @@ python3 clean_memories_tr.py --execute --limit 100 --- -### 6. memory-qdrant Plugin +### 9. memory-qdrant Plugin **Location:** `~/.openclaw/extensions/memory-qdrant/`