docs: Add Semantic Deduplication section with similarity checking

- Why smaller models need deduplication (4b vs 30b) - Three implementation options (built-in, periodic AI, watcher hook) - Code example for pre-insertion similarity check - Configuration options for deduplication settings - Recommendations by model size - Fixed section numbering
2026-02-24 21:15:04 -06:00
parent 05acb4f22b
commit 99a1aabd11
1 changed files with 129 additions and 4 deletions
--- a/README.md
+++ b/README.md
@@ -246,7 +246,132 @@ python3 clean_memories_tr.py --execute --limit 100

 ---

-### 5. OpenClaw Compactor Configuration
+### 5. Semantic Deduplication (Similarity Checking)
+
+**Why:** Smaller models (4b) often extract duplicate or near-duplicate gems. Without checking, your `gems_tr` collection fills with redundant entries.
+
+**The Problem:**
+- "User decided on Redis" and "User selected Redis for caching" are the same gem
+- Smaller models lack nuance — they extract surface variations as separate gems
+- Over time, 30-50% of gems may be duplicates
+
+**Solution: Semantic Similarity Check**
+
+Before inserting a new gem:
+1. Embed the candidate gem text
+2. Search `gems_tr` for similar embeddings (past 24h)
+3. If similarity > 0.85, SKIP (don't insert)
+4. If similarity 0.70-0.85, MERGE (update existing with richer context)
+5. If similarity < 0.70, INSERT (new unique gem)
+
+**Implementation Options:**
+
+#### Option A: Built-in Curator Check (Recommended)
+
+Modify `curator_timer.py` to add pre-insertion similarity check:
+
+```python
+import numpy as np
+from qdrant_client import QdrantClient
+
+qdrant = QdrantClient("http://<QDRANT_IP>:6333")
+
+def is_duplicate(gem_text: str, user_id: str = "rob", threshold: float = 0.85) -> bool:
+    """Check if similar gem exists in past 24h"""
+    # Embed the candidate
+    response = requests.post(
+        "http://<OLLAMA_IP>:11434/api/embeddings",
+        json={"model": "mxbai-embed-large", "prompt": gem_text}
+    )
+    embedding = response.json()["embedding"]
+    
+    # Search for similar gems
+    results = qdrant.search(
+        collection_name="gems_tr",
+        query_vector=embedding,
+        limit=3,
+        query_filter={
+            "must": [
+                {"key": "user_id", "match": {"value": user_id}},
+                {"key": "timestamp", "range": {"gte": "now-24h"}}
+            ]
+        }
+    )
+    
+    # Check similarity scores
+    for result in results:
+        if result.score > threshold:
+            return True  # Duplicate found
+    return False
+
+# In main loop, before inserting:
+if is_duplicate(gem["gem"]):
+    log.info(f"Skipping duplicate gem: {gem['gem'][:50]}...")
+    continue
+```
+
+**Pros:** Catches duplicates at source, no extra jobs
+**Cons:** Adds ~50-100ms per gem (embedding call)
+
+#### Option B: Periodic AI Review (Subagent Task)
+
+Have a subagent periodically review and merge duplicates:
+
+```bash
+# Run weekly via cron
+0 3 * * 0 cd <PROJECT_PATH> && python3 dedup_gems.py
+```
+
+**dedup_gems.py approach:**
+1. Load all gems from past 7 days
+2. Group by semantic similarity (clustering)
+3. For each cluster > 1 gem:
+   - Keep highest confidence gem as primary
+   - Merge context from others into primary
+   - Delete duplicates
+
+**Pros:** Can use reasoning model for nuanced merging
+**Cons:** Batch job, duplicates exist until cleanup runs
+
+#### Option C: Real-time Watcher Hook
+
+Add deduplication to the real-time watcher before memories are even stored:
+
+```python
+# In watcher, before upsert to memories_tr
+if is_similar_to_recent(memory_text, window="1h"):
+    memory["duplicate_of"] = similar_id  # Tag but still store
+```
+
+**Pros:** Prevents duplicate memories upstream
+**Cons:** Memories may differ slightly even if gems would be same
+
+**Recommendation by Model:**
+
+| Model | Recommended Approach | Reason |
+|-------|---------------------|--------|
+| **4b** | **Option A + B** | Built-in check prevents duplicates; periodic review catches edge cases |
+| **30b** | **Option B only** | 30b produces fewer duplicates; weekly review sufficient |
+| **Production** | **Option A** | Best balance of prevention and performance |
+
+**Configuration:**
+
+Add to `curator_config.json`:
+
+```json
+{
+  "deduplication": {
+    "enabled": true,
+    "similarity_threshold": 0.85,
+    "lookback_hours": 24,
+    "mode": "skip"  // "skip", "merge", or "flag"
+  }
+}
+```
+
+---
+
+### 6. OpenClaw Compactor Configuration

 **Status:** ✅ Applied

@@ -278,7 +403,7 @@ python3 clean_memories_tr.py --execute --limit 100

 ---

-### 6. Configuration Options Reference
+### 7. Configuration Options Reference

 **All configurable options with defaults:**

@@ -308,7 +433,7 @@ python3 clean_memories_tr.py --execute --limit 100

 ---

-### 7. Embedding Models
+### 8. Embedding Models

 **Current Setup:**
 - `memories_tr`: `snowflake-arctic-embed2` (capture similarity)
@@ -322,7 +447,7 @@ python3 clean_memories_tr.py --execute --limit 100

 ---

-### 6. memory-qdrant Plugin
+### 9. memory-qdrant Plugin

 **Location:** `~/.openclaw/extensions/memory-qdrant/`