docs: Add Semantic Deduplication section with similarity checking

- Why smaller models need deduplication (4b vs 30b)
- Three implementation options (built-in, periodic AI, watcher hook)
- Code example for pre-insertion similarity check
- Configuration options for deduplication settings
- Recommendations by model size
- Fixed section numbering
This commit is contained in:
root
2026-02-24 21:15:04 -06:00
parent 05acb4f22b
commit 99a1aabd11

133
README.md
View File

@@ -246,7 +246,132 @@ python3 clean_memories_tr.py --execute --limit 100
---
### 5. OpenClaw Compactor Configuration
### 5. Semantic Deduplication (Similarity Checking)
**Why:** Smaller models (4b) often extract duplicate or near-duplicate gems. Without checking, your `gems_tr` collection fills with redundant entries.
**The Problem:**
- "User decided on Redis" and "User selected Redis for caching" are the same gem
- Smaller models lack nuance — they extract surface variations as separate gems
- Over time, 30-50% of gems may be duplicates
**Solution: Semantic Similarity Check**
Before inserting a new gem:
1. Embed the candidate gem text
2. Search `gems_tr` for similar embeddings (past 24h)
3. If similarity > 0.85, SKIP (don't insert)
4. If similarity 0.70-0.85, MERGE (update existing with richer context)
5. If similarity < 0.70, INSERT (new unique gem)
**Implementation Options:**
#### Option A: Built-in Curator Check (Recommended)
Modify `curator_timer.py` to add pre-insertion similarity check:
```python
import numpy as np
from qdrant_client import QdrantClient
qdrant = QdrantClient("http://<QDRANT_IP>:6333")
def is_duplicate(gem_text: str, user_id: str = "rob", threshold: float = 0.85) -> bool:
"""Check if similar gem exists in past 24h"""
# Embed the candidate
response = requests.post(
"http://<OLLAMA_IP>:11434/api/embeddings",
json={"model": "mxbai-embed-large", "prompt": gem_text}
)
embedding = response.json()["embedding"]
# Search for similar gems
results = qdrant.search(
collection_name="gems_tr",
query_vector=embedding,
limit=3,
query_filter={
"must": [
{"key": "user_id", "match": {"value": user_id}},
{"key": "timestamp", "range": {"gte": "now-24h"}}
]
}
)
# Check similarity scores
for result in results:
if result.score > threshold:
return True # Duplicate found
return False
# In main loop, before inserting:
if is_duplicate(gem["gem"]):
log.info(f"Skipping duplicate gem: {gem['gem'][:50]}...")
continue
```
**Pros:** Catches duplicates at source, no extra jobs
**Cons:** Adds ~50-100ms per gem (embedding call)
#### Option B: Periodic AI Review (Subagent Task)
Have a subagent periodically review and merge duplicates:
```bash
# Run weekly via cron
0 3 * * 0 cd <PROJECT_PATH> && python3 dedup_gems.py
```
**dedup_gems.py approach:**
1. Load all gems from past 7 days
2. Group by semantic similarity (clustering)
3. For each cluster > 1 gem:
- Keep highest confidence gem as primary
- Merge context from others into primary
- Delete duplicates
**Pros:** Can use reasoning model for nuanced merging
**Cons:** Batch job, duplicates exist until cleanup runs
#### Option C: Real-time Watcher Hook
Add deduplication to the real-time watcher before memories are even stored:
```python
# In watcher, before upsert to memories_tr
if is_similar_to_recent(memory_text, window="1h"):
memory["duplicate_of"] = similar_id # Tag but still store
```
**Pros:** Prevents duplicate memories upstream
**Cons:** Memories may differ slightly even if gems would be same
**Recommendation by Model:**
| Model | Recommended Approach | Reason |
|-------|---------------------|--------|
| **4b** | **Option A + B** | Built-in check prevents duplicates; periodic review catches edge cases |
| **30b** | **Option B only** | 30b produces fewer duplicates; weekly review sufficient |
| **Production** | **Option A** | Best balance of prevention and performance |
**Configuration:**
Add to `curator_config.json`:
```json
{
"deduplication": {
"enabled": true,
"similarity_threshold": 0.85,
"lookback_hours": 24,
"mode": "skip" // "skip", "merge", or "flag"
}
}
```
---
### 6. OpenClaw Compactor Configuration
**Status:** ✅ Applied
@@ -278,7 +403,7 @@ python3 clean_memories_tr.py --execute --limit 100
---
### 6. Configuration Options Reference
### 7. Configuration Options Reference
**All configurable options with defaults:**
@@ -308,7 +433,7 @@ python3 clean_memories_tr.py --execute --limit 100
---
### 7. Embedding Models
### 8. Embedding Models
**Current Setup:**
- `memories_tr`: `snowflake-arctic-embed2` (capture similarity)
@@ -322,7 +447,7 @@ python3 clean_memories_tr.py --execute --limit 100
---
### 6. memory-qdrant Plugin
### 9. memory-qdrant Plugin
**Location:** `~/.openclaw/extensions/memory-qdrant/`