docs: Add Semantic Deduplication section with similarity checking
- Why smaller models need deduplication (4b vs 30b) - Three implementation options (built-in, periodic AI, watcher hook) - Code example for pre-insertion similarity check - Configuration options for deduplication settings - Recommendations by model size - Fixed section numbering
This commit is contained in:
133
README.md
133
README.md
@@ -246,7 +246,132 @@ python3 clean_memories_tr.py --execute --limit 100
|
||||
|
||||
---
|
||||
|
||||
### 5. OpenClaw Compactor Configuration
|
||||
### 5. Semantic Deduplication (Similarity Checking)
|
||||
|
||||
**Why:** Smaller models (4b) often extract duplicate or near-duplicate gems. Without checking, your `gems_tr` collection fills with redundant entries.
|
||||
|
||||
**The Problem:**
|
||||
- "User decided on Redis" and "User selected Redis for caching" are the same gem
|
||||
- Smaller models lack nuance — they extract surface variations as separate gems
|
||||
- Over time, 30-50% of gems may be duplicates
|
||||
|
||||
**Solution: Semantic Similarity Check**
|
||||
|
||||
Before inserting a new gem:
|
||||
1. Embed the candidate gem text
|
||||
2. Search `gems_tr` for similar embeddings (past 24h)
|
||||
3. If similarity > 0.85, SKIP (don't insert)
|
||||
4. If similarity 0.70-0.85, MERGE (update existing with richer context)
|
||||
5. If similarity < 0.70, INSERT (new unique gem)
|
||||
|
||||
**Implementation Options:**
|
||||
|
||||
#### Option A: Built-in Curator Check (Recommended)
|
||||
|
||||
Modify `curator_timer.py` to add pre-insertion similarity check:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from qdrant_client import QdrantClient
|
||||
|
||||
qdrant = QdrantClient("http://<QDRANT_IP>:6333")
|
||||
|
||||
def is_duplicate(gem_text: str, user_id: str = "rob", threshold: float = 0.85) -> bool:
|
||||
"""Check if similar gem exists in past 24h"""
|
||||
# Embed the candidate
|
||||
response = requests.post(
|
||||
"http://<OLLAMA_IP>:11434/api/embeddings",
|
||||
json={"model": "mxbai-embed-large", "prompt": gem_text}
|
||||
)
|
||||
embedding = response.json()["embedding"]
|
||||
|
||||
# Search for similar gems
|
||||
results = qdrant.search(
|
||||
collection_name="gems_tr",
|
||||
query_vector=embedding,
|
||||
limit=3,
|
||||
query_filter={
|
||||
"must": [
|
||||
{"key": "user_id", "match": {"value": user_id}},
|
||||
{"key": "timestamp", "range": {"gte": "now-24h"}}
|
||||
]
|
||||
}
|
||||
)
|
||||
|
||||
# Check similarity scores
|
||||
for result in results:
|
||||
if result.score > threshold:
|
||||
return True # Duplicate found
|
||||
return False
|
||||
|
||||
# In main loop, before inserting:
|
||||
if is_duplicate(gem["gem"]):
|
||||
log.info(f"Skipping duplicate gem: {gem['gem'][:50]}...")
|
||||
continue
|
||||
```
|
||||
|
||||
**Pros:** Catches duplicates at source, no extra jobs
|
||||
**Cons:** Adds ~50-100ms per gem (embedding call)
|
||||
|
||||
#### Option B: Periodic AI Review (Subagent Task)
|
||||
|
||||
Have a subagent periodically review and merge duplicates:
|
||||
|
||||
```bash
|
||||
# Run weekly via cron
|
||||
0 3 * * 0 cd <PROJECT_PATH> && python3 dedup_gems.py
|
||||
```
|
||||
|
||||
**dedup_gems.py approach:**
|
||||
1. Load all gems from past 7 days
|
||||
2. Group by semantic similarity (clustering)
|
||||
3. For each cluster > 1 gem:
|
||||
- Keep highest confidence gem as primary
|
||||
- Merge context from others into primary
|
||||
- Delete duplicates
|
||||
|
||||
**Pros:** Can use reasoning model for nuanced merging
|
||||
**Cons:** Batch job, duplicates exist until cleanup runs
|
||||
|
||||
#### Option C: Real-time Watcher Hook
|
||||
|
||||
Add deduplication to the real-time watcher before memories are even stored:
|
||||
|
||||
```python
|
||||
# In watcher, before upsert to memories_tr
|
||||
if is_similar_to_recent(memory_text, window="1h"):
|
||||
memory["duplicate_of"] = similar_id # Tag but still store
|
||||
```
|
||||
|
||||
**Pros:** Prevents duplicate memories upstream
|
||||
**Cons:** Memories may differ slightly even if gems would be same
|
||||
|
||||
**Recommendation by Model:**
|
||||
|
||||
| Model | Recommended Approach | Reason |
|
||||
|-------|---------------------|--------|
|
||||
| **4b** | **Option A + B** | Built-in check prevents duplicates; periodic review catches edge cases |
|
||||
| **30b** | **Option B only** | 30b produces fewer duplicates; weekly review sufficient |
|
||||
| **Production** | **Option A** | Best balance of prevention and performance |
|
||||
|
||||
**Configuration:**
|
||||
|
||||
Add to `curator_config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"deduplication": {
|
||||
"enabled": true,
|
||||
"similarity_threshold": 0.85,
|
||||
"lookback_hours": 24,
|
||||
"mode": "skip" // "skip", "merge", or "flag"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. OpenClaw Compactor Configuration
|
||||
|
||||
**Status:** ✅ Applied
|
||||
|
||||
@@ -278,7 +403,7 @@ python3 clean_memories_tr.py --execute --limit 100
|
||||
|
||||
---
|
||||
|
||||
### 6. Configuration Options Reference
|
||||
### 7. Configuration Options Reference
|
||||
|
||||
**All configurable options with defaults:**
|
||||
|
||||
@@ -308,7 +433,7 @@ python3 clean_memories_tr.py --execute --limit 100
|
||||
|
||||
---
|
||||
|
||||
### 7. Embedding Models
|
||||
### 8. Embedding Models
|
||||
|
||||
**Current Setup:**
|
||||
- `memories_tr`: `snowflake-arctic-embed2` (capture similarity)
|
||||
@@ -322,7 +447,7 @@ python3 clean_memories_tr.py --execute --limit 100
|
||||
|
||||
---
|
||||
|
||||
### 6. memory-qdrant Plugin
|
||||
### 9. memory-qdrant Plugin
|
||||
|
||||
**Location:** `~/.openclaw/extensions/memory-qdrant/`
|
||||
|
||||
|
||||
Reference in New Issue
Block a user