dify/api/core/rag/docstore
Frederick2313072 626e71cb3b feat: implement content-based deduplication for document segments
- Add database index on (dataset_id, index_node_hash) for efficient deduplication queries
- Add deduplication check in SegmentService.create_segment and multi_create_segment methods
- Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing
- Skip creating segments with identical content hashes across the entire dataset

This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs.
2025-09-20 06:28:14 +08:00
..
__init__.py Feat/delete single dataset retrival (#6570) 2024-07-24 12:50:11 +08:00
dataset_docstore.py feat: implement content-based deduplication for document segments 2025-09-20 06:28:14 +08:00