dify/docstore at feat/chunk-deduplication - dify

mirror of https://github.com/langgenius/dify.git synced 2026-06-21 01:41:08 +08:00

History

Frederick2313072 626e71cb3b feat: implement content-based deduplication for document segments - Add database index on (dataset_id, index_node_hash) for efficient deduplication queries - Add deduplication check in SegmentService.create_segment and multi_create_segment methods - Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing - Skip creating segments with identical content hashes across the entire dataset This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs.	2025-09-20 06:28:14 +08:00
..
__init__.py	Feat/delete single dataset retrival (#6570 )	2024-07-24 12:50:11 +08:00
dataset_docstore.py	feat: implement content-based deduplication for document segments	2025-09-20 06:28:14 +08:00

Frederick2313072 626e71cb3b feat: implement content-based deduplication for document segments

- Add database index on (dataset_id, index_node_hash) for efficient deduplication queries
- Add deduplication check in SegmentService.create_segment and multi_create_segment methods
- Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing
- Skip creating segments with identical content hashes across the entire dataset

This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs.

2025-09-20 06:28:14 +08:00

__init__.py

Feat/delete single dataset retrival (#6570 )

2024-07-24 12:50:11 +08:00

dataset_docstore.py

feat: implement content-based deduplication for document segments

2025-09-20 06:28:14 +08:00