dify/rag at feat/chunk-deduplication - dify - Gitea: Git with a cup of tea

mirror of https://github.com/langgenius/dify.git synced 2026-06-21 01:41:08 +08:00

History

Frederick2313072 626e71cb3b feat: implement content-based deduplication for document segments - Add database index on (dataset_id, index_node_hash) for efficient deduplication queries - Add deduplication check in SegmentService.create_segment and multi_create_segment methods - Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing - Skip creating segments with identical content hashes across the entire dataset This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs.		2025-09-20 06:28:14 +08:00
..
cleaner	fix: drop dead code phase2 unused class (#22042 )	2025-07-17 09:33:07 +08:00
data_post_processor	chore: add ast-grep rule to convert Optional[T] to T \| None (#25560 )	2025-09-15 13:06:33 +08:00
datasource	refactor: replace print statements with proper logging (#25773 )	2025-09-18 20:35:47 +08:00
docstore	feat: implement content-based deduplication for document segments	2025-09-20 06:28:14 +08:00
embedding	chore: add ast-grep rule to convert Optional[T] to T \| None (#25560 )	2025-09-15 13:06:33 +08:00
entities	feat: knowledge pipeline (#25360 )	2025-09-18 12:49:10 +08:00
extractor	feat: knowledge pipeline (#25360 )	2025-09-18 12:49:10 +08:00
index_processor	feat: knowledge pipeline (#25360 )	2025-09-18 12:49:10 +08:00
models	feat: knowledge pipeline (#25360 )	2025-09-18 12:49:10 +08:00
rerank	chore: add ast-grep rule to convert Optional[T] to T \| None (#25560 )	2025-09-15 13:06:33 +08:00
retrieval	feat: knowledge pipeline (#25360 )	2025-09-18 12:49:10 +08:00
splitter	chore: add ast-grep rule to convert Optional[T] to T \| None (#25560 )	2025-09-15 13:06:33 +08:00
__init__.py	Feat/dify rag (#2528 )	2024-02-22 23:31:57 +08:00