dify/api/core
Frederick2313072 626e71cb3b feat: implement content-based deduplication for document segments
- Add database index on (dataset_id, index_node_hash) for efficient deduplication queries
- Add deduplication check in SegmentService.create_segment and multi_create_segment methods
- Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing
- Skip creating segments with identical content hashes across the entire dataset

This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs.
2025-09-20 06:28:14 +08:00
..
agent feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
app chore: Update the value of sys.dialogue_count to start from 1. (#25905) 2025-09-18 15:52:52 +08:00
base feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
callback_handler feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
datasource feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
entities feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
errors chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
extension chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
external_data_tool chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
file feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
helper feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
llm_generator feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
mcp feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
memory chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
model_runtime feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
moderation chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
ops refactor: replace print statements with proper logging (#25773) 2025-09-18 20:35:47 +08:00
plugin feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
prompt chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
rag feat: implement content-based deduplication for document segments 2025-09-20 06:28:14 +08:00
repositories Chore: correct inconsistent logging and typo (#25945) 2025-09-19 10:36:16 +08:00
schemas Fix: replace stdout prints with debug logging (#25931) 2025-09-18 21:03:20 +08:00
tools fix: ensure original response are maintained by yielding text messages in ApiTool (#23456) (#25973) 2025-09-19 18:27:33 +08:00
variables feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
workflow Chore: correct inconsistent logging and typo (#25945) 2025-09-19 10:36:16 +08:00
__init__.py Fix basedpyright type errors (#25435) 2025-09-10 01:54:26 +08:00
hosting_configuration.py chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
indexing_runner.py feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00
model_manager.py chore: add ast-grep rule to convert Optional[T] to T | None (#25560) 2025-09-15 13:06:33 +08:00
provider_manager.py feat: knowledge pipeline (#25360) 2025-09-18 12:49:10 +08:00