mirror of https://github.com/langgenius/dify.git
- Add database index on (dataset_id, index_node_hash) for efficient deduplication queries - Add deduplication check in SegmentService.create_segment and multi_create_segment methods - Add deduplication check in DatasetDocumentStore.add_documents method to prevent duplicate embedding processing - Skip creating segments with identical content hashes across the entire dataset This prevents duplicate content from being re-processed and re-embedded when uploading documents with repeated content, improving efficiency and reducing unnecessary compute costs. |
||
|---|---|---|
| .. | ||
| cleaner | ||
| data_post_processor | ||
| datasource | ||
| docstore | ||
| embedding | ||
| entities | ||
| extractor | ||
| index_processor | ||
| models | ||
| rerank | ||
| retrieval | ||
| splitter | ||
| __init__.py | ||