feat: Download the uploaded files (#31068)

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
盐粒 Yanli 2026-01-19 16:48:13 +08:00 committed by GitHub
parent 2d4289a925
commit 62ac02a568
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
20 changed files with 1226 additions and 10 deletions

View File

@ -0,0 +1,52 @@
## Purpose
`api/controllers/console/datasets/datasets_document.py` contains the console (authenticated) APIs for managing dataset documents (list/create/update/delete, processing controls, estimates, etc.).
## Storage model (uploaded files)
- For local file uploads into a knowledge base, the binary is stored via `extensions.ext_storage.storage` under the key:
- `upload_files/<tenant_id>/<uuid>.<ext>`
- File metadata is stored in the `upload_files` table (`UploadFile` model), keyed by `UploadFile.id`.
- Dataset `Document` records reference the uploaded file via:
- `Document.data_source_info.upload_file_id`
## Download endpoint
- `GET /datasets/<dataset_id>/documents/<document_id>/download`
- Only supported when `Document.data_source_type == "upload_file"`.
- Performs dataset permission + tenant checks via `DocumentResource.get_document(...)`.
- Delegates `Document -> UploadFile` validation and signed URL generation to `DocumentService.get_document_download_url(...)`.
- Applies `cloud_edition_billing_rate_limit_check("knowledge")` to match other KB operations.
- Response body is **only**: `{ "url": "<signed-url>" }`.
- `POST /datasets/<dataset_id>/documents/download-zip`
- Accepts `{ "document_ids": ["..."] }` (upload-file only).
- Returns `application/zip` as a single attachment download.
- Rationale: browsers often block multiple automatic downloads; a ZIP avoids that limitation.
- Applies `cloud_edition_billing_rate_limit_check("knowledge")`.
- Delegates dataset permission checks, document/upload-file validation, and download-name generation to
`DocumentService.prepare_document_batch_download_zip(...)` before streaming the ZIP.
## Verification plan
- Upload a document from a local file into a dataset.
- Call the download endpoint and confirm it returns a signed URL.
- Open the URL and confirm:
- Response headers force download (`Content-Disposition`), and
- Downloaded bytes match the uploaded file.
- Select multiple uploaded-file documents and download as ZIP; confirm all selected files exist in the archive.
## Shared helper
- `DocumentService.get_document_download_url(document)` resolves the `UploadFile` and signs a download URL.
- `DocumentService.prepare_document_batch_download_zip(...)` performs dataset permission checks, batches
document + upload file lookups, preserves request order, and generates the client-visible ZIP filename.
- Internal helpers now live in `DocumentService` (`_get_upload_file_id_for_upload_file_document(...)`,
`_get_upload_file_for_upload_file_document(...)`, `_get_upload_files_by_document_id_for_zip_download(...)`).
- ZIP packing is handled by `FileService.build_upload_files_zip_tempfile(...)`, which also:
- sanitizes entry names to avoid path traversal, and
- deduplicates names while preserving extensions (e.g., `doc.txt``doc (1).txt`).
Streaming the response and deferring cleanup is handled by the route via `send_file(path, ...)` + `ExitStack` +
`response.call_on_close(...)` (the file is deleted when the response is closed).

View File

@ -0,0 +1,18 @@
## Purpose
`api/services/dataset_service.py` hosts dataset/document service logic used by console and API controllers.
## Batch document operations
- Batch document workflows should avoid N+1 database queries by using set-based lookups.
- Tenant checks must be enforced consistently across dataset/document operations.
- `DocumentService.get_documents_by_ids(...)` fetches documents for a dataset using `id.in_(...)`.
- `FileService.get_upload_files_by_ids(...)` performs tenant-scoped batch lookup for `UploadFile` (dedupes ids with `set(...)`).
- `DocumentService.get_document_download_url(...)` and `prepare_document_batch_download_zip(...)` handle
dataset/document permission checks plus `Document -> UploadFile` validation for download endpoints.
## Verification plan
- Exercise document list and download endpoints that use the service helpers.
- Confirm batch download uses constant query count for documents + upload files.
- Request a ZIP with a missing document id and confirm a 404 is returned.

View File

@ -0,0 +1,35 @@
## Purpose
`api/services/file_service.py` owns business logic around `UploadFile` objects: upload validation, storage persistence,
previews/generators, and deletion.
## Key invariants
- All storage I/O goes through `extensions.ext_storage.storage`.
- Uploaded file keys follow: `upload_files/<tenant_id>/<uuid>.<ext>`.
- Upload validation is enforced in `FileService.upload_file(...)` (blocked extensions, size limits, dataset-only types).
## Batch lookup helpers
- `FileService.get_upload_files_by_ids(tenant_id, upload_file_ids)` is the canonical tenant-scoped batch loader for
`UploadFile`.
## Dataset document download helpers
The dataset document download/ZIP endpoints now delegate “Document → UploadFile” validation and permission checks to
`DocumentService` (`api/services/dataset_service.py`). `FileService` stays focused on generic `UploadFile` operations
(uploading, previews, deletion), plus generic ZIP serving.
### ZIP serving
- `FileService.build_upload_files_zip_tempfile(...)` builds a ZIP from `UploadFile` objects and yields a seeked
tempfile **path** so callers can stream it (e.g., `send_file(path, ...)`) without hitting "read of closed file"
issues from file-handle lifecycle during streamed responses.
- Flask `send_file(...)` and the `ExitStack`/`call_on_close(...)` cleanup pattern are handled in the route layer.
## Verification plan
- Unit: `api/tests/unit_tests/controllers/console/datasets/test_datasets_document_download.py`
- Verify signed URL generation for upload-file documents and ZIP download behavior for multiple documents.
- Unit: `api/tests/unit_tests/services/test_file_service_zip_and_lookup.py`
- Verify ZIP packing produces a valid, openable archive and preserves file content.

View File

@ -0,0 +1,28 @@
## Purpose
Unit tests for the console dataset document download endpoint:
- `GET /datasets/<dataset_id>/documents/<document_id>/download`
## Testing approach
- Uses `Flask.test_request_context()` and calls the `Resource.get(...)` method directly.
- Monkeypatches console decorators (`login_required`, `setup_required`, rate limit) to no-ops to keep the test focused.
- Mocks:
- `DatasetService.get_dataset` / `check_dataset_permission`
- `DocumentService.get_document` for single-file download tests
- `DocumentService.get_documents_by_ids` + `FileService.get_upload_files_by_ids` for ZIP download tests
- `FileService.get_upload_files_by_ids` for `UploadFile` lookups in single-file tests
- `services.dataset_service.file_helpers.get_signed_file_url` to return a deterministic URL
- Document mocks include `id` fields so batch lookups can map documents by id.
## Covered cases
- Success returns `{ "url": "<signed>" }` for upload-file documents.
- 404 when document is not `upload_file`.
- 404 when `upload_file_id` is missing.
- 404 when referenced `UploadFile` row does not exist.
- 403 when document tenant does not match current tenant.
- Batch ZIP download returns `application/zip` for upload-file documents.
- Batch ZIP download rejects non-upload-file documents.
- Batch ZIP download uses a random `.zip` attachment name (`download_name`), so tests only assert the suffix.

View File

@ -0,0 +1,18 @@
## Purpose
Unit tests for `api/services/file_service.py` helper methods that are not covered by higher-level controller tests.
## Whats covered
- `FileService.build_upload_files_zip_tempfile(...)`
- ZIP entry name sanitization (no directory components / traversal)
- name deduplication while preserving extensions
- writing streamed bytes from `storage.load(...)` into ZIP entries
- yields a tempfile path so callers can open/stream the ZIP without holding a live file handle
- `FileService.get_upload_files_by_ids(...)`
- returns `{}` for empty id lists
- returns an id-keyed mapping for non-empty lists
## Notes
- These tests intentionally stub `storage.load` and `db.session.scalars(...).all()` to avoid needing a real DB/storage.

View File

@ -2,10 +2,12 @@ import json
import logging
from argparse import ArgumentTypeError
from collections.abc import Sequence
from typing import Literal, cast
from contextlib import ExitStack
from typing import Any, Literal, cast
from uuid import UUID
import sqlalchemy as sa
from flask import request
from flask import request, send_file
from flask_restx import Resource, fields, marshal, marshal_with
from pydantic import BaseModel, Field
from sqlalchemy import asc, desc, select
@ -42,6 +44,7 @@ from models import DatasetProcessRule, Document, DocumentSegment, UploadFile
from models.dataset import DocumentPipelineExecutionLog
from services.dataset_service import DatasetService, DocumentService
from services.entities.knowledge_entities.knowledge_entities import KnowledgeConfig, ProcessRule, RetrievalModel
from services.file_service import FileService
from ..app.error import (
ProviderModelCurrentlyNotSupportError,
@ -65,6 +68,9 @@ from ..wraps import (
logger = logging.getLogger(__name__)
# NOTE: Keep constants near the top of the module for discoverability.
DOCUMENT_BATCH_DOWNLOAD_ZIP_MAX_DOCS = 100
def _get_or_create_model(model_name: str, field_def):
existing = console_ns.models.get(model_name)
@ -104,6 +110,12 @@ class DocumentRenamePayload(BaseModel):
name: str
class DocumentBatchDownloadZipPayload(BaseModel):
"""Request payload for bulk downloading documents as a zip archive."""
document_ids: list[UUID] = Field(..., min_length=1, max_length=DOCUMENT_BATCH_DOWNLOAD_ZIP_MAX_DOCS)
class DocumentDatasetListParam(BaseModel):
page: int = Field(1, title="Page", description="Page number.")
limit: int = Field(20, title="Limit", description="Page size.")
@ -120,6 +132,7 @@ register_schema_models(
RetrievalModel,
DocumentRetryPayload,
DocumentRenamePayload,
DocumentBatchDownloadZipPayload,
)
@ -853,6 +866,62 @@ class DocumentApi(DocumentResource):
return {"result": "success"}, 204
@console_ns.route("/datasets/<uuid:dataset_id>/documents/<uuid:document_id>/download")
class DocumentDownloadApi(DocumentResource):
"""Return a signed download URL for a dataset document's original uploaded file."""
@console_ns.doc("get_dataset_document_download_url")
@console_ns.doc(description="Get a signed download URL for a dataset document's original uploaded file")
@setup_required
@login_required
@account_initialization_required
@cloud_edition_billing_rate_limit_check("knowledge")
def get(self, dataset_id: str, document_id: str) -> dict[str, Any]:
# Reuse the shared permission/tenant checks implemented in DocumentResource.
document = self.get_document(str(dataset_id), str(document_id))
return {"url": DocumentService.get_document_download_url(document)}
@console_ns.route("/datasets/<uuid:dataset_id>/documents/download-zip")
class DocumentBatchDownloadZipApi(DocumentResource):
"""Download multiple uploaded-file documents as a single ZIP (avoids browser multi-download limits)."""
@console_ns.doc("download_dataset_documents_as_zip")
@console_ns.doc(description="Download selected dataset documents as a single ZIP archive (upload-file only)")
@setup_required
@login_required
@account_initialization_required
@cloud_edition_billing_rate_limit_check("knowledge")
@console_ns.expect(console_ns.models[DocumentBatchDownloadZipPayload.__name__])
def post(self, dataset_id: str):
"""Stream a ZIP archive containing the requested uploaded documents."""
# Parse and validate request payload.
payload = DocumentBatchDownloadZipPayload.model_validate(console_ns.payload or {})
current_user, current_tenant_id = current_account_with_tenant()
dataset_id = str(dataset_id)
document_ids: list[str] = [str(document_id) for document_id in payload.document_ids]
upload_files, download_name = DocumentService.prepare_document_batch_download_zip(
dataset_id=dataset_id,
document_ids=document_ids,
tenant_id=current_tenant_id,
current_user=current_user,
)
# Delegate ZIP packing to FileService, but keep Flask response+cleanup in the route.
with ExitStack() as stack:
zip_path = stack.enter_context(FileService.build_upload_files_zip_tempfile(upload_files=upload_files))
response = send_file(
zip_path,
mimetype="application/zip",
as_attachment=True,
download_name=download_name,
)
cleanup = stack.pop_all()
response.call_on_close(cleanup.close)
return response
@console_ns.route("/datasets/<uuid:dataset_id>/documents/<uuid:document_id>/processing/<string:action>")
class DocumentProcessingApi(DocumentResource):
@console_ns.doc("update_document_processing")

View File

@ -13,10 +13,11 @@ import sqlalchemy as sa
from redis.exceptions import LockNotOwnedError
from sqlalchemy import exists, func, select
from sqlalchemy.orm import Session
from werkzeug.exceptions import NotFound
from werkzeug.exceptions import Forbidden, NotFound
from configs import dify_config
from core.errors.error import LLMBadRequestError, ProviderTokenNotInitError
from core.file import helpers as file_helpers
from core.helper.name_generator import generate_incremental_name
from core.model_manager import ModelManager
from core.model_runtime.entities.model_entities import ModelFeature, ModelType
@ -73,6 +74,7 @@ from services.errors.document import DocumentIndexingError
from services.errors.file import FileNotExistsError
from services.external_knowledge_service import ExternalDatasetService
from services.feature_service import FeatureModel, FeatureService
from services.file_service import FileService
from services.rag_pipeline.rag_pipeline import RagPipelineService
from services.tag_service import TagService
from services.vector_service import VectorService
@ -1162,6 +1164,7 @@ class DocumentService:
Document.archived.is_(True),
),
}
DOCUMENT_BATCH_DOWNLOAD_ZIP_FILENAME_EXTENSION = ".zip"
@classmethod
def normalize_display_status(cls, status: str | None) -> str | None:
@ -1288,6 +1291,143 @@ class DocumentService:
else:
return None
@staticmethod
def get_documents_by_ids(dataset_id: str, document_ids: Sequence[str]) -> Sequence[Document]:
"""Fetch documents for a dataset in a single batch query."""
if not document_ids:
return []
document_id_list: list[str] = [str(document_id) for document_id in document_ids]
# Fetch all requested documents in one query to avoid N+1 lookups.
documents: Sequence[Document] = db.session.scalars(
select(Document).where(
Document.dataset_id == dataset_id,
Document.id.in_(document_id_list),
)
).all()
return documents
@staticmethod
def get_document_download_url(document: Document) -> str:
"""
Return a signed download URL for an upload-file document.
"""
upload_file = DocumentService._get_upload_file_for_upload_file_document(document)
return file_helpers.get_signed_file_url(upload_file_id=upload_file.id, as_attachment=True)
@staticmethod
def prepare_document_batch_download_zip(
*,
dataset_id: str,
document_ids: Sequence[str],
tenant_id: str,
current_user: Account,
) -> tuple[list[UploadFile], str]:
"""
Resolve upload files for batch ZIP downloads and generate a client-visible filename.
"""
dataset = DatasetService.get_dataset(dataset_id)
if not dataset:
raise NotFound("Dataset not found.")
try:
DatasetService.check_dataset_permission(dataset, current_user)
except NoPermissionError as e:
raise Forbidden(str(e))
upload_files_by_document_id = DocumentService._get_upload_files_by_document_id_for_zip_download(
dataset_id=dataset_id,
document_ids=document_ids,
tenant_id=tenant_id,
)
upload_files = [upload_files_by_document_id[document_id] for document_id in document_ids]
download_name = DocumentService._generate_document_batch_download_zip_filename()
return upload_files, download_name
@staticmethod
def _generate_document_batch_download_zip_filename() -> str:
"""
Generate a random attachment filename for the batch download ZIP.
"""
return f"{uuid.uuid4().hex}{DocumentService.DOCUMENT_BATCH_DOWNLOAD_ZIP_FILENAME_EXTENSION}"
@staticmethod
def _get_upload_file_id_for_upload_file_document(
document: Document,
*,
invalid_source_message: str,
missing_file_message: str,
) -> str:
"""
Normalize and validate `Document -> UploadFile` linkage for download flows.
"""
if document.data_source_type != "upload_file":
raise NotFound(invalid_source_message)
data_source_info: dict[str, Any] = document.data_source_info_dict or {}
upload_file_id: str | None = data_source_info.get("upload_file_id")
if not upload_file_id:
raise NotFound(missing_file_message)
return str(upload_file_id)
@staticmethod
def _get_upload_file_for_upload_file_document(document: Document) -> UploadFile:
"""
Load the `UploadFile` row for an upload-file document.
"""
upload_file_id = DocumentService._get_upload_file_id_for_upload_file_document(
document,
invalid_source_message="Document does not have an uploaded file to download.",
missing_file_message="Uploaded file not found.",
)
upload_files_by_id = FileService.get_upload_files_by_ids(document.tenant_id, [upload_file_id])
upload_file = upload_files_by_id.get(upload_file_id)
if not upload_file:
raise NotFound("Uploaded file not found.")
return upload_file
@staticmethod
def _get_upload_files_by_document_id_for_zip_download(
*,
dataset_id: str,
document_ids: Sequence[str],
tenant_id: str,
) -> dict[str, UploadFile]:
"""
Batch load upload files keyed by document id for ZIP downloads.
"""
document_id_list: list[str] = [str(document_id) for document_id in document_ids]
documents = DocumentService.get_documents_by_ids(dataset_id, document_id_list)
documents_by_id: dict[str, Document] = {str(document.id): document for document in documents}
missing_document_ids: set[str] = set(document_id_list) - set(documents_by_id.keys())
if missing_document_ids:
raise NotFound("Document not found.")
upload_file_ids: list[str] = []
upload_file_ids_by_document_id: dict[str, str] = {}
for document_id, document in documents_by_id.items():
if document.tenant_id != tenant_id:
raise Forbidden("No permission.")
upload_file_id = DocumentService._get_upload_file_id_for_upload_file_document(
document,
invalid_source_message="Only uploaded-file documents can be downloaded as ZIP.",
missing_file_message="Only uploaded-file documents can be downloaded as ZIP.",
)
upload_file_ids.append(upload_file_id)
upload_file_ids_by_document_id[document_id] = upload_file_id
upload_files_by_id = FileService.get_upload_files_by_ids(tenant_id, upload_file_ids)
missing_upload_file_ids: set[str] = set(upload_file_ids) - set(upload_files_by_id.keys())
if missing_upload_file_ids:
raise NotFound("Only uploaded-file documents can be downloaded as ZIP.")
return {
document_id: upload_files_by_id[upload_file_id]
for document_id, upload_file_id in upload_file_ids_by_document_id.items()
}
@staticmethod
def get_document_by_id(document_id: str) -> Document | None:
document = db.session.query(Document).where(Document.id == document_id).first()

View File

@ -2,7 +2,11 @@ import base64
import hashlib
import os
import uuid
from collections.abc import Iterator, Sequence
from contextlib import contextmanager, suppress
from tempfile import NamedTemporaryFile
from typing import Literal, Union
from zipfile import ZIP_DEFLATED, ZipFile
from sqlalchemy import Engine, select
from sqlalchemy.orm import Session, sessionmaker
@ -17,6 +21,7 @@ from constants import (
)
from core.file import helpers as file_helpers
from core.rag.extractor.extract_processor import ExtractProcessor
from extensions.ext_database import db
from extensions.ext_storage import storage
from libs.datetime_utils import naive_utc_now
from libs.helper import extract_tenant_id
@ -167,6 +172,9 @@ class FileService:
return upload_file
def get_file_preview(self, file_id: str):
"""
Return a short text preview extracted from a document file.
"""
with self._session_maker(expire_on_commit=False) as session:
upload_file = session.query(UploadFile).where(UploadFile.id == file_id).first()
@ -253,3 +261,101 @@ class FileService:
return
storage.delete(upload_file.key)
session.delete(upload_file)
@staticmethod
def get_upload_files_by_ids(tenant_id: str, upload_file_ids: Sequence[str]) -> dict[str, UploadFile]:
"""
Fetch `UploadFile` rows for a tenant in a single batch query.
This is a generic `UploadFile` lookup helper (not dataset/document specific), so it lives in `FileService`.
"""
if not upload_file_ids:
return {}
# Normalize and deduplicate ids before using them in the IN clause.
upload_file_id_list: list[str] = [str(upload_file_id) for upload_file_id in upload_file_ids]
unique_upload_file_ids: list[str] = list(set(upload_file_id_list))
# Fetch upload files in one query for efficient batch access.
upload_files: Sequence[UploadFile] = db.session.scalars(
select(UploadFile).where(
UploadFile.tenant_id == tenant_id,
UploadFile.id.in_(unique_upload_file_ids),
)
).all()
return {str(upload_file.id): upload_file for upload_file in upload_files}
@staticmethod
def _sanitize_zip_entry_name(name: str) -> str:
"""
Sanitize a ZIP entry name to avoid path traversal and weird separators.
We keep this conservative: the upload flow already rejects `/` and `\\`, but older rows (or imported data)
could still contain unsafe names.
"""
# Drop any directory components and prevent empty names.
base = os.path.basename(name).strip() or "file"
# ZIP uses forward slashes as separators; remove any residual separator characters.
return base.replace("/", "_").replace("\\", "_")
@staticmethod
def _dedupe_zip_entry_name(original_name: str, used_names: set[str]) -> str:
"""
Return a unique ZIP entry name, inserting suffixes before the extension.
"""
# Keep the original name when it's not already used.
if original_name not in used_names:
return original_name
# Insert suffixes before the extension (e.g., "doc.txt" -> "doc (1).txt").
stem, extension = os.path.splitext(original_name)
suffix = 1
while True:
candidate = f"{stem} ({suffix}){extension}"
if candidate not in used_names:
return candidate
suffix += 1
@staticmethod
@contextmanager
def build_upload_files_zip_tempfile(
*,
upload_files: Sequence[UploadFile],
) -> Iterator[str]:
"""
Build a ZIP from `UploadFile`s and yield a tempfile path.
We yield a path (rather than an open file handle) to avoid "read of closed file" issues when Flask/Werkzeug
streams responses. The caller is expected to keep this context open until the response is fully sent, then
close it (e.g., via `response.call_on_close(...)`) to delete the tempfile.
"""
used_names: set[str] = set()
# Build a ZIP in a temp file and keep it on disk until the caller finishes streaming it.
tmp_path: str | None = None
try:
with NamedTemporaryFile(mode="w+b", suffix=".zip", delete=False) as tmp:
tmp_path = tmp.name
with ZipFile(tmp, mode="w", compression=ZIP_DEFLATED) as zf:
for upload_file in upload_files:
# Ensure the entry name is safe and unique.
safe_name = FileService._sanitize_zip_entry_name(upload_file.name)
arcname = FileService._dedupe_zip_entry_name(safe_name, used_names)
used_names.add(arcname)
# Stream file bytes from storage into the ZIP entry.
with zf.open(arcname, "w") as entry:
for chunk in storage.load(upload_file.key, stream=True):
entry.write(chunk)
# Flush so `send_file(path, ...)` can re-open it safely on all platforms.
tmp.flush()
assert tmp_path is not None
yield tmp_path
finally:
# Remove the temp file when the context is closed (typically after the response finishes streaming).
if tmp_path is not None:
with suppress(FileNotFoundError):
os.remove(tmp_path)

View File

@ -0,0 +1,430 @@
"""
Unit tests for the dataset document download endpoint.
These tests validate that the controller returns a signed download URL for
upload-file documents, and rejects unsupported or missing file cases.
"""
from __future__ import annotations
import importlib
import sys
from collections import UserDict
from io import BytesIO
from types import SimpleNamespace
from typing import Any
from zipfile import ZipFile
import pytest
from flask import Flask
from werkzeug.exceptions import Forbidden, NotFound
@pytest.fixture
def app() -> Flask:
"""Create a minimal Flask app for request-context based controller tests."""
app = Flask(__name__)
app.config["TESTING"] = True
return app
@pytest.fixture
def datasets_document_module(monkeypatch: pytest.MonkeyPatch):
"""
Reload `controllers.console.datasets.datasets_document` with lightweight decorators.
We patch auth / setup / rate-limit decorators to no-ops so we can unit test the
controller logic without requiring the full console stack.
"""
from controllers.console import console_ns, wraps
from libs import login
def _noop(func): # type: ignore[no-untyped-def]
return func
# Bypass login/setup/account checks in unit tests.
monkeypatch.setattr(login, "login_required", _noop)
monkeypatch.setattr(wraps, "setup_required", _noop)
monkeypatch.setattr(wraps, "account_initialization_required", _noop)
# Bypass billing-related decorators used by other endpoints in this module.
monkeypatch.setattr(wraps, "cloud_edition_billing_resource_check", lambda *_args, **_kwargs: (lambda f: f))
monkeypatch.setattr(wraps, "cloud_edition_billing_rate_limit_check", lambda *_args, **_kwargs: (lambda f: f))
# Avoid Flask-RESTX route registration side effects during import.
def _noop_route(*_args, **_kwargs): # type: ignore[override]
def _decorator(cls):
return cls
return _decorator
monkeypatch.setattr(console_ns, "route", _noop_route)
module_name = "controllers.console.datasets.datasets_document"
sys.modules.pop(module_name, None)
return importlib.import_module(module_name)
def _mock_user(*, is_dataset_editor: bool = True) -> SimpleNamespace:
"""Build a minimal user object compatible with dataset permission checks."""
return SimpleNamespace(is_dataset_editor=is_dataset_editor, id="user-123")
def _mock_document(
*,
document_id: str,
tenant_id: str,
data_source_type: str,
upload_file_id: str | None,
) -> SimpleNamespace:
"""Build a minimal document object used by the controller."""
data_source_info_dict: dict[str, Any] | None = None
if upload_file_id is not None:
data_source_info_dict = {"upload_file_id": upload_file_id}
else:
data_source_info_dict = {}
return SimpleNamespace(
id=document_id,
tenant_id=tenant_id,
data_source_type=data_source_type,
data_source_info_dict=data_source_info_dict,
)
def _wire_common_success_mocks(
*,
module,
monkeypatch: pytest.MonkeyPatch,
current_tenant_id: str,
document_tenant_id: str,
data_source_type: str,
upload_file_id: str | None,
upload_file_exists: bool,
signed_url: str,
) -> None:
"""Patch controller dependencies to create a deterministic test environment."""
import services.dataset_service as dataset_service_module
# Make `current_account_with_tenant()` return a known user + tenant id.
monkeypatch.setattr(module, "current_account_with_tenant", lambda: (_mock_user(), current_tenant_id))
# Return a dataset object and allow permission checks to pass.
monkeypatch.setattr(module.DatasetService, "get_dataset", lambda _dataset_id: SimpleNamespace(id="ds-1"))
monkeypatch.setattr(module.DatasetService, "check_dataset_permission", lambda *_args, **_kwargs: None)
# Return a document that will be validated inside DocumentResource.get_document.
document = _mock_document(
document_id="doc-1",
tenant_id=document_tenant_id,
data_source_type=data_source_type,
upload_file_id=upload_file_id,
)
monkeypatch.setattr(module.DocumentService, "get_document", lambda *_args, **_kwargs: document)
# Mock UploadFile lookup via FileService batch helper.
upload_files_by_id: dict[str, Any] = {}
if upload_file_exists and upload_file_id is not None:
upload_files_by_id[str(upload_file_id)] = SimpleNamespace(id=str(upload_file_id))
monkeypatch.setattr(module.FileService, "get_upload_files_by_ids", lambda *_args, **_kwargs: upload_files_by_id)
# Mock signing helper so the returned URL is deterministic.
monkeypatch.setattr(dataset_service_module.file_helpers, "get_signed_file_url", lambda **_kwargs: signed_url)
def _mock_send_file(obj, **kwargs): # type: ignore[no-untyped-def]
"""Return a lightweight representation of `send_file(...)` for unit tests."""
class _ResponseMock(UserDict):
def __init__(self, sent_file: object, send_file_kwargs: dict[str, object]) -> None:
super().__init__({"_sent_file": sent_file, "_send_file_kwargs": send_file_kwargs})
self._on_close: object | None = None
def call_on_close(self, func): # type: ignore[no-untyped-def]
self._on_close = func
return func
return _ResponseMock(obj, kwargs)
def test_batch_download_zip_returns_send_file(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure batch ZIP download returns a zip attachment via `send_file`."""
# Arrange common permission mocks.
monkeypatch.setattr(datasets_document_module, "current_account_with_tenant", lambda: (_mock_user(), "tenant-123"))
monkeypatch.setattr(
datasets_document_module.DatasetService, "get_dataset", lambda _dataset_id: SimpleNamespace(id="ds-1")
)
monkeypatch.setattr(
datasets_document_module.DatasetService, "check_dataset_permission", lambda *_args, **_kwargs: None
)
# Two upload-file documents, each referencing an UploadFile.
doc1 = _mock_document(
document_id="11111111-1111-1111-1111-111111111111",
tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-1",
)
doc2 = _mock_document(
document_id="22222222-2222-2222-2222-222222222222",
tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-2",
)
monkeypatch.setattr(
datasets_document_module.DocumentService,
"get_documents_by_ids",
lambda *_args, **_kwargs: [doc1, doc2],
)
monkeypatch.setattr(
datasets_document_module.FileService,
"get_upload_files_by_ids",
lambda *_args, **_kwargs: {
"file-1": SimpleNamespace(id="file-1", name="a.txt", key="k1"),
"file-2": SimpleNamespace(id="file-2", name="b.txt", key="k2"),
},
)
# Mock storage streaming content.
import services.file_service as file_service_module
monkeypatch.setattr(file_service_module.storage, "load", lambda _key, stream=True: [b"hello"])
# Replace send_file used by the controller to avoid a real Flask response object.
monkeypatch.setattr(datasets_document_module, "send_file", _mock_send_file)
# Act
with app.test_request_context(
"/datasets/ds-1/documents/download-zip",
method="POST",
json={"document_ids": ["11111111-1111-1111-1111-111111111111", "22222222-2222-2222-2222-222222222222"]},
):
api = datasets_document_module.DocumentBatchDownloadZipApi()
result = api.post(dataset_id="ds-1")
# Assert: we returned via send_file with correct mime type and attachment.
assert result["_send_file_kwargs"]["mimetype"] == "application/zip"
assert result["_send_file_kwargs"]["as_attachment"] is True
assert isinstance(result["_send_file_kwargs"]["download_name"], str)
assert result["_send_file_kwargs"]["download_name"].endswith(".zip")
# Ensure our cleanup hook is registered and execute it to avoid temp file leaks in unit tests.
assert getattr(result, "_on_close", None) is not None
result._on_close() # type: ignore[attr-defined]
def test_batch_download_zip_response_is_openable_zip(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure the real Flask `send_file` response body is a valid ZIP that can be opened."""
# Arrange: same controller mocks as the lightweight send_file test, but we keep the real `send_file`.
monkeypatch.setattr(datasets_document_module, "current_account_with_tenant", lambda: (_mock_user(), "tenant-123"))
monkeypatch.setattr(
datasets_document_module.DatasetService, "get_dataset", lambda _dataset_id: SimpleNamespace(id="ds-1")
)
monkeypatch.setattr(
datasets_document_module.DatasetService, "check_dataset_permission", lambda *_args, **_kwargs: None
)
doc1 = _mock_document(
document_id="33333333-3333-3333-3333-333333333333",
tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-1",
)
doc2 = _mock_document(
document_id="44444444-4444-4444-4444-444444444444",
tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-2",
)
monkeypatch.setattr(
datasets_document_module.DocumentService,
"get_documents_by_ids",
lambda *_args, **_kwargs: [doc1, doc2],
)
monkeypatch.setattr(
datasets_document_module.FileService,
"get_upload_files_by_ids",
lambda *_args, **_kwargs: {
"file-1": SimpleNamespace(id="file-1", name="a.txt", key="k1"),
"file-2": SimpleNamespace(id="file-2", name="b.txt", key="k2"),
},
)
# Stream distinct bytes per key so we can verify both ZIP entries.
import services.file_service as file_service_module
monkeypatch.setattr(
file_service_module.storage, "load", lambda key, stream=True: [b"one"] if key == "k1" else [b"two"]
)
# Act
with app.test_request_context(
"/datasets/ds-1/documents/download-zip",
method="POST",
json={"document_ids": ["33333333-3333-3333-3333-333333333333", "44444444-4444-4444-4444-444444444444"]},
):
api = datasets_document_module.DocumentBatchDownloadZipApi()
response = api.post(dataset_id="ds-1")
# Assert: response body is a valid ZIP and contains the expected entries.
response.direct_passthrough = False
data = response.get_data()
response.close()
with ZipFile(BytesIO(data), mode="r") as zf:
assert zf.namelist() == ["a.txt", "b.txt"]
assert zf.read("a.txt") == b"one"
assert zf.read("b.txt") == b"two"
def test_batch_download_zip_rejects_non_upload_file_document(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure batch ZIP download rejects non upload-file documents."""
monkeypatch.setattr(datasets_document_module, "current_account_with_tenant", lambda: (_mock_user(), "tenant-123"))
monkeypatch.setattr(
datasets_document_module.DatasetService, "get_dataset", lambda _dataset_id: SimpleNamespace(id="ds-1")
)
monkeypatch.setattr(
datasets_document_module.DatasetService, "check_dataset_permission", lambda *_args, **_kwargs: None
)
doc = _mock_document(
document_id="55555555-5555-5555-5555-555555555555",
tenant_id="tenant-123",
data_source_type="website_crawl",
upload_file_id="file-1",
)
monkeypatch.setattr(
datasets_document_module.DocumentService,
"get_documents_by_ids",
lambda *_args, **_kwargs: [doc],
)
with app.test_request_context(
"/datasets/ds-1/documents/download-zip",
method="POST",
json={"document_ids": ["55555555-5555-5555-5555-555555555555"]},
):
api = datasets_document_module.DocumentBatchDownloadZipApi()
with pytest.raises(NotFound):
api.post(dataset_id="ds-1")
def test_document_download_returns_url_for_upload_file_document(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure upload-file documents return a `{url}` JSON payload."""
_wire_common_success_mocks(
module=datasets_document_module,
monkeypatch=monkeypatch,
current_tenant_id="tenant-123",
document_tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-123",
upload_file_exists=True,
signed_url="https://example.com/signed",
)
# Build a request context then call the resource method directly.
with app.test_request_context("/datasets/ds-1/documents/doc-1/download", method="GET"):
api = datasets_document_module.DocumentDownloadApi()
result = api.get(dataset_id="ds-1", document_id="doc-1")
assert result == {"url": "https://example.com/signed"}
def test_document_download_rejects_non_upload_file_document(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure non-upload documents raise 404 (no file to download)."""
_wire_common_success_mocks(
module=datasets_document_module,
monkeypatch=monkeypatch,
current_tenant_id="tenant-123",
document_tenant_id="tenant-123",
data_source_type="website_crawl",
upload_file_id="file-123",
upload_file_exists=True,
signed_url="https://example.com/signed",
)
with app.test_request_context("/datasets/ds-1/documents/doc-1/download", method="GET"):
api = datasets_document_module.DocumentDownloadApi()
with pytest.raises(NotFound):
api.get(dataset_id="ds-1", document_id="doc-1")
def test_document_download_rejects_missing_upload_file_id(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure missing `upload_file_id` raises 404."""
_wire_common_success_mocks(
module=datasets_document_module,
monkeypatch=monkeypatch,
current_tenant_id="tenant-123",
document_tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id=None,
upload_file_exists=False,
signed_url="https://example.com/signed",
)
with app.test_request_context("/datasets/ds-1/documents/doc-1/download", method="GET"):
api = datasets_document_module.DocumentDownloadApi()
with pytest.raises(NotFound):
api.get(dataset_id="ds-1", document_id="doc-1")
def test_document_download_rejects_when_upload_file_record_missing(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure missing UploadFile row raises 404."""
_wire_common_success_mocks(
module=datasets_document_module,
monkeypatch=monkeypatch,
current_tenant_id="tenant-123",
document_tenant_id="tenant-123",
data_source_type="upload_file",
upload_file_id="file-123",
upload_file_exists=False,
signed_url="https://example.com/signed",
)
with app.test_request_context("/datasets/ds-1/documents/doc-1/download", method="GET"):
api = datasets_document_module.DocumentDownloadApi()
with pytest.raises(NotFound):
api.get(dataset_id="ds-1", document_id="doc-1")
def test_document_download_rejects_tenant_mismatch(
app: Flask, datasets_document_module, monkeypatch: pytest.MonkeyPatch
) -> None:
"""Ensure tenant mismatch is rejected by the shared `get_document()` permission check."""
_wire_common_success_mocks(
module=datasets_document_module,
monkeypatch=monkeypatch,
current_tenant_id="tenant-123",
document_tenant_id="tenant-999",
data_source_type="upload_file",
upload_file_id="file-123",
upload_file_exists=True,
signed_url="https://example.com/signed",
)
with app.test_request_context("/datasets/ds-1/documents/doc-1/download", method="GET"):
api = datasets_document_module.DocumentDownloadApi()
with pytest.raises(Forbidden):
api.get(dataset_id="ds-1", document_id="doc-1")

View File

@ -0,0 +1,99 @@
"""
Unit tests for `services.file_service.FileService` helpers.
We keep these tests focused on:
- ZIP tempfile building (sanitization + deduplication + content writes)
- tenant-scoped batch lookup behavior (`get_upload_files_by_ids`)
"""
from __future__ import annotations
from types import SimpleNamespace
from typing import Any
from zipfile import ZipFile
import pytest
import services.file_service as file_service_module
from services.file_service import FileService
def test_build_upload_files_zip_tempfile_sanitizes_and_dedupes_names(monkeypatch: pytest.MonkeyPatch) -> None:
"""Ensure ZIP entry names are safe and unique while preserving extensions."""
# Arrange: three upload files that all sanitize down to the same basename ("b.txt").
upload_files: list[Any] = [
SimpleNamespace(name="a/b.txt", key="k1"),
SimpleNamespace(name="c/b.txt", key="k2"),
SimpleNamespace(name="../b.txt", key="k3"),
]
# Stream distinct bytes per key so we can verify content is written to the right entry.
data_by_key: dict[str, list[bytes]] = {"k1": [b"one"], "k2": [b"two"], "k3": [b"three"]}
def _load(key: str, stream: bool = True) -> list[bytes]:
# Return the corresponding chunks for this key (the production code iterates chunks).
assert stream is True
return data_by_key[key]
monkeypatch.setattr(file_service_module.storage, "load", _load)
# Act: build zip in a tempfile.
with FileService.build_upload_files_zip_tempfile(upload_files=upload_files) as tmp:
with ZipFile(tmp, mode="r") as zf:
# Assert: names are sanitized (no directory components) and deduped with suffixes.
assert zf.namelist() == ["b.txt", "b (1).txt", "b (2).txt"]
# Assert: each entry contains the correct bytes from storage.
assert zf.read("b.txt") == b"one"
assert zf.read("b (1).txt") == b"two"
assert zf.read("b (2).txt") == b"three"
def test_get_upload_files_by_ids_returns_empty_when_no_ids(monkeypatch: pytest.MonkeyPatch) -> None:
"""Ensure empty input returns an empty mapping without hitting the database."""
class _Session:
def scalars(self, _stmt): # type: ignore[no-untyped-def]
raise AssertionError("db.session.scalars should not be called for empty id lists")
monkeypatch.setattr(file_service_module, "db", SimpleNamespace(session=_Session()))
assert FileService.get_upload_files_by_ids("tenant-1", []) == {}
def test_get_upload_files_by_ids_returns_id_keyed_mapping(monkeypatch: pytest.MonkeyPatch) -> None:
"""Ensure batch lookup returns a dict keyed by stringified UploadFile ids."""
upload_files: list[Any] = [
SimpleNamespace(id="file-1", tenant_id="tenant-1"),
SimpleNamespace(id="file-2", tenant_id="tenant-1"),
]
class _ScalarResult:
def __init__(self, items: list[Any]) -> None:
self._items = items
def all(self) -> list[Any]:
return self._items
class _Session:
def __init__(self, items: list[Any]) -> None:
self._items = items
self.calls: list[object] = []
def scalars(self, stmt): # type: ignore[no-untyped-def]
# Capture the statement so we can at least assert the query path is taken.
self.calls.append(stmt)
return _ScalarResult(self._items)
session = _Session(upload_files)
monkeypatch.setattr(file_service_module, "db", SimpleNamespace(session=session))
# Provide duplicates to ensure callers can safely pass repeated ids.
result = FileService.get_upload_files_by_ids("tenant-1", ["file-1", "file-1", "file-2"])
assert set(result.keys()) == {"file-1", "file-2"}
assert result["file-1"].id == "file-1"
assert result["file-2"].id == "file-2"
assert len(session.calls) == 1

View File

@ -1,4 +1,4 @@
import type { FC } from 'react'
import type { FC, MouseEvent } from 'react'
import type { Resources } from './index'
import Link from 'next/link'
import { Fragment, useState } from 'react'
@ -18,6 +18,8 @@ import {
PortalToFollowElemContent,
PortalToFollowElemTrigger,
} from '@/app/components/base/portal-to-follow-elem'
import { useDocumentDownload } from '@/service/knowledge/use-document'
import { downloadUrl } from '@/utils/download'
import ProgressTooltip from './progress-tooltip'
import Tooltip from './tooltip'
@ -36,6 +38,30 @@ const Popup: FC<PopupProps> = ({
? (/\.([^.]*)$/.exec(data.documentName)?.[1] || '')
: 'notion'
const { mutateAsync: downloadDocument, isPending: isDownloading } = useDocumentDownload()
/**
* Download the original uploaded file for citations whose data source is upload-file.
* We request a signed URL from the dataset document download endpoint, then trigger browser download.
*/
const handleDownloadUploadFile = async (e: MouseEvent<HTMLElement>) => {
// Prevent toggling the citation popup when user clicks the download link.
e.preventDefault()
e.stopPropagation()
// Only upload-file citations can be downloaded this way (needs dataset/document ids).
const isUploadFile = data.dataSourceType === 'upload_file' || data.dataSourceType === 'file'
const datasetId = data.sources?.[0]?.dataset_id
const documentId = data.documentId || data.sources?.[0]?.document_id
if (!isUploadFile || !datasetId || !documentId || isDownloading)
return
// Fetch signed URL (usually points to `/files/<id>/file-preview?...&as_attachment=true`).
const res = await downloadDocument({ datasetId, documentId })
if (res?.url)
downloadUrl({ url: res.url, fileName: data.documentName })
}
return (
<PortalToFollowElem
open={open}
@ -49,6 +75,7 @@ const Popup: FC<PopupProps> = ({
<PortalToFollowElemTrigger onClick={() => setOpen(v => !v)}>
<div className="flex h-7 max-w-[240px] items-center rounded-lg bg-components-button-secondary-bg px-2">
<FileIcon type={fileType} className="mr-1 h-4 w-4 shrink-0" />
{/* Keep the trigger purely for opening the popup (no download link here). */}
<div className="truncate text-xs text-text-tertiary">{data.documentName}</div>
</div>
</PortalToFollowElemTrigger>
@ -57,7 +84,21 @@ const Popup: FC<PopupProps> = ({
<div className="px-4 pb-2 pt-3">
<div className="flex h-[18px] items-center">
<FileIcon type={fileType} className="mr-1 h-4 w-4 shrink-0" />
<div className="system-xs-medium truncate text-text-tertiary">{data.documentName}</div>
<div className="system-xs-medium truncate text-text-tertiary">
{/* If it's an upload-file reference, the title becomes a download link. */}
{(data.dataSourceType === 'upload_file' || data.dataSourceType === 'file') && !!data.sources?.[0]?.dataset_id
? (
<button
type="button"
className="cursor-pointer truncate text-text-tertiary hover:underline"
onClick={handleDownloadUploadFile}
disabled={isDownloading}
>
{data.documentName}
</button>
)
: data.documentName}
</div>
</div>
</div>
<div className="max-h-[450px] overflow-y-auto rounded-lg bg-components-panel-bg px-4 py-0.5">

View File

@ -30,9 +30,10 @@ import { useDatasetDetailContextWithSelector as useDatasetDetailContext } from '
import useTimestamp from '@/hooks/use-timestamp'
import { ChunkingMode, DataSourceType, DocumentActionType } from '@/models/datasets'
import { DatasourceType } from '@/models/pipeline'
import { useDocumentArchive, useDocumentBatchRetryIndex, useDocumentDelete, useDocumentDisable, useDocumentEnable } from '@/service/knowledge/use-document'
import { useDocumentArchive, useDocumentBatchRetryIndex, useDocumentDelete, useDocumentDisable, useDocumentDownloadZip, useDocumentEnable } from '@/service/knowledge/use-document'
import { asyncRunSafe } from '@/utils'
import { cn } from '@/utils/classnames'
import { downloadBlob } from '@/utils/download'
import { formatNumber } from '@/utils/format'
import BatchAction from '../detail/completed/common/batch-action'
import StatusItem from '../status-item'
@ -222,6 +223,7 @@ const DocumentList: FC<IDocumentListProps> = ({
const { mutateAsync: disableDocument } = useDocumentDisable()
const { mutateAsync: deleteDocument } = useDocumentDelete()
const { mutateAsync: retryIndexDocument } = useDocumentBatchRetryIndex()
const { mutateAsync: requestDocumentsZip, isPending: isDownloadingZip } = useDocumentDownloadZip()
const handleAction = (actionName: DocumentActionType) => {
return async () => {
@ -300,6 +302,39 @@ const DocumentList: FC<IDocumentListProps> = ({
return dataSourceType === DatasourceType.onlineDrive
}, [])
const downloadableSelectedIds = useMemo(() => {
const selectedSet = new Set(selectedIds)
return localDocs
.filter(doc => selectedSet.has(doc.id) && doc.data_source_type === DataSourceType.FILE)
.map(doc => doc.id)
}, [localDocs, selectedIds])
/**
* Generate a random ZIP filename for bulk document downloads.
* We intentionally avoid leaking dataset info in the exported archive name.
*/
const generateDocsZipFileName = useCallback((): string => {
// Prefer UUID for uniqueness; fall back to time+random when unavailable.
const randomPart = (typeof crypto !== 'undefined' && typeof crypto.randomUUID === 'function')
? crypto.randomUUID()
: `${Date.now().toString(36)}${Math.random().toString(36).slice(2, 10)}`
return `${randomPart}-docs.zip`
}, [])
const handleBatchDownload = useCallback(async () => {
if (isDownloadingZip)
return
// Download as a single ZIP to avoid browser caps on multiple automatic downloads.
const [e, blob] = await asyncRunSafe(requestDocumentsZip({ datasetId, documentIds: downloadableSelectedIds }))
if (e || !blob) {
Toast.notify({ type: 'error', message: t('actionMsg.downloadUnsuccessfully', { ns: 'common' }) })
return
}
downloadBlob({ data: blob, fileName: generateDocsZipFileName() })
}, [datasetId, downloadableSelectedIds, generateDocsZipFileName, isDownloadingZip, requestDocumentsZip, t])
return (
<div className="relative mt-3 flex h-full w-full flex-col">
<div className="relative h-0 grow overflow-x-auto">
@ -463,6 +498,7 @@ const DocumentList: FC<IDocumentListProps> = ({
onArchive={handleAction(DocumentActionType.archive)}
onBatchEnable={handleAction(DocumentActionType.enable)}
onBatchDisable={handleAction(DocumentActionType.disable)}
onBatchDownload={downloadableSelectedIds.length > 0 ? handleBatchDownload : undefined}
onBatchDelete={handleAction(DocumentActionType.delete)}
onEditMetadata={showEditModal}
onBatchReIndex={hasErrorDocumentsSelected ? handleBatchReIndex : undefined}

View File

@ -1,8 +1,10 @@
import type { OperationName } from '../types'
import type { CommonResponse } from '@/models/common'
import type { DocumentDownloadResponse } from '@/service/datasets'
import {
RiArchive2Line,
RiDeleteBinLine,
RiDownload2Line,
RiEditLine,
RiEqualizer2Line,
RiLoopLeftLine,
@ -28,6 +30,7 @@ import {
useDocumentArchive,
useDocumentDelete,
useDocumentDisable,
useDocumentDownload,
useDocumentEnable,
useDocumentPause,
useDocumentResume,
@ -37,6 +40,7 @@ import {
} from '@/service/knowledge/use-document'
import { asyncRunSafe } from '@/utils'
import { cn } from '@/utils/classnames'
import { downloadUrl } from '@/utils/download'
import s from '../style.module.css'
import RenameModal from './rename-modal'
@ -69,7 +73,7 @@ const Operations = ({
scene = 'list',
className = '',
}: OperationsProps) => {
const { id, enabled = false, archived = false, data_source_type, display_status } = detail || {}
const { id, name, enabled = false, archived = false, data_source_type, display_status } = detail || {}
const [showModal, setShowModal] = useState(false)
const [deleting, setDeleting] = useState(false)
const { notify } = useContext(ToastContext)
@ -80,6 +84,7 @@ const Operations = ({
const { mutateAsync: enableDocument } = useDocumentEnable()
const { mutateAsync: disableDocument } = useDocumentDisable()
const { mutateAsync: deleteDocument } = useDocumentDelete()
const { mutateAsync: downloadDocument, isPending: isDownloading } = useDocumentDownload()
const { mutateAsync: syncDocument } = useSyncDocument()
const { mutateAsync: syncWebsite } = useSyncWebsite()
const { mutateAsync: pauseDocument } = useDocumentPause()
@ -158,6 +163,24 @@ const Operations = ({
onUpdate()
}, [onUpdate])
const handleDownload = useCallback(async () => {
// Avoid repeated clicks while the signed URL request is in-flight.
if (isDownloading)
return
// Request a signed URL first (it points to `/files/<id>/file-preview?...&as_attachment=true`).
const [e, res] = await asyncRunSafe<DocumentDownloadResponse>(
downloadDocument({ datasetId, documentId: id }) as Promise<DocumentDownloadResponse>,
)
if (e || !res?.url) {
notify({ type: 'error', message: t('actionMsg.downloadUnsuccessfully', { ns: 'common' }) })
return
}
// Trigger download without navigating away (helps avoid duplicate downloads in some browsers).
downloadUrl({ url: res.url, fileName: name })
}, [datasetId, downloadDocument, id, isDownloading, name, notify, t])
return (
<div className="flex items-center" onClick={e => e.stopPropagation()}>
{isListScene && !embeddingAvailable && (
@ -214,6 +237,20 @@ const Operations = ({
<RiEditLine className="h-4 w-4 text-text-tertiary" />
<span className={s.actionName}>{t('list.table.rename', { ns: 'datasetDocuments' })}</span>
</div>
{data_source_type === DataSourceType.FILE && (
<div
className={s.actionItem}
onClick={(evt) => {
evt.preventDefault()
evt.stopPropagation()
evt.nativeEvent.stopImmediatePropagation?.()
handleDownload()
}}
>
<RiDownload2Line className="h-4 w-4 text-text-tertiary" />
<span className={s.actionName}>{t('list.action.download', { ns: 'datasetDocuments' })}</span>
</div>
)}
{['notion_import', DataSourceType.WEB].includes(data_source_type) && (
<div className={s.actionItem} onClick={() => onOperate('sync')}>
<RiLoopLeftLine className="h-4 w-4 text-text-tertiary" />
@ -223,6 +260,23 @@ const Operations = ({
<Divider className="my-1" />
</>
)}
{archived && data_source_type === DataSourceType.FILE && (
<>
<div
className={s.actionItem}
onClick={(evt) => {
evt.preventDefault()
evt.stopPropagation()
evt.nativeEvent.stopImmediatePropagation?.()
handleDownload()
}}
>
<RiDownload2Line className="h-4 w-4 text-text-tertiary" />
<span className={s.actionName}>{t('list.action.download', { ns: 'datasetDocuments' })}</span>
</div>
<Divider className="my-1" />
</>
)}
{!archived && display_status?.toLowerCase() === 'indexing' && (
<div className={s.actionItem} onClick={() => onOperate('pause')}>
<RiPauseCircleLine className="h-4 w-4 text-text-tertiary" />

View File

@ -1,5 +1,5 @@
import type { FC } from 'react'
import { RiArchive2Line, RiCheckboxCircleLine, RiCloseCircleLine, RiDeleteBinLine, RiDraftLine, RiRefreshLine } from '@remixicon/react'
import { RiArchive2Line, RiCheckboxCircleLine, RiCloseCircleLine, RiDeleteBinLine, RiDownload2Line, RiDraftLine, RiRefreshLine } from '@remixicon/react'
import { useBoolean } from 'ahooks'
import * as React from 'react'
import { useTranslation } from 'react-i18next'
@ -14,6 +14,7 @@ type IBatchActionProps = {
selectedIds: string[]
onBatchEnable: () => void
onBatchDisable: () => void
onBatchDownload?: () => void
onBatchDelete: () => Promise<void>
onArchive?: () => void
onEditMetadata?: () => void
@ -26,6 +27,7 @@ const BatchAction: FC<IBatchActionProps> = ({
selectedIds,
onBatchEnable,
onBatchDisable,
onBatchDownload,
onArchive,
onBatchDelete,
onEditMetadata,
@ -103,6 +105,16 @@ const BatchAction: FC<IBatchActionProps> = ({
<span className="px-0.5">{t(`${i18nPrefix}.reIndex`, { ns: 'dataset' })}</span>
</Button>
)}
{onBatchDownload && (
<Button
variant="ghost"
className="gap-x-0.5 px-3"
onClick={onBatchDownload}
>
<RiDownload2Line className="size-4" />
<span className="px-0.5">{t(`${i18nPrefix}.download`, { ns: 'dataset' })}</span>
</Button>
)}
<Button
variant="ghost"
destructive

View File

@ -61,6 +61,7 @@
"account.workspaceName": "Workspace Name",
"account.workspaceNamePlaceholder": "Enter workspace name",
"actionMsg.copySuccessfully": "Copied successfully",
"actionMsg.downloadUnsuccessfully": "Download failed. Please try again later.",
"actionMsg.generatedSuccessfully": "Generated successfully",
"actionMsg.generatedUnsuccessfully": "Generated unsuccessfully",
"actionMsg.modifiedSuccessfully": "Modified successfully",

View File

@ -26,6 +26,7 @@
"list.action.archive": "Archive",
"list.action.batchAdd": "Batch add",
"list.action.delete": "Delete",
"list.action.download": "Download",
"list.action.enableWarning": "Archived file cannot be enabled",
"list.action.pause": "Pause",
"list.action.resume": "Resume",

View File

@ -7,6 +7,7 @@
"batchAction.cancel": "Cancel",
"batchAction.delete": "Delete",
"batchAction.disable": "Disable",
"batchAction.download": "Download",
"batchAction.enable": "Enable",
"batchAction.reIndex": "Re-index",
"batchAction.selected": "Selected",

View File

@ -40,6 +40,15 @@ type CommonDocReq = {
documentId: string
}
export type DocumentDownloadResponse = {
url: string
}
export type DocumentDownloadZipRequest = {
datasetId: string
documentIds: string[]
}
type BatchReq = {
datasetId: string
batchId: string
@ -158,6 +167,18 @@ export const resumeDocIndexing = ({ datasetId, documentId }: CommonDocReq): Prom
return patch<CommonResponse>(`/datasets/${datasetId}/documents/${documentId}/processing/resume`)
}
export const fetchDocumentDownloadUrl = ({ datasetId, documentId }: CommonDocReq): Promise<DocumentDownloadResponse> => {
return get<DocumentDownloadResponse>(`/datasets/${datasetId}/documents/${documentId}/download`, {})
}
export const downloadDocumentsZip = ({ datasetId, documentIds }: DocumentDownloadZipRequest): Promise<Blob> => {
return post<Blob>(`/datasets/${datasetId}/documents/download-zip`, {
body: {
document_ids: documentIds,
},
})
}
export const preImportNotionPages = ({ url, datasetId }: { url: string, datasetId?: string }): Promise<{ notion_info: DataSourceNotionWorkspace[] }> => {
return get<{ notion_info: DataSourceNotionWorkspace[] }>(url, { params: { dataset_id: datasetId } })
}

View File

@ -1,4 +1,4 @@
import type { MetadataType, SortType } from '../datasets'
import type { DocumentDownloadResponse, DocumentDownloadZipRequest, MetadataType, SortType } from '../datasets'
import type { CommonResponse } from '@/models/common'
import type { DocumentDetailResponse, DocumentListResponse, UpdateDocumentBatchParams } from '@/models/datasets'
import {
@ -8,7 +8,7 @@ import {
import { normalizeStatusForQuery } from '@/app/components/datasets/documents/status-filter'
import { DocumentActionType } from '@/models/datasets'
import { del, get, patch, post } from '../base'
import { pauseDocIndexing, resumeDocIndexing } from '../datasets'
import { downloadDocumentsZip, fetchDocumentDownloadUrl, pauseDocIndexing, resumeDocIndexing } from '../datasets'
import { useInvalid } from '../use-base'
const NAME_SPACE = 'knowledge/document'
@ -164,6 +164,26 @@ export const useDocumentResume = () => {
})
}
export const useDocumentDownload = () => {
return useMutation({
mutationFn: ({ datasetId, documentId }: UpdateDocumentBatchParams) => {
if (!datasetId || !documentId)
throw new Error('datasetId and documentId are required')
return fetchDocumentDownloadUrl({ datasetId, documentId }) as Promise<DocumentDownloadResponse>
},
})
}
export const useDocumentDownloadZip = () => {
return useMutation({
mutationFn: ({ datasetId, documentIds }: DocumentDownloadZipRequest) => {
if (!datasetId || !documentIds?.length)
throw new Error('datasetId and documentIds are required')
return downloadDocumentsZip({ datasetId, documentIds })
},
})
}
export const useDocumentBatchRetryIndex = () => {
return useMutation({
mutationFn: ({ datasetId, documentIds }: { datasetId: string, documentIds: string[] }) => {

34
web/utils/download.ts Normal file
View File

@ -0,0 +1,34 @@
export type DownloadUrlOptions = {
url: string
fileName?: string
rel?: string
target?: string
}
const triggerDownload = ({ url, fileName, rel, target }: DownloadUrlOptions) => {
if (!url)
return
const anchor = document.createElement('a')
anchor.href = url
if (fileName)
anchor.download = fileName
if (rel)
anchor.rel = rel
if (target)
anchor.target = target
anchor.style.display = 'none'
document.body.appendChild(anchor)
anchor.click()
anchor.remove()
}
export const downloadUrl = ({ url, fileName, rel = 'noopener noreferrer', target }: DownloadUrlOptions) => {
triggerDownload({ url, fileName, rel, target })
}
export const downloadBlob = ({ data, fileName }: { data: Blob, fileName: string }) => {
const url = window.URL.createObjectURL(data)
triggerDownload({ url, fileName, rel: 'noopener noreferrer' })
window.URL.revokeObjectURL(url)
}