dify/web/app/(commonLayout)/datasets/template/template.zh.mdx
zxhlyh 9dbb8acd4b
Feat/dataset support api service (#1240)
Co-authored-by: Joel <iamjoel007@gmail.com>
Co-authored-by: crazywoola <427733928@qq.com>
2023-09-27 16:06:49 +08:00

793 lines
25 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

import { CodeGroup } from '@/app/components/develop/code.tsx'
import { Row, Col, Properties, Property, Heading, SubProperty, Paragraph } from '@/app/components/develop/md.tsx'
# 数据集 API
<br/>
<br/>
<Heading
url='/datasets'
method='POST'
title='创建空数据集'
name='#create_empty_dataset'
/>
<Row>
<Col>
### Request Body
<Properties>
<Property name='name' type='string' key='name'>
数据集名称
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets"
targetCode={`curl --location --request POST '${props.apiBaseUrl}/datasets' \\\n--header 'Authorization: Bearer {api_key}' \\\n--header 'Content-Type: application/json' \\\n--data-raw '{"name": "name"}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request POST 'https://api.dify.ai/v1/datasets' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "name"
}'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"id": "",
"name": "name",
"description": null,
"provider": "vendor",
"permission": "only_me",
"data_source_type": null,
"indexing_technique": null,
"app_count": 0,
"document_count": 0,
"word_count": 0,
"created_by": "",
"created_at": 1695636173,
"updated_by": "",
"updated_at": 1695636173,
"embedding_model": null,
"embedding_model_provider": null,
"embedding_available": null
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets'
method='GET'
title='数据集列表'
name='#dataset_list'
/>
<Row>
<Col>
### Path Query
<Properties>
<Property name='page' type='string' key='page'>
页码
</Property>
<Property name='limit' type='string' key='limit'>
返回条数,默认 20范围 1-100
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets"
targetCode={`curl --location --request GET '${props.apiBaseUrl}/datasets?page=1&limit=20' \\\n--header 'Authorization: Bearer {api_key}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request GET 'https://api.dify.ai/v1/datasets?page=1&limit=20' \
--header 'Authorization: Bearer {api_key}'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"data": [
{
"id": "",
"name": "数据集名称",
"description": "描述信息",
"permission": "only_me",
"data_source_type": "upload_file",
"indexing_technique": "",
"app_count": 2,
"document_count": 10,
"word_count": 1200,
"created_by": "",
"created_at": "",
"updated_by": "",
"updated_at": ""
},
...
],
"has_more": true,
"limit": 20,
"total": 50,
"page": 1
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/document/create_by_text'
method='POST'
title='通过文本创建文档'
name='#create_by_text'
/>
<Row>
<Col>
此接口基于已存在数据集,在此数据集的基础上通过文本创建新的文档
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
</Properties>
### Request Body
<Properties>
<Property name='name' type='string' key='name'>
文档名称
</Property>
<Property name='text' type='string' key='text'>
文档内容
</Property>
<Property name='indexing_technique' type='string' key='indexing_technique'>
索引方式
- high_quality 高质量:使用 embedding 模型进行嵌入,构建为向量数据库索引
- economy 经济:使用 Keyword Table Index 的倒排索引进行构建
</Property>
<Property name='process_rule' type='object' key='process_rule'>
处理规则
- mode (string) 清洗、分段模式 automatic 自动 / custom 自定义
- rules (text) 自定义规则(自动模式下,该字段为空)
- pre_processing_rules (array[object]) 预处理规则
- id (string) 预处理规则的唯一标识符
- 枚举:
- remove_extra_spaces 替换连续空格、换行符、制表符
- remove_urls_emails 删除 URL、电子邮件地址
- enabled (bool) 是否选中该规则,不传入文档 ID 时代表默认值
- segmentation (object) 分段规则
- separator 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- max_tokens 最大长度 (token) 默认为 1000
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets/{dataset_id}/document/create_by_text"
targetCode={`curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_text' \\\n--header 'Authorization: Bearer {api_key}' \\\n--header 'Content-Type: application/json' \\\n--data-raw '{"name": "text","text": "text","indexing_technique": "high_quality","process_rule": {"mode": "automatic"}}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "text",
"text": "text",
"indexing_technique": "high_quality",
"process_rule": {
"mode": "automatic"
}
}'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"document": {
"id": "",
"position": 1,
"data_source_type": "upload_file",
"data_source_info": {
"upload_file_id": ""
},
"dataset_process_rule_id": "",
"name": "text.txt",
"created_from": "api",
"created_by": "",
"created_at": 1695690280,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false,
"display_status": "queuing",
"word_count": 0,
"hit_count": 0,
"doc_form": "text_model"
},
"batch": ""
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/document/create_by_file'
method='POST'
title='通过文件创建文档 '
name='#create_by_file'
/>
<Row>
<Col>
此接口基于已存在数据集,在此数据集的基础上通过文件创建新的文档
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
</Properties>
### Request Body
<Properties>
<Property name='original_document_id' type='string' key='original_document_id'>
源文档 ID (选填)
- 用于重新上传文档或修改文档清洗、分段配置,缺失的信息从源文档复制
- 源文档不可为归档的文档
- 当传入 original_document_id 时代表文档进行更新操作process_rule 为可填项目,不填默认使用源文档的分段方式
- 未传入 original_document_id 时代表文档进行新增操作process_rule 为必填
</Property>
<Property name='file' type='multipart/form-data' key='file'>
需要上传的文件。
</Property>
<Property name='indexing_technique' type='string' key='indexing_technique'>
索引方式
- high_quality 高质量:使用 embedding 模型进行嵌入,构建为向量数据库索引
- economy 经济:使用 Keyword Table Index 的倒排索引进行构建
</Property>
<Property name='process_rule' type='object' key='process_rule'>
处理规则
- mode (string) 清洗、分段模式 automatic 自动 / custom 自定义。
- rules (text) 自定义规则(自动模式下,该字段为空)
- pre_processing_rules (array[object]) 预处理规则
- id (string) 预处理规则的唯一标识符
- 枚举:
- remove_extra_spaces 替换连续空格、换行符、制表符
- remove_urls_emails 删除 URL、电子邮件地址
- enabled (bool) 是否选中该规则,不传入文档 ID 时代表默认值。
- segmentation (object) 分段规则
- separator 自定义分段标识符,目前仅允许设置一个分隔符,默认为 \n
- max_tokens 最大长度 (token) 默认为 1000
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets/{dataset_id}/document/create_by_file"
targetCode={`curl --location POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_file' \\\n--header 'Authorization: Bearer {api_key}' \\\n--form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain' \\\n--form 'file=@"/path/to/file"'`}
>
```bash {{ title: 'cURL' }}
curl --location POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"document": {
"id": "",
"position": 1,
"data_source_type": "upload_file",
"data_source_info": {
"upload_file_id": ""
},
"dataset_process_rule_id": "",
"name": "Dify.txt",
"created_from": "api",
"created_by": "",
"created_at": 1695308667,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false,
"display_status": "queuing",
"word_count": 0,
"hit_count": 0,
"doc_form": "text_model"
},
"batch": ""
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/update_by_text'
method='POST'
title='通过文本更新文档 '
name='#update_by_text'
/>
<Row>
<Col>
此接口基于已存在数据集,在此数据集的基础上通过文本更新文档
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
<Property name='document_id' type='string' key='document_id'>
文档 ID
</Property>
</Properties>
### Request Body
<Properties>
<Property name='name' type='string' key='name'>
文档名称 (选填)
</Property>
<Property name='text' type='string' key='text'>
文档内容(选填)
</Property>
<Property name='process_rule' type='object' key='process_rule'>
处理规则(选填)
- mode (string) 清洗、分段模式 automatic 自动 / custom 自定义。
- rules (text) 自定义规则(自动模式下,该字段为空)
- pre_processing_rules (array[object]) 预处理规则
- id (string) 预处理规则的唯一标识符
- 枚举:
- remove_extra_spaces 替换连续空格、换行符、制表符
- remove_urls_emails 删除 URL、电子邮件地址
- enabled (bool) 是否选中该规则,不传入文档 ID 时代表默认值。
- segmentation (object) 分段规则
- separator 自定义分段标识符,目前仅允许设置一个分隔符。默认为 \n
- max_tokens 最大长度 (token) 默认为 1000
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets/{dataset_id}/documents/{document_id}/update_by_text"
targetCode={`curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/documents/{document_id}/update_by_text' \\\n--header 'Authorization: Bearer {api_key}' \\\n--header 'Content-Type: application/json' \\\n--data-raw '{"name": "name","text": "text"}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/update_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "name",
"text": "text"
}'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"document": {
"id": "",
"position": 1,
"data_source_type": "upload_file",
"data_source_info": {
"upload_file_id": ""
},
"dataset_process_rule_id": "",
"name": "name.txt",
"created_from": "api",
"created_by": "",
"created_at": 1695308667,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false,
"display_status": "queuing",
"word_count": 0,
"hit_count": 0,
"doc_form": "text_model"
},
"batch": ""
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/update_by_file'
method='POST'
title='通过文件更新文档 '
name='#update_by_file'
/>
<Row>
<Col>
此接口基于已存在数据集,在此数据集的基础上通过文件更新文档的操作。
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
<Property name='document_id' type='string' key='document_id'>
文档 ID
</Property>
</Properties>
### Request Body
<Properties>
<Property name='name' type='string' key='name'>
文档名称 (选填)
</Property>
<Property name='file' type='multipart/form-data' key='file'>
需要上传的文件
</Property>
<Property name='process_rule' type='object' key='process_rule'>
处理规则(选填)
- mode (string) 清洗、分段模式 automatic 自动 / custom 自定义。
- rules (text) 自定义规则(自动模式下,该字段为空)
- pre_processing_rules (array[object]) 预处理规则
- id (string) 预处理规则的唯一标识符
- 枚举:
- remove_extra_spaces 替换连续空格、换行符、制表符
- remove_urls_emails 删除 URL、电子邮件地址
- enabled (bool) 是否选中该规则,不传入文档 ID 时代表默认值
- segmentation (object) 分段规则
- separator 自定义分段标识符,目前仅允许设置一个分隔符,默认为 \n
- max_tokens 最大长度 (token) 默认为 1000
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets/{dataset_id}/documents/{document_id}/update_by_file"
targetCode={`curl --location POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/{document_id}/create_by_file' \\\n--header 'Authorization: Bearer {api_key}' \\\n--form 'data="{"name":"Dify","indexing_technique":"high_quality","process_rule":{"rules":{"pre_processing_rules":[{"id":"remove_extra_spaces","enabled":true},{"id":"remove_urls_emails","enabled":true}],"segmentation":{"separator":"###","max_tokens":500}},"mode":"custom"}}";type=text/plain' \\\n--form 'file=@"/path/to/file"'`}
>
```bash {{ title: 'cURL' }}
curl --location POST 'https://api.dify.ai/v1/datasets/{dataset_id}/document/{document_id}/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"document": {
"id": "",
"position": 1,
"data_source_type": "upload_file",
"data_source_info": {
"upload_file_id": ""
},
"dataset_process_rule_id": "",
"name": "Dify.txt",
"created_from": "api",
"created_by": "",
"created_at": 1695308667,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false,
"display_status": "queuing",
"word_count": 0,
"hit_count": 0,
"doc_form": "text_model"
},
"batch": "20230921150427533684"
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/batch/{batch}/indexing-status'
method='GET'
title='获取文档嵌入状态(进度)'
name='#indexing_status'
/>
<Row>
<Col>
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
<Property name='batch' type='string' key='batch'>
上传文档的批次号
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="GET"
label="/datasets/{dataset_id}/batch/{batch}/indexing-status"
targetCode={`curl --location --request GET '${props.apiBaseUrl}/datasets/{dataset_id}/documents/{batch}/indexing-status' \\\n--header 'Authorization: Bearer {api_key}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
--header 'Authorization: Bearer {api_key}' \
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"data":[{
"id": "",
"indexing_status": "indexing",
"processing_started_at": 1681623462.0,
"parsing_completed_at": 1681623462.0,
"cleaning_completed_at": 1681623462.0,
"splitting_completed_at": 1681623462.0,
"completed_at": null,
"paused_at": null,
"error": null,
"stopped_at": null,
"completed_segments": 24,
"total_segments": 100
}]
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/documents/{document_id}'
method='DELETE'
title='删除文档'
name='#delete_document'
/>
<Row>
<Col>
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
<Property name='document_id' type='string' key='document_id'>
文档 ID
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="DELETE"
label="/datasets/{dataset_id}/documents/{document_id}"
targetCode={`curl --location --request DELETE '${props.apiBaseUrl}/datasets/{dataset_id}/documents/{document_id}' \\\n--header 'Authorization: Bearer {api_key}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request DELETE 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}' \
--header 'Authorization: Bearer {api_key}' \
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"result": "success"
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/documents'
method='GET'
title='数据集文档列表'
name='#dataset_document_list'
/>
<Row>
<Col>
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
</Properties>
### Path Query
<Properties>
<Property name='keyword' type='string' key='keyword'>
搜索关键词,可选,目前仅搜索文档名称
</Property>
<Property name='page' type='string' key='page'>
页码,可选
</Property>
<Property name='limit' type='string' key='limit'>
返回条数,可选,默认 20范围 1-100
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="GET"
label="/datasets/{dataset_id}/documents"
targetCode={`curl --location --request GET '${props.apiBaseUrl}/datasets/{dataset_id}/documents' \\\n--header 'Authorization: Bearer {api_key}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents' \
--header 'Authorization: Bearer {api_key}' \
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"data": [
{
"id": "",
"position": 1,
"data_source_type": "file_upload",
"data_source_info": null,
"dataset_process_rule_id": null,
"name": "dify",
"created_from": "",
"created_by": "",
"created_at": 1681623639,
"tokens": 0,
"indexing_status": "waiting",
"error": null,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"archived": false
},
],
"has_more": false,
"limit": 20,
"total": 9,
"page": 1
}
```
</CodeGroup>
</Col>
</Row>
---
<Heading
url='/datasets/{dataset_id}/documents/{document_id}/segments'
method='POST'
title='新增分段'
name='#create_new_segment'
/>
<Row>
<Col>
### Path Params
<Properties>
<Property name='dataset_id' type='string' key='dataset_id'>
数据集 ID
</Property>
<Property name='document_id' type='string' key='document_id'>
文档 ID
</Property>
</Properties>
### Request Body
<Properties>
<Property name='segments' type='object list' key='segments'>
segments (object list) 分段内容
- content (text) 文本内容/问题内容,必填
- answer(text) 答案内容非必填如果数据集的模式为qa模式则传值
- keywords(list) 关键字,非必填
</Property>
</Properties>
</Col>
<Col sticky>
<CodeGroup
title="Request"
tag="POST"
label="/datasets/{dataset_id}/documents/{document_id}/segments"
targetCode={`curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/documents/{document_id}/segments' \\\n--header 'Authorization: Bearer {api_key}' \\\n--header 'Content-Type: application/json' \\\n--data-raw '{"segments": [{"content": "1","answer": "1","keywords": ["a"]}]}'`}
>
```bash {{ title: 'cURL' }}
curl --location --request POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/segments' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
"segments": [
{
"content": "1",
"answer": "1",
"keywords": ["a"]
}
]
}'
```
</CodeGroup>
<CodeGroup title="Response">
```json {{ title: 'Response' }}
{
"data": [{
"id": "",
"position": 1,
"document_id": "",
"content": "1",
"answer": "1",
"word_count": 25,
"tokens": 0,
"keywords": [
"a"
],
"index_node_id": "",
"index_node_hash": "",
"hit_count": 0,
"enabled": true,
"disabled_at": null,
"disabled_by": null,
"status": "completed",
"created_by": "",
"created_at": 1695312007,
"indexing_at": 1695312007,
"completed_at": 1695312007,
"error": null,
"stopped_at": null
}],
"doc_form": "text_model"
}
```
</CodeGroup>
</Col>
</Row>
---
错误信息
- **document_indexing**: 文档索引失败
- **provider_not_initialize**: Embedding 模型未配置
- **not_found**,文档不存在
- **dataset_name_duplicate**: 数据集名称重复
- **provider_quota_exceeded**: 模型额度超过限制
- **dataset_not_initialized**: 数据集还未初始化
- **unsupported_file_type**: 不支持的文件类型
- 目前只支持txt, markdown, md, pdf, html, htm, xlsx, docx, csv
- **too_many_files**: 文件数量过多,暂时只支持单一文件上传
- **file_too_large*: 文件太大默认支持15M以下, 具体需要参考环境变量配置