book Book
+ - web_page Web page
+ - paper Academic paper/article
+ - social_media_post Social media post
+ - wikipedia_entry Wikipedia entry
+ - personal_document Personal document
+ - business_document Business document
+ - im_chat_log Chat log
+ - synced_from_notion Notion document
+ - synced_from_github GitHub document
+ - others Other document types
+ book:
+ - title Book title
+ - language Book language
+ - author Book author
+ - publisher Publisher name
+ - publication_date Publication date
+ - isbn ISBN number
+ - category Book category
+
+ For web_page:
+ - title Page title
+ - url Page URL
+ - language Page language
+ - publish_date Publish date
+ - author/publisher Author or publisher
+ - topic/keywords Topic or keywords
+ - description Page description
+
+ Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
+
+ For doc_type "others", any valid JSON object is accepted
+ high_quality High quality: embedding using embedding model, built as vector database index
@@ -195,6 +233,68 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
- hierarchical_model Parent-child mode
- qa_model Q&A Mode: Generates Q&A pairs for segmented documents and then embeds the questions
+ - doc_type Type of document (optional)
+ - book Book
+ Document records a book or publication
+ - web_page Web page
+ Document records web page content
+ - paper Academic paper/article
+ Document records academic paper or research article
+ - social_media_post Social media post
+ Content from social media posts
+ - wikipedia_entry Wikipedia entry
+ Content from Wikipedia entries
+ - personal_document Personal document
+ Documents related to personal content
+ - business_document Business document
+ Documents related to business content
+ - im_chat_log Chat log
+ Records of instant messaging chats
+ - synced_from_notion Notion document
+ Documents synchronized from Notion
+ - synced_from_github GitHub document
+ Documents synchronized from GitHub
+ - others Other document types
+ Other document types not listed above
+
+ - doc_metadata Document metadata (required if doc_type is provided)
+ Fields vary by doc_type:
+
+ For book:
+ - title Book title
+ Title of the book
+ - language Book language
+ Language of the book
+ - author Book author
+ Author of the book
+ - publisher Publisher name
+ Name of the publishing house
+ - publication_date Publication date
+ Date when the book was published
+ - isbn ISBN number
+ International Standard Book Number
+ - category Book category
+ Category or genre of the book
+
+ For web_page:
+ - title Page title
+ Title of the web page
+ - url Page URL
+ URL address of the web page
+ - language Page language
+ Language of the web page
+ - publish_date Publish date
+ Date when the web page was published
+ - author/publisher Author or publisher
+ Author or publisher of the web page
+ - topic/keywords Topic or keywords
+ Topics or keywords of the web page
+ - description Page description
+ Description of the web page content
+
+ Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
+ For doc_type "others", any valid JSON object is accepted
+
- doc_language In Q&A mode, specify the language of the document, for example: English, Chinese
- process_rule Processing rules
@@ -307,6 +407,44 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
book Book
+ - web_page Web page
+ - paper Academic paper/article
+ - social_media_post Social media post
+ - wikipedia_entry Wikipedia entry
+ - personal_document Personal document
+ - business_document Business document
+ - im_chat_log Chat log
+ - synced_from_notion Notion document
+ - synced_from_github GitHub document
+ - others Other document types
+ book:
+ - title Book title
+ - language Book language
+ - author Book author
+ - publisher Publisher name
+ - publication_date Publication date
+ - isbn ISBN number
+ - category Book category
+
+ For web_page:
+ - title Page title
+ - url Page URL
+ - language Page language
+ - publish_date Publish date
+ - author/publisher Author or publisher
+ - topic/keywords Topic or keywords
+ - description Page description
+
+ Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
+
+ For doc_type "others", any valid JSON object is accepted
+ high_quality High quality
@@ -624,6 +762,67 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
- separator Segmentation identifier. Currently, only one delimiter is allowed. The default is ***
- max_tokens The maximum length (tokens) must be validated to be shorter than the length of the parent chunk
- chunk_overlap Define the overlap between adjacent chunks (optional)
+ - doc_type Type of document (optional)
+ - book Book
+ Document records a book or publication
+ - web_page Web page
+ Document records web page content
+ - paper Academic paper/article
+ Document records academic paper or research article
+ - social_media_post Social media post
+ Content from social media posts
+ - wikipedia_entry Wikipedia entry
+ Content from Wikipedia entries
+ - personal_document Personal document
+ Documents related to personal content
+ - business_document Business document
+ Documents related to business content
+ - im_chat_log Chat log
+ Records of instant messaging chats
+ - synced_from_notion Notion document
+ Documents synchronized from Notion
+ - synced_from_github GitHub document
+ Documents synchronized from GitHub
+ - others Other document types
+ Other document types not listed above
+
+ - doc_metadata Document metadata (required if doc_type is provided)
+ Fields vary by doc_type:
+
+ For book:
+ - title Book title
+ Title of the book
+ - language Book language
+ Language of the book
+ - author Book author
+ Author of the book
+ - publisher Publisher name
+ Name of the publishing house
+ - publication_date Publication date
+ Date when the book was published
+ - isbn ISBN number
+ International Standard Book Number
+ - category Book category
+ Category or genre of the book
+
+ For web_page:
+ - title Page title
+ Title of the web page
+ - url Page URL
+ URL address of the web page
+ - language Page language
+ Language of the web page
+ - publish_date Publish date
+ Date when the web page was published
+ - author/publisher Author or publisher
+ Author or publisher of the web page
+ - topic/keywords Topic or keywords
+ Topics or keywords of the web page
+ - description Page description
+ Description of the web page content
+
+ Please check [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) for more details on the fields required for each doc_type.
+ For doc_type "others", any valid JSON object is accepted
book 图书 Book
+ - web_page 网页 Web page
+ - paper 学术论文/文章 Academic paper/article
+ - social_media_post 社交媒体帖子 Social media post
+ - wikipedia_entry 维基百科条目 Wikipedia entry
+ - personal_document 个人文档 Personal document
+ - business_document 商业文档 Business document
+ - im_chat_log 即时通讯记录 Chat log
+ - synced_from_notion Notion同步文档 Notion document
+ - synced_from_github GitHub同步文档 GitHub document
+ - others 其他文档类型 Other document types
+ book:
+ - title 书名 Book title
+ - language 图书语言 Book language
+ - author 作者 Book author
+ - publisher 出版社 Publisher name
+ - publication_date 出版日期 Publication date
+ - isbn ISBN号码 ISBN number
+ - category 图书分类 Book category
+
+ 针对网页 For web_page:
+ - title 页面标题 Page title
+ - url 页面网址 Page URL
+ - language 页面语言 Page language
+ - publish_date 发布日期 Publish date
+ - author/publisher 作者/发布者 Author or publisher
+ - topic/keywords 主题/关键词 Topic or keywords
+ - description 页面描述 Page description
+
+ 请查看 [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) 了解各文档类型所需字段的详细信息。
+
+ 针对"其他"类型文档,接受任何有效的JSON对象
+ high_quality 高质量:使用 embedding 模型进行嵌入,构建为向量数据库索引
@@ -194,6 +234,68 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
- text_model text 文档直接 embedding,经济模式默认为该模式
- hierarchical_model parent-child 模式
- qa_model Q&A 模式:为分片文档生成 Q&A 对,然后对问题进行 embedding
+ - doc_type 文档类型(选填)Type of document (optional)
+ - book 图书
+ 文档记录一本书籍或出版物
+ - web_page 网页
+ 网页内容的文档记录
+ - paper 学术论文/文章
+ 学术论文或研究文章的记录
+ - social_media_post 社交媒体帖子
+ 社交媒体上的帖子内容
+ - wikipedia_entry 维基百科条目
+ 维基百科的词条内容
+ - personal_document 个人文档
+ 个人相关的文档记录
+ - business_document 商业文档
+ 商业相关的文档记录
+ - im_chat_log 即时通讯记录
+ 即时通讯的聊天记录
+ - synced_from_notion Notion同步文档
+ 从Notion同步的文档内容
+ - synced_from_github GitHub同步文档
+ 从GitHub同步的文档内容
+ - others 其他文档类型
+ 其他未列出的文档类型
+
+ - doc_metadata 文档元数据(如提供文档类型则必填
+ 字段因文档类型而异
+
+ 针对图书类型 For book:
+ - title 书名
+ 书籍的标题
+ - language 图书语言
+ 书籍的语言
+ - author 作者
+ 书籍的作者
+ - publisher 出版社
+ 出版社的名称
+ - publication_date 出版日期
+ 书籍的出版日期
+ - isbn ISBN号码
+ 书籍的ISBN编号
+ - category 图书分类
+ 书籍的分类类别
+
+ 针对网页类型 For web_page:
+ - title 页面标题
+ 网页的标题
+ - url 页面网址
+ 网页的URL地址
+ - language 页面语言
+ 网页的语言
+ - publish_date 发布日期
+ 网页的发布日期
+ - author/publisher 作者/发布者
+ 网页的作者或发布者
+ - topic/keywords 主题/关键词
+ 网页的主题或关键词
+ - description 页面描述
+ 网页的描述信息
+
+ 请查看 [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) 了解各文档类型所需字段的详细信息。
+
+ 针对"其他"类型文档,接受任何有效的JSON对象
- doc_language 在 Q&A 模式下,指定文档的语言,例如:English、Chinese
@@ -504,6 +606,46 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
book 图书 Book
+ - web_page 网页 Web page
+ - paper 学术论文/文章 Academic paper/article
+ - social_media_post 社交媒体帖子 Social media post
+ - wikipedia_entry 维基百科条目 Wikipedia entry
+ - personal_document 个人文档 Personal document
+ - business_document 商业文档 Business document
+ - im_chat_log 即时通讯记录 Chat log
+ - synced_from_notion Notion同步文档 Notion document
+ - synced_from_github GitHub同步文档 GitHub document
+ - others 其他文档类型 Other document types
+ book:
+ - title 书名 Book title
+ - language 图书语言 Book language
+ - author 作者 Book author
+ - publisher 出版社 Publisher name
+ - publication_date 出版日期 Publication date
+ - isbn ISBN号码 ISBN number
+ - category 图书分类 Book category
+
+ 针对网页 For web_page:
+ - title 页面标题 Page title
+ - url 页面网址 Page URL
+ - language 页面语言 Page language
+ - publish_date 发布日期 Publish date
+ - author/publisher 作者/发布者 Author or publisher
+ - topic/keywords 主题/关键词 Topic or keywords
+ - description 页面描述 Page description
+
+ 请查看 [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) 了解各文档类型所需字段的详细信息。
+
+ 针对"其他"类型文档,接受任何有效的JSON对象
+ mode (string) 清洗、分段模式 ,automatic 自动 / custom 自定义
@@ -624,6 +766,68 @@ import { Row, Col, Properties, Property, Heading, SubProperty, PropertyInstructi
- separator 分段标识符,目前仅允许设置一个分隔符。默认为 ***
- max_tokens 最大长度 (token) 需要校验小于父级的长度
- chunk_overlap 分段重叠指的是在对数据进行分段时,段与段之间存在一定的重叠部分(选填)
+ - doc_type 文档类型(选填)Type of document (optional)
+ - book 图书
+ 文档记录一本书籍或出版物
+ - web_page 网页
+ 网页内容的文档记录
+ - paper 学术论文/文章
+ 学术论文或研究文章的记录
+ - social_media_post 社交媒体帖子
+ 社交媒体上的帖子内容
+ - wikipedia_entry 维基百科条目
+ 维基百科的词条内容
+ - personal_document 个人文档
+ 个人相关的文档记录
+ - business_document 商业文档
+ 商业相关的文档记录
+ - im_chat_log 即时通讯记录
+ 即时通讯的聊天记录
+ - synced_from_notion Notion同步文档
+ 从Notion同步的文档内容
+ - synced_from_github GitHub同步文档
+ 从GitHub同步的文档内容
+ - others 其他文档类型
+ 其他未列出的文档类型
+
+ - doc_metadata 文档元数据(如提供文档类型则必填
+ 字段因文档类型而异
+
+ 针对图书类型 For book:
+ - title 书名
+ 书籍的标题
+ - language 图书语言
+ 书籍的语言
+ - author 作者
+ 书籍的作者
+ - publisher 出版社
+ 出版社的名称
+ - publication_date 出版日期
+ 书籍的出版日期
+ - isbn ISBN号码
+ 书籍的ISBN编号
+ - category 图书分类
+ 书籍的分类类别
+
+ 针对网页类型 For web_page:
+ - title 页面标题
+ 网页的标题
+ - url 页面网址
+ 网页的URL地址
+ - language 页面语言
+ 网页的语言
+ - publish_date 发布日期
+ 网页的发布日期
+ - author/publisher 作者/发布者
+ 网页的作者或发布者
+ - topic/keywords 主题/关键词
+ 网页的主题或关键词
+ - description 页面描述
+ 网页的描述信息
+
+ 请查看 [api/services/dataset_service.py](https://github.com/langgenius/dify/blob/main/api/services/dataset_service.py#L475) 了解各文档类型所需字段的详细信息。
+
+ 针对"其他"类型文档,接受任何有效的JSON对象