Commit Graph

9010 Commits

Author SHA1 Message Date
GareArc
11ab67c8cb
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 04:20:06 -08:00
GareArc
fe741140d5
fix(telemetry): fix zero-value message and workflow duration histograms
Workflow RT: replace float(info.workflow_run_elapsed_time) with
(end_time - start_time).total_seconds() using workflow_run.created_at and
workflow_run.finished_at. The elapsed_time DB field defaults to 0 and can
be stale if the workflow_storage Celery task has not committed yet when the
trace fires. Wall-clock timestamps are more reliable; elapsed_time is kept
as fallback.

Message RT: change end_time from created_at + provider_response_latency to
message.updated_at when updated_at > created_at. The pipeline explicitly
sets message.updated_at = naive_utc_now() at the moment the LLM response
is complete, making it the canonical response-complete timestamp.
Falls back to the latency-based calculation for error/aborted messages.
2026-03-02 04:14:57 -08:00
GareArc
9b5b355a4e
fix(telemetry): gate ObservabilityLayer content attrs behind ENTERPRISE_INCLUDE_CONTENT
Add should_include_content() helper to extensions/otel/parser/base.py that
returns True in CE (no behaviour change) and respects ENTERPRISE_INCLUDE_CONTENT
in EE. Gate all content-bearing span attributes in LLM, retrieval, tool, and
default node parsers so that gen_ai.completion, gen_ai.prompt, retrieval.document,
tool call arguments/results, and node input/output values are suppressed when
ENTERPRISE_ENABLED=True and ENTERPRISE_INCLUDE_CONTENT=False.
2026-03-02 04:04:26 -08:00
GareArc
ff35f1bfaa
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 02:28:30 -08:00
GareArc
3364003f90
fix(telemetry): add credential_name lookup with async-safe fallback 2026-03-02 02:27:31 -08:00
GareArc
e387d0205b
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 01:54:55 -08:00
GareArc
6df00c83ae
fix(telemetry): populate LLM credential info in node execution traces
- Add _lookup_llm_credential_info() to query Provider/ProviderModel tables
- Lookup LLM credentials when tool credential_id is null
- Fall back to provider-level credential if no model-specific credential
2026-03-02 01:47:39 -08:00
GareArc
05cf2336ac
docs(telemetry): add token consumption query patterns to data dictionary
Add token hierarchy diagram, common PromQL queries (totals, drill-down,
rates), and app name lookup via trace query.
2026-03-02 01:19:00 -08:00
GareArc
b710c9ad59
fix(telemetry): populate missing fields in node execution trace
- Extract model_provider/model_name from process_data (LLM nodes store
  model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
2026-03-02 01:18:59 -08:00
GareArc
a2a5b02a53
docs(telemetry): add token consumption query patterns to data dictionary
Add token hierarchy diagram, common PromQL queries (totals, drill-down,
rates), and app name lookup via trace query.
2026-03-02 01:07:18 -08:00
GareArc
1fcb05432d
fix(telemetry): populate missing fields in node execution trace
- Extract model_provider/model_name from process_data (LLM nodes store
  model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
2026-03-02 01:07:10 -08:00
L1nSn0w
9c148218fc Merge branch 'deploy/enterprise' of https://github.com/langgenius/dify into deploy/enterprise 2026-03-02 16:58:01 +08:00
L1nSn0w
02ab3a34b4 Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-03-02 16:57:31 +08:00
L1nSn0w
58524fd7fd feat(enterprise): auto-join newly registered accounts to the default workspace (#32308)
Co-authored-by: Yunlu Wen <yunlu.wen@dify.ai>
2026-03-02 16:38:43 +08:00
GareArc
aa7f648712
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-01 22:30:09 -08:00
GareArc
9d4b2715e8
fix(celery): register enterprise_telemetry_task in worker imports
Fixes Celery worker error where process_enterprise_telemetry task
was unregistered despite being dispatched from the app.

Added conditional import when ENTERPRISE_TELEMETRY_ENABLED=true
to ensure the task is available in the worker process.

Resolves: KeyError 'tasks.enterprise_telemetry_task.process_enterprise_telemetry'
2026-03-01 22:27:44 -08:00
GareArc
e2fc3417be
Merge branch 'fix/otel-upgrade-e-1.12.1' into deploy/enterprise 2026-03-01 21:48:37 -08:00
GareArc
2d7bffcc11
fix: upgrade OpenTelemetry packages from 0.48b0 to 0.49b0
Fixes "Failed to detach context" error in production by upgrading to OTEL 0.49b0,
which includes None token guards in Celery instrumentor (PR opentelemetry-python-contrib#2927).

Package Updates:
- OTEL instrumentation: 0.48b0 → 0.49b0
- OTEL SDK/API: 1.27.0 → 1.28.0
- protobuf: 4.25.8 → 5.29.6 (required by opentelemetry-proto 1.28.0)
- Google Cloud packages upgraded for protobuf 5.x compatibility:
  - google-api-core: 2.18.0 → 2.19.1+
  - google-auth: 2.29.0 → 2.47.0+
  - google-cloud-aiplatform: 1.49.0 → 1.123.0+
  - googleapis-common-protos: 1.63.0 → 1.65.0+
  - google-cloud-storage: 2.16.0 → 3.0.0+
- httpx: 0.27.0 → 0.28.0 (required by google-genai 1.37+)

Also removed duplicate opentelemetry-instrumentation-httpx entry in pyproject.toml.
2026-03-01 21:47:51 -08:00
GareArc
eb1b1eb09c
Merge 1.12.1-otel-ee into deploy/enterprise 2026-03-01 19:37:06 -08:00
GareArc
83f5850d0a
refactor(telemetry): add resolved_parent_context property and fix edge cases
- Add resolved_parent_context property to BaseTraceInfo for reusable parent context extraction
- Refactor enterprise_trace.py to use property instead of duplicated dict plucking (~19 lines eliminated)
- Fix UUID validation in exporter.py with specific error logging for invalid trace correlation IDs
- Add error isolation in event_handlers.py to prevent telemetry failures from breaking user operations
- Replace pickle-based payload_fallback with JSON storage rehydration for security
- Update TelemetryEnvelope to use Pydantic v2 ConfigDict with extra='forbid'
- Update tests to reflect contract changes and new error handling behavior
2026-03-01 19:33:59 -08:00
yunlu.wen
3368d4cf02 Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-02 10:10:28 +08:00
yunlu.wen
7a92c1764f fix token label 2026-03-02 10:10:01 +08:00
yunlu.wen
5617d69ca7 try to fix exception logging 2026-03-02 09:53:11 +08:00
GareArc
1a6aded8e0
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-01 02:25:23 -08:00
GareArc
9952a17fed
fix(telemetry): use URL scheme instead of API key for gRPC TLS detection
- Change insecure parameter from API key-based to URL scheme-based detection
- https:// endpoints now correctly use TLS (insecure=False)
- All other endpoints (http://, no scheme) use insecure=True
- Update tests to reflect URL scheme-based logic
- Remove incorrect documentation claiming API key controls TLS
2026-03-01 02:24:25 -08:00
GareArc
36ff9b447d
Merge origin/release/e-1.12.1 into 1.12.1-otel-ee
Sync enterprise 1.12.1 changes:
- feat: implement heartbeat mechanism for database migration lock
- refactor: replace AutoRenewRedisLock with DbMigrationAutoRenewLock
- fix: improve logging for database migration lock release
- fix: make flask upgrade-db fail on error
- fix: include sso_verified in access_mode validation
- fix: inherit web app permission from original app
- fix: make e-1.12.1 enterprise migrations database-agnostic
- fix: get_message_event_type return wrong message type
- refactor: document_indexing_sync_task split db session
- fix: trigger output schema miss
- test: remove unrelated enterprise service test

Conflict resolution:
- Combined OTEL telemetry imports with tool signature import in easy_ui_based_generate_task_pipeline.py
2026-03-01 00:18:46 -08:00
GareArc
1fa1960201
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 20:34:15 -08:00
GareArc
ff877ee39c
fix(telemetry): add resolved_trace_id property to eliminate trace_id inconsistencies
Add computed property to BaseTraceInfo that provides intelligent fallback:
1. External trace_id (from X-Trace-Id header)
2. workflow_run_id (for workflow-related traces)
3. message_id (as final fallback)

This ensures attribute dify.trace_id always matches log-level trace_id,
eliminating inconsistencies where attribute was null but log-level had value.

Changes:
- Add resolved_trace_id property to BaseTraceInfo (trace_entity.py)
- Replace 4 direct trace_id attribute assignments with resolved_trace_id
- Add trace_id_source parameter to 5 emit_metric_only_event calls

Fixes trace_id inconsistency found in MESSAGE_RUN, TOOL_EXECUTION,
MODERATION_CHECK, SUGGESTED_QUESTION_GENERATION, GENERATE_NAME_EXECUTION,
DATASET_RETRIEVAL, and PROMPT_GENERATION_EXECUTION events.

All 78 telemetry tests passing.
2026-02-28 20:32:15 -08:00
GareArc
370e1fa5e2
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 19:30:49 -08:00
GareArc
abcf14a571
refactor(telemetry): move gateway to core as stateless module-level functions
Move routing table, emit(), and is_enterprise_telemetry_enabled() from
enterprise/telemetry/gateway.py into core/telemetry/gateway.py so both
CE and EE share one code path. The ce_eligible flag in CASE_ROUTING
controls which events flow in CE — flipping it is the only change needed
to enable an event in community edition.

- Delete enterprise/telemetry/gateway.py (class-based singleton)
- Create core/telemetry/gateway.py (stateless functions, no shared state)
- Simplify core/telemetry/__init__.py to thin facade over gateway
- Remove TelemetryGateway class and get_gateway() from ext_enterprise_telemetry
- Single-source is_enterprise_telemetry_enabled in core.telemetry.gateway
- Fix pre-existing test bugs (missing dify.event.id in metric handler tests)
- Update all imports and mock paths across 7 test files
2026-02-28 19:27:24 -08:00
GareArc
9bd938b4e1
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-28 17:41:17 -08:00
GareArc
5e57f73598
feat(telemetry): add model provider and name tags to all trace metrics
Add comprehensive model tracking across all OTEL metrics and logs:
- Node execution metrics now include model_name for LLM operations
- Suggested question metrics include model_provider and model_name
- Dataset retrieval captures both embedding and rerank model info
- Updated DATA_DICTIONARY.md with complete metric label documentation

This enables granular cost tracking, performance analysis, and usage monitoring per model across all operation types.
2026-02-28 00:06:44 -08:00
GareArc
62592be60b
docs(enterprise): split telemetry docs into README and data dictionary
Separate background/configuration instructions from the data dictionary:
- README.md: Overview, configuration, correlation model, content gating
- DATA_DICTIONARY.md: Pure reference format with signals and attributes

The data dictionary is now concise (465 lines vs 911) and focuses on
attribute types and relationships without verbose explanations.
2026-02-27 12:32:48 -08:00
L1nSn0w
7a8c96b4b7 Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-02-14 17:00:06 +08:00
L1nSn0w
5025e29220 test: remove unrelated enterprise service test
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-14 16:34:49 +08:00
L1nSn0w
3cdc9c119e refactor(api): enhance DbMigrationAutoRenewLock acquisition logic
- Added a check to prevent double acquisition of the DB migration lock, raising an error if an attempt is made to acquire it while already held.
- Implemented logic to reuse the lock object if it has already been created, improving efficiency and clarity in lock management.
- Reset the lock object to None upon release to ensure proper state management.

(cherry picked from commit d4b102d3c8a473c4fd6409dba7c198289bb5f921)
2026-02-14 16:28:38 +08:00
L1nSn0w
18ba367b11 refactor(api): improve DbMigrationAutoRenewLock configuration and logging
- Introduced constants for minimum and maximum join timeout values, enhancing clarity and maintainability.
- Updated the renewal interval calculation to use defined constants for better readability.
- Improved logging messages to include context information, making it easier to trace issues during lock operations.

(cherry picked from commit 1471b77bf5156a95417bde148753702d44221929)
2026-02-14 16:28:38 +08:00
autofix-ci[bot]
d0bd74fccb [autofix.ci] apply automated fixes
(cherry picked from commit 907e63cdc57f8006017837a74c2da2fbe274dcfb)
2026-02-14 16:28:38 +08:00
L1nSn0w
5ccbc00eb9 refactor(api): replace AutoRenewRedisLock with DbMigrationAutoRenewLock
- Updated the database migration locking mechanism to use DbMigrationAutoRenewLock for improved clarity and functionality.
- Removed the AutoRenewRedisLock implementation and its associated tests.
- Adjusted integration and unit tests to reflect the new locking class and its usage in the upgrade_db command.

(cherry picked from commit c812ad9ff26bed3eb59862bd7a5179b7ee83f11f)
2026-02-14 16:28:38 +08:00
L1nSn0w
94603b5408 refactor(api): replace heartbeat mechanism with AutoRenewRedisLock for database migration
- Removed the manual heartbeat function for renewing the Redis lock during database migrations.
- Integrated AutoRenewRedisLock to handle lock renewal automatically, simplifying the upgrade_db command.
- Updated unit tests to reflect changes in lock handling and error management during migrations.

(cherry picked from commit 8814256eb5fa20b29e554264f3b659b027bc4c9a)
2026-02-14 16:28:38 +08:00
L1nSn0w
8d4bd5636b refactor(tests): replace hardcoded wait time with constant for clarity
- Introduced HEARTBEAT_WAIT_TIMEOUT_SECONDS constant to improve readability and maintainability of test code.
- Updated test assertions to use the new constant instead of a hardcoded value.

(cherry picked from commit 0d53743d83b03ae0e68fad143711ffa5f6354093)
2026-02-14 16:28:38 +08:00
autofix-ci[bot]
ee0c4a8852 [autofix.ci] apply automated fixes
(cherry picked from commit 326cffa553ffac1bcd39a051c899c35b0ebe997d)
2026-02-14 16:28:38 +08:00
L1nSn0w
6032c598b0 fix(api): improve logging for database migration lock release
- Added a migration_succeeded flag to track the success of database migrations.
- Enhanced logging messages to indicate the status of the migration when releasing the lock, providing clearer context for potential issues.

(cherry picked from commit e74be0392995d16d288eed2175c51148c9e5b9c0)
2026-02-14 16:28:38 +08:00
L1nSn0w
afdd5b6c86 feat(api): implement heartbeat mechanism for database migration lock
- Added a heartbeat function to renew the Redis lock during database migrations, preventing long blockages from crashed processes.
- Updated the upgrade_db command to utilize the new locking mechanism with a configurable TTL.
- Removed the deprecated MIGRATION_LOCK_TTL from DeploymentConfig and related files.
- Enhanced unit tests to cover the new lock renewal behavior and error handling during migrations.

(cherry picked from commit a3331c622435f9f215b95f6b0261f43ae56a9d9c)
2026-02-14 16:28:38 +08:00
L1nSn0w
9acdfbde2f feat(api): enhance database migration locking mechanism and configuration
- Introduced a configurable Redis lock TTL for database migrations in DeploymentConfig.
- Updated the upgrade_db command to handle lock release errors gracefully.
- Added documentation for the new MIGRATION_LOCK_TTL environment variable in the .env.example file and docker-compose.yaml.

(cherry picked from commit 4a05fb120622908bc109a3715686706aab3d3b59)
2026-02-14 16:28:38 +08:00
longbingljw
1977e68b2d fix: make flask upgrade-db fail on error (#32024)
(cherry picked from commit d9530f7bb7)
2026-02-14 16:28:38 +08:00
GareArc
f17b51ab3a
Merge branch 'fix/access-mode-sso-verified-e-1.12.1' into deploy/enterprise 2026-02-13 23:41:04 -08:00
Xiyuan Chen
e9a7e8f77f
fix: include sso_verified in access_mode validation (#32325) 2026-02-13 23:40:37 -08:00
GareArc
23c75c7ec7
fix: centralize access_mode validation and support sso_verified
- Add ALLOWED_ACCESS_MODES constant to centralize valid access modes
- Include 'sso_verified' in validation to fix app duplication errors
- Update error message to dynamically list all allowed modes
- Refactor for maintainability: single source of truth for access modes

This fixes the issue where apps with access_mode='sso_verified' could not
be duplicated because the validation in update_app_access_mode() was missing
this mode, even though it was documented in WebAppSettings model.
2026-02-13 23:29:05 -08:00
GareArc
588e6561dc
Merge branch 'hotfix/e-1.12.1-app-copy-inherit-webapp-permission' into deploy/enterprise 2026-02-13 22:42:35 -08:00