Commit Graph

5308 Commits

Author SHA1 Message Date
GareArc
05cf2336ac
docs(telemetry): add token consumption query patterns to data dictionary
Add token hierarchy diagram, common PromQL queries (totals, drill-down,
rates), and app name lookup via trace query.
2026-03-02 01:19:00 -08:00
GareArc
b710c9ad59
fix(telemetry): populate missing fields in node execution trace
- Extract model_provider/model_name from process_data (LLM nodes store
  model info there, not in execution_metadata)
- Add invoke_from to node execution trace metadata dict
- Add credential_id to node execution trace metadata dict
- Add conversation_id to metadata after message_id lookup
- Add tool_name to tool_info dict in tool node
2026-03-02 01:18:59 -08:00
L1nSn0w
9c148218fc Merge branch 'deploy/enterprise' of https://github.com/langgenius/dify into deploy/enterprise 2026-03-02 16:58:01 +08:00
L1nSn0w
02ab3a34b4 Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-03-02 16:57:31 +08:00
L1nSn0w
58524fd7fd feat(enterprise): auto-join newly registered accounts to the default workspace (#32308)
Co-authored-by: Yunlu Wen <yunlu.wen@dify.ai>
2026-03-02 16:38:43 +08:00
GareArc
aa7f648712
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-03-01 22:30:09 -08:00
GareArc
9d4b2715e8
fix(celery): register enterprise_telemetry_task in worker imports
Fixes Celery worker error where process_enterprise_telemetry task
was unregistered despite being dispatched from the app.

Added conditional import when ENTERPRISE_TELEMETRY_ENABLED=true
to ensure the task is available in the worker process.

Resolves: KeyError 'tasks.enterprise_telemetry_task.process_enterprise_telemetry'
2026-03-01 22:27:44 -08:00
GareArc
e2fc3417be
Merge branch 'fix/otel-upgrade-e-1.12.1' into deploy/enterprise 2026-03-01 21:48:37 -08:00
GareArc
2d7bffcc11
fix: upgrade OpenTelemetry packages from 0.48b0 to 0.49b0
Fixes "Failed to detach context" error in production by upgrading to OTEL 0.49b0,
which includes None token guards in Celery instrumentor (PR opentelemetry-python-contrib#2927).

Package Updates:
- OTEL instrumentation: 0.48b0 → 0.49b0
- OTEL SDK/API: 1.27.0 → 1.28.0
- protobuf: 4.25.8 → 5.29.6 (required by opentelemetry-proto 1.28.0)
- Google Cloud packages upgraded for protobuf 5.x compatibility:
  - google-api-core: 2.18.0 → 2.19.1+
  - google-auth: 2.29.0 → 2.47.0+
  - google-cloud-aiplatform: 1.49.0 → 1.123.0+
  - googleapis-common-protos: 1.63.0 → 1.65.0+
  - google-cloud-storage: 2.16.0 → 3.0.0+
- httpx: 0.27.0 → 0.28.0 (required by google-genai 1.37+)

Also removed duplicate opentelemetry-instrumentation-httpx entry in pyproject.toml.
2026-03-01 21:47:51 -08:00
GareArc
83f5850d0a
refactor(telemetry): add resolved_parent_context property and fix edge cases
- Add resolved_parent_context property to BaseTraceInfo for reusable parent context extraction
- Refactor enterprise_trace.py to use property instead of duplicated dict plucking (~19 lines eliminated)
- Fix UUID validation in exporter.py with specific error logging for invalid trace correlation IDs
- Add error isolation in event_handlers.py to prevent telemetry failures from breaking user operations
- Replace pickle-based payload_fallback with JSON storage rehydration for security
- Update TelemetryEnvelope to use Pydantic v2 ConfigDict with extra='forbid'
- Update tests to reflect contract changes and new error handling behavior
2026-03-01 19:33:59 -08:00
yunlu.wen
7a92c1764f fix token label 2026-03-02 10:10:01 +08:00
GareArc
9952a17fed
fix(telemetry): use URL scheme instead of API key for gRPC TLS detection
- Change insecure parameter from API key-based to URL scheme-based detection
- https:// endpoints now correctly use TLS (insecure=False)
- All other endpoints (http://, no scheme) use insecure=True
- Update tests to reflect URL scheme-based logic
- Remove incorrect documentation claiming API key controls TLS
2026-03-01 02:24:25 -08:00
GareArc
36ff9b447d
Merge origin/release/e-1.12.1 into 1.12.1-otel-ee
Sync enterprise 1.12.1 changes:
- feat: implement heartbeat mechanism for database migration lock
- refactor: replace AutoRenewRedisLock with DbMigrationAutoRenewLock
- fix: improve logging for database migration lock release
- fix: make flask upgrade-db fail on error
- fix: include sso_verified in access_mode validation
- fix: inherit web app permission from original app
- fix: make e-1.12.1 enterprise migrations database-agnostic
- fix: get_message_event_type return wrong message type
- refactor: document_indexing_sync_task split db session
- fix: trigger output schema miss
- test: remove unrelated enterprise service test

Conflict resolution:
- Combined OTEL telemetry imports with tool signature import in easy_ui_based_generate_task_pipeline.py
2026-03-01 00:18:46 -08:00
GareArc
ff877ee39c
fix(telemetry): add resolved_trace_id property to eliminate trace_id inconsistencies
Add computed property to BaseTraceInfo that provides intelligent fallback:
1. External trace_id (from X-Trace-Id header)
2. workflow_run_id (for workflow-related traces)
3. message_id (as final fallback)

This ensures attribute dify.trace_id always matches log-level trace_id,
eliminating inconsistencies where attribute was null but log-level had value.

Changes:
- Add resolved_trace_id property to BaseTraceInfo (trace_entity.py)
- Replace 4 direct trace_id attribute assignments with resolved_trace_id
- Add trace_id_source parameter to 5 emit_metric_only_event calls

Fixes trace_id inconsistency found in MESSAGE_RUN, TOOL_EXECUTION,
MODERATION_CHECK, SUGGESTED_QUESTION_GENERATION, GENERATE_NAME_EXECUTION,
DATASET_RETRIEVAL, and PROMPT_GENERATION_EXECUTION events.

All 78 telemetry tests passing.
2026-02-28 20:32:15 -08:00
GareArc
abcf14a571
refactor(telemetry): move gateway to core as stateless module-level functions
Move routing table, emit(), and is_enterprise_telemetry_enabled() from
enterprise/telemetry/gateway.py into core/telemetry/gateway.py so both
CE and EE share one code path. The ce_eligible flag in CASE_ROUTING
controls which events flow in CE — flipping it is the only change needed
to enable an event in community edition.

- Delete enterprise/telemetry/gateway.py (class-based singleton)
- Create core/telemetry/gateway.py (stateless functions, no shared state)
- Simplify core/telemetry/__init__.py to thin facade over gateway
- Remove TelemetryGateway class and get_gateway() from ext_enterprise_telemetry
- Single-source is_enterprise_telemetry_enabled in core.telemetry.gateway
- Fix pre-existing test bugs (missing dify.event.id in metric handler tests)
- Update all imports and mock paths across 7 test files
2026-02-28 19:27:24 -08:00
GareArc
5e57f73598
feat(telemetry): add model provider and name tags to all trace metrics
Add comprehensive model tracking across all OTEL metrics and logs:
- Node execution metrics now include model_name for LLM operations
- Suggested question metrics include model_provider and model_name
- Dataset retrieval captures both embedding and rerank model info
- Updated DATA_DICTIONARY.md with complete metric label documentation

This enables granular cost tracking, performance analysis, and usage monitoring per model across all operation types.
2026-02-28 00:06:44 -08:00
GareArc
62592be60b
docs(enterprise): split telemetry docs into README and data dictionary
Separate background/configuration instructions from the data dictionary:
- README.md: Overview, configuration, correlation model, content gating
- DATA_DICTIONARY.md: Pure reference format with signals and attributes

The data dictionary is now concise (465 lines vs 911) and focuses on
attribute types and relationships without verbose explanations.
2026-02-27 12:32:48 -08:00
L1nSn0w
7a8c96b4b7 Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-02-14 17:00:06 +08:00
L1nSn0w
5025e29220 test: remove unrelated enterprise service test
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-14 16:34:49 +08:00
L1nSn0w
3cdc9c119e refactor(api): enhance DbMigrationAutoRenewLock acquisition logic
- Added a check to prevent double acquisition of the DB migration lock, raising an error if an attempt is made to acquire it while already held.
- Implemented logic to reuse the lock object if it has already been created, improving efficiency and clarity in lock management.
- Reset the lock object to None upon release to ensure proper state management.

(cherry picked from commit d4b102d3c8a473c4fd6409dba7c198289bb5f921)
2026-02-14 16:28:38 +08:00
L1nSn0w
18ba367b11 refactor(api): improve DbMigrationAutoRenewLock configuration and logging
- Introduced constants for minimum and maximum join timeout values, enhancing clarity and maintainability.
- Updated the renewal interval calculation to use defined constants for better readability.
- Improved logging messages to include context information, making it easier to trace issues during lock operations.

(cherry picked from commit 1471b77bf5156a95417bde148753702d44221929)
2026-02-14 16:28:38 +08:00
autofix-ci[bot]
d0bd74fccb [autofix.ci] apply automated fixes
(cherry picked from commit 907e63cdc57f8006017837a74c2da2fbe274dcfb)
2026-02-14 16:28:38 +08:00
L1nSn0w
5ccbc00eb9 refactor(api): replace AutoRenewRedisLock with DbMigrationAutoRenewLock
- Updated the database migration locking mechanism to use DbMigrationAutoRenewLock for improved clarity and functionality.
- Removed the AutoRenewRedisLock implementation and its associated tests.
- Adjusted integration and unit tests to reflect the new locking class and its usage in the upgrade_db command.

(cherry picked from commit c812ad9ff26bed3eb59862bd7a5179b7ee83f11f)
2026-02-14 16:28:38 +08:00
L1nSn0w
94603b5408 refactor(api): replace heartbeat mechanism with AutoRenewRedisLock for database migration
- Removed the manual heartbeat function for renewing the Redis lock during database migrations.
- Integrated AutoRenewRedisLock to handle lock renewal automatically, simplifying the upgrade_db command.
- Updated unit tests to reflect changes in lock handling and error management during migrations.

(cherry picked from commit 8814256eb5fa20b29e554264f3b659b027bc4c9a)
2026-02-14 16:28:38 +08:00
L1nSn0w
8d4bd5636b refactor(tests): replace hardcoded wait time with constant for clarity
- Introduced HEARTBEAT_WAIT_TIMEOUT_SECONDS constant to improve readability and maintainability of test code.
- Updated test assertions to use the new constant instead of a hardcoded value.

(cherry picked from commit 0d53743d83b03ae0e68fad143711ffa5f6354093)
2026-02-14 16:28:38 +08:00
autofix-ci[bot]
ee0c4a8852 [autofix.ci] apply automated fixes
(cherry picked from commit 326cffa553ffac1bcd39a051c899c35b0ebe997d)
2026-02-14 16:28:38 +08:00
L1nSn0w
6032c598b0 fix(api): improve logging for database migration lock release
- Added a migration_succeeded flag to track the success of database migrations.
- Enhanced logging messages to indicate the status of the migration when releasing the lock, providing clearer context for potential issues.

(cherry picked from commit e74be0392995d16d288eed2175c51148c9e5b9c0)
2026-02-14 16:28:38 +08:00
L1nSn0w
afdd5b6c86 feat(api): implement heartbeat mechanism for database migration lock
- Added a heartbeat function to renew the Redis lock during database migrations, preventing long blockages from crashed processes.
- Updated the upgrade_db command to utilize the new locking mechanism with a configurable TTL.
- Removed the deprecated MIGRATION_LOCK_TTL from DeploymentConfig and related files.
- Enhanced unit tests to cover the new lock renewal behavior and error handling during migrations.

(cherry picked from commit a3331c622435f9f215b95f6b0261f43ae56a9d9c)
2026-02-14 16:28:38 +08:00
L1nSn0w
9acdfbde2f feat(api): enhance database migration locking mechanism and configuration
- Introduced a configurable Redis lock TTL for database migrations in DeploymentConfig.
- Updated the upgrade_db command to handle lock release errors gracefully.
- Added documentation for the new MIGRATION_LOCK_TTL environment variable in the .env.example file and docker-compose.yaml.

(cherry picked from commit 4a05fb120622908bc109a3715686706aab3d3b59)
2026-02-14 16:28:38 +08:00
longbingljw
1977e68b2d fix: make flask upgrade-db fail on error (#32024)
(cherry picked from commit d9530f7bb7)
2026-02-14 16:28:38 +08:00
GareArc
f17b51ab3a
Merge branch 'fix/access-mode-sso-verified-e-1.12.1' into deploy/enterprise 2026-02-13 23:41:04 -08:00
Xiyuan Chen
e9a7e8f77f
fix: include sso_verified in access_mode validation (#32325) 2026-02-13 23:40:37 -08:00
GareArc
23c75c7ec7
fix: centralize access_mode validation and support sso_verified
- Add ALLOWED_ACCESS_MODES constant to centralize valid access modes
- Include 'sso_verified' in validation to fix app duplication errors
- Update error message to dynamically list all allowed modes
- Refactor for maintainability: single source of truth for access modes

This fixes the issue where apps with access_mode='sso_verified' could not
be duplicated because the validation in update_app_access_mode() was missing
this mode, even though it was documented in WebAppSettings model.
2026-02-13 23:29:05 -08:00
GareArc
588e6561dc
Merge branch 'hotfix/e-1.12.1-app-copy-inherit-webapp-permission' into deploy/enterprise 2026-02-13 22:42:35 -08:00
Xiyuan Chen
9e2b28c950
fix(app-copy): inherit web app permission from original app (#32322) 2026-02-13 22:33:51 -08:00
GareArc
efbdb4c706
fix(app-copy): inherit web app permission from original app
When copying an app, the copied app was not getting a web_app_settings
record created. This caused the enterprise service to query for settings
that don't exist, falling back to default behavior.

This fix ensures copied apps inherit the same access mode as the original:
- If original has explicit settings (public/private/private_all/sso_verified),
  the copy gets the same setting
- If original has no settings (old apps), copy defaults to 'public' to match
  the original's effective permission via fallback

This prevents permission mismatches between original and copied apps and
ensures the enterprise service has explicit settings to query.

Related: langgenius/dify-enterprise#423
2026-02-13 22:11:03 -08:00
L1nSn0w
2bbe74be23
fix: make e-1.12.1 enterprise migrations database-agnostic for MySQL/TiDB (#32269)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:57:38 +08:00
L1nSn0w
affd07ae94
fix: make e-1.12.1 enterprise migrations database-agnostic for MySQL/TiDB (#32267)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:45:24 +08:00
GareArc
76471821d7
Merge branch 'release/e-1.12.1' into deploy/enterprise 2026-02-11 21:43:42 -08:00
NFish
111c76b71f Merge remote-tracking branch 'origin/hotfix/1.12.1-fix.6' into release/e-1.12.1 2026-02-12 13:26:12 +08:00
GareArc
25c457e2ed
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-10 20:12:16 -08:00
GareArc
262b7d4d08
docs(enterprise): add telemetry data dictionary for OTEL signals
- Comprehensive reference for all enterprise telemetry signals
- Documents 3 span types, 10 counters, 6 histograms, 13 log events
- Includes trace correlation model with ASCII diagrams
- Configuration reference for all 8 ENTERPRISE_* variables
- Per-emission-site label tables for metrics
- Full JSON schemas for structured log events
- Content gating behavior and token double-counting warnings
2026-02-10 19:51:14 -08:00
wangxiaolei
793d22754e
fix: fix get_message_event_type return wrong message type (#32019)
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-02-11 11:00:40 +08:00
GareArc
efeae4c46f
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-10 00:31:34 -08:00
GareArc
b5dbabf5d0
feat(telemetry): add missing ID fields for name attributes
- Add dify.credential.id to node execution events
- Add dify.event.id to all telemetry events (APP_CREATED, APP_UPDATED, APP_DELETED, FEEDBACK_CREATED)

This ensures all .name fields have corresponding .id fields for reliable aggregation and deduplication.
2026-02-10 00:09:41 -08:00
GareArc
d207ca3f1e
Merge branch 'deploy/enterprise' of https://github.com/langgenius/dify into deploy/enterprise 2026-02-09 01:57:13 -08:00
GareArc
aa34ec0d25
test(enterprise-telemetry): add unit tests for OTEL bearer auth and insecure flag 2026-02-09 01:44:21 -08:00
GareArc
ffa8aedc48
feat(enterprise-telemetry): wire bearer token auth and configurable insecure flag into OTEL exporter 2026-02-09 01:44:21 -08:00
GareArc
f78b0f1f36
feat(enterprise-telemetry): add ENTERPRISE_OTLP_API_KEY config field 2026-02-09 01:44:21 -08:00
GareArc
f85275e5f9
test(enterprise-telemetry): add unit tests for OTEL bearer auth and insecure flag 2026-02-09 01:35:17 -08:00