Commit Graph

8915 Commits

Author SHA1 Message Date
GareArc 4e3112bd7f
feat(telemetry): add enterprise OTEL telemetry with gateway, traces, metrics, and logs 2026-02-06 01:02:19 -08:00
GareArc 576eca2113
Merge branch '1.12.1-otel-ee' into deploy/enterprise 2026-02-05 23:07:48 -08:00
GareArc 8ded2d73f0
fix(telemetry): move EE guard to gateway routing level
Prevents CE users from enqueueing EE-only events (all METRIC_LOG cases)
to non-existent enterprise_telemetry Celery queue.

- Add _should_drop_ee_only_event() check in emit() before routing
- Remove redundant check from _emit_trace()
- Single guard at gateway level protects both trace and metric/log paths
2026-02-05 22:58:40 -08:00
GareArc 4a9b74f86b
refactor(telemetry): simplify by eliminating TelemetryFacade
**Problem:**
The telemetry system had unnecessary abstraction layers and bad practices
from the last 3 commits introducing the gateway implementation:
- TelemetryFacade class wrapper around emit() function
- String literals instead of SignalType enum
- Dictionary mapping enum → string instead of enum → enum
- Unnecessary ENTERPRISE_TELEMETRY_GATEWAY_ENABLED feature flag
- Duplicate guard checks scattered across files
- Non-thread-safe TelemetryGateway singleton pattern
- Missing guard in ops_trace_task.py causing RuntimeError spam

**Solution:**
1. Deleted TelemetryFacade - replaced with thin emit() function in core/telemetry/__init__.py
2. Added SignalType enum ('trace' | 'metric_log') to enterprise/telemetry/contracts.py
3. Replaced CASE_TO_TRACE_TASK_NAME dict with CASE_TO_TRACE_TASK: dict[TelemetryCase, TraceTaskName]
4. Deleted is_gateway_enabled() and _emit_legacy() - using existing ENTERPRISE_ENABLED + ENTERPRISE_TELEMETRY_ENABLED instead
5. Extracted _should_drop_ee_only_event() helper to eliminate duplicate checks
6. Moved TelemetryGateway singleton to ext_enterprise_telemetry.py:
   - Init once in init_app() for thread-safety
   - Access via get_gateway() function
7. Re-added guard to ops_trace_task.py to prevent RuntimeError when EE=OFF but CE tracing enabled
8. Updated 11 caller files to import 'emit as telemetry_emit' instead of 'TelemetryFacade'

**Result:**
- 322 net lines deleted (533 removed, 211 added)
- All 91 tests pass
- Thread-safe singleton pattern
- Cleaner API surface: from TelemetryFacade.emit() to telemetry_emit()
- Proper enum usage throughout
- No RuntimeError spam in EE=OFF + CE=ON scenario
2026-02-05 22:41:09 -08:00
Xiyuan Chen d7c3ae50dc Update api/services/tools/builtin_tools_manage_service.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-06 13:37:37 +08:00
NFish b921711e9e fix: hide invite button if current user is not workspace manager (#31742) 2026-02-06 13:37:37 +08:00
yunlu.wen fb38ad84e1 chore: upgrade deps, see pull #30976 2026-02-06 13:37:33 +08:00
Yunlu Wen 91c854b5be chore: sync enterprise release (#31626)
Co-authored-by: zhsama <torvalds@linux.do>
2026-02-06 13:35:28 +08:00
NFish d35b231941 fix: enterprise CVE 2026 23864 (#31599) 2026-02-06 13:35:22 +08:00
GareArc 849b4b8c40 fix: add TYPE_CHECKING import for Account type annotation 2026-02-06 13:32:20 +08:00
GareArc 990e8feee8 security: fix IDOR and privilege escalation in set_default_provider
- Add tenant_id verification to prevent IDOR attacks
- Add admin check for enterprise tenant-wide default changes
- Preserve non-enterprise behavior (users can set own defaults)
2026-02-06 13:32:18 +08:00
GareArc 53641019b1 fix: remove user_id filter when clearing default provider (enterprise only)
When setting a new default credential in enterprise mode, the code was
only clearing is_default for credentials matching the current user_id.
This caused issues when:
1. Enterprise credential A (synced with system user_id) was default
2. User sets local credential B as default
3. A still had is_default=true (different user_id)
4. Both A and B were considered defaults

The fix removes user_id from the filter only for enterprise deployments,
since enterprise credentials may have different user_id than local ones.
Non-enterprise behavior is unchanged to avoid breaking existing setups.

Fixes EE-1511
2026-02-06 13:31:50 +08:00
GareArc d1f10ff301 feat: add redis mq for account deletion cleanup 2026-02-06 13:31:50 +08:00
Xiyuan Chen c8027e168b feat: implement workspace permission checks for member invitations an… (#31202) 2026-02-06 13:31:46 +08:00
NFish aae3f76999 feat: ee workspace permission control (#30841) 2026-02-06 13:31:26 +08:00
NFish 2860c72b03 feat: ee workspace permission control (#30841) 2026-02-06 13:13:06 +08:00
GareArc 4d47339ce6
feat: Add parent trace context propagation for workflow-as-tool hierarchy
Enables distributed tracing for nested workflows across all trace providers
(Langfuse, LangSmith, community providers). When a workflow invokes another
workflow via workflow-as-tool, the child workflow now includes parent context
attributes that allow trace systems to reconstruct the full execution tree.

Changes:
- Add parent_trace_context field to WorkflowTool
- Set parent context in tool node when invoking workflow-as-tool
- Extract and pass parent context through app generator

This is a community enhancement (ungated) that improves distributed tracing
for all users. Parent context includes: trace_id, node_execution_id,
workflow_run_id, and app_id.
2026-02-05 20:19:29 -08:00
GareArc 6e47e163b8
fix(telemetry): use atomic Redis SET NX for idempotency and register Celery queue 2026-02-05 20:15:34 -08:00
GareArc 1663a7ab4c
feat(telemetry): add gateway diagnostics and verify integration
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-Claude)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-02-05 20:15:13 -08:00
GareArc 51b0c5c89c
feat(telemetry): implement gateway routing and enqueue logic
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-Claude)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-02-05 20:15:13 -08:00
GareArc 752b01ae91
refactor(telemetry): migrate event handlers to gateway-only producers
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-Claude)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-02-05 20:15:12 -08:00
GareArc 3d3e8d75d8
feat(telemetry): add gateway envelope contracts and routing table 2026-02-05 20:15:12 -08:00
GareArc 55c0fe503d
fix(telemetry): correct enterprise-only trace filtering logic
The logic was inverted - we were blocking all CE traces and only allowing
enterprise traces. The correct logic should be:
- Allow all CE traces (workflow, message, tool, etc.)
- Only block enterprise-only traces when enterprise telemetry is disabled

Before: if event.name not in _ENTERPRISE_ONLY_TRACES: return
After: if event.name in _ENTERPRISE_ONLY_TRACES and not is_enterprise_telemetry_enabled(): return
2026-02-05 20:15:12 -08:00
GareArc adadf1ec5f
refactor(telemetry): migrate to type-safe enum-based event routing with centralized enterprise filtering
Changes:
- Change TelemetryEvent.name from str to TraceTaskName enum for type safety
- Remove hardcoded trace_task_name_map from facade (no mapping needed)
- Add centralized enterprise-only filter in TelemetryFacade.emit()
- Rename is_telemetry_enabled() to is_enterprise_telemetry_enabled()
- Update all 11 call sites to pass TraceTaskName enum values
- Remove redundant enterprise guard from draft_trace.py
- Add unit tests for TelemetryFacade.emit() routing (6 tests)
- Add unit tests for TraceQueueManager telemetry guard (5 tests)
- Fix test fixture scoping issue for full test suite compatibility
- Fix tenant_id handling in agent tool callback handler

Benefits:
- 100% type-safe: basedpyright catches errors at compile time
- No string literals: eliminates entire class of typo bugs
- Single point of control: centralized filtering in facade
- All guards removed except facade
- Zero regressions: 4887 tests passing

Verification:
- make lint: PASS
- make type-check: PASS (0 errors, 0 warnings)
- pytest: 4887 passed, 8 skipped
2026-02-05 20:15:12 -08:00
GareArc ed222945aa
refactor(telemetry): introduce TelemetryFacade to centralize event emission
Migrate from direct TraceQueueManager.add_trace_task calls to TelemetryFacade.emit
with TelemetryEvent abstraction. This reduces CE code invasion by consolidating
telemetry logic in core/telemetry/ with a single guard in ops_trace_manager.py.
2026-02-05 20:15:11 -08:00
GareArc 2d60be311d
fix: extract model_provider from model_config in prompt generation trace
The model_provider field in prompt generation traces was being incorrectly
extracted by parsing the model name (e.g., 'deepseek-chat'), which resulted
in an empty string when the model name didn't contain a '/' character.

Now extracts the provider directly from the model_config parameter, with
a fallback to the old parsing logic for backward compatibility.

Changes:
- Update _emit_prompt_generation_trace to accept model_config parameter
- Extract provider from model_config.get('provider') when available
- Update all 6 call sites to pass model_config
- Maintain backward compatibility with fallback logic
2026-02-05 20:15:11 -08:00
GareArc 80ee2e982e
fix(telemetry): prevent UUID validation error for tenant-prefixed storage IDs
- get_ops_trace_instance was trying to query App table with storage_id format "tenant-{uuid}"
- This caused psycopg2.errors.InvalidTextRepresentation when app_id is None
- Added early return for tenant-prefixed storage identifiers to skip App lookup
- Enterprise telemetry still works correctly with these storage IDs
2026-02-05 20:15:11 -08:00
GareArc 5bbc938a0d
fix(telemetry): add prompt generation trace emission for no_variable=false path
- The no_variable=false code path in generate_rule_config was missing trace emission
- Added timing wrapper and _emit_prompt_generation_trace call to ensure metrics/logs are captured
- Trace now emitted on both success and failure cases for consistency with no_variable=true path
2026-02-05 20:15:10 -08:00
GareArc 052f50805f
feat(telemetry): add node_execution_id and app_id support to trace metadata
- Forward kwargs to message_trace to preserve node_execution_id
- Add node_execution_id extraction to all trace methods
- Add app_id parameter to prompt generation API endpoints
- Enable app_id tracing for rule_generate, code_generate, and structured_output operations
2026-02-05 20:15:10 -08:00
GareArc f5043a8ac8
fix(telemetry): enable metrics and logs for standalone prompt generation
Remove app_id parameter from three endpoints and update trace manager to use
tenant_id as storage identifier when app_id is unavailable. This allows
standalone prompt generation utilities to emit telemetry.

Changes:
- controllers/console/app/generator.py: Remove app_id=None from 3 endpoints
  (RuleGenerateApi, RuleCodeGenerateApi, RuleStructuredOutputGenerateApi)
- core/ops/ops_trace_manager.py: Use tenant_id fallback in send_to_celery
  - Extract tenant_id from task.kwargs when app_id is None
  - Use 'tenant-{tenant_id}' format as storage identifier
  - Skip traces only if neither app_id nor tenant_id available

The trace metadata still contains the actual tenant_id, so enterprise
telemetry correctly emits metrics and logs grouped by tenant.
2026-02-05 20:15:10 -08:00
GareArc a4bebbb5b5
fix(telemetry): remove app_id parameter from standalone prompt generation endpoints
Remove app_id=None from three prompt generation endpoints that lack proper
app context. These standalone utilities only have tenant_id available, so
we don't pass app_id at all rather than passing incomplete information.

Affected endpoints:
- /rule-generate (RuleGenerateApi)
- /code-generate (RuleCodeGenerateApi)
- /structured-output-generate (RuleStructuredOutputGenerateApi)
2026-02-05 20:15:10 -08:00
GareArc 22c8d8d772
feat(telemetry): add prompt generation telemetry to Enterprise OTEL
- Add PromptGenerationTraceInfo trace entity with operation_type field
- Implement telemetry for rule-generate, code-generate, structured-output, instruction-modify operations
- Emit metrics: tokens (total/input/output), duration histogram, requests counter, errors counter
- Emit structured logs with model info and operation context
- Content redaction controlled by ENTERPRISE_INCLUDE_CONTENT env var
- Fix user_id propagation in TraceTask kwargs
- Fix latency calculation when llm_result is None

No spans exported - metrics and logs only for lightweight observability.
2026-02-05 20:14:49 -08:00
GareArc e67afa7a5b
feat(telemetry): add input/output token metrics and fix trace cleanup
- Add dify.tokens.input and dify.tokens.output OTEL metrics
- Remove token split from trace log attributes (keep metrics only)
- Emit split token metrics for workflows and node executions
- Gracefully handle trace file deletion failures to prevent task crashes

BREAKING: None
MIGRATION: None
2026-02-05 20:12:30 -08:00
GareArc 8ceb1ed96f
feat(telemetry): add input/output token split to enterprise OTEL traces
- Add PROMPT_TOKENS and COMPLETION_TOKENS to WorkflowNodeExecutionMetadataKey
- Store prompt/completion tokens in node execution metadata JSON (no schema change)
- Calculate workflow-level token split by summing node executions on-the-fly
- Export gen_ai.usage.input_tokens and output_tokens to enterprise telemetry
- Add semantic convention constants for token attributes
- Maintain backward compatibility (historical data shows null)

BREAKING: None
MIGRATION: None (uses JSON metadata, no schema changes)
2026-02-05 20:12:30 -08:00
GareArc 701f02f853
feat(telemetry): add invoked_by user tracking to enterprise OTEL 2026-02-05 20:12:29 -08:00
GareArc 639fb304ca
fix(enterprise): Remove OTEL log export 2026-02-05 20:12:29 -08:00
GareArc df44e79599
feat(enterprise): Add independent metrics export with dedicated MeterProvider
- Create dedicated MeterProvider instance (independent from ext_otel.py)
- Add create_metric_exporter() to _ExporterFactory with HTTP/gRPC support
- Enterprise metrics now work without requiring standard OTEL to be enabled
- Add MeterProvider shutdown to cleanup lifecycle
- Update module docstring to reflect full independence (Tracer, Logger, Meter)
2026-02-05 20:12:29 -08:00
GareArc 0497fd7469
fix(enterprise): Scope log handler to telemetry logger only
Only export structured telemetry logs, not all application logs. The attach_log_handler method now attaches to the 'dify.telemetry' logger instead of the root logger.
2026-02-05 20:12:29 -08:00
GareArc bb3fcbfd5c
feat(enterprise): Add gRPC protocol support for OTLP telemetry
- Add ENTERPRISE_OTLP_PROTOCOL config (http/grpc, default: http)
- Introduce _ExporterFactory class for protocol-agnostic exporter creation
- Support both HTTP and gRPC OTLP endpoints for traces and logs
- Refactor endpoint path handling into factory methods
2026-02-05 20:12:28 -08:00
GareArc 4d7ab24eb1
feat(enterprise): Add OTEL logs export with span_id correlation
- Add ENTERPRISE_OTEL_LOGS_ENABLED and ENTERPRISE_OTLP_LOGS_ENDPOINT config options
- Implement EnterpriseLoggingHandler for log record translation with trace/span ID parsing
- Add LoggerProvider and BatchLogRecordProcessor for OTLP log export
- Correlate telemetry logs with spans via span_id_source parameter
- Attach log handler during enterprise telemetry initialization
2026-02-05 20:12:28 -08:00
GareArc 3461c3a8ef
feat(enterprise): Add OTEL telemetry with slim traces, metrics, and structured logs
- Add EnterpriseOtelTrace handler with span emission for workflows and nodes
- Implement minimal-span strategy: slim spans + detailed companion logs
- Add deterministic span/trace IDs for cross-workflow trace correlation
- Add metric collection at 100% accuracy (counters & histograms)
- Add event handlers for app lifecycle and feedback telemetry
- Add cross-workflow trace linking with parent context propagation
- Add OTEL exporter with configurable sampling and privacy controls
- Wire enterprise telemetry into workflow execution pipeline
- Add telemetry configuration in enterprise configs
2026-02-05 20:12:28 -08:00
wangxiaolei cd03e0a9ef fix: fix delete_draft_variables_batch cycle forever (#31934)
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-02-04 19:42:50 +08:00
zxhlyh df2421d187 fix: auto summary env (#31930) 2026-02-04 19:42:26 +08:00
QuantumGhost 0ba321d840 chore: bump version in docker-compose and package manager to 1.12.1 (#31947) 2026-02-04 19:41:50 +08:00
Stephen Zhou d8402f686e
fix: base url in client (#31902)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-04 12:09:22 +08:00
Tomo 8bd8dee767
fix(docker): improve IRIS data persistence with proper Durable %SYS (#31901)
Co-authored-by: Tomo Okuyama <tomo.okuyama@intersystems.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-04 11:39:26 +08:00
Tomo 05f2764d7c
fix(docker): persist IRIS data across container recreation using Durable %SYS (#31899)
Co-authored-by: Tomo Okuyama <tomo.okuyama@intersystems.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-02-04 09:57:46 +08:00
Asuka Minato f5d6c250ed
fix: "refactor: port api/controllers/console/tag/tags.py to ov3" (#31887)
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-02-03 22:18:53 +08:00
niveshdandyan 45daec7541
refactor: replace line-clamp package with native CSS (#31877)
Co-authored-by: OSS Contributor <oss-contributor@example.com>
Co-authored-by: Claude (claude-opus-4-5) <noreply@anthropic.com>
Co-authored-by: niveshdandyan <niveshdandyan@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
2026-02-03 22:14:18 +08:00
盐粒 Yanli c14a8bb437
chore(dev): use strict bash mode for pytest (#31873) 2026-02-03 19:42:42 +08:00