**Problem:**
The telemetry system had unnecessary abstraction layers and bad practices
from the last 3 commits introducing the gateway implementation:
- TelemetryFacade class wrapper around emit() function
- String literals instead of SignalType enum
- Dictionary mapping enum → string instead of enum → enum
- Unnecessary ENTERPRISE_TELEMETRY_GATEWAY_ENABLED feature flag
- Duplicate guard checks scattered across files
- Non-thread-safe TelemetryGateway singleton pattern
- Missing guard in ops_trace_task.py causing RuntimeError spam
**Solution:**
1. Deleted TelemetryFacade - replaced with thin emit() function in core/telemetry/__init__.py
2. Added SignalType enum ('trace' | 'metric_log') to enterprise/telemetry/contracts.py
3. Replaced CASE_TO_TRACE_TASK_NAME dict with CASE_TO_TRACE_TASK: dict[TelemetryCase, TraceTaskName]
4. Deleted is_gateway_enabled() and _emit_legacy() - using existing ENTERPRISE_ENABLED + ENTERPRISE_TELEMETRY_ENABLED instead
5. Extracted _should_drop_ee_only_event() helper to eliminate duplicate checks
6. Moved TelemetryGateway singleton to ext_enterprise_telemetry.py:
- Init once in init_app() for thread-safety
- Access via get_gateway() function
7. Re-added guard to ops_trace_task.py to prevent RuntimeError when EE=OFF but CE tracing enabled
8. Updated 11 caller files to import 'emit as telemetry_emit' instead of 'TelemetryFacade'
**Result:**
- 322 net lines deleted (533 removed, 211 added)
- All 91 tests pass
- Thread-safe singleton pattern
- Cleaner API surface: from TelemetryFacade.emit() to telemetry_emit()
- Proper enum usage throughout
- No RuntimeError spam in EE=OFF + CE=ON scenario
Enables distributed tracing for nested workflows across all trace providers
(Langfuse, LangSmith, community providers). When a workflow invokes another
workflow via workflow-as-tool, the child workflow now includes parent context
attributes that allow trace systems to reconstruct the full execution tree.
Changes:
- Add parent_trace_context field to WorkflowTool
- Set parent context in tool node when invoking workflow-as-tool
- Extract and pass parent context through app generator
This is a community enhancement (ungated) that improves distributed tracing
for all users. Parent context includes: trace_id, node_execution_id,
workflow_run_id, and app_id.
The logic was inverted - we were blocking all CE traces and only allowing
enterprise traces. The correct logic should be:
- Allow all CE traces (workflow, message, tool, etc.)
- Only block enterprise-only traces when enterprise telemetry is disabled
Before: if event.name not in _ENTERPRISE_ONLY_TRACES: return
After: if event.name in _ENTERPRISE_ONLY_TRACES and not is_enterprise_telemetry_enabled(): return
Changes:
- Change TelemetryEvent.name from str to TraceTaskName enum for type safety
- Remove hardcoded trace_task_name_map from facade (no mapping needed)
- Add centralized enterprise-only filter in TelemetryFacade.emit()
- Rename is_telemetry_enabled() to is_enterprise_telemetry_enabled()
- Update all 11 call sites to pass TraceTaskName enum values
- Remove redundant enterprise guard from draft_trace.py
- Add unit tests for TelemetryFacade.emit() routing (6 tests)
- Add unit tests for TraceQueueManager telemetry guard (5 tests)
- Fix test fixture scoping issue for full test suite compatibility
- Fix tenant_id handling in agent tool callback handler
Benefits:
- 100% type-safe: basedpyright catches errors at compile time
- No string literals: eliminates entire class of typo bugs
- Single point of control: centralized filtering in facade
- All guards removed except facade
- Zero regressions: 4887 tests passing
Verification:
- make lint: PASS
- make type-check: PASS (0 errors, 0 warnings)
- pytest: 4887 passed, 8 skipped
Migrate from direct TraceQueueManager.add_trace_task calls to TelemetryFacade.emit
with TelemetryEvent abstraction. This reduces CE code invasion by consolidating
telemetry logic in core/telemetry/ with a single guard in ops_trace_manager.py.
The model_provider field in prompt generation traces was being incorrectly
extracted by parsing the model name (e.g., 'deepseek-chat'), which resulted
in an empty string when the model name didn't contain a '/' character.
Now extracts the provider directly from the model_config parameter, with
a fallback to the old parsing logic for backward compatibility.
Changes:
- Update _emit_prompt_generation_trace to accept model_config parameter
- Extract provider from model_config.get('provider') when available
- Update all 6 call sites to pass model_config
- Maintain backward compatibility with fallback logic
- get_ops_trace_instance was trying to query App table with storage_id format "tenant-{uuid}"
- This caused psycopg2.errors.InvalidTextRepresentation when app_id is None
- Added early return for tenant-prefixed storage identifiers to skip App lookup
- Enterprise telemetry still works correctly with these storage IDs
- The no_variable=false code path in generate_rule_config was missing trace emission
- Added timing wrapper and _emit_prompt_generation_trace call to ensure metrics/logs are captured
- Trace now emitted on both success and failure cases for consistency with no_variable=true path
- Forward kwargs to message_trace to preserve node_execution_id
- Add node_execution_id extraction to all trace methods
- Add app_id parameter to prompt generation API endpoints
- Enable app_id tracing for rule_generate, code_generate, and structured_output operations
Remove app_id parameter from three endpoints and update trace manager to use
tenant_id as storage identifier when app_id is unavailable. This allows
standalone prompt generation utilities to emit telemetry.
Changes:
- controllers/console/app/generator.py: Remove app_id=None from 3 endpoints
(RuleGenerateApi, RuleCodeGenerateApi, RuleStructuredOutputGenerateApi)
- core/ops/ops_trace_manager.py: Use tenant_id fallback in send_to_celery
- Extract tenant_id from task.kwargs when app_id is None
- Use 'tenant-{tenant_id}' format as storage identifier
- Skip traces only if neither app_id nor tenant_id available
The trace metadata still contains the actual tenant_id, so enterprise
telemetry correctly emits metrics and logs grouped by tenant.
- Add PromptGenerationTraceInfo trace entity with operation_type field
- Implement telemetry for rule-generate, code-generate, structured-output, instruction-modify operations
- Emit metrics: tokens (total/input/output), duration histogram, requests counter, errors counter
- Emit structured logs with model info and operation context
- Content redaction controlled by ENTERPRISE_INCLUDE_CONTENT env var
- Fix user_id propagation in TraceTask kwargs
- Fix latency calculation when llm_result is None
No spans exported - metrics and logs only for lightweight observability.
The backend part of the human in the loop (HITL) feature and relevant architecture / workflow engine changes.
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: -LAN- <laipz8200@outlook.com>
Co-authored-by: 盐粒 Yanli <yanli@dify.ai>
Co-authored-by: CrabSAMA <40541269+CrabSAMA@users.noreply.github.com>
Co-authored-by: Stephen Zhou <38493346+hyoban@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: yihong <zouzou0208@gmail.com>
Co-authored-by: Joel <iamjoel007@gmail.com>