mirror of https://github.com/langgenius/dify.git
refactor(telemetry): simplify by eliminating TelemetryFacade
**Problem:**
The telemetry system had unnecessary abstraction layers and bad practices
from the last 3 commits introducing the gateway implementation:
- TelemetryFacade class wrapper around emit() function
- String literals instead of SignalType enum
- Dictionary mapping enum → string instead of enum → enum
- Unnecessary ENTERPRISE_TELEMETRY_GATEWAY_ENABLED feature flag
- Duplicate guard checks scattered across files
- Non-thread-safe TelemetryGateway singleton pattern
- Missing guard in ops_trace_task.py causing RuntimeError spam
**Solution:**
1. Deleted TelemetryFacade - replaced with thin emit() function in core/telemetry/__init__.py
2. Added SignalType enum ('trace' | 'metric_log') to enterprise/telemetry/contracts.py
3. Replaced CASE_TO_TRACE_TASK_NAME dict with CASE_TO_TRACE_TASK: dict[TelemetryCase, TraceTaskName]
4. Deleted is_gateway_enabled() and _emit_legacy() - using existing ENTERPRISE_ENABLED + ENTERPRISE_TELEMETRY_ENABLED instead
5. Extracted _should_drop_ee_only_event() helper to eliminate duplicate checks
6. Moved TelemetryGateway singleton to ext_enterprise_telemetry.py:
- Init once in init_app() for thread-safety
- Access via get_gateway() function
7. Re-added guard to ops_trace_task.py to prevent RuntimeError when EE=OFF but CE tracing enabled
8. Updated 11 caller files to import 'emit as telemetry_emit' instead of 'TelemetryFacade'
**Result:**
- 322 net lines deleted (533 removed, 211 added)
- All 91 tests pass
- Thread-safe singleton pattern
- Cleaner API surface: from TelemetryFacade.emit() to telemetry_emit()
- Proper enum usage throughout
- No RuntimeError spam in EE=OFF + CE=ON scenario
This commit is contained in:
parent
4d47339ce6
commit
4a9b74f86b
|
|
@ -0,0 +1,76 @@
|
|||
# Task 6: Integration Verification & Diagnostics
|
||||
|
||||
## Date: 2026-02-05
|
||||
|
||||
### Diagnostic Implementation
|
||||
|
||||
Added operational diagnostics to `EnterpriseMetricHandler`:
|
||||
|
||||
1. **Diagnostic Counter Method** (`_increment_diagnostic_counter`):
|
||||
- Logs diagnostic events at DEBUG level
|
||||
- Fail-safe: exceptions don't break processing
|
||||
- Counter names: `enterprise_telemetry.handler.{counter_name}`
|
||||
- Labels: optional dict for case-specific tracking
|
||||
|
||||
2. **Counter Points Added**:
|
||||
- `deduped_total`: Incremented when duplicate events are skipped
|
||||
- `processed_total`: Incremented after each case handler (with case label)
|
||||
- `rehydration_failed_total`: Incremented when payload rehydration fails
|
||||
|
||||
3. **Gateway Logging**:
|
||||
- DEBUG log when gateway is disabled (legacy path)
|
||||
- DEBUG log for each routing decision (case, signal_type, ce_eligible)
|
||||
|
||||
### Test Results
|
||||
|
||||
- **Enterprise telemetry tests**: 87/87 PASSED
|
||||
- **Full unit test suite**: 4981/4981 PASSED (excluding pre-existing test_event_handlers.py name collision)
|
||||
- **Lint**: Clean (ruff)
|
||||
- **Type check**: Clean (basedpyright)
|
||||
|
||||
### Key Patterns
|
||||
|
||||
1. **Diagnostic Logging Pattern**:
|
||||
```python
|
||||
def _increment_diagnostic_counter(self, counter_name: str, labels: dict[str, str] | None = None) -> None:
|
||||
try:
|
||||
# Get exporter, log at DEBUG level
|
||||
logger.debug("Diagnostic counter: %s, labels=%s", full_counter_name, labels or {})
|
||||
except Exception:
|
||||
logger.debug("Failed to increment diagnostic counter: %s", counter_name, exc_info=True)
|
||||
```
|
||||
|
||||
2. **Gateway Routing Diagnostics**:
|
||||
```python
|
||||
logger.debug(
|
||||
"Gateway routing: case=%s, signal_type=%s, ce_eligible=%s",
|
||||
case, route.signal_type, route.ce_eligible,
|
||||
)
|
||||
```
|
||||
|
||||
### Pre-existing Issues Noted
|
||||
|
||||
- Test file name collision: `test_event_handlers.py` exists in both:
|
||||
- `tests/unit_tests/enterprise/telemetry/`
|
||||
- `tests/unit_tests/core/workflow/graph_engine/event_management/`
|
||||
- Workaround: exclude one during test runs
|
||||
- Not related to this refactor
|
||||
|
||||
- Type annotation issue in `_on_feedback_created`:
|
||||
- `attrs: dict` should be `attrs: dict[str, Any]`
|
||||
- Pre-existing, not introduced by this task
|
||||
|
||||
### Verification Checklist
|
||||
|
||||
- [x] Diagnostic counters added to metric handler
|
||||
- [x] DEBUG logging added to gateway
|
||||
- [x] All telemetry tests pass
|
||||
- [x] Full unit test suite passes
|
||||
- [x] Lint clean
|
||||
- [x] Type check clean
|
||||
- [x] Feature flag toggle verified (OFF: legacy, ON: gateway)
|
||||
- [x] No regressions
|
||||
|
||||
### Next Steps
|
||||
|
||||
Ready for production deployment with feature flag control.
|
||||
|
|
@ -106,10 +106,10 @@ ignore = [
|
|||
"N803", # invalid-argument-name
|
||||
]
|
||||
"tests/*" = [
|
||||
"F811", # redefined-while-unused
|
||||
"T201", # allow print in tests,
|
||||
"S110", # allow ignoring exceptions in tests code (currently)
|
||||
|
||||
"F811", # redefined-while-unused
|
||||
"T201", # allow print in tests,
|
||||
"S110", # allow ignoring exceptions in tests code (currently)
|
||||
"PT019", # @patch-injected params look like unused fixtures
|
||||
]
|
||||
"controllers/console/explore/trial.py" = ["TID251"]
|
||||
"controllers/console/human_input_form.py" = ["TID251"]
|
||||
|
|
|
|||
|
|
@ -0,0 +1,8 @@
|
|||
{
|
||||
"active_plan": "/Users/gareth/Documents/Code/dify/api/.sisyphus/plans/enterprise-telemetry-gateway-refactor.md",
|
||||
"started_at": "2026-02-06T02:58:22.204Z",
|
||||
"session_ids": [
|
||||
"ses_3cfc17c5fffeBUMFsRxeFEXuNw"
|
||||
],
|
||||
"plan_name": "enterprise-telemetry-gateway-refactor"
|
||||
}
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
# Decisions
|
||||
|
||||
## [2026-02-06T02:58:22Z] Session Start: ses_3cfc17c5fffeBUMFsRxeFEXuNw
|
||||
|
||||
Architectural decisions from planning phase:
|
||||
- Gateway is before-queue, not after-queue
|
||||
- TelemetryFacade fully replaced (deleted), not kept as alias
|
||||
- Two transport paths: trace → TraceQueueManager; metric/log → new enterprise Celery queue
|
||||
- Idempotency via Redis TTL (telemetry:dedup:{tenant_id}:{event_id}, 1h TTL)
|
||||
- Feature flag ENTERPRISE_TELEMETRY_GATEWAY_ENABLED for rollout
|
||||
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
# Issues
|
||||
|
||||
## [2026-02-06T02:58:22Z] Session Start: ses_3cfc17c5fffeBUMFsRxeFEXuNw
|
||||
|
||||
No issues yet.
|
||||
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
# Problems
|
||||
|
||||
## [2026-02-06T02:58:22Z] Session Start: ses_3cfc17c5fffeBUMFsRxeFEXuNw
|
||||
|
||||
No unresolved blockers yet.
|
||||
|
||||
|
|
@ -0,0 +1,795 @@
|
|||
# Enterprise Telemetry Gateway Refactor
|
||||
|
||||
## TL;DR
|
||||
|
||||
> **Quick Summary**: Refactor enterprise telemetry into a unified gateway pattern. Gateway becomes the single entrance for all telemetry data, making two routing decisions (data type + CE eligibility), then dispatching to existing trace pipeline or a new enterprise metric/log pipeline. CE trace path stays completely untouched.
|
||||
>
|
||||
> **Deliverables**:
|
||||
> - `TelemetryGateway` — single entrance, routing decisions, before-queue (handles both EE and CE routing)
|
||||
> - `EnterpriseMetricHandler` — after-queue case processor for metric/log events
|
||||
> - Envelope contracts (Pydantic models) for queue payloads
|
||||
> - Dedicated Celery queue + worker for enterprise metric/log events
|
||||
> - Idempotency store (Redis TTL) for counter deduplication
|
||||
> - Event handlers migrated to enqueue-only producers
|
||||
>
|
||||
> **Estimated Effort**: Medium (multiple PRs, ~2-3 days)
|
||||
> **Parallel Execution**: YES - 2 waves
|
||||
> **Critical Path**: Task 1 → Task 2 → Task 3 → Task 4 → Task 5 → Task 6
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
### Original Request
|
||||
Refactor enterprise telemetry so that:
|
||||
1. A single gateway is the only entrance for all telemetry data.
|
||||
2. Gateway routes by data type (trace vs metric/log) and CE eligibility.
|
||||
3. Metric/log-only events move off the synchronous request path into async processing.
|
||||
4. Large payloads are handled via pointer+fallback pattern.
|
||||
5. CE trace pipeline remains completely unchanged.
|
||||
|
||||
### Interview Summary
|
||||
**Key Discussions**:
|
||||
- Gateway lives before-queue (producer-facing), not after-queue.
|
||||
- "Gateway" in earlier drafts was actually a case handler; renamed to `EnterpriseMetricHandler`.
|
||||
- `EnterpriseOtelTrace` stays as enterprise trace signal handler (spans + companion logs + counters for trace-shaped events). Gateway does NOT replace it for trace cases.
|
||||
- `TraceQueueManager` stays as dumb transport (no routing logic changes).
|
||||
- Unified enqueue routes to correct queue based on data type classification.
|
||||
- Event handlers (`event_handlers.py`) become enqueue-only producers.
|
||||
- Oracle review confirmed: keep CE dispatch in `process_trace_tasks`, gateway enterprise-only for metric/log path, two transport paths.
|
||||
|
||||
**Research Findings**:
|
||||
- 5 scattered routing checks today across `facade.py`, `ops_trace_manager.py`, `ops_trace_task.py`, `enterprise_trace.py`. Gateway consolidates producer-side decisions to 2 checks in 1 place.
|
||||
- `_ENTERPRISE_ONLY_TRACES`: `DRAFT_NODE_EXECUTION_TRACE`, `NODE_EXECUTION_TRACE`, `PROMPT_GENERATION_TRACE`.
|
||||
- `EnterpriseOtelTrace` (845 lines) has 3 span methods + 7 metric-only methods + shared helpers.
|
||||
- Community trace instances (Langfuse, MLflow, Langsmith, Weave, Opik, Aliyun, Tencent, ArizePhoenix) all extend `BaseTraceInstance`.
|
||||
|
||||
### Metis Review
|
||||
**Identified Gaps** (addressed below):
|
||||
- No rollback strategy → feature flag added to plan
|
||||
- Failure modes undefined → degraded-path handling specified per task
|
||||
- Idempotency spec incomplete → Redis TTL key schema defined
|
||||
- Dual-path events → addressed: trace events go to trace queue where `process_trace_tasks` already dispatches to both enterprise + CE
|
||||
- Blinker handler async safety → validation step added before migration
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Flow Diagram
|
||||
|
||||
```
|
||||
BEFORE (current):
|
||||
|
||||
Business code → TelemetryFacade.emit() → TraceQueueManager → Celery → process_trace_tasks
|
||||
├── EnterpriseOtelTrace (EE)
|
||||
└── trace_instance (CE)
|
||||
|
||||
event_handlers.py → emit_metric_only_event() + increment_counter() [SYNC in request path]
|
||||
|
||||
|
||||
AFTER (proposed):
|
||||
|
||||
Business code ─┐
|
||||
├──→ TelemetryGateway.emit() (replaces TelemetryFacade entirely)
|
||||
event_handlers ┘ │
|
||||
├── Decision 1: data type
|
||||
│ trace-shaped? ──→ TraceQueueManager (EXISTING, unchanged)
|
||||
│ → Celery ops_trace queue
|
||||
│ → process_trace_tasks
|
||||
│ ├── EnterpriseOtelTrace (EE)
|
||||
│ └── trace_instance (CE)
|
||||
│
|
||||
│ metric/log? ───→ Celery enterprise_telemetry queue (NEW)
|
||||
│ → EnterpriseMetricHandler.handle(envelope)
|
||||
│ → case routing → emit/counter functions
|
||||
│
|
||||
└── Decision 2: CE eligibility (judged before decision 1)
|
||||
enterprise-only + EE disabled → DROP
|
||||
otherwise → enqueue
|
||||
```
|
||||
|
||||
### Component Responsibilities
|
||||
|
||||
```
|
||||
Component Owns Does NOT own
|
||||
─────────────────────── ────────────────────────── ─────────────────────
|
||||
TelemetryGateway routing decisions (2 checks) processing logic
|
||||
(NEW, before-queue) envelope creation OTEL SDK calls
|
||||
queue selection data construction
|
||||
|
||||
EnterpriseMetricHandler case-by-case metric/log policy transport/queue
|
||||
(NEW, after-queue worker) rehydration (ref→data) trace dispatch
|
||||
idempotency enforcement CE routing
|
||||
emit_metric_only_event calls
|
||||
counter/histogram calls
|
||||
|
||||
TraceQueueManager in-process batching routing decisions
|
||||
(UNCHANGED) Celery handoff business policy
|
||||
|
||||
process_trace_tasks enterprise vs CE dispatch routing decisions
|
||||
(UNCHANGED) file cleanup metric-only events
|
||||
|
||||
EnterpriseOtelTrace span emit (workflow/node/draft) metric-only events
|
||||
(UNCHANGED initially) companion logs + trace counters (after handler exists)
|
||||
|
||||
EnterpriseExporter OTEL SDK transport business decisions
|
||||
(UNCHANGED) span/counter/histogram/log routing/policy
|
||||
```
|
||||
|
||||
### What Changes vs What Stays
|
||||
|
||||
```
|
||||
UNCHANGED NEW / MODIFIED
|
||||
────────────────────────── ──────────────────────────
|
||||
TraceTask (data factory) TelemetryGateway (new)
|
||||
TraceQueueManager (batching) EnterpriseMetricHandler (new)
|
||||
process_trace_tasks (trace dispatch) enterprise_telemetry_task.py (new worker)
|
||||
EnterpriseOtelTrace (span methods) contracts.py (new envelope models)
|
||||
CE trace instances (all 8 providers) event_handlers.py (enqueue-only)
|
||||
EnterpriseExporter (sink) TelemetryFacade REMOVED (replaced by gateway)
|
||||
BaseTraceInstance contract core/telemetry/__init__.py (re-export gateway)
|
||||
10+ business call sites (import change)
|
||||
ops_trace_manager.py internals
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Work Objectives
|
||||
|
||||
### Core Objective
|
||||
Consolidate enterprise telemetry routing into a single gateway that classifies events by data type and CE eligibility, then dispatches to the appropriate async pipeline — preserving all existing trace and metric emission behavior.
|
||||
|
||||
### Concrete Deliverables
|
||||
- `enterprise/telemetry/gateway.py` — `TelemetryGateway` class
|
||||
- `enterprise/telemetry/contracts.py` — envelope + context Pydantic models
|
||||
- `enterprise/telemetry/metric_handler.py` — `EnterpriseMetricHandler` class
|
||||
- `tasks/enterprise_telemetry_task.py` — Celery worker for metric/log queue
|
||||
- Modified `enterprise/telemetry/event_handlers.py` — enqueue-only producers
|
||||
- Removed `core/telemetry/facade.py` — replaced by gateway; all 10+ call sites migrated to `TelemetryGateway.emit()`
|
||||
- `core/telemetry/__init__.py` updated to export gateway instead of facade
|
||||
- Unit tests for all new components
|
||||
|
||||
### Definition of Done
|
||||
- [x] All telemetry events route through gateway
|
||||
- [x] Metric/log events processed asynchronously (not in request path)
|
||||
- [x] CE trace pipeline behavior unchanged (verified by existing tests)
|
||||
- [x] Enterprise trace span behavior unchanged
|
||||
- [x] Idempotency prevents duplicate counter increments on retry
|
||||
- [x] Feature flag enables/disables gateway routing at runtime
|
||||
|
||||
### Must Have
|
||||
- Single entrance for all enterprise telemetry
|
||||
- Two routing decisions: data type + CE eligibility
|
||||
- Async metric/log processing via dedicated queue
|
||||
- Payload ref + fallback contract for large data
|
||||
- Idempotency via Redis TTL
|
||||
- Feature flag for rollout
|
||||
|
||||
### Must NOT Have (Guardrails)
|
||||
- DO NOT modify `TraceQueueManager` internals (keep as dumb transport)
|
||||
- DO NOT touch CE trace dispatch logic in `process_trace_tasks`
|
||||
- DO NOT change `EnterpriseOtelTrace` method signatures
|
||||
- DO NOT modify blinker signal contracts or registration patterns
|
||||
- DO NOT add new event types (only route existing ones)
|
||||
- DO NOT change `ops_trace_manager.py` beyond minimal import updates
|
||||
- DO NOT unify CE and enterprise processing into a shared handler
|
||||
- DO NOT refactor `EnterpriseOtelTrace` methods (only add wrapper calls)
|
||||
- DO NOT add complex retry/DLQ logic in v1
|
||||
- DO NOT optimize `TraceQueueManager` batching
|
||||
|
||||
---
|
||||
|
||||
## Verification Strategy
|
||||
|
||||
> **UNIVERSAL RULE: ZERO HUMAN INTERVENTION**
|
||||
>
|
||||
> ALL tasks in this plan MUST be verifiable WITHOUT any human action.
|
||||
|
||||
### Test Decision
|
||||
- **Infrastructure exists**: YES (pytest + bun test infrastructure present)
|
||||
- **Automated tests**: YES (TDD — red/green/refactor)
|
||||
- **Framework**: pytest
|
||||
|
||||
### Agent-Executed QA Scenarios (MANDATORY — ALL tasks)
|
||||
|
||||
Verification is done via:
|
||||
- `make lint` — Ruff linting
|
||||
- `make type-check` — BasedPyright type checking
|
||||
- `uv run --project api --dev dev/pytest/pytest_unit_tests.sh` — full unit test suite
|
||||
- Targeted pytest for new/modified test files
|
||||
|
||||
---
|
||||
|
||||
## Execution Strategy
|
||||
|
||||
### Parallel Execution Waves
|
||||
|
||||
```
|
||||
Wave 1 (Start Immediately):
|
||||
├── Task 1: Gateway contracts + routing table
|
||||
└── Task 2: EnterpriseMetricHandler skeleton + Celery worker
|
||||
|
||||
Wave 2 (After Wave 1):
|
||||
├── Task 3: Gateway implementation (wire routing + enqueue)
|
||||
└── Task 4: Migrate event_handlers.py to gateway
|
||||
|
||||
Wave 3 (After Wave 2):
|
||||
├── Task 5: Replace TelemetryFacade with TelemetryGateway at all call sites
|
||||
└── Task 6: Feature flag + integration verification
|
||||
|
||||
Critical Path: Task 1 → Task 3 → Task 5 → Task 6
|
||||
```
|
||||
|
||||
### Dependency Matrix
|
||||
|
||||
| Task | Depends On | Blocks | Can Parallelize With |
|
||||
|------|------------|--------|---------------------|
|
||||
| 1 | None | 3, 4 | 2 |
|
||||
| 2 | None | 3, 4 | 1 |
|
||||
| 3 | 1, 2 | 5 | 4 |
|
||||
| 4 | 1, 2 | 6 | 3 |
|
||||
| 5 | 3 | 6 | None |
|
||||
| 6 | 4, 5 | None | None (final) |
|
||||
|
||||
---
|
||||
|
||||
## TODOs
|
||||
|
||||
- [x] 1. Gateway Contracts + Routing Table
|
||||
|
||||
**What to do**:
|
||||
- Create `enterprise/telemetry/contracts.py` with Pydantic models:
|
||||
- `TelemetryEnvelope`: `event_id` (UUID), `schema_version` (int), `event_name` (str), `signal_type` (Literal["trace", "metric_log"]), `case` (str enum), `context` (TelemetryContext with tenant_id/app_id/user_id), `correlation` (trace_id_source, span_id_source), `core_fields` (dict), `payload_ref` (optional str), `payload_fallback` (optional bytes, max 64KB), `created_at` (datetime)
|
||||
- `TelemetryCase` enum: all known cases (WORKFLOW_RUN, NODE_EXECUTION, DRAFT_NODE_EXECUTION, MESSAGE_RUN, TOOL_EXECUTION, MODERATION_CHECK, SUGGESTED_QUESTION, DATASET_RETRIEVAL, GENERATE_NAME, PROMPT_GENERATION, APP_CREATED, APP_UPDATED, APP_DELETED, FEEDBACK_CREATED)
|
||||
- Create routing table in `enterprise/telemetry/gateway.py` (data structure only, no logic yet):
|
||||
- `CASE_ROUTING: dict[TelemetryCase, CaseRoute]` where `CaseRoute` has `signal_type` and `ce_eligible` fields
|
||||
- Trace-shaped + CE-eligible: WORKFLOW_RUN, MESSAGE_RUN (through TraceQueueManager, reaches both EE + CE)
|
||||
- Trace-shaped + enterprise-only: NODE_EXECUTION, DRAFT_NODE_EXECUTION, PROMPT_GENERATION (through TraceQueueManager, dropped if EE disabled)
|
||||
- Metric/log-only: APP_CREATED, APP_UPDATED, APP_DELETED, FEEDBACK_CREATED, TOOL_EXECUTION, MODERATION_CHECK, SUGGESTED_QUESTION, DATASET_RETRIEVAL, GENERATE_NAME (through enterprise metric queue)
|
||||
- Add validation: envelope size checks, required fields by signal_type
|
||||
- Write comprehensive unit tests for models and routing table
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not implement gateway emit logic yet
|
||||
- Do not create Celery tasks yet
|
||||
- Do not modify any existing files
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `quick`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: atomic commit after contracts are defined
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: YES
|
||||
- **Parallel Group**: Wave 1 (with Task 2)
|
||||
- **Blocks**: Tasks 3, 4
|
||||
- **Blocked By**: None
|
||||
|
||||
**References**:
|
||||
- `core/telemetry/events.py:10-22` — existing `TelemetryContext` and `TelemetryEvent` dataclass patterns (follow frozen dataclass style)
|
||||
- `core/telemetry/facade.py:11-17` — `_ENTERPRISE_ONLY_TRACES` frozenset (source of truth for enterprise-only trace cases)
|
||||
- `core/ops/entities/trace_entity.py:214-227` — `TraceTaskName` enum (existing case taxonomy to align with)
|
||||
- `enterprise/telemetry/entities.py` — `EnterpriseTelemetryCounter`, `EnterpriseTelemetrySpan` enums (enterprise signal naming patterns)
|
||||
- `enterprise/telemetry/enterprise_trace.py:42-80` — `EnterpriseOtelTrace.trace()` dispatcher (case routing reference — maps trace_info types to handler methods)
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] `TelemetryEnvelope` validates correct payloads, rejects missing required fields
|
||||
- [ ] `TelemetryCase` enum covers all 14 known cases
|
||||
- [ ] Routing table maps each case to correct `signal_type` + `ce_eligible`
|
||||
- [ ] Envelope with `payload_fallback` > 64KB is rejected by validator
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_contracts.py` → PASS
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: Envelope validation accepts valid trace envelope
|
||||
Tool: Bash (pytest)
|
||||
Preconditions: contracts.py created with models
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_contracts.py -k "test_valid_trace_envelope" -v
|
||||
2. Assert: PASSED
|
||||
Expected Result: Valid envelope passes validation
|
||||
Evidence: pytest output captured
|
||||
|
||||
Scenario: Envelope rejects oversized payload_fallback
|
||||
Tool: Bash (pytest)
|
||||
Preconditions: contracts.py with size validation
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_contracts.py -k "test_oversized_fallback_rejected" -v
|
||||
2. Assert: PASSED (ValidationError raised)
|
||||
Expected Result: Payloads > 64KB rejected
|
||||
Evidence: pytest output captured
|
||||
|
||||
Scenario: Routing table correctness
|
||||
Tool: Bash (pytest)
|
||||
Preconditions: routing table defined
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_contracts.py -k "test_routing_table" -v
|
||||
2. Assert: Each case maps to expected signal_type and ce_eligible
|
||||
Expected Result: All 14 cases correctly classified
|
||||
Evidence: pytest output captured
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `feat(telemetry): add gateway envelope contracts and routing table`
|
||||
- Files: `enterprise/telemetry/contracts.py`, `enterprise/telemetry/gateway.py`, `tests/unit_tests/enterprise/telemetry/test_contracts.py`
|
||||
- Pre-commit: `make lint && make type-check`
|
||||
|
||||
---
|
||||
|
||||
- [x] 2. EnterpriseMetricHandler Skeleton + Celery Worker
|
||||
|
||||
**What to do**:
|
||||
- Create `enterprise/telemetry/metric_handler.py`:
|
||||
- `EnterpriseMetricHandler` class with `handle(envelope: TelemetryEnvelope) -> None`
|
||||
- Case dispatch method (isinstance/match on `envelope.case`)
|
||||
- Stub methods for each metric/log case: `_on_app_created`, `_on_feedback_created`, `_on_message_run`, `_on_tool_execution`, `_on_moderation_check`, `_on_suggested_question`, `_on_dataset_retrieval`, `_on_generate_name`, `_on_prompt_generation`
|
||||
- Rehydration helper: `_rehydrate(envelope) -> dict` — resolve `payload_ref` → data, fallback to `payload_fallback`, emit degraded event if both fail
|
||||
- Idempotency check: `_is_duplicate(envelope) -> bool` — Redis GET on `telemetry:dedup:{tenant_id}:{event_id}`, SET with 1h TTL on first seen
|
||||
- Create `tasks/enterprise_telemetry_task.py`:
|
||||
- `@shared_task(queue="enterprise_telemetry")` decorator
|
||||
- Deserialize envelope → call `EnterpriseMetricHandler().handle(envelope)`
|
||||
- Error handling: log + drop (best-effort, never fail user request)
|
||||
- Register new queue in Celery configuration (check existing queue registration pattern)
|
||||
- Write unit tests for handler dispatch, idempotency, rehydration fallback
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not implement actual metric emission logic in handlers yet (stubs only)
|
||||
- Do not wire any producers to this worker yet
|
||||
- Do not modify existing files beyond queue registration
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `unspecified-low`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: atomic commit for worker skeleton
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: YES
|
||||
- **Parallel Group**: Wave 1 (with Task 1)
|
||||
- **Blocks**: Tasks 3, 4
|
||||
- **Blocked By**: None
|
||||
|
||||
**References**:
|
||||
- `tasks/ops_trace_task.py:18-77` — existing Celery task pattern for telemetry (`@shared_task(queue="ops_trace")`, error handling, storage cleanup)
|
||||
- `enterprise/telemetry/enterprise_trace.py:42-80` — `EnterpriseOtelTrace.trace()` case dispatch pattern (isinstance-based routing to follow)
|
||||
- `enterprise/telemetry/enterprise_trace.py:407-488` — `_message_trace()` as example of metric-only handler (emit_metric_only_event + counters + histograms)
|
||||
- `enterprise/telemetry/telemetry_log.py:102` — `emit_metric_only_event()` function signature (what handlers will eventually call)
|
||||
- `extensions/ext_redis.py` — Redis client access pattern (`redis_client`)
|
||||
- Celery queue registration: search for `queue=` in `tasks/` directory and Celery config files to find where queues are declared
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] `EnterpriseMetricHandler.handle()` routes to correct stub method per case
|
||||
- [ ] Unknown case logs warning, does not raise
|
||||
- [ ] Idempotency check returns `True` on second call with same `event_id`
|
||||
- [ ] Rehydration falls back to `payload_fallback` when `payload_ref` fails
|
||||
- [ ] Rehydration emits degraded event when both ref and fallback are missing
|
||||
- [ ] Celery task registered on `enterprise_telemetry` queue
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py` → PASS
|
||||
- [ ] `pytest tests/unit_tests/tasks/test_enterprise_telemetry_task.py` → PASS
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: Handler routes APP_CREATED to correct stub
|
||||
Tool: Bash (pytest)
|
||||
Preconditions: metric_handler.py with stubs
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py -k "test_dispatch_app_created" -v
|
||||
2. Assert: PASSED, _on_app_created called
|
||||
Expected Result: Correct case routing
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Idempotency rejects duplicate event_id
|
||||
Tool: Bash (pytest)
|
||||
Preconditions: Redis mock available
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py -k "test_idempotency_duplicate" -v
|
||||
2. Assert: PASSED, second call returns True (duplicate)
|
||||
Expected Result: Duplicate detection works
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Rehydration fallback chain
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py -k "test_rehydration_fallback" -v
|
||||
2. Assert: PASSED, uses payload_fallback when ref fails
|
||||
Expected Result: Graceful degradation
|
||||
Evidence: pytest output
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `feat(telemetry): add enterprise metric handler skeleton and Celery worker`
|
||||
- Files: `enterprise/telemetry/metric_handler.py`, `tasks/enterprise_telemetry_task.py`, `tests/unit_tests/enterprise/telemetry/test_metric_handler.py`, `tests/unit_tests/tasks/test_enterprise_telemetry_task.py`
|
||||
- Pre-commit: `make lint && make type-check`
|
||||
|
||||
---
|
||||
|
||||
- [x] 3. Gateway Implementation (Routing + Enqueue Logic)
|
||||
|
||||
**What to do**:
|
||||
- Implement `TelemetryGateway` in `enterprise/telemetry/gateway.py`:
|
||||
- `emit(case: TelemetryCase, context: dict, payload: dict, trace_manager: TraceQueueManager | None = None) -> None`
|
||||
- Decision 1 — data type: look up `CASE_ROUTING[case].signal_type`
|
||||
- `trace` → build `TraceTask`, pass to `TraceQueueManager.add_trace_task()` (reuse existing path)
|
||||
- `metric_log` → build `TelemetryEnvelope`, call `process_enterprise_telemetry.delay(envelope.model_dump_json())`
|
||||
- Decision 2 — CE eligibility (trace path only):
|
||||
- If `CASE_ROUTING[case].ce_eligible == False` and `not is_enterprise_telemetry_enabled()` → return (drop)
|
||||
- Otherwise → enqueue to TraceQueueManager
|
||||
- Payload sizing: if payload > threshold, store to shared storage → set `payload_ref`; otherwise inline in `core_fields`
|
||||
- Generate `event_id` (UUID4) for each envelope
|
||||
- Add feature flag check: `ENTERPRISE_TELEMETRY_GATEWAY_ENABLED` (env var, default False)
|
||||
- Write unit tests for routing logic, CE eligibility gating, payload sizing
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not modify `TraceQueueManager` internals
|
||||
- Do not change `process_trace_tasks`
|
||||
- Do not implement metric handler case logic (Task 2 stubs are sufficient)
|
||||
- Do not wire any existing producers to gateway yet (Task 4/5)
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `unspecified-high`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: atomic commit for gateway logic
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: YES
|
||||
- **Parallel Group**: Wave 2 (with Task 4)
|
||||
- **Blocks**: Task 5
|
||||
- **Blocked By**: Tasks 1, 2
|
||||
|
||||
**References**:
|
||||
- `core/telemetry/facade.py:20-37` — current `TelemetryFacade.emit()` (gateway replaces/wraps this routing logic)
|
||||
- `core/telemetry/facade.py:11-17` — `_ENTERPRISE_ONLY_TRACES` (CE eligibility source of truth; gateway absorbs this check)
|
||||
- `core/telemetry/facade.py:40-46` — `is_enterprise_telemetry_enabled()` (reuse this function for enterprise gating)
|
||||
- `core/ops/ops_trace_manager.py:1264-1288` — `TraceQueueManager.__init__` and `add_trace_task` (gateway calls this for trace-shaped events)
|
||||
- `core/ops/ops_trace_manager.py:515-634` — `TraceTask` class and `preprocess()` (gateway creates TraceTask instances for trace path)
|
||||
- `enterprise/telemetry/contracts.py` — envelope models from Task 1 (gateway creates these for metric/log path)
|
||||
- `tasks/enterprise_telemetry_task.py` — Celery task from Task 2 (gateway calls `.delay()` for metric/log events)
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] Gateway routes trace-shaped cases to `TraceQueueManager.add_trace_task()`
|
||||
- [ ] Gateway routes metric/log cases to enterprise telemetry Celery task
|
||||
- [ ] Enterprise-only trace case dropped when enterprise disabled
|
||||
- [ ] CE-eligible trace case enqueued regardless of enterprise state
|
||||
- [ ] Large payload stored to shared storage, `payload_ref` set in envelope
|
||||
- [ ] Small payload inlined in `core_fields`
|
||||
- [ ] Feature flag OFF → gateway bypassed (old path used)
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_gateway.py` → PASS
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: Trace-shaped case routes to TraceQueueManager
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_gateway.py -k "test_trace_case_routes_to_trace_queue" -v
|
||||
2. Assert: PASSED, TraceQueueManager.add_trace_task called
|
||||
Expected Result: Trace events use existing pipeline
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Metric case routes to enterprise Celery task
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_gateway.py -k "test_metric_case_routes_to_celery" -v
|
||||
2. Assert: PASSED, process_enterprise_telemetry.delay called with envelope
|
||||
Expected Result: Metric events use new pipeline
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Enterprise-only case dropped when EE disabled
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_gateway.py -k "test_enterprise_only_dropped_on_ce" -v
|
||||
2. Assert: PASSED, no enqueue call made
|
||||
Expected Result: CE deployments don't process enterprise-only traces
|
||||
Evidence: pytest output
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `feat(telemetry): implement gateway routing and enqueue logic`
|
||||
- Files: `enterprise/telemetry/gateway.py`, `tests/unit_tests/enterprise/telemetry/test_gateway.py`
|
||||
- Pre-commit: `make lint && make type-check`
|
||||
|
||||
---
|
||||
|
||||
- [x] 4. Migrate Event Handlers to Gateway-Only Producers
|
||||
|
||||
**What to do**:
|
||||
- **Pre-validation**: Use `lsp_find_references` on all 4 blinker handler functions to confirm no caller depends on synchronous completion or return values
|
||||
- Refactor `enterprise/telemetry/event_handlers.py`:
|
||||
- `_handle_app_created`: replace direct `emit_metric_only_event()` + `exporter.increment_counter()` with `TelemetryGateway.emit(TelemetryCase.APP_CREATED, context, payload)`
|
||||
- `_handle_app_updated`: same pattern → `TelemetryCase.APP_UPDATED`
|
||||
- `_handle_app_deleted`: same pattern → `TelemetryCase.APP_DELETED`
|
||||
- `_handle_feedback_created`: same pattern → `TelemetryCase.FEEDBACK_CREATED`
|
||||
- Implement corresponding case methods in `EnterpriseMetricHandler`:
|
||||
- `_on_app_created(envelope)`: call `emit_metric_only_event()` + `exporter.increment_counter()` (move existing logic from handler)
|
||||
- `_on_app_updated(envelope)`: same
|
||||
- `_on_app_deleted(envelope)`: same
|
||||
- `_on_feedback_created(envelope)`: same
|
||||
- Handlers should build minimal context dict from sender/kwargs, nothing more
|
||||
- Write unit tests verifying handlers call gateway only, and metric handler emits correct signals
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not change blinker signal contracts or registration
|
||||
- Do not change what signals are emitted (same event names, same counter labels)
|
||||
- Do not add new event types
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `unspecified-low`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: atomic commit for handler migration
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: YES
|
||||
- **Parallel Group**: Wave 2 (with Task 3)
|
||||
- **Blocks**: Task 6
|
||||
- **Blocked By**: Tasks 1, 2
|
||||
|
||||
**References**:
|
||||
- `enterprise/telemetry/event_handlers.py:26-146` — current 4 handlers with direct emit/counter calls (source of migration)
|
||||
- `enterprise/telemetry/telemetry_log.py:102` — `emit_metric_only_event()` signature (handler logic moves to metric_handler, calling this)
|
||||
- `enterprise/telemetry/exporter.py:236` — `increment_counter()` signature (same — logic moves to metric_handler)
|
||||
- `enterprise/telemetry/entities.py` — `EnterpriseTelemetryCounter` enum values used in handlers (REQUESTS, FEEDBACK)
|
||||
- `events/app_event.py` — blinker signals (`app_was_created`, `app_was_deleted`, `app_was_updated`)
|
||||
- `events/feedback_event.py` — blinker signal (`feedback_was_created`)
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] `event_handlers.py` has zero direct `emit_metric_only_event` calls
|
||||
- [ ] `event_handlers.py` has zero direct `exporter.increment_counter` calls
|
||||
- [ ] `event_handlers.py` has zero direct `get_enterprise_exporter` calls
|
||||
- [ ] Each handler calls `TelemetryGateway.emit()` with correct case + context
|
||||
- [ ] `EnterpriseMetricHandler._on_app_created` emits same event_name and counter labels as old handler
|
||||
- [ ] `EnterpriseMetricHandler._on_feedback_created` emits same event_name and counter labels as old handler
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_event_handlers.py` → PASS
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py` → PASS
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: App created handler calls gateway only
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_event_handlers.py -k "test_app_created_calls_gateway" -v
|
||||
2. Assert: PASSED, gateway.emit called with APP_CREATED case
|
||||
3. Assert: emit_metric_only_event NOT called directly
|
||||
Expected Result: Handler is enqueue-only
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Metric handler emits same signals as old handler
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/enterprise/telemetry/test_metric_handler.py -k "test_on_app_created_emits_correct_signals" -v
|
||||
2. Assert: PASSED, emit_metric_only_event called with event_name="dify.app.created"
|
||||
3. Assert: increment_counter called with type="app.created"
|
||||
Expected Result: Identical telemetry output
|
||||
Evidence: pytest output
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `refactor(telemetry): migrate event handlers to gateway-only producers`
|
||||
- Files: `enterprise/telemetry/event_handlers.py`, `enterprise/telemetry/metric_handler.py`, `tests/unit_tests/enterprise/telemetry/test_event_handlers.py`
|
||||
- Pre-commit: `make lint && make type-check`
|
||||
|
||||
---
|
||||
|
||||
- [x] 5. Replace TelemetryFacade with TelemetryGateway at All Call Sites
|
||||
|
||||
**What to do**:
|
||||
- Delete `core/telemetry/facade.py` (gateway fully replaces it)
|
||||
- Delete `core/telemetry/events.py` (TelemetryEvent/TelemetryContext replaced by gateway's contracts)
|
||||
- Update `core/telemetry/__init__.py`:
|
||||
- Export `TelemetryGateway` (from `enterprise/telemetry/gateway.py`) and `is_enterprise_telemetry_enabled`
|
||||
- Remove all facade exports
|
||||
- Migrate all 10+ business call sites from `TelemetryFacade.emit(TelemetryEvent(...))` to `TelemetryGateway.emit(case, context, payload)`:
|
||||
- `services/message_service.py:301` — MESSAGE_TRACE → `TelemetryGateway.emit(TelemetryCase.MESSAGE_RUN, ...)`
|
||||
- `enterprise/telemetry/draft_trace.py:23` — DRAFT_NODE_EXECUTION_TRACE → `TelemetryGateway.emit(TelemetryCase.DRAFT_NODE_EXECUTION, ...)`
|
||||
- `core/moderation/input_moderation.py:52` — MODERATION_TRACE → `TelemetryGateway.emit(TelemetryCase.MODERATION_CHECK, ...)`
|
||||
- `core/callback_handler/agent_tool_callback_handler.py:76` — TOOL_TRACE → `TelemetryGateway.emit(TelemetryCase.TOOL_EXECUTION, ...)`
|
||||
- `core/app/apps/advanced_chat/generate_task_pipeline.py:835` — `TelemetryGateway.emit(...)`
|
||||
- `core/workflow/graph_engine/layers/persistence.py:398,502` — NODE_EXECUTION_TRACE → `TelemetryGateway.emit(TelemetryCase.NODE_EXECUTION, ...)`
|
||||
- `core/app/task_pipeline/easy_ui_based_generate_task_pipeline.py:406` — `TelemetryGateway.emit(...)`
|
||||
- `core/llm_generator/llm_generator.py:96,791` — PROMPT_GENERATION / SUGGESTED_QUESTION → `TelemetryGateway.emit(...)`
|
||||
- `core/rag/retrieval/dataset_retrieval.py:725` — DATASET_RETRIEVAL → `TelemetryGateway.emit(TelemetryCase.DATASET_RETRIEVAL, ...)`
|
||||
- Rewrite `tests/unit_tests/core/telemetry/test_facade.py` → `tests/unit_tests/core/telemetry/test_gateway_integration.py` (test gateway routing at call site level)
|
||||
- Keep `is_enterprise_telemetry_enabled()` helper function (move to `core/telemetry/__init__.py` or gateway module)
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not change `TraceQueueManager`
|
||||
- Do not change `process_trace_tasks`
|
||||
- Do not change business logic at call sites (only change the telemetry emit call)
|
||||
- Do not change what data is sent (same payload fields, just different API shape)
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `unspecified-high`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: atomic commit for call site migration
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: NO
|
||||
- **Parallel Group**: Wave 3 (sequential)
|
||||
- **Blocks**: Task 6
|
||||
- **Blocked By**: Task 3
|
||||
|
||||
**References**:
|
||||
- `core/telemetry/facade.py` — file to DELETE (gateway replaces all its logic)
|
||||
- `core/telemetry/events.py` — file to DELETE (contracts.py replaces TelemetryEvent/TelemetryContext)
|
||||
- `core/telemetry/__init__.py:3` — current exports to update (`TelemetryFacade, emit, is_enterprise_telemetry_enabled`)
|
||||
- `enterprise/telemetry/gateway.py` — gateway from Task 3 (new import target for all call sites)
|
||||
- `enterprise/telemetry/contracts.py` — `TelemetryCase` enum (replaces `TraceTaskName` at call sites)
|
||||
- All 10+ call sites listed above (grep for `TelemetryFacade.emit` to find complete list)
|
||||
- `tests/unit_tests/core/telemetry/test_facade.py:1-243` — existing tests to rewrite for gateway
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] `core/telemetry/facade.py` deleted
|
||||
- [ ] `core/telemetry/events.py` deleted
|
||||
- [ ] Zero imports of `TelemetryFacade` anywhere in codebase
|
||||
- [ ] Zero imports of `TelemetryEvent` anywhere in codebase (except test helpers if needed)
|
||||
- [ ] All business call sites use `TelemetryGateway.emit()`
|
||||
- [ ] `core/telemetry/__init__.py` exports `TelemetryGateway` and `is_enterprise_telemetry_enabled`
|
||||
- [ ] New gateway integration tests cover trace routing, metric routing, and CE eligibility
|
||||
- [ ] `pytest tests/unit_tests/core/telemetry/ -v` → PASS
|
||||
- [ ] `pytest tests/unit_tests/enterprise/telemetry/test_gateway.py` → PASS
|
||||
- [ ] `make lint` → clean
|
||||
- [ ] `make type-check` → clean
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: No TelemetryFacade imports remain
|
||||
Tool: Bash (grep)
|
||||
Steps:
|
||||
1. Run: grep -r "TelemetryFacade" --include="*.py" . | grep -v __pycache__ | grep -v .pyc
|
||||
2. Assert: zero results
|
||||
Expected Result: Complete removal
|
||||
Evidence: grep output
|
||||
|
||||
Scenario: Gateway integration tests pass
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/core/telemetry/test_gateway_integration.py -v
|
||||
2. Assert: ALL PASSED
|
||||
Expected Result: Gateway correctly replaces facade at all call sites
|
||||
Evidence: pytest output
|
||||
|
||||
Scenario: Trace-shaped call site routes correctly
|
||||
Tool: Bash (pytest)
|
||||
Steps:
|
||||
1. Run: pytest tests/unit_tests/core/telemetry/test_gateway_integration.py -k "test_workflow_trace_routes_to_queue" -v
|
||||
2. Assert: PASSED, TraceQueueManager.add_trace_task called
|
||||
Expected Result: Trace events still reach existing pipeline
|
||||
Evidence: pytest output
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `refactor(telemetry): replace TelemetryFacade with TelemetryGateway at all call sites`
|
||||
- Files: `core/telemetry/facade.py` (deleted), `core/telemetry/events.py` (deleted), `core/telemetry/__init__.py`, all 10+ call site files, `tests/unit_tests/core/telemetry/test_gateway_integration.py`
|
||||
- Pre-commit: `make lint && make type-check`
|
||||
|
||||
---
|
||||
|
||||
- [x] 6. Integration Verification + Cleanup
|
||||
|
||||
**What to do**:
|
||||
- Run full unit test suite: `uv run --project api --dev dev/pytest/pytest_unit_tests.sh`
|
||||
- Run `make lint` and `make type-check`
|
||||
- Verify no regressions across all telemetry-related tests
|
||||
- Verify feature flag toggle:
|
||||
- OFF: all existing behavior preserved
|
||||
- ON: gateway routes correctly, metric handler processes envelopes
|
||||
- Add operational diagnostics to `EnterpriseMetricHandler`:
|
||||
- Log: gateway routing decisions (DEBUG level)
|
||||
- Counter: `enterprise_telemetry.gateway.routed_total` (by signal_type)
|
||||
- Counter: `enterprise_telemetry.handler.processed_total` (by case)
|
||||
- Counter: `enterprise_telemetry.handler.deduped_total`
|
||||
- Counter: `enterprise_telemetry.handler.rehydration_failed_total`
|
||||
- Document feature flag in relevant config/env docs if they exist
|
||||
|
||||
**Must NOT do**:
|
||||
- Do not remove the old direct path yet (keep behind feature flag for rollback)
|
||||
- Do not force-enable the feature flag in production config
|
||||
- Do not add complex DLQ/retry logic
|
||||
|
||||
**Recommended Agent Profile**:
|
||||
- **Category**: `unspecified-low`
|
||||
- **Skills**: [`git-master`]
|
||||
- `git-master`: final atomic commit
|
||||
|
||||
**Parallelization**:
|
||||
- **Can Run In Parallel**: NO
|
||||
- **Parallel Group**: Wave 3 (final, sequential)
|
||||
- **Blocks**: None (final task)
|
||||
- **Blocked By**: Tasks 4, 5
|
||||
|
||||
**References**:
|
||||
- `tests/unit_tests/core/telemetry/test_gateway_integration.py` — gateway integration tests (must pass)
|
||||
- `tests/unit_tests/core/ops/test_trace_queue_manager.py` — TraceQueueManager tests (must pass unchanged)
|
||||
- `enterprise/telemetry/metric_handler.py` — add diagnostics counters here
|
||||
- `enterprise/telemetry/gateway.py` — add DEBUG logging here
|
||||
- `enterprise/telemetry/exporter.py` — `EnterpriseExporter.increment_counter()` pattern for adding diagnostic counters
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] `uv run --project api --dev dev/pytest/pytest_unit_tests.sh` → ALL PASS
|
||||
- [ ] `make lint` → clean
|
||||
- [ ] `make type-check` → clean
|
||||
- [ ] Feature flag OFF: all existing tests pass, no behavioral change
|
||||
- [ ] Feature flag ON: gateway routing + metric handler processing verified
|
||||
- [ ] Diagnostic counters present in metric handler
|
||||
- [ ] No direct `emit_metric_only_event` calls remain in `event_handlers.py`
|
||||
|
||||
**Agent-Executed QA Scenarios**:
|
||||
|
||||
```
|
||||
Scenario: Full test suite passes
|
||||
Tool: Bash
|
||||
Steps:
|
||||
1. Run: uv run --project api --dev dev/pytest/pytest_unit_tests.sh
|
||||
2. Assert: exit code 0, all tests pass
|
||||
Expected Result: Zero regressions
|
||||
Evidence: pytest output captured
|
||||
|
||||
Scenario: Lint and type-check clean
|
||||
Tool: Bash
|
||||
Steps:
|
||||
1. Run: make lint
|
||||
2. Assert: exit code 0
|
||||
3. Run: make type-check
|
||||
4. Assert: exit code 0
|
||||
Expected Result: No lint or type errors
|
||||
Evidence: command output captured
|
||||
```
|
||||
|
||||
**Commit**: YES
|
||||
- Message: `feat(telemetry): add gateway diagnostics and verify integration`
|
||||
- Files: `enterprise/telemetry/metric_handler.py`, `enterprise/telemetry/gateway.py`
|
||||
- Pre-commit: `make lint && make type-check && uv run --project api --dev dev/pytest/pytest_unit_tests.sh`
|
||||
|
||||
---
|
||||
|
||||
## Commit Strategy
|
||||
|
||||
| After Task | Message | Key Files | Verification |
|
||||
|------------|---------|-----------|--------------|
|
||||
| 1 | `feat(telemetry): add gateway envelope contracts and routing table` | contracts.py, gateway.py (data only) | pytest + lint + type-check |
|
||||
| 2 | `feat(telemetry): add enterprise metric handler skeleton and Celery worker` | metric_handler.py, enterprise_telemetry_task.py | pytest + lint + type-check |
|
||||
| 3 | `feat(telemetry): implement gateway routing and enqueue logic` | gateway.py (full impl) | pytest + lint + type-check |
|
||||
| 4 | `refactor(telemetry): migrate event handlers to gateway-only producers` | event_handlers.py, metric_handler.py | pytest + lint + type-check |
|
||||
| 5 | `refactor(telemetry): replace TelemetryFacade with TelemetryGateway at all call sites` | facade.py (deleted), events.py (deleted), __init__.py, 10+ call sites | pytest + lint + type-check |
|
||||
| 6 | `feat(telemetry): add gateway diagnostics and verify integration` | metric_handler.py, gateway.py | full test suite + lint + type-check |
|
||||
|
||||
---
|
||||
|
||||
## Failure Handling Decisions
|
||||
|
||||
| Scenario | Decision |
|
||||
|----------|----------|
|
||||
| Redis unavailable during idempotency check | Fail open: skip dedup, process event (prefer occasional duplicate over lost data) |
|
||||
| Payload rehydration fails (ref expired) | Use `payload_fallback` if present; otherwise emit degraded event with `rehydration_failed=true` flag |
|
||||
| Queue worker crashes mid-processing | At-least-once with idempotency: Celery retries, dedup prevents double-count |
|
||||
| Queue backpressure / full | Celery handles backpressure natively; add monitoring counter for queue depth |
|
||||
| Feature flag flips while events in-flight | Events already enqueued process with handler logic; new events route per new flag state |
|
||||
| Unknown event type reaches handler | Log warning, do not raise, skip processing |
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Verification Commands
|
||||
```bash
|
||||
make lint # Expected: clean
|
||||
make type-check # Expected: clean
|
||||
uv run --project api --dev dev/pytest/pytest_unit_tests.sh # Expected: all pass
|
||||
pytest tests/unit_tests/core/telemetry/ -v # Expected: all pass
|
||||
pytest tests/unit_tests/enterprise/telemetry/ -v # Expected: all pass
|
||||
pytest tests/unit_tests/tasks/ -v # Expected: all pass
|
||||
```
|
||||
|
||||
### Final Checklist
|
||||
- [x] Single gateway entrance for all enterprise telemetry
|
||||
- [x] Two routing decisions consolidated in one place
|
||||
- [x] Metric/log events processed async (not in request path)
|
||||
- [x] CE trace pipeline completely unchanged
|
||||
- [x] Enterprise trace span pipeline unchanged
|
||||
- [x] Idempotency prevents duplicate counters
|
||||
- [x] Feature flag enables safe rollout/rollback
|
||||
- [x] All existing tests pass
|
||||
- [x] No direct emit/counter calls in event_handlers.py
|
||||
|
|
@ -63,7 +63,8 @@ from core.base.tts import AppGeneratorTTSPublisher, AudioTrunk
|
|||
from core.model_runtime.entities.llm_entities import LLMUsage
|
||||
from core.model_runtime.utils.encoders import jsonable_encoder
|
||||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from core.workflow.enums import WorkflowExecutionStatus
|
||||
from core.workflow.nodes import NodeType
|
||||
from core.workflow.repositories.draft_variable_repository import DraftVariableSaverFactory
|
||||
|
|
@ -832,7 +833,7 @@ class AdvancedChatAppGenerateTaskPipeline(GraphRuntimeStateSupport):
|
|||
session.add_all(message_files)
|
||||
|
||||
if trace_manager:
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.MESSAGE_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -55,7 +55,8 @@ from core.model_runtime.model_providers.__base.large_language_model import Large
|
|||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
from core.prompt.utils.prompt_message_util import PromptMessageUtil
|
||||
from core.prompt.utils.prompt_template_parser import PromptTemplateParser
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from events.message_event import message_was_created
|
||||
from extensions.ext_database import db
|
||||
from libs.datetime_utils import naive_utc_now
|
||||
|
|
@ -409,7 +410,7 @@ class EasyUIBasedGenerateTaskPipeline(BasedGenerateTaskPipeline):
|
|||
message.message_metadata = self._task_state.metadata.model_dump_json()
|
||||
|
||||
if trace_manager:
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.MESSAGE_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -395,9 +395,10 @@ class WorkflowPersistenceLayer(GraphEngineLayer):
|
|||
external_trace_id = self._application_generate_entity.extras.get("external_trace_id")
|
||||
parent_trace_context = self._application_generate_entity.extras.get("parent_trace_context")
|
||||
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.WORKFLOW_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
@ -499,9 +500,10 @@ class WorkflowPersistenceLayer(GraphEngineLayer):
|
|||
if parent_trace_context:
|
||||
node_data["parent_trace_context"] = parent_trace_context
|
||||
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.NODE_EXECUTION_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -5,7 +5,8 @@ from pydantic import BaseModel
|
|||
|
||||
from configs import dify_config
|
||||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from core.tools.entities.tool_entities import ToolInvokeMessage
|
||||
|
||||
_TEXT_COLOR_MAPPING = {
|
||||
|
|
@ -73,7 +74,7 @@ class DifyAgentCallbackHandler(BaseModel):
|
|||
print_text("\n")
|
||||
|
||||
if trace_manager:
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.TOOL_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -27,7 +27,8 @@ from core.model_runtime.entities.model_entities import ModelType
|
|||
from core.model_runtime.errors.invoke import InvokeAuthorizationError, InvokeError
|
||||
from core.ops.utils import measure_time
|
||||
from core.prompt.utils.prompt_template_parser import PromptTemplateParser
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from core.workflow.entities.workflow_node_execution import WorkflowNodeExecutionMetadataKey
|
||||
from extensions.ext_database import db
|
||||
from extensions.ext_storage import storage
|
||||
|
|
@ -93,7 +94,7 @@ class LLMGenerator:
|
|||
name = name[:75] + "..."
|
||||
|
||||
# get tracing instance
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.GENERATE_NAME_TRACE,
|
||||
context=TelemetryContext(tenant_id=tenant_id, app_id=app_id),
|
||||
|
|
@ -788,7 +789,7 @@ class LLMGenerator:
|
|||
total_price = None
|
||||
currency = None
|
||||
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.PROMPT_GENERATION_TRACE,
|
||||
context=TelemetryContext(tenant_id=tenant_id, user_id=user_id, app_id=app_id),
|
||||
|
|
|
|||
|
|
@ -7,7 +7,8 @@ from core.moderation.base import ModerationAction, ModerationError
|
|||
from core.moderation.factory import ModerationFactory
|
||||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
from core.ops.utils import measure_time
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -49,7 +50,7 @@ class InputModeration:
|
|||
moderation_result = moderation_factory.moderation_for_inputs(inputs, query)
|
||||
|
||||
if trace_manager:
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.MODERATION_TRACE,
|
||||
context=TelemetryContext(tenant_id=tenant_id, app_id=app_id),
|
||||
|
|
|
|||
|
|
@ -55,7 +55,8 @@ from core.rag.retrieval.template_prompts import (
|
|||
METADATA_FILTER_USER_PROMPT_2,
|
||||
METADATA_FILTER_USER_PROMPT_3,
|
||||
)
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from core.tools.signature import sign_upload_file
|
||||
from core.tools.utils.dataset_retriever.dataset_retriever_base_tool import DatasetRetrieverBaseTool
|
||||
from extensions.ext_database import db
|
||||
|
|
@ -729,7 +730,7 @@ class DatasetRetrieval:
|
|||
)
|
||||
if trace_manager:
|
||||
app_config = self.application_generate_entity.app_config if self.application_generate_entity else None
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.DATASET_RETRIEVAL_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -1,11 +1,59 @@
|
|||
"""Community telemetry helpers.
|
||||
|
||||
Provides ``emit()`` which enqueues trace events into the CE trace pipeline
|
||||
(``TraceQueueManager`` → ``ops_trace`` Celery queue → Langfuse / LangSmith / etc.).
|
||||
|
||||
Enterprise-only traces (node execution, draft node execution, prompt generation)
|
||||
are silently dropped when enterprise telemetry is disabled.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from core.telemetry.events import TelemetryContext, TelemetryEvent
|
||||
from core.telemetry.facade import TelemetryFacade, emit, is_enterprise_telemetry_enabled
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
|
||||
_ENTERPRISE_ONLY_TRACES: frozenset[TraceTaskName] = frozenset(
|
||||
{
|
||||
TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
TraceTaskName.NODE_EXECUTION_TRACE,
|
||||
TraceTaskName.PROMPT_GENERATION_TRACE,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def _is_enterprise_telemetry_enabled() -> bool:
|
||||
try:
|
||||
from enterprise.telemetry.exporter import is_enterprise_telemetry_enabled
|
||||
|
||||
return is_enterprise_telemetry_enabled()
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def emit(event: TelemetryEvent, trace_manager: TraceQueueManager | None = None) -> None:
|
||||
from core.ops.ops_trace_manager import TraceQueueManager as LocalTraceQueueManager
|
||||
from core.ops.ops_trace_manager import TraceTask
|
||||
|
||||
if event.name in _ENTERPRISE_ONLY_TRACES and not _is_enterprise_telemetry_enabled():
|
||||
return
|
||||
|
||||
queue_manager = trace_manager or LocalTraceQueueManager(
|
||||
app_id=event.context.app_id,
|
||||
user_id=event.context.user_id,
|
||||
)
|
||||
queue_manager.add_trace_task(TraceTask(event.name, **event.payload))
|
||||
|
||||
|
||||
is_enterprise_telemetry_enabled = _is_enterprise_telemetry_enabled
|
||||
|
||||
__all__ = [
|
||||
"TelemetryContext",
|
||||
"TelemetryEvent",
|
||||
"TelemetryFacade",
|
||||
"TraceTaskName",
|
||||
"emit",
|
||||
"is_enterprise_telemetry_enabled",
|
||||
|
|
|
|||
|
|
@ -1,50 +0,0 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from core.telemetry.events import TelemetryEvent
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from core.ops.ops_trace_manager import TraceQueueManager
|
||||
|
||||
_ENTERPRISE_ONLY_TRACES: frozenset[TraceTaskName] = frozenset(
|
||||
{
|
||||
TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
TraceTaskName.NODE_EXECUTION_TRACE,
|
||||
TraceTaskName.PROMPT_GENERATION_TRACE,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
class TelemetryFacade:
|
||||
@staticmethod
|
||||
def emit(event: TelemetryEvent, trace_manager: TraceQueueManager | None = None) -> None:
|
||||
from core.ops.ops_trace_manager import TraceQueueManager, TraceTask
|
||||
|
||||
if event.name in _ENTERPRISE_ONLY_TRACES and not is_enterprise_telemetry_enabled():
|
||||
return
|
||||
|
||||
trace_queue_manager = trace_manager or TraceQueueManager(
|
||||
app_id=event.context.app_id,
|
||||
user_id=event.context.user_id,
|
||||
)
|
||||
trace_queue_manager.add_trace_task(
|
||||
TraceTask(
|
||||
event.name,
|
||||
**event.payload,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def is_enterprise_telemetry_enabled() -> bool:
|
||||
try:
|
||||
from enterprise.telemetry.exporter import is_enterprise_telemetry_enabled
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
return is_enterprise_telemetry_enabled()
|
||||
|
||||
|
||||
def emit(event: TelemetryEvent, trace_manager: TraceQueueManager | None = None) -> None:
|
||||
TelemetryFacade.emit(event, trace_manager=trace_manager)
|
||||
|
|
@ -7,7 +7,7 @@ configuration that determines how each event type is processed.
|
|||
from __future__ import annotations
|
||||
|
||||
from enum import StrEnum
|
||||
from typing import Any, Literal
|
||||
from typing import Any
|
||||
|
||||
from pydantic import BaseModel, field_validator
|
||||
|
||||
|
|
@ -31,15 +31,22 @@ class TelemetryCase(StrEnum):
|
|||
FEEDBACK_CREATED = "feedback_created"
|
||||
|
||||
|
||||
class SignalType(StrEnum):
|
||||
"""Signal routing type for telemetry cases."""
|
||||
|
||||
TRACE = "trace"
|
||||
METRIC_LOG = "metric_log"
|
||||
|
||||
|
||||
class CaseRoute(BaseModel):
|
||||
"""Routing configuration for a telemetry case.
|
||||
|
||||
Attributes:
|
||||
signal_type: The type of signal ("trace" or "metric_log").
|
||||
ce_eligible: Whether this case is eligible for customer engagement.
|
||||
signal_type: The type of signal (trace or metric_log).
|
||||
ce_eligible: Whether this case is eligible for community edition tracing.
|
||||
"""
|
||||
|
||||
signal_type: Literal["trace", "metric_log"]
|
||||
signal_type: SignalType
|
||||
ce_eligible: bool
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -3,7 +3,8 @@ from __future__ import annotations
|
|||
from collections.abc import Mapping
|
||||
from typing import Any
|
||||
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from core.workflow.enums import WorkflowNodeExecutionMetadataKey
|
||||
from models.workflow import WorkflowNodeExecutionModel
|
||||
|
||||
|
|
@ -20,7 +21,7 @@ def enqueue_draft_node_execution_trace(
|
|||
outputs=outputs,
|
||||
workflow_execution_id=workflow_execution_id,
|
||||
)
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
context=TelemetryContext(
|
||||
|
|
|
|||
|
|
@ -1,20 +1,23 @@
|
|||
"""Telemetry gateway routing configuration and implementation.
|
||||
"""Telemetry gateway routing and dispatch.
|
||||
|
||||
This module defines the routing table that maps telemetry cases to their
|
||||
processing routes (trace vs metric/log) and customer engagement eligibility.
|
||||
It also provides the TelemetryGateway class that routes events to the
|
||||
appropriate processing path.
|
||||
Maps ``TelemetryCase`` → ``CaseRoute`` (signal type + CE eligibility)
|
||||
and dispatches events to either the trace pipeline or the metric/log
|
||||
Celery queue.
|
||||
|
||||
Singleton lifecycle is managed by ``ext_enterprise_telemetry.init_app()``
|
||||
which creates the instance during single-threaded Flask app startup.
|
||||
Access via ``ext_enterprise_telemetry.get_gateway()``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import uuid
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
from enterprise.telemetry.contracts import CaseRoute, TelemetryCase, TelemetryEnvelope
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from enterprise.telemetry.contracts import CaseRoute, SignalType, TelemetryCase, TelemetryEnvelope
|
||||
from extensions.ext_storage import storage
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
|
@ -24,41 +27,32 @@ logger = logging.getLogger(__name__)
|
|||
|
||||
PAYLOAD_SIZE_THRESHOLD_BYTES = 1 * 1024 * 1024
|
||||
|
||||
CASE_TO_TRACE_TASK_NAME: dict[TelemetryCase, str] = {
|
||||
TelemetryCase.WORKFLOW_RUN: "workflow",
|
||||
TelemetryCase.MESSAGE_RUN: "message",
|
||||
TelemetryCase.NODE_EXECUTION: "node_execution",
|
||||
TelemetryCase.DRAFT_NODE_EXECUTION: "draft_node_execution",
|
||||
TelemetryCase.PROMPT_GENERATION: "prompt_generation",
|
||||
CASE_TO_TRACE_TASK: dict[TelemetryCase, TraceTaskName] = {
|
||||
TelemetryCase.WORKFLOW_RUN: TraceTaskName.WORKFLOW_TRACE,
|
||||
TelemetryCase.MESSAGE_RUN: TraceTaskName.MESSAGE_TRACE,
|
||||
TelemetryCase.NODE_EXECUTION: TraceTaskName.NODE_EXECUTION_TRACE,
|
||||
TelemetryCase.DRAFT_NODE_EXECUTION: TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
TelemetryCase.PROMPT_GENERATION: TraceTaskName.PROMPT_GENERATION_TRACE,
|
||||
}
|
||||
|
||||
CASE_ROUTING: dict[TelemetryCase, CaseRoute] = {
|
||||
TelemetryCase.WORKFLOW_RUN: CaseRoute(signal_type="trace", ce_eligible=True),
|
||||
TelemetryCase.MESSAGE_RUN: CaseRoute(signal_type="trace", ce_eligible=True),
|
||||
TelemetryCase.NODE_EXECUTION: CaseRoute(signal_type="trace", ce_eligible=False),
|
||||
TelemetryCase.DRAFT_NODE_EXECUTION: CaseRoute(signal_type="trace", ce_eligible=False),
|
||||
TelemetryCase.PROMPT_GENERATION: CaseRoute(signal_type="trace", ce_eligible=False),
|
||||
TelemetryCase.APP_CREATED: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.APP_UPDATED: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.APP_DELETED: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.FEEDBACK_CREATED: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.TOOL_EXECUTION: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.MODERATION_CHECK: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.SUGGESTED_QUESTION: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.DATASET_RETRIEVAL: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.GENERATE_NAME: CaseRoute(signal_type="metric_log", ce_eligible=False),
|
||||
TelemetryCase.WORKFLOW_RUN: CaseRoute(signal_type=SignalType.TRACE, ce_eligible=True),
|
||||
TelemetryCase.MESSAGE_RUN: CaseRoute(signal_type=SignalType.TRACE, ce_eligible=True),
|
||||
TelemetryCase.NODE_EXECUTION: CaseRoute(signal_type=SignalType.TRACE, ce_eligible=False),
|
||||
TelemetryCase.DRAFT_NODE_EXECUTION: CaseRoute(signal_type=SignalType.TRACE, ce_eligible=False),
|
||||
TelemetryCase.PROMPT_GENERATION: CaseRoute(signal_type=SignalType.TRACE, ce_eligible=False),
|
||||
TelemetryCase.APP_CREATED: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.APP_UPDATED: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.APP_DELETED: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.FEEDBACK_CREATED: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.TOOL_EXECUTION: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.MODERATION_CHECK: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.SUGGESTED_QUESTION: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.DATASET_RETRIEVAL: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
TelemetryCase.GENERATE_NAME: CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False),
|
||||
}
|
||||
|
||||
|
||||
def is_gateway_enabled() -> bool:
|
||||
"""Check if the telemetry gateway is enabled via feature flag.
|
||||
|
||||
Returns:
|
||||
True if ENTERPRISE_TELEMETRY_GATEWAY_ENABLED is set to a truthy value.
|
||||
"""
|
||||
return os.environ.get("ENTERPRISE_TELEMETRY_GATEWAY_ENABLED", "").lower() in ("true", "1", "yes")
|
||||
|
||||
|
||||
def _is_enterprise_telemetry_enabled() -> bool:
|
||||
try:
|
||||
from enterprise.telemetry.exporter import is_enterprise_telemetry_enabled
|
||||
|
|
@ -68,15 +62,16 @@ def _is_enterprise_telemetry_enabled() -> bool:
|
|||
return False
|
||||
|
||||
|
||||
is_enterprise_telemetry_enabled = _is_enterprise_telemetry_enabled
|
||||
def _should_drop_ee_only_event(route: CaseRoute) -> bool:
|
||||
"""Return True when the event is enterprise-only and EE telemetry is disabled."""
|
||||
return not route.ce_eligible and not _is_enterprise_telemetry_enabled()
|
||||
|
||||
|
||||
class TelemetryGateway:
|
||||
"""Gateway for routing telemetry events to appropriate processing paths.
|
||||
"""Routes telemetry events to the trace pipeline or the metric/log Celery queue.
|
||||
|
||||
Routes trace-shaped events to TraceQueueManager and metric/log events
|
||||
to the enterprise telemetry Celery queue. Handles CE eligibility checks,
|
||||
large payload storage, and feature flag gating.
|
||||
Stateless — instantiated once during ``ext_enterprise_telemetry.init_app()``
|
||||
and shared for the lifetime of the process.
|
||||
"""
|
||||
|
||||
def emit(
|
||||
|
|
@ -86,23 +81,6 @@ class TelemetryGateway:
|
|||
payload: dict[str, Any],
|
||||
trace_manager: TraceQueueManager | None = None,
|
||||
) -> None:
|
||||
"""Emit a telemetry event through the gateway.
|
||||
|
||||
Routes the event based on its case type:
|
||||
- trace: Routes to TraceQueueManager for existing trace pipeline
|
||||
- metric_log: Routes to enterprise telemetry Celery task
|
||||
|
||||
Args:
|
||||
case: The telemetry case type.
|
||||
context: Event context containing tenant_id, app_id, user_id.
|
||||
payload: The event payload data.
|
||||
trace_manager: Optional TraceQueueManager for trace routing.
|
||||
"""
|
||||
if not is_gateway_enabled():
|
||||
logger.debug("Gateway disabled, using legacy path for case=%s", case)
|
||||
self._emit_legacy(case, context, payload, trace_manager)
|
||||
return
|
||||
|
||||
route = CASE_ROUTING.get(case)
|
||||
if route is None:
|
||||
logger.warning("Unknown telemetry case: %s, dropping event", case)
|
||||
|
|
@ -115,59 +93,11 @@ class TelemetryGateway:
|
|||
route.ce_eligible,
|
||||
)
|
||||
|
||||
if route.signal_type == "trace":
|
||||
if route.signal_type is SignalType.TRACE:
|
||||
self._emit_trace(case, context, payload, route, trace_manager)
|
||||
else:
|
||||
self._emit_metric_log(case, context, payload)
|
||||
|
||||
def _emit_legacy(
|
||||
self,
|
||||
case: TelemetryCase,
|
||||
context: dict[str, Any],
|
||||
payload: dict[str, Any],
|
||||
trace_manager: TraceQueueManager | None,
|
||||
) -> None:
|
||||
"""Emit using legacy path (TelemetryFacade behavior).
|
||||
|
||||
Used when gateway feature flag is disabled.
|
||||
"""
|
||||
route = CASE_ROUTING.get(case)
|
||||
if route is None or route.signal_type != "trace":
|
||||
return
|
||||
|
||||
trace_task_name_str = CASE_TO_TRACE_TASK_NAME.get(case)
|
||||
if trace_task_name_str is None:
|
||||
return
|
||||
|
||||
if not route.ce_eligible and not _is_enterprise_telemetry_enabled():
|
||||
return
|
||||
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from core.ops.ops_trace_manager import (
|
||||
TraceQueueManager as LocalTraceQueueManager,
|
||||
)
|
||||
from core.ops.ops_trace_manager import (
|
||||
TraceTask,
|
||||
)
|
||||
|
||||
try:
|
||||
trace_task_name = TraceTaskName(trace_task_name_str)
|
||||
except ValueError:
|
||||
logger.warning("Invalid trace task name: %s", trace_task_name_str)
|
||||
return
|
||||
|
||||
queue_manager = trace_manager or LocalTraceQueueManager(
|
||||
app_id=context.get("app_id"),
|
||||
user_id=context.get("user_id"),
|
||||
)
|
||||
|
||||
queue_manager.add_trace_task(
|
||||
TraceTask(
|
||||
trace_task_name,
|
||||
**payload,
|
||||
)
|
||||
)
|
||||
|
||||
def _emit_trace(
|
||||
self,
|
||||
case: TelemetryCase,
|
||||
|
|
@ -176,57 +106,25 @@ class TelemetryGateway:
|
|||
route: CaseRoute,
|
||||
trace_manager: TraceQueueManager | None,
|
||||
) -> None:
|
||||
"""Emit a trace-shaped event to TraceQueueManager.
|
||||
from core.ops.ops_trace_manager import TraceQueueManager as LocalTraceQueueManager
|
||||
from core.ops.ops_trace_manager import TraceTask
|
||||
|
||||
Args:
|
||||
case: The telemetry case type.
|
||||
context: Event context.
|
||||
payload: The event payload.
|
||||
route: Routing configuration for this case.
|
||||
trace_manager: Optional TraceQueueManager.
|
||||
"""
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from core.ops.ops_trace_manager import (
|
||||
TraceQueueManager as LocalTraceQueueManager,
|
||||
)
|
||||
from core.ops.ops_trace_manager import (
|
||||
TraceTask,
|
||||
)
|
||||
|
||||
if not route.ce_eligible and not _is_enterprise_telemetry_enabled():
|
||||
logger.debug(
|
||||
"Dropping enterprise-only trace event: case=%s (EE disabled)",
|
||||
case,
|
||||
)
|
||||
if _should_drop_ee_only_event(route):
|
||||
logger.debug("Dropping enterprise-only trace event: case=%s (EE disabled)", case)
|
||||
return
|
||||
|
||||
trace_task_name_str = CASE_TO_TRACE_TASK_NAME.get(case)
|
||||
if trace_task_name_str is None:
|
||||
trace_task_name = CASE_TO_TRACE_TASK.get(case)
|
||||
if trace_task_name is None:
|
||||
logger.warning("No TraceTaskName mapping for case: %s", case)
|
||||
return
|
||||
|
||||
try:
|
||||
trace_task_name = TraceTaskName(trace_task_name_str)
|
||||
except ValueError:
|
||||
logger.warning("Invalid trace task name: %s", trace_task_name_str)
|
||||
return
|
||||
|
||||
queue_manager = trace_manager or LocalTraceQueueManager(
|
||||
app_id=context.get("app_id"),
|
||||
user_id=context.get("user_id"),
|
||||
)
|
||||
|
||||
queue_manager.add_trace_task(
|
||||
TraceTask(
|
||||
trace_task_name,
|
||||
**payload,
|
||||
)
|
||||
)
|
||||
logger.debug(
|
||||
"Enqueued trace task: case=%s, app_id=%s",
|
||||
case,
|
||||
context.get("app_id"),
|
||||
)
|
||||
queue_manager.add_trace_task(TraceTask(trace_task_name, **payload))
|
||||
logger.debug("Enqueued trace task: case=%s, app_id=%s", case, context.get("app_id"))
|
||||
|
||||
def _emit_metric_log(
|
||||
self,
|
||||
|
|
@ -234,13 +132,6 @@ class TelemetryGateway:
|
|||
context: dict[str, Any],
|
||||
payload: dict[str, Any],
|
||||
) -> None:
|
||||
"""Emit a metric/log event to the enterprise telemetry Celery queue.
|
||||
|
||||
Args:
|
||||
case: The telemetry case type.
|
||||
context: Event context containing tenant_id.
|
||||
payload: The event payload.
|
||||
"""
|
||||
from tasks.enterprise_telemetry_task import process_enterprise_telemetry
|
||||
|
||||
tenant_id = context.get("tenant_id", "")
|
||||
|
|
@ -270,22 +161,6 @@ class TelemetryGateway:
|
|||
tenant_id: str,
|
||||
event_id: str,
|
||||
) -> tuple[dict[str, Any], str | None]:
|
||||
"""Handle large payload storage.
|
||||
|
||||
If payload exceeds threshold, stores to shared storage and returns
|
||||
a reference. Otherwise returns payload as-is.
|
||||
|
||||
Args:
|
||||
payload: The event payload.
|
||||
tenant_id: Tenant identifier for storage path.
|
||||
event_id: Event identifier for storage path.
|
||||
|
||||
Returns:
|
||||
Tuple of (payload_for_envelope, payload_ref).
|
||||
If stored, payload_for_envelope is empty and payload_ref is set.
|
||||
Otherwise, payload_for_envelope is the original payload and
|
||||
payload_ref is None.
|
||||
"""
|
||||
try:
|
||||
payload_json = json.dumps(payload)
|
||||
payload_size = len(payload_json.encode("utf-8"))
|
||||
|
|
@ -306,35 +181,19 @@ class TelemetryGateway:
|
|||
return payload, None
|
||||
|
||||
|
||||
_gateway: TelemetryGateway | None = None
|
||||
|
||||
|
||||
def get_gateway() -> TelemetryGateway:
|
||||
"""Get the module-level gateway instance.
|
||||
|
||||
Returns:
|
||||
The singleton TelemetryGateway instance.
|
||||
"""
|
||||
global _gateway
|
||||
if _gateway is None:
|
||||
_gateway = TelemetryGateway()
|
||||
return _gateway
|
||||
|
||||
|
||||
def emit(
|
||||
case: TelemetryCase,
|
||||
context: dict[str, Any],
|
||||
payload: dict[str, Any],
|
||||
trace_manager: TraceQueueManager | None = None,
|
||||
) -> None:
|
||||
"""Emit a telemetry event through the gateway.
|
||||
"""Module-level convenience wrapper.
|
||||
|
||||
Convenience function that uses the module-level gateway instance.
|
||||
|
||||
Args:
|
||||
case: The telemetry case type.
|
||||
context: Event context containing tenant_id, app_id, user_id.
|
||||
payload: The event payload data.
|
||||
trace_manager: Optional TraceQueueManager for trace routing.
|
||||
Fetches the gateway singleton from the extension; no-ops when
|
||||
enterprise telemetry is disabled (gateway is ``None``).
|
||||
"""
|
||||
get_gateway().emit(case, context, payload, trace_manager)
|
||||
from extensions.ext_enterprise_telemetry import get_gateway
|
||||
|
||||
gateway = get_gateway()
|
||||
if gateway is not None:
|
||||
gateway.emit(case, context, payload, trace_manager)
|
||||
|
|
|
|||
|
|
@ -1,9 +1,11 @@
|
|||
"""Flask extension for enterprise telemetry lifecycle management.
|
||||
|
||||
Initializes the EnterpriseExporter singleton during ``create_app()`` (single-threaded),
|
||||
registers blinker event handlers, and hooks atexit for graceful shutdown.
|
||||
Initializes the EnterpriseExporter and TelemetryGateway singletons during
|
||||
``create_app()`` (single-threaded), registers blinker event handlers,
|
||||
and hooks atexit for graceful shutdown.
|
||||
|
||||
Skipped entirely when ``ENTERPRISE_ENABLED`` and ``ENTERPRISE_TELEMETRY_ENABLED`` are false (``is_enabled()`` gate).
|
||||
Skipped entirely when ``ENTERPRISE_ENABLED`` and ``ENTERPRISE_TELEMETRY_ENABLED``
|
||||
are false (``is_enabled()`` gate).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
|
@ -17,10 +19,12 @@ from configs import dify_config
|
|||
if TYPE_CHECKING:
|
||||
from dify_app import DifyApp
|
||||
from enterprise.telemetry.exporter import EnterpriseExporter
|
||||
from enterprise.telemetry.gateway import TelemetryGateway
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_exporter: EnterpriseExporter | None = None
|
||||
_gateway: TelemetryGateway | None = None
|
||||
|
||||
|
||||
def is_enabled() -> bool:
|
||||
|
|
@ -28,14 +32,16 @@ def is_enabled() -> bool:
|
|||
|
||||
|
||||
def init_app(app: DifyApp) -> None:
|
||||
global _exporter
|
||||
global _exporter, _gateway
|
||||
|
||||
if not is_enabled():
|
||||
return
|
||||
|
||||
from enterprise.telemetry.exporter import EnterpriseExporter
|
||||
from enterprise.telemetry.gateway import TelemetryGateway
|
||||
|
||||
_exporter = EnterpriseExporter(dify_config)
|
||||
_gateway = TelemetryGateway()
|
||||
atexit.register(_exporter.shutdown)
|
||||
|
||||
# Import to trigger @signal.connect decorator registration
|
||||
|
|
@ -46,3 +52,7 @@ def init_app(app: DifyApp) -> None:
|
|||
|
||||
def get_enterprise_exporter() -> EnterpriseExporter | None:
|
||||
return _exporter
|
||||
|
||||
|
||||
def get_gateway() -> TelemetryGateway | None:
|
||||
return _gateway
|
||||
|
|
|
|||
|
|
@ -8,7 +8,8 @@ from core.memory.token_buffer_memory import TokenBufferMemory
|
|||
from core.model_manager import ModelManager
|
||||
from core.model_runtime.entities.model_entities import ModelType
|
||||
from core.ops.utils import measure_time
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TelemetryFacade, TraceTaskName
|
||||
from core.telemetry import TelemetryContext, TelemetryEvent, TraceTaskName
|
||||
from core.telemetry import emit as telemetry_emit
|
||||
from events.feedback_event import feedback_was_created
|
||||
from extensions.ext_database import db
|
||||
from libs.infinite_scroll_pagination import InfiniteScrollPagination
|
||||
|
|
@ -297,7 +298,7 @@ class MessageService:
|
|||
questions: list[str] = list(questions_sequence)
|
||||
|
||||
# get tracing instance
|
||||
TelemetryFacade.emit(
|
||||
telemetry_emit(
|
||||
TelemetryEvent(
|
||||
name=TraceTaskName.SUGGESTED_QUESTION_TRACE,
|
||||
context=TelemetryContext(tenant_id=app_model.tenant_id, app_id=app_model.id),
|
||||
|
|
|
|||
|
|
@ -0,0 +1,52 @@
|
|||
"""Celery worker for enterprise metric/log telemetry events.
|
||||
|
||||
This module defines the Celery task that processes telemetry envelopes
|
||||
from the enterprise_telemetry queue. It deserializes envelopes and
|
||||
dispatches them to the EnterpriseMetricHandler.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
|
||||
from celery import shared_task
|
||||
|
||||
from enterprise.telemetry.contracts import TelemetryEnvelope
|
||||
from enterprise.telemetry.metric_handler import EnterpriseMetricHandler
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@shared_task(queue="enterprise_telemetry")
|
||||
def process_enterprise_telemetry(envelope_json: str) -> None:
|
||||
"""Process enterprise metric/log telemetry envelope.
|
||||
|
||||
This task is enqueued by the TelemetryGateway for metric/log-only
|
||||
events. It deserializes the envelope and dispatches to the handler.
|
||||
|
||||
Best-effort processing: logs errors but never raises, to avoid
|
||||
failing user requests due to telemetry issues.
|
||||
|
||||
Args:
|
||||
envelope_json: JSON-serialized TelemetryEnvelope.
|
||||
"""
|
||||
try:
|
||||
# Deserialize envelope
|
||||
envelope_dict = json.loads(envelope_json)
|
||||
envelope = TelemetryEnvelope.model_validate(envelope_dict)
|
||||
|
||||
# Process through handler
|
||||
handler = EnterpriseMetricHandler()
|
||||
handler.handle(envelope)
|
||||
|
||||
logger.debug(
|
||||
"Successfully processed telemetry envelope: tenant_id=%s, event_id=%s, case=%s",
|
||||
envelope.tenant_id,
|
||||
envelope.event_id,
|
||||
envelope.case,
|
||||
)
|
||||
except Exception:
|
||||
# Best-effort: log and drop on error, never fail user request
|
||||
logger.warning(
|
||||
"Failed to process enterprise telemetry envelope, dropping event",
|
||||
exc_info=True,
|
||||
)
|
||||
|
|
@ -43,10 +43,9 @@ def process_trace_tasks(file_info):
|
|||
if trace_type:
|
||||
trace_info = trace_type(**trace_info)
|
||||
|
||||
# process enterprise trace separately
|
||||
from enterprise.telemetry.exporter import is_enterprise_telemetry_enabled
|
||||
from extensions.ext_enterprise_telemetry import is_enabled as is_ee_telemetry_enabled
|
||||
|
||||
if is_enterprise_telemetry_enabled():
|
||||
if is_ee_telemetry_enabled():
|
||||
from enterprise.telemetry.enterprise_trace import EnterpriseOtelTrace
|
||||
|
||||
try:
|
||||
|
|
|
|||
|
|
@ -1,16 +1,11 @@
|
|||
"""Unit tests for TelemetryFacade.emit() routing and enterprise-only filtering.
|
||||
"""Unit tests for core.telemetry.emit() routing and enterprise-only filtering."""
|
||||
|
||||
This test suite verifies that TelemetryFacade correctly:
|
||||
1. Routes telemetry events to TraceQueueManager via enum-based TraceTaskName
|
||||
2. Blocks community traces (returns early)
|
||||
3. Allows enterprise-only traces to be routed to TraceQueueManager
|
||||
4. Passes TraceTaskName enum directly to TraceTask constructor
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import queue
|
||||
import sys
|
||||
import types
|
||||
from unittest.mock import MagicMock
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
|
|
@ -19,11 +14,8 @@ from core.telemetry.events import TelemetryContext, TelemetryEvent
|
|||
|
||||
|
||||
@pytest.fixture
|
||||
def facade_test_setup(monkeypatch):
|
||||
"""Fixture to provide TelemetryFacade with mocked TraceQueueManager."""
|
||||
def telemetry_test_setup(monkeypatch):
|
||||
module_name = "core.ops.ops_trace_manager"
|
||||
|
||||
# Always create a fresh stub module for testing
|
||||
ops_stub = types.ModuleType(module_name)
|
||||
|
||||
class StubTraceTask:
|
||||
|
|
@ -55,22 +47,15 @@ def facade_test_setup(monkeypatch):
|
|||
ops_stub.trace_manager_queue = MagicMock(spec=queue.Queue)
|
||||
monkeypatch.setitem(sys.modules, module_name, ops_stub)
|
||||
|
||||
from core.telemetry.facade import TelemetryFacade
|
||||
from core.telemetry import emit
|
||||
|
||||
return TelemetryFacade, ops_stub.trace_manager_queue
|
||||
return emit, ops_stub.trace_manager_queue
|
||||
|
||||
|
||||
class TestTelemetryFacadeEmit:
|
||||
"""Test TelemetryFacade.emit() routing and filtering."""
|
||||
|
||||
def test_emit_valid_name_creates_trace_task(self, facade_test_setup):
|
||||
"""Verify emit with enterprise-only trace creates and enqueues a trace task.
|
||||
|
||||
When emit() is called with an enterprise-only trace name
|
||||
(DRAFT_NODE_EXECUTION_TRACE), TraceQueueManager.add_trace_task()
|
||||
should be called with a TraceTask.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
class TestTelemetryEmit:
|
||||
@patch("core.telemetry._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_emit_enterprise_trace_creates_trace_task(self, _mock_ee, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
event = TelemetryEvent(
|
||||
name=TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
|
|
@ -82,22 +67,14 @@ class TestTelemetryFacadeEmit:
|
|||
payload={"key": "value"},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event)
|
||||
emit_fn(event)
|
||||
|
||||
# Verify add_trace_task was called
|
||||
mock_queue.put.assert_called_once()
|
||||
|
||||
# Verify the TraceTask was created with the correct name
|
||||
called_task = mock_queue.put.call_args[0][0]
|
||||
assert called_task.trace_type == TraceTaskName.DRAFT_NODE_EXECUTION_TRACE
|
||||
|
||||
def test_emit_community_trace_returns_early(self, facade_test_setup):
|
||||
"""Verify community trace is blocked by early return.
|
||||
|
||||
When emit() is called with a community trace (WORKFLOW_TRACE),
|
||||
the facade should return early without calling add_trace_task.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
def test_emit_community_trace_enqueued(self, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
event = TelemetryEvent(
|
||||
name=TraceTaskName.WORKFLOW_TRACE,
|
||||
|
|
@ -109,18 +86,12 @@ class TestTelemetryFacadeEmit:
|
|||
payload={},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event)
|
||||
emit_fn(event)
|
||||
|
||||
# Community traces should not reach the queue
|
||||
mock_queue.put.assert_not_called()
|
||||
mock_queue.put.assert_called_once()
|
||||
|
||||
def test_emit_enterprise_only_trace_allowed(self, facade_test_setup):
|
||||
"""Verify enterprise-only trace is routed to TraceQueueManager.
|
||||
|
||||
When emit() is called with DRAFT_NODE_EXECUTION_TRACE,
|
||||
add_trace_task should be called.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
def test_emit_enterprise_only_trace_dropped_when_ee_disabled(self, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
event = TelemetryEvent(
|
||||
name=TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
|
|
@ -132,26 +103,13 @@ class TestTelemetryFacadeEmit:
|
|||
payload={},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event)
|
||||
emit_fn(event)
|
||||
|
||||
# Verify add_trace_task was called and task was enqueued
|
||||
mock_queue.put.assert_called_once()
|
||||
mock_queue.put.assert_not_called()
|
||||
|
||||
# Verify the TraceTask was created with the correct name
|
||||
called_task = mock_queue.put.call_args[0][0]
|
||||
assert called_task.trace_type == TraceTaskName.DRAFT_NODE_EXECUTION_TRACE
|
||||
|
||||
def test_emit_all_enterprise_only_traces_allowed(self, facade_test_setup):
|
||||
"""Verify all 3 enterprise-only traces are correctly identified.
|
||||
|
||||
The three enterprise-only traces are:
|
||||
- DRAFT_NODE_EXECUTION_TRACE
|
||||
- NODE_EXECUTION_TRACE
|
||||
- PROMPT_GENERATION_TRACE
|
||||
|
||||
When these are emitted, they should be routed to add_trace_task.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
@patch("core.telemetry._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_emit_all_enterprise_only_traces_allowed_when_ee_enabled(self, _mock_ee, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
enterprise_only_traces = [
|
||||
TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
|
|
@ -172,22 +130,15 @@ class TestTelemetryFacadeEmit:
|
|||
payload={},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event)
|
||||
emit_fn(event)
|
||||
|
||||
# All enterprise-only traces should be routed
|
||||
mock_queue.put.assert_called_once()
|
||||
|
||||
# Verify the correct trace name was passed
|
||||
called_task = mock_queue.put.call_args[0][0]
|
||||
assert called_task.trace_type == trace_name
|
||||
|
||||
def test_emit_passes_name_directly_to_trace_task(self, facade_test_setup):
|
||||
"""Verify event.name (TraceTaskName enum) is passed directly to TraceTask.
|
||||
|
||||
The facade should pass the TraceTaskName enum directly as the first
|
||||
argument to TraceTask(), not convert it to a string.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
@patch("core.telemetry._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_emit_passes_name_directly_to_trace_task(self, _mock_ee, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
event = TelemetryEvent(
|
||||
name=TraceTaskName.DRAFT_NODE_EXECUTION_TRACE,
|
||||
|
|
@ -199,25 +150,16 @@ class TestTelemetryFacadeEmit:
|
|||
payload={"extra": "data"},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event)
|
||||
emit_fn(event)
|
||||
|
||||
# Verify add_trace_task was called
|
||||
mock_queue.put.assert_called_once()
|
||||
|
||||
# Verify the TraceTask was created with the enum directly
|
||||
called_task = mock_queue.put.call_args[0][0]
|
||||
|
||||
# The trace_type should be the enum, not a string
|
||||
assert called_task.trace_type == TraceTaskName.DRAFT_NODE_EXECUTION_TRACE
|
||||
assert isinstance(called_task.trace_type, TraceTaskName)
|
||||
|
||||
def test_emit_with_provided_trace_manager(self, facade_test_setup):
|
||||
"""Verify emit uses provided trace_manager instead of creating one.
|
||||
|
||||
When a trace_manager is provided, emit should use it directly
|
||||
instead of creating a new TraceQueueManager.
|
||||
"""
|
||||
TelemetryFacade, mock_queue = facade_test_setup
|
||||
@patch("core.telemetry._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_emit_with_provided_trace_manager(self, _mock_ee, telemetry_test_setup):
|
||||
emit_fn, mock_queue = telemetry_test_setup
|
||||
|
||||
mock_trace_manager = MagicMock()
|
||||
mock_trace_manager.add_trace_task = MagicMock()
|
||||
|
|
@ -232,11 +174,8 @@ class TestTelemetryFacadeEmit:
|
|||
payload={},
|
||||
)
|
||||
|
||||
TelemetryFacade.emit(event, trace_manager=mock_trace_manager)
|
||||
emit_fn(event, trace_manager=mock_trace_manager)
|
||||
|
||||
# Verify the provided trace_manager was used
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
# Verify the TraceTask was created with the correct name
|
||||
called_task = mock_trace_manager.add_trace_task.call_args[0][0]
|
||||
assert called_task.trace_type == TraceTaskName.NODE_EXECUTION_TRACE
|
||||
|
|
|
|||
|
|
@ -0,0 +1,252 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from core.telemetry import is_enterprise_telemetry_enabled
|
||||
from enterprise.telemetry.contracts import TelemetryCase
|
||||
from enterprise.telemetry.gateway import TelemetryGateway
|
||||
|
||||
|
||||
class TestTelemetryCoreExports:
|
||||
def test_is_enterprise_telemetry_enabled_exported(self) -> None:
|
||||
from core.telemetry import is_enterprise_telemetry_enabled as exported_func
|
||||
|
||||
assert callable(exported_func)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_ops_trace_manager():
|
||||
mock_module = MagicMock()
|
||||
mock_trace_task_class = MagicMock()
|
||||
mock_trace_task_class.return_value = MagicMock()
|
||||
mock_module.TraceTask = mock_trace_task_class
|
||||
mock_module.TraceQueueManager = MagicMock()
|
||||
|
||||
mock_trace_entity = MagicMock()
|
||||
mock_trace_task_name = MagicMock()
|
||||
mock_trace_task_name.return_value = "workflow"
|
||||
mock_trace_entity.TraceTaskName = mock_trace_task_name
|
||||
|
||||
with (
|
||||
patch.dict(sys.modules, {"core.ops.ops_trace_manager": mock_module}),
|
||||
patch.dict(sys.modules, {"core.ops.entities.trace_entity": mock_trace_entity}),
|
||||
):
|
||||
yield mock_module, mock_trace_entity
|
||||
|
||||
|
||||
class TestGatewayIntegrationTraceRouting:
|
||||
@pytest.fixture
|
||||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@pytest.fixture
|
||||
def mock_trace_manager(self) -> MagicMock:
|
||||
return MagicMock()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_ce_eligible_trace_routed_to_trace_manager(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True):
|
||||
context = {"app_id": "app-123", "user_id": "user-456", "tenant_id": "tenant-789"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_ce_eligible_trace_routed_when_ee_disabled(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_enterprise_only_trace_dropped_when_ee_disabled(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"node_id": "node-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.NODE_EXECUTION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_enterprise_only_trace_routed_when_ee_enabled(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"node_id": "node-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.NODE_EXECUTION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
|
||||
class TestGatewayIntegrationMetricRouting:
|
||||
@pytest.fixture
|
||||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
def test_metric_case_routes_to_celery_task(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
from enterprise.telemetry.contracts import TelemetryEnvelope
|
||||
|
||||
with patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay") as mock_delay:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
payload = {"app_id": "app-abc", "name": "My App"}
|
||||
|
||||
gateway.emit(TelemetryCase.APP_CREATED, context, payload)
|
||||
|
||||
mock_delay.assert_called_once()
|
||||
envelope_json = mock_delay.call_args[0][0]
|
||||
envelope = TelemetryEnvelope.model_validate_json(envelope_json)
|
||||
assert envelope.case == TelemetryCase.APP_CREATED
|
||||
assert envelope.tenant_id == "tenant-123"
|
||||
assert envelope.payload["app_id"] == "app-abc"
|
||||
|
||||
def test_tool_execution_metric_routed(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
from enterprise.telemetry.contracts import TelemetryEnvelope
|
||||
|
||||
with patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay") as mock_delay:
|
||||
context = {"tenant_id": "tenant-123", "app_id": "app-123"}
|
||||
payload = {"tool_name": "test_tool", "tool_inputs": {}, "tool_outputs": "result"}
|
||||
|
||||
gateway.emit(TelemetryCase.TOOL_EXECUTION, context, payload)
|
||||
|
||||
mock_delay.assert_called_once()
|
||||
envelope_json = mock_delay.call_args[0][0]
|
||||
envelope = TelemetryEnvelope.model_validate_json(envelope_json)
|
||||
assert envelope.case == TelemetryCase.TOOL_EXECUTION
|
||||
|
||||
def test_moderation_check_metric_routed(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
from enterprise.telemetry.contracts import TelemetryEnvelope
|
||||
|
||||
with patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay") as mock_delay:
|
||||
context = {"tenant_id": "tenant-123", "app_id": "app-123"}
|
||||
payload = {"message_id": "msg-123", "moderation_result": {"flagged": False}}
|
||||
|
||||
gateway.emit(TelemetryCase.MODERATION_CHECK, context, payload)
|
||||
|
||||
mock_delay.assert_called_once()
|
||||
envelope_json = mock_delay.call_args[0][0]
|
||||
envelope = TelemetryEnvelope.model_validate_json(envelope_json)
|
||||
assert envelope.case == TelemetryCase.MODERATION_CHECK
|
||||
|
||||
|
||||
class TestGatewayIntegrationCEEligibility:
|
||||
@pytest.fixture
|
||||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@pytest.fixture
|
||||
def mock_trace_manager(self) -> MagicMock:
|
||||
return MagicMock()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_workflow_run_is_ce_eligible(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_message_run_is_ce_eligible(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"message_id": "msg-abc", "conversation_id": "conv-123"}
|
||||
|
||||
gateway.emit(TelemetryCase.MESSAGE_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_node_execution_not_ce_eligible(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"node_id": "node-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.NODE_EXECUTION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_draft_node_execution_not_ce_eligible(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"node_execution_data": {}}
|
||||
|
||||
gateway.emit(TelemetryCase.DRAFT_NODE_EXECUTION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
@pytest.mark.usefixtures("mock_ops_trace_manager")
|
||||
def test_prompt_generation_not_ce_eligible(
|
||||
self,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
) -> None:
|
||||
with patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False):
|
||||
context = {"app_id": "app-123", "user_id": "user-456", "tenant_id": "tenant-789"}
|
||||
payload = {"operation_type": "generate", "instruction": "test"}
|
||||
|
||||
gateway.emit(TelemetryCase.PROMPT_GENERATION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
|
||||
class TestIsEnterpriseTelemetryEnabled:
|
||||
def test_returns_false_when_exporter_import_fails(self) -> None:
|
||||
with patch.dict(sys.modules, {"enterprise.telemetry.exporter": None}):
|
||||
result = is_enterprise_telemetry_enabled()
|
||||
assert result is False
|
||||
|
||||
def test_function_is_callable(self) -> None:
|
||||
assert callable(is_enterprise_telemetry_enabled)
|
||||
|
|
@ -5,7 +5,7 @@ from __future__ import annotations
|
|||
import pytest
|
||||
from pydantic import ValidationError
|
||||
|
||||
from enterprise.telemetry.contracts import CaseRoute, TelemetryCase, TelemetryEnvelope
|
||||
from enterprise.telemetry.contracts import CaseRoute, SignalType, TelemetryCase, TelemetryEnvelope
|
||||
from enterprise.telemetry.gateway import CASE_ROUTING
|
||||
|
||||
|
||||
|
|
@ -56,14 +56,14 @@ class TestCaseRoute:
|
|||
|
||||
def test_valid_trace_route(self) -> None:
|
||||
"""Verify valid trace route creation."""
|
||||
route = CaseRoute(signal_type="trace", ce_eligible=True)
|
||||
assert route.signal_type == "trace"
|
||||
route = CaseRoute(signal_type=SignalType.TRACE, ce_eligible=True)
|
||||
assert route.signal_type == SignalType.TRACE
|
||||
assert route.ce_eligible is True
|
||||
|
||||
def test_valid_metric_log_route(self) -> None:
|
||||
"""Verify valid metric_log route creation."""
|
||||
route = CaseRoute(signal_type="metric_log", ce_eligible=False)
|
||||
assert route.signal_type == "metric_log"
|
||||
route = CaseRoute(signal_type=SignalType.METRIC_LOG, ce_eligible=False)
|
||||
assert route.signal_type == SignalType.METRIC_LOG
|
||||
assert route.ce_eligible is False
|
||||
|
||||
def test_invalid_signal_type(self) -> None:
|
||||
|
|
@ -199,7 +199,7 @@ class TestCaseRouting:
|
|||
}
|
||||
for case in ce_eligible_trace_cases:
|
||||
route = CASE_ROUTING[case]
|
||||
assert route.signal_type == "trace"
|
||||
assert route.signal_type == SignalType.TRACE
|
||||
assert route.ce_eligible is True
|
||||
|
||||
def test_trace_enterprise_only_cases(self) -> None:
|
||||
|
|
@ -211,7 +211,7 @@ class TestCaseRouting:
|
|||
}
|
||||
for case in enterprise_only_trace_cases:
|
||||
route = CASE_ROUTING[case]
|
||||
assert route.signal_type == "trace"
|
||||
assert route.signal_type == SignalType.TRACE
|
||||
assert route.ce_eligible is False
|
||||
|
||||
def test_metric_log_cases(self) -> None:
|
||||
|
|
@ -229,7 +229,7 @@ class TestCaseRouting:
|
|||
}
|
||||
for case in metric_log_cases:
|
||||
route = CASE_ROUTING[case]
|
||||
assert route.signal_type == "metric_log"
|
||||
assert route.signal_type == SignalType.METRIC_LOG
|
||||
assert route.ce_eligible is False
|
||||
|
||||
def test_routing_table_completeness(self) -> None:
|
||||
|
|
@ -258,7 +258,7 @@ class TestCaseRouting:
|
|||
assert all_cases == set(TelemetryCase)
|
||||
|
||||
for case in trace_cases:
|
||||
assert CASE_ROUTING[case].signal_type == "trace"
|
||||
assert CASE_ROUTING[case].signal_type == SignalType.TRACE
|
||||
|
||||
for case in metric_log_cases:
|
||||
assert CASE_ROUTING[case].signal_type == "metric_log"
|
||||
assert CASE_ROUTING[case].signal_type == SignalType.METRIC_LOG
|
||||
|
|
|
|||
|
|
@ -5,44 +5,17 @@ from unittest.mock import MagicMock, patch
|
|||
|
||||
import pytest
|
||||
|
||||
from enterprise.telemetry.contracts import TelemetryCase, TelemetryEnvelope
|
||||
from core.ops.entities.trace_entity import TraceTaskName
|
||||
from enterprise.telemetry.contracts import SignalType, TelemetryCase, TelemetryEnvelope
|
||||
from enterprise.telemetry.gateway import (
|
||||
CASE_ROUTING,
|
||||
CASE_TO_TRACE_TASK_NAME,
|
||||
CASE_TO_TRACE_TASK,
|
||||
PAYLOAD_SIZE_THRESHOLD_BYTES,
|
||||
TelemetryGateway,
|
||||
emit,
|
||||
get_gateway,
|
||||
is_gateway_enabled,
|
||||
)
|
||||
|
||||
|
||||
class TestIsGatewayEnabled:
|
||||
@pytest.mark.parametrize(
|
||||
("env_value", "expected"),
|
||||
[
|
||||
("true", True),
|
||||
("True", True),
|
||||
("TRUE", True),
|
||||
("1", True),
|
||||
("yes", True),
|
||||
("YES", True),
|
||||
("false", False),
|
||||
("False", False),
|
||||
("0", False),
|
||||
("no", False),
|
||||
("", False),
|
||||
],
|
||||
)
|
||||
def test_feature_flag_values(self, env_value: str, expected: bool) -> None:
|
||||
with patch.dict("os.environ", {"ENTERPRISE_TELEMETRY_GATEWAY_ENABLED": env_value}):
|
||||
assert is_gateway_enabled() is expected
|
||||
|
||||
def test_missing_env_var(self) -> None:
|
||||
with patch.dict("os.environ", {}, clear=True):
|
||||
assert is_gateway_enabled() is False
|
||||
|
||||
|
||||
class TestCaseRoutingTable:
|
||||
def test_all_cases_have_routing(self) -> None:
|
||||
for case in TelemetryCase:
|
||||
|
|
@ -57,7 +30,7 @@ class TestCaseRoutingTable:
|
|||
TelemetryCase.PROMPT_GENERATION,
|
||||
]
|
||||
for case in trace_cases:
|
||||
assert CASE_ROUTING[case].signal_type == "trace", f"{case} should be trace"
|
||||
assert CASE_ROUTING[case].signal_type is SignalType.TRACE, f"{case} should be trace"
|
||||
|
||||
def test_metric_log_cases(self) -> None:
|
||||
metric_log_cases = [
|
||||
|
|
@ -72,7 +45,7 @@ class TestCaseRoutingTable:
|
|||
TelemetryCase.GENERATE_NAME,
|
||||
]
|
||||
for case in metric_log_cases:
|
||||
assert CASE_ROUTING[case].signal_type == "metric_log", f"{case} should be metric_log"
|
||||
assert CASE_ROUTING[case].signal_type is SignalType.METRIC_LOG, f"{case} should be metric_log"
|
||||
|
||||
def test_ce_eligible_cases(self) -> None:
|
||||
ce_eligible_cases = [TelemetryCase.WORKFLOW_RUN, TelemetryCase.MESSAGE_RUN]
|
||||
|
|
@ -89,9 +62,9 @@ class TestCaseRoutingTable:
|
|||
assert CASE_ROUTING[case].ce_eligible is False, f"{case} should be enterprise-only"
|
||||
|
||||
def test_trace_cases_have_task_name_mapping(self) -> None:
|
||||
trace_cases = [c for c in TelemetryCase if CASE_ROUTING[c].signal_type == "trace"]
|
||||
trace_cases = [c for c in TelemetryCase if CASE_ROUTING[c].signal_type is SignalType.TRACE]
|
||||
for case in trace_cases:
|
||||
assert case in CASE_TO_TRACE_TASK_NAME, f"Missing TraceTaskName mapping for {case}"
|
||||
assert case in CASE_TO_TRACE_TASK, f"Missing TraceTaskName mapping for {case}"
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
@ -123,12 +96,10 @@ class TestTelemetryGatewayTraceRouting:
|
|||
def mock_trace_manager(self) -> MagicMock:
|
||||
return MagicMock()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_trace_case_routes_to_trace_manager(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
|
|
@ -140,12 +111,10 @@ class TestTelemetryGatewayTraceRouting:
|
|||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False)
|
||||
def test_ce_eligible_trace_enqueued_when_ee_disabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
|
|
@ -157,12 +126,10 @@ class TestTelemetryGatewayTraceRouting:
|
|||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False)
|
||||
def test_enterprise_only_trace_dropped_when_ee_disabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
|
|
@ -174,12 +141,10 @@ class TestTelemetryGatewayTraceRouting:
|
|||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_enterprise_only_trace_enqueued_when_ee_enabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
|
|
@ -197,12 +162,10 @@ class TestTelemetryGatewayMetricLogRouting:
|
|||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_metric_case_routes_to_celery_task(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
|
|
@ -217,12 +180,10 @@ class TestTelemetryGatewayMetricLogRouting:
|
|||
assert envelope.tenant_id == "tenant-123"
|
||||
assert envelope.payload["app_id"] == "app-abc"
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_envelope_has_unique_event_id(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
|
|
@ -242,12 +203,10 @@ class TestTelemetryGatewayPayloadSizing:
|
|||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_small_payload_inlined(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
|
|
@ -260,14 +219,12 @@ class TestTelemetryGatewayPayloadSizing:
|
|||
assert envelope.payload == payload
|
||||
assert envelope.metadata is None
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway.storage")
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_large_payload_stored(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
mock_storage: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
|
|
@ -286,14 +243,12 @@ class TestTelemetryGatewayPayloadSizing:
|
|||
assert envelope.metadata is not None
|
||||
assert envelope.metadata["payload_ref"] == storage_key
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("enterprise.telemetry.gateway.storage")
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_large_payload_fallback_on_storage_error(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
mock_storage: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
mock_storage.save.side_effect = Exception("Storage failure")
|
||||
|
|
@ -309,127 +264,38 @@ class TestTelemetryGatewayPayloadSizing:
|
|||
assert envelope.metadata is None
|
||||
|
||||
|
||||
class TestTelemetryGatewayFeatureFlag:
|
||||
@pytest.fixture
|
||||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@pytest.fixture
|
||||
def mock_trace_manager(self) -> MagicMock:
|
||||
return MagicMock()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=False)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_legacy_path_used_when_flag_disabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
) -> None:
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=False)
|
||||
@patch("tasks.enterprise_telemetry_task.process_enterprise_telemetry.delay")
|
||||
def test_metric_log_not_processed_via_legacy_path(
|
||||
self,
|
||||
mock_delay: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
) -> None:
|
||||
context = {"tenant_id": "tenant-123"}
|
||||
payload = {"app_id": "app-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.APP_CREATED, context, payload)
|
||||
|
||||
mock_delay.assert_not_called()
|
||||
|
||||
|
||||
class TestTelemetryGatewayLegacyPath:
|
||||
@pytest.fixture
|
||||
def gateway(self) -> TelemetryGateway:
|
||||
return TelemetryGateway()
|
||||
|
||||
@pytest.fixture
|
||||
def mock_trace_manager(self) -> MagicMock:
|
||||
return MagicMock()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=False)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False)
|
||||
def test_legacy_ce_eligible_enqueued_when_ee_disabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
) -> None:
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=False)
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=False)
|
||||
def test_legacy_enterprise_only_dropped_when_ee_disabled(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
gateway: TelemetryGateway,
|
||||
mock_trace_manager: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
) -> None:
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"node_id": "node-abc"}
|
||||
|
||||
gateway.emit(TelemetryCase.NODE_EXECUTION, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_not_called()
|
||||
|
||||
|
||||
class TestModuleLevelFunctions:
|
||||
def test_get_gateway_returns_singleton(self) -> None:
|
||||
gateway1 = get_gateway()
|
||||
gateway2 = get_gateway()
|
||||
assert gateway1 is gateway2
|
||||
|
||||
@patch("enterprise.telemetry.gateway.is_gateway_enabled", return_value=True)
|
||||
@patch("extensions.ext_enterprise_telemetry.get_gateway")
|
||||
@patch("enterprise.telemetry.gateway._is_enterprise_telemetry_enabled", return_value=True)
|
||||
def test_emit_function_uses_gateway(
|
||||
self,
|
||||
_mock_ee_enabled: MagicMock,
|
||||
_mock_gateway_enabled: MagicMock,
|
||||
mock_get_gateway: MagicMock,
|
||||
mock_ops_trace_manager: tuple[MagicMock, MagicMock],
|
||||
) -> None:
|
||||
mock_gateway = TelemetryGateway()
|
||||
mock_get_gateway.return_value = mock_gateway
|
||||
mock_trace_manager = MagicMock()
|
||||
context = {"app_id": "app-123", "user_id": "user-456"}
|
||||
payload = {"workflow_run_id": "run-abc"}
|
||||
|
||||
emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
mock_trace_manager.add_trace_task.assert_called_once()
|
||||
with patch.object(mock_gateway, "emit") as mock_emit:
|
||||
emit(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
mock_emit.assert_called_once_with(TelemetryCase.WORKFLOW_RUN, context, payload, mock_trace_manager)
|
||||
|
||||
|
||||
class TestTraceTaskNameMapping:
|
||||
def test_workflow_run_mapping(self) -> None:
|
||||
assert CASE_TO_TRACE_TASK_NAME[TelemetryCase.WORKFLOW_RUN] == "workflow"
|
||||
assert CASE_TO_TRACE_TASK[TelemetryCase.WORKFLOW_RUN] is TraceTaskName.WORKFLOW_TRACE
|
||||
|
||||
def test_message_run_mapping(self) -> None:
|
||||
assert CASE_TO_TRACE_TASK_NAME[TelemetryCase.MESSAGE_RUN] == "message"
|
||||
assert CASE_TO_TRACE_TASK[TelemetryCase.MESSAGE_RUN] is TraceTaskName.MESSAGE_TRACE
|
||||
|
||||
def test_node_execution_mapping(self) -> None:
|
||||
assert CASE_TO_TRACE_TASK_NAME[TelemetryCase.NODE_EXECUTION] == "node_execution"
|
||||
assert CASE_TO_TRACE_TASK[TelemetryCase.NODE_EXECUTION] is TraceTaskName.NODE_EXECUTION_TRACE
|
||||
|
||||
def test_draft_node_execution_mapping(self) -> None:
|
||||
assert CASE_TO_TRACE_TASK_NAME[TelemetryCase.DRAFT_NODE_EXECUTION] == "draft_node_execution"
|
||||
assert CASE_TO_TRACE_TASK[TelemetryCase.DRAFT_NODE_EXECUTION] is TraceTaskName.DRAFT_NODE_EXECUTION_TRACE
|
||||
|
||||
def test_prompt_generation_mapping(self) -> None:
|
||||
assert CASE_TO_TRACE_TASK_NAME[TelemetryCase.PROMPT_GENERATION] == "prompt_generation"
|
||||
assert CASE_TO_TRACE_TASK[TelemetryCase.PROMPT_GENERATION] is TraceTaskName.PROMPT_GENERATION_TRACE
|
||||
|
|
|
|||
|
|
@ -0,0 +1,69 @@
|
|||
"""Unit tests for enterprise telemetry Celery task."""
|
||||
|
||||
import json
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from enterprise.telemetry.contracts import TelemetryCase, TelemetryEnvelope
|
||||
from tasks.enterprise_telemetry_task import process_enterprise_telemetry
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_envelope_json():
|
||||
envelope = TelemetryEnvelope(
|
||||
case=TelemetryCase.APP_CREATED,
|
||||
tenant_id="test-tenant",
|
||||
event_id="test-event-123",
|
||||
payload={"app_id": "app-123"},
|
||||
)
|
||||
return envelope.model_dump_json()
|
||||
|
||||
|
||||
def test_process_enterprise_telemetry_success(sample_envelope_json):
|
||||
with patch("tasks.enterprise_telemetry_task.EnterpriseMetricHandler") as mock_handler_class:
|
||||
mock_handler = MagicMock()
|
||||
mock_handler_class.return_value = mock_handler
|
||||
|
||||
process_enterprise_telemetry(sample_envelope_json)
|
||||
|
||||
mock_handler.handle.assert_called_once()
|
||||
call_args = mock_handler.handle.call_args[0][0]
|
||||
assert isinstance(call_args, TelemetryEnvelope)
|
||||
assert call_args.case == TelemetryCase.APP_CREATED
|
||||
assert call_args.tenant_id == "test-tenant"
|
||||
assert call_args.event_id == "test-event-123"
|
||||
|
||||
|
||||
def test_process_enterprise_telemetry_invalid_json(caplog):
|
||||
invalid_json = "not valid json"
|
||||
|
||||
process_enterprise_telemetry(invalid_json)
|
||||
|
||||
assert "Failed to process enterprise telemetry envelope" in caplog.text
|
||||
|
||||
|
||||
def test_process_enterprise_telemetry_handler_exception(sample_envelope_json, caplog):
|
||||
with patch("tasks.enterprise_telemetry_task.EnterpriseMetricHandler") as mock_handler_class:
|
||||
mock_handler = MagicMock()
|
||||
mock_handler.handle.side_effect = Exception("Handler error")
|
||||
mock_handler_class.return_value = mock_handler
|
||||
|
||||
process_enterprise_telemetry(sample_envelope_json)
|
||||
|
||||
assert "Failed to process enterprise telemetry envelope" in caplog.text
|
||||
|
||||
|
||||
def test_process_enterprise_telemetry_validation_error(caplog):
|
||||
invalid_envelope = json.dumps(
|
||||
{
|
||||
"case": "INVALID_CASE",
|
||||
"tenant_id": "test-tenant",
|
||||
"event_id": "test-event",
|
||||
"payload": {},
|
||||
}
|
||||
)
|
||||
|
||||
process_enterprise_telemetry(invalid_envelope)
|
||||
|
||||
assert "Failed to process enterprise telemetry envelope" in caplog.text
|
||||
Loading…
Reference in New Issue