-
Notifications
You must be signed in to change notification settings - Fork 396
Observability redesign to reduce dependencies and improve flexibility #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability redesign to reduce dependencies and improve flexibility #379
Conversation
- Introduced new subpackages: `aiqtoolkit-opentelemetry` and `aiqtoolkit-phoenix` for enhanced observability and integration with external services. - Updated `pyproject.toml` and `uv.lock` to include dependencies for the new packages. - Implemented telemetry exporters for both OpenTelemetry and Phoenix, allowing for flexible trace exporting. - Added necessary classes and methods for span management and exporting in the observability module. This update enhances the toolkit's capabilities for monitoring and observability, providing users with more options for telemetry integration. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
… manager context manager. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
… type check in StepAdaptor process method Signed-off-by: Matthew Penn <mpenn@nvidia.com>
exporters and removed deprecated of main pyproject.toml Signed-off-by: Matthew Penn <mpenn@nvidia.com>
phoenix Signed-off-by: Matthew Penn <mpenn@nvidia.com>
step output Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are close but there are a few tweaks needed:
- The base exporter should subscribe directly to the event stream or have an abstract method which just accepts raw
IntermediateStepobjects - We need to discuss the lifetime of exporters. Im concerned we may be creating/destroying them too frequently.
- The ExporterRegistry I am not sure about. We can use factory functions in the registration if needed
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ity-redesign Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
implementations, removal of ExporterRegistry, and moving on_complete within ExporterManager.start context manager. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
updated tests Signed-off-by: Matthew Penn <mpenn@nvidia.com>
lighter task scheduling Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Updated the test to check that the "exporter1" instance is not only present but also a subclass of BaseExporter, improving type safety in the telemetry exporter validation. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
workflow observability index.md re: aiq info -t, was missing components command Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ity-redesign Signed-off-by: Michael Demoret <mdemoret@nvidia.com>
Signed-off-by: Michael Demoret <mdemoret@nvidia.com>
Signed-off-by: Michael Demoret <mdemoret@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice piece of code. Im impressed. Couple of things general statements:
- Getting type incompatibilities on uses of
register_telemetry_exporter. Likely need to update the decorator types with the new registrations - Is this backwards compatible? If not, we need to properly mark and document
packages/aiqtoolkit_opentelemetry/src/aiq/plugins/opentelemetry/register.py
Outdated
Show resolved
Hide resolved
Updated the TelemetryExporterBuildCallableT type to reference BaseExporter instead of SpanExporter. Removed the OpenTelemetry import block as it is no longer necessary. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…io.gather with a simple for loop. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…Agent-Toolkit into mpenn_observability-redesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor feedback related to UUIDs. Feel free to ignore.
packages/aiqtoolkit_opentelemetry/src/aiq/plugins/opentelemetry/otel_span.py
Show resolved
Hide resolved
…s for telemetry exporters. Updated the Langfuse, Langsmith, OtelCollector, Patronus, Galileo, Phoenix, and Catalyst telemetry exporters to inherit from BatchTelemetryConfigMixin, consolidating batch size control options into a single mixin. Removed redundant batch size control fields from individual exporters. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ors. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ion for UUID. Updated SpanExporter to use span_id factory method. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ed file management features - Renamed `FileTelemetryExporter` to `FileTelemetryExporterConfig` and updated its configuration fields to include `output_path`, `mode`, `enable_rolling`, `max_file_size`, `max_files`, and `cleanup_on_init`. - Introduced `FileMode` enum for file write modes (append/overwrite). - Updated the `file_telemetry_exporter` function to yield a `FileExporter` instance with the new configuration. - Renamed `ConsoleLoggingMethod` to `ConsoleLoggingMethodConfig` and updated its registration accordingly. - Added `FileExportMixin` enhancements to manage file path conflicts and rolling log files, including methods for file cleanup and resource conflict detection. This refactor improves the flexibility and robustness of file-based telemetry logging, allowing for better management of log files and preventing resource conflicts. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
types (including planned event types) Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
- Replaced `BatchTelemetryConfigMixin` with `BatchConfigMixin` and introduced `CollectorConfigMixin` for improved separation of concerns in telemetry exporters. - Updated `Langfuse`, `Langsmith`, `OtelCollector`, `Patronus`, `Galileo`, `Phoenix`, and `Catalyst` telemetry exporters to utilize the new mixins, streamlining their configuration. - Removed redundant project name fields from exporters where applicable. This refactor enhances the clarity and maintainability of the telemetry exporter implementations. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…ity-redesign Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
|
/merge |
…NVIDIA#379) Closes: NVIDIA#378 This PR introduces a complete overhaul of the NeMo Agent toolkit's observability infrastructure, providing a robust, scalable, and extensible framework for monitoring and tracing workflow execution. ## Key Improvements - **Backwards Compatibility**: Maintains full backward compatibility while adding significant improvements in extensibility - **Modular Architecture**: Plugin-based system supporting multiple telemetry backends - **Reduced Dependencies**: Core observability no longer requires OpenTelemetry - now available as optional plugin - **High Performance**: Copy-on-write isolation enables efficient concurrent execution - **Extensible**: Type-safe processor pipelines for flexible data transformation - **Multiple Backends**: Built-in support for OpenTelemetry, Phoenix, and Weave ## New Architecture ### Class/Plugin Hierarchy ``` Exporter (abstract interface) └── BaseExporter (abstract) └── ProcessingExporter (abstract) ├── SpanExporter (abstract) │ ├── OtelSpanExporter (abstract) │ │ ├── PhoenixOtelExporter (concrete) │ │ │ └── phoenix (plugin) │ │ ├── OTLPSpanAdapterExporter (concrete) │ │ │ ├── langfuse (plugin) │ │ │ ├── langsmith (plugin) │ │ │ ├── otelcollector (plugin) │ │ │ ├── patronus (plugin) │ │ │ └── galileo (plugin) │ │ └── RagaAICatalystExporter (concrete) │ │ └── catalyst (plugin) │ └── WeaveExporter (concrete) │ └── weave (plugin) └── RawExporter (abstract) └── FileExporter (concrete) └── file (plugin) ``` ### Core Components #### ExporterManager - Manages exporter lifecycle and concurrent execution - Copy-on-write isolation for concurrent workflows - Automatic cleanup and resource management #### Processing Pipeline Processing pipelines enable flexible data transformation chains where events can be modified, enriched, or reformatted before export. This allows exporters to adapt data for different backends without duplicating transformation logic. - `Processor[InputT, OutputT]` generic transformation interface - `BatchingProcessor` for efficient dynamic batching of traces - Type-safe processor chains #### Copy-on-Write Isolation ``` ExporterManager ├── Original Exporters (Shared Registry) │ ├── Shared: HTTP Clients, Auth, Configuration │ └── create_isolated_exporters() │ ├── Workflow 1 → Isolated Instance (private tasks, events, buffers) │ ├── Workflow 2 → Isolated Instance (private tasks, events, buffers) │ └── Workflow N → Isolated Instance (private tasks, events, buffers) └── Benefits: Fast (shared resources) + Safe (isolated state) ``` #### Performance Benefits - **Fast**: Shares expensive resources (HTTP clients, auth) - **Safe**: Isolates mutable state (tasks, events, buffers) - **Memory Efficient**: Minimal overhead for isolation - **Concurrent**: Enables parallel workflow execution #### Core Plugin Exporters - **FileExporter**: Local file export for debugging ### Plugin Exporters - **OpenTelemetry**: Standard OTEL span export with OTLP adapter - **Phoenix**: AI observability platform integration - **Weave**: Weights & Biases integration ## Integration The system integrates seamlessly with existing NAT runtime: ```python # In AIQRunner async with self._exporter_manager.start(context_state=self._context_state): result = await self._entry_fn.ainvoke(self._input_message, to_type=to_type) # In WorkflowBuilder for key, exporter_config in telemetry_config.tracing.items(): await self.add_exporter(key, exporter_config) ``` Also included are minor bug fixes: - Fixing `Function.astream` to set final output of `ActiveFunctionContextManager` for streaming requests - Fixing `Function.astream` to use function runtime instance name to remove ambiguity in traces (previously the config type was used) ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Matthew Penn (https://github.com/mpenn) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#379
…NVIDIA#379) Closes: NVIDIA#378 This PR introduces a complete overhaul of the NeMo Agent toolkit's observability infrastructure, providing a robust, scalable, and extensible framework for monitoring and tracing workflow execution. ## Key Improvements - **Backwards Compatibility**: Maintains full backward compatibility while adding significant improvements in extensibility - **Modular Architecture**: Plugin-based system supporting multiple telemetry backends - **Reduced Dependencies**: Core observability no longer requires OpenTelemetry - now available as optional plugin - **High Performance**: Copy-on-write isolation enables efficient concurrent execution - **Extensible**: Type-safe processor pipelines for flexible data transformation - **Multiple Backends**: Built-in support for OpenTelemetry, Phoenix, and Weave ## New Architecture ### Class/Plugin Hierarchy ``` Exporter (abstract interface) └── BaseExporter (abstract) └── ProcessingExporter (abstract) ├── SpanExporter (abstract) │ ├── OtelSpanExporter (abstract) │ │ ├── PhoenixOtelExporter (concrete) │ │ │ └── phoenix (plugin) │ │ ├── OTLPSpanAdapterExporter (concrete) │ │ │ ├── langfuse (plugin) │ │ │ ├── langsmith (plugin) │ │ │ ├── otelcollector (plugin) │ │ │ ├── patronus (plugin) │ │ │ └── galileo (plugin) │ │ └── RagaAICatalystExporter (concrete) │ │ └── catalyst (plugin) │ └── WeaveExporter (concrete) │ └── weave (plugin) └── RawExporter (abstract) └── FileExporter (concrete) └── file (plugin) ``` ### Core Components #### ExporterManager - Manages exporter lifecycle and concurrent execution - Copy-on-write isolation for concurrent workflows - Automatic cleanup and resource management #### Processing Pipeline Processing pipelines enable flexible data transformation chains where events can be modified, enriched, or reformatted before export. This allows exporters to adapt data for different backends without duplicating transformation logic. - `Processor[InputT, OutputT]` generic transformation interface - `BatchingProcessor` for efficient dynamic batching of traces - Type-safe processor chains #### Copy-on-Write Isolation ``` ExporterManager ├── Original Exporters (Shared Registry) │ ├── Shared: HTTP Clients, Auth, Configuration │ └── create_isolated_exporters() │ ├── Workflow 1 → Isolated Instance (private tasks, events, buffers) │ ├── Workflow 2 → Isolated Instance (private tasks, events, buffers) │ └── Workflow N → Isolated Instance (private tasks, events, buffers) └── Benefits: Fast (shared resources) + Safe (isolated state) ``` #### Performance Benefits - **Fast**: Shares expensive resources (HTTP clients, auth) - **Safe**: Isolates mutable state (tasks, events, buffers) - **Memory Efficient**: Minimal overhead for isolation - **Concurrent**: Enables parallel workflow execution #### Core Plugin Exporters - **FileExporter**: Local file export for debugging ### Plugin Exporters - **OpenTelemetry**: Standard OTEL span export with OTLP adapter - **Phoenix**: AI observability platform integration - **Weave**: Weights & Biases integration ## Integration The system integrates seamlessly with existing NAT runtime: ```python # In AIQRunner async with self._exporter_manager.start(context_state=self._context_state): result = await self._entry_fn.ainvoke(self._input_message, to_type=to_type) # In WorkflowBuilder for key, exporter_config in telemetry_config.tracing.items(): await self.add_exporter(key, exporter_config) ``` Also included are minor bug fixes: - Fixing `Function.astream` to set final output of `ActiveFunctionContextManager` for streaming requests - Fixing `Function.astream` to use function runtime instance name to remove ambiguity in traces (previously the config type was used) ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Matthew Penn (https://github.com/mpenn) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#379
…NVIDIA#379) Closes: NVIDIA#378 This PR introduces a complete overhaul of the NeMo Agent toolkit's observability infrastructure, providing a robust, scalable, and extensible framework for monitoring and tracing workflow execution. - **Backwards Compatibility**: Maintains full backward compatibility while adding significant improvements in extensibility - **Modular Architecture**: Plugin-based system supporting multiple telemetry backends - **Reduced Dependencies**: Core observability no longer requires OpenTelemetry - now available as optional plugin - **High Performance**: Copy-on-write isolation enables efficient concurrent execution - **Extensible**: Type-safe processor pipelines for flexible data transformation - **Multiple Backends**: Built-in support for OpenTelemetry, Phoenix, and Weave ``` Exporter (abstract interface) └── BaseExporter (abstract) └── ProcessingExporter (abstract) ├── SpanExporter (abstract) │ ├── OtelSpanExporter (abstract) │ │ ├── PhoenixOtelExporter (concrete) │ │ │ └── phoenix (plugin) │ │ ├── OTLPSpanAdapterExporter (concrete) │ │ │ ├── langfuse (plugin) │ │ │ ├── langsmith (plugin) │ │ │ ├── otelcollector (plugin) │ │ │ ├── patronus (plugin) │ │ │ └── galileo (plugin) │ │ └── RagaAICatalystExporter (concrete) │ │ └── catalyst (plugin) │ └── WeaveExporter (concrete) │ └── weave (plugin) └── RawExporter (abstract) └── FileExporter (concrete) └── file (plugin) ``` - Manages exporter lifecycle and concurrent execution - Copy-on-write isolation for concurrent workflows - Automatic cleanup and resource management Processing pipelines enable flexible data transformation chains where events can be modified, enriched, or reformatted before export. This allows exporters to adapt data for different backends without duplicating transformation logic. - `Processor[InputT, OutputT]` generic transformation interface - `BatchingProcessor` for efficient dynamic batching of traces - Type-safe processor chains ``` ExporterManager ├── Original Exporters (Shared Registry) │ ├── Shared: HTTP Clients, Auth, Configuration │ └── create_isolated_exporters() │ ├── Workflow 1 → Isolated Instance (private tasks, events, buffers) │ ├── Workflow 2 → Isolated Instance (private tasks, events, buffers) │ └── Workflow N → Isolated Instance (private tasks, events, buffers) └── Benefits: Fast (shared resources) + Safe (isolated state) ``` - **Fast**: Shares expensive resources (HTTP clients, auth) - **Safe**: Isolates mutable state (tasks, events, buffers) - **Memory Efficient**: Minimal overhead for isolation - **Concurrent**: Enables parallel workflow execution - **FileExporter**: Local file export for debugging - **OpenTelemetry**: Standard OTEL span export with OTLP adapter - **Phoenix**: AI observability platform integration - **Weave**: Weights & Biases integration The system integrates seamlessly with existing NAT runtime: ```python async with self._exporter_manager.start(context_state=self._context_state): result = await self._entry_fn.ainvoke(self._input_message, to_type=to_type) for key, exporter_config in telemetry_config.tracing.items(): await self.add_exporter(key, exporter_config) ``` Also included are minor bug fixes: - Fixing `Function.astream` to set final output of `ActiveFunctionContextManager` for streaming requests - Fixing `Function.astream` to use function runtime instance name to remove ambiguity in traces (previously the config type was used) - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Matthew Penn (https://github.com/mpenn) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#379
Description
Closes: #378
This PR introduces a complete overhaul of the NeMo Agent toolkit's observability infrastructure, providing a robust, scalable, and extensible framework for monitoring and tracing workflow execution.
Key Improvements
New Architecture
Class/Plugin Hierarchy
Core Components
ExporterManager
Processing Pipeline
Processing pipelines enable flexible data transformation chains where events can be modified, enriched, or reformatted before export. This allows exporters to adapt data for different backends without duplicating transformation logic.
Processor[InputT, OutputT]generic transformation interfaceBatchingProcessorfor efficient dynamic batching of tracesCopy-on-Write Isolation
Performance Benefits
Core Plugin Exporters
Plugin Exporters
Integration
The system integrates seamlessly with existing NAT runtime:
Also included are minor bug fixes:
Function.astreamto set final output ofActiveFunctionContextManagerfor streaming requestsFunction.astreamto use function runtime instance name to remove ambiguity in traces (previously the config type was used)By Submitting this PR I confirm: