-
Notifications
You must be signed in to change notification settings - Fork 396
Add override option to the eval CLI command #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Hritik003 <hritik.raj@nutanix.com>
|
@AnuradhaKaruppiah any changes required on this PR? |
This is for consistency with the start commands Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
|
LGTM. Thanks for the contribution @Hritik003 |
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds the ability to override configuration options when executing the eval CLI command. Key changes include:
- Introducing a new function in the evaluation module to load, override, and validate the configuration.
- Updating the evaluation run configuration type to include an override tuple.
- Modifying the CLI command to accept and pass override values, with corresponding documentation updates.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/aiq/eval/evaluate.py | Added apply_overrides to load and validate overriden config options. |
| src/aiq/eval/config.py | Updated the EvaluationRunConfig to include an override parameter. |
| src/aiq/cli/commands/evaluate.py | Extended the CLI command to accept override options and pass them. |
| docs/source/guides/evaluate.md | Documented the new override flag usage with an example. |
Comments suppressed due to low confidence (1)
src/aiq/eval/evaluate.py:227
- [nitpick] Consider renaming 'apply_overrides' to 'load_and_apply_overrides' to better reflect that the method both loads the configuration and applies overrides, improving clarity.
def apply_overrides(self):
|
/ok to test |
@AnuradhaKaruppiah, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 68e2321 |
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
|
/ok to test fd53dee |
|
/merge |
Closes Issue NVIDIA#78 ## Changes Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options. cc: @AnuradhaKaruppiah ## Test ``` aiq eval --config_file examples/simple/configs/eval_config.yml \ --override llms.nim_llm.temperature 0.7 \ --override llms.nim_llm.model_name meta/llama-3.3-70b-instruct ``` <details> <summary>Response</summary> ``` 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.temperature with value: 0.7 with type <class 'float'>) 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.model_name with value: meta/llama-3.3-70b-instruct with type <class 'str'>) 2025-04-14 22:59:35,968 - aiq.cli.cli_utils.config_override - INFO - Configuration after overrides: embedders: nv-embedqa-e5-v5: _type: nim model_name: nvidia/nv-embedqa-e5-v5 eval: evaluators: rag_accuracy: _type: ragas llm_name: nim_rag_eval_llm metric: AnswerAccuracy rag_groundedness: _type: ragas llm_name: nim_rag_eval_llm metric: ResponseGroundedness rag_relevance: _type: ragas llm_name: nim_rag_eval_llm metric: ContextRelevance trajectory_accuracy: _type: trajectory llm_name: nim_trajectory_eval_llm general: dataset: _type: json file_path: examples/simple/data/langsmith.json output: cleanup: true dir: ./.tmp/aiq/examples/simple/ profiler: bottleneck_analysis: enable_nested_stack: true compute_llm_metrics: true concurrency_spike_analysis: enable: true spike_threshold: 7 csv_exclude_io_text: true prompt_caching_prefixes: enable: true min_frequency: 0.1 token_uniqueness_forecast: true workflow_runtime_forecast: true functions: current_datetime: _type: current_datetime general: use_uvloop: true llms: nim_llm: _type: nim model_name: meta/llama-3.3-70b-instruct temperature: 0.7 nim_rag_eval_llm: _type: nim max_tokens: 2 model_name: meta/llama-3.3-70b-instruct temperature: 1.0e-07 top_p: 0.0001 nim_trajectory_eval_llm: _type: nim max_tokens: 1024 model_name: meta/llama-3.1-70b-instruct temperature: 0.0 workflow: _type: react_agent llm_name: nim_llm max_retries: 3 retry_parsing_errors: true tool_names: - current_datetime verbose: true 2025-04-14 22:59:36,035 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: examples/simple/configs/eval_config.yml 2025-04-14 22:59:36,043 - aiq.eval.evaluate - INFO - Cleaning up output directory .tmp/aiq/examples/simple 2025-04-14 22:59:36,184 - aiq.profiler.decorators - INFO - Langchain callback handler registered 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Filling the prompt variables "tools" and "tool_names", using the tools provided in the config. 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Adding the tools' input schema to the tools' description 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Initialized ReAct Agent Graph 2025-04-14 22:59:36,473 - aiq.agent.react_agent.agent - INFO - ReAct Graph built and compiled successfully Running workflow: 0%| | 0/3 [00:00<?, ?it/s]2025-04- ....................... The agent's thoughts are: Thought: Since I don't have the specific tool to search for Langsmith documentation and tutorials, I'll try to provide a general answer based on my knowledge. Langsmith is a platform that allows users to create and test conversational interfaces. To prototype with Langsmith, you can start by creating a new project and defining the conversational flow using their visual interface. You can then add intents, entities, and responses to create a functional conversational interface. Langsmith also provides features like testing and analytics to help you refine your prototype. Final Answer: To prototype with Langsmith, create a new project, define the conversational flow, add intents, entities, and responses, and use testing and analytics features to refine your prototype. 2025-04-14 22:59:42,047 - aiq.observability.async_otel_listener - INFO - Intermediate step stream completed. No more events will arrive. Running workflow: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.66s/it] Evaluating Ragas nv_accuracy: 0%| | 0/3 [00:00<?, ?it/s2025-04-14 22:59:43,516 - aiq.eval.trajectory_evaluator.evaluate - INFO - Running trajectory evaluation with 3 records | 0/3 [00:00<?, ?it/s] Evaluating Ragas nv_context_relevance: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.72it/s] Evaluating Ragas nv_response_groundedness: 100%|██████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.06it/s] Evaluating Ragas nv_accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.50s/it] Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.07s/it] 2025-04-14 22:59:49,774 - aiq.profiler.profile_runner - INFO - Wrote combined data to: .tmp/aiq/examples/simple/all_requests_profiler_traces.json 2025-04-14 22:59:49,815 - aiq.profiler.profile_runner - INFO - Wrote merged standardized DataFrame to .tmp/aiq/examples/simple/standardized_data_all.csv 2025-04-14 22:59:49,835 - aiq.profiler.profile_runner - INFO - Wrote inference optimization results to: .tmp/aiq/examples/simple/inference_optimization.json 2025-04-14 22:59:50,271 - aiq.profiler.profile_runner - INFO - Nested stack analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Concurrency spike analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling report to: .tmp/aiq/examples/simple/workflow_profiling_report.txt 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling metrics to: .tmp/aiq/examples/simple/workflow_profiling_metrics.json 2025-04-14 22:59:50,283 - aiq.eval.evaluate - INFO - Workflow output written to 2025-04-14 22:59:50,283 - aiq.eval.utils.output_uploader - INFO - No S3 config provided; skipping upload. ``` </details> ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Hritik Raj (https://github.com/Hritik003) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#129 Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
Closes Issue NVIDIA#78 ## Changes Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options. cc: @AnuradhaKaruppiah ## Test ``` aiq eval --config_file examples/simple/configs/eval_config.yml \ --override llms.nim_llm.temperature 0.7 \ --override llms.nim_llm.model_name meta/llama-3.3-70b-instruct ``` <details> <summary>Response</summary> ``` 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.temperature with value: 0.7 with type <class 'float'>) 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.model_name with value: meta/llama-3.3-70b-instruct with type <class 'str'>) 2025-04-14 22:59:35,968 - aiq.cli.cli_utils.config_override - INFO - Configuration after overrides: embedders: nv-embedqa-e5-v5: _type: nim model_name: nvidia/nv-embedqa-e5-v5 eval: evaluators: rag_accuracy: _type: ragas llm_name: nim_rag_eval_llm metric: AnswerAccuracy rag_groundedness: _type: ragas llm_name: nim_rag_eval_llm metric: ResponseGroundedness rag_relevance: _type: ragas llm_name: nim_rag_eval_llm metric: ContextRelevance trajectory_accuracy: _type: trajectory llm_name: nim_trajectory_eval_llm general: dataset: _type: json file_path: examples/simple/data/langsmith.json output: cleanup: true dir: ./.tmp/aiq/examples/simple/ profiler: bottleneck_analysis: enable_nested_stack: true compute_llm_metrics: true concurrency_spike_analysis: enable: true spike_threshold: 7 csv_exclude_io_text: true prompt_caching_prefixes: enable: true min_frequency: 0.1 token_uniqueness_forecast: true workflow_runtime_forecast: true functions: current_datetime: _type: current_datetime general: use_uvloop: true llms: nim_llm: _type: nim model_name: meta/llama-3.3-70b-instruct temperature: 0.7 nim_rag_eval_llm: _type: nim max_tokens: 2 model_name: meta/llama-3.3-70b-instruct temperature: 1.0e-07 top_p: 0.0001 nim_trajectory_eval_llm: _type: nim max_tokens: 1024 model_name: meta/llama-3.1-70b-instruct temperature: 0.0 workflow: _type: react_agent llm_name: nim_llm max_retries: 3 retry_parsing_errors: true tool_names: - current_datetime verbose: true 2025-04-14 22:59:36,035 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: examples/simple/configs/eval_config.yml 2025-04-14 22:59:36,043 - aiq.eval.evaluate - INFO - Cleaning up output directory .tmp/aiq/examples/simple 2025-04-14 22:59:36,184 - aiq.profiler.decorators - INFO - Langchain callback handler registered 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Filling the prompt variables "tools" and "tool_names", using the tools provided in the config. 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Adding the tools' input schema to the tools' description 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Initialized ReAct Agent Graph 2025-04-14 22:59:36,473 - aiq.agent.react_agent.agent - INFO - ReAct Graph built and compiled successfully Running workflow: 0%| | 0/3 [00:00<?, ?it/s]2025-04- ....................... The agent's thoughts are: Thought: Since I don't have the specific tool to search for Langsmith documentation and tutorials, I'll try to provide a general answer based on my knowledge. Langsmith is a platform that allows users to create and test conversational interfaces. To prototype with Langsmith, you can start by creating a new project and defining the conversational flow using their visual interface. You can then add intents, entities, and responses to create a functional conversational interface. Langsmith also provides features like testing and analytics to help you refine your prototype. Final Answer: To prototype with Langsmith, create a new project, define the conversational flow, add intents, entities, and responses, and use testing and analytics features to refine your prototype. 2025-04-14 22:59:42,047 - aiq.observability.async_otel_listener - INFO - Intermediate step stream completed. No more events will arrive. Running workflow: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.66s/it] Evaluating Ragas nv_accuracy: 0%| | 0/3 [00:00<?, ?it/s2025-04-14 22:59:43,516 - aiq.eval.trajectory_evaluator.evaluate - INFO - Running trajectory evaluation with 3 records | 0/3 [00:00<?, ?it/s] Evaluating Ragas nv_context_relevance: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.72it/s] Evaluating Ragas nv_response_groundedness: 100%|██████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.06it/s] Evaluating Ragas nv_accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.50s/it] Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.07s/it] 2025-04-14 22:59:49,774 - aiq.profiler.profile_runner - INFO - Wrote combined data to: .tmp/aiq/examples/simple/all_requests_profiler_traces.json 2025-04-14 22:59:49,815 - aiq.profiler.profile_runner - INFO - Wrote merged standardized DataFrame to .tmp/aiq/examples/simple/standardized_data_all.csv 2025-04-14 22:59:49,835 - aiq.profiler.profile_runner - INFO - Wrote inference optimization results to: .tmp/aiq/examples/simple/inference_optimization.json 2025-04-14 22:59:50,271 - aiq.profiler.profile_runner - INFO - Nested stack analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Concurrency spike analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling report to: .tmp/aiq/examples/simple/workflow_profiling_report.txt 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling metrics to: .tmp/aiq/examples/simple/workflow_profiling_metrics.json 2025-04-14 22:59:50,283 - aiq.eval.evaluate - INFO - Workflow output written to 2025-04-14 22:59:50,283 - aiq.eval.utils.output_uploader - INFO - No S3 config provided; skipping upload. ``` </details> ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Hritik Raj (https://github.com/Hritik003) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#129 Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
Closes Issue NVIDIA#78 ## Changes Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options. cc: @AnuradhaKaruppiah ## Test ``` aiq eval --config_file examples/simple/configs/eval_config.yml \ --override llms.nim_llm.temperature 0.7 \ --override llms.nim_llm.model_name meta/llama-3.3-70b-instruct ``` <details> <summary>Response</summary> ``` 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.temperature with value: 0.7 with type <class 'float'>) 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.model_name with value: meta/llama-3.3-70b-instruct with type <class 'str'>) 2025-04-14 22:59:35,968 - aiq.cli.cli_utils.config_override - INFO - Configuration after overrides: embedders: nv-embedqa-e5-v5: _type: nim model_name: nvidia/nv-embedqa-e5-v5 eval: evaluators: rag_accuracy: _type: ragas llm_name: nim_rag_eval_llm metric: AnswerAccuracy rag_groundedness: _type: ragas llm_name: nim_rag_eval_llm metric: ResponseGroundedness rag_relevance: _type: ragas llm_name: nim_rag_eval_llm metric: ContextRelevance trajectory_accuracy: _type: trajectory llm_name: nim_trajectory_eval_llm general: dataset: _type: json file_path: examples/simple/data/langsmith.json output: cleanup: true dir: ./.tmp/aiq/examples/simple/ profiler: bottleneck_analysis: enable_nested_stack: true compute_llm_metrics: true concurrency_spike_analysis: enable: true spike_threshold: 7 csv_exclude_io_text: true prompt_caching_prefixes: enable: true min_frequency: 0.1 token_uniqueness_forecast: true workflow_runtime_forecast: true functions: current_datetime: _type: current_datetime general: use_uvloop: true llms: nim_llm: _type: nim model_name: meta/llama-3.3-70b-instruct temperature: 0.7 nim_rag_eval_llm: _type: nim max_tokens: 2 model_name: meta/llama-3.3-70b-instruct temperature: 1.0e-07 top_p: 0.0001 nim_trajectory_eval_llm: _type: nim max_tokens: 1024 model_name: meta/llama-3.1-70b-instruct temperature: 0.0 workflow: _type: react_agent llm_name: nim_llm max_retries: 3 retry_parsing_errors: true tool_names: - current_datetime verbose: true 2025-04-14 22:59:36,035 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: examples/simple/configs/eval_config.yml 2025-04-14 22:59:36,043 - aiq.eval.evaluate - INFO - Cleaning up output directory .tmp/aiq/examples/simple 2025-04-14 22:59:36,184 - aiq.profiler.decorators - INFO - Langchain callback handler registered 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Filling the prompt variables "tools" and "tool_names", using the tools provided in the config. 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Adding the tools' input schema to the tools' description 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Initialized ReAct Agent Graph 2025-04-14 22:59:36,473 - aiq.agent.react_agent.agent - INFO - ReAct Graph built and compiled successfully Running workflow: 0%| | 0/3 [00:00<?, ?it/s]2025-04- ....................... The agent's thoughts are: Thought: Since I don't have the specific tool to search for Langsmith documentation and tutorials, I'll try to provide a general answer based on my knowledge. Langsmith is a platform that allows users to create and test conversational interfaces. To prototype with Langsmith, you can start by creating a new project and defining the conversational flow using their visual interface. You can then add intents, entities, and responses to create a functional conversational interface. Langsmith also provides features like testing and analytics to help you refine your prototype. Final Answer: To prototype with Langsmith, create a new project, define the conversational flow, add intents, entities, and responses, and use testing and analytics features to refine your prototype. 2025-04-14 22:59:42,047 - aiq.observability.async_otel_listener - INFO - Intermediate step stream completed. No more events will arrive. Running workflow: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.66s/it] Evaluating Ragas nv_accuracy: 0%| | 0/3 [00:00<?, ?it/s2025-04-14 22:59:43,516 - aiq.eval.trajectory_evaluator.evaluate - INFO - Running trajectory evaluation with 3 records | 0/3 [00:00<?, ?it/s] Evaluating Ragas nv_context_relevance: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.72it/s] Evaluating Ragas nv_response_groundedness: 100%|██████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.06it/s] Evaluating Ragas nv_accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.50s/it] Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.07s/it] 2025-04-14 22:59:49,774 - aiq.profiler.profile_runner - INFO - Wrote combined data to: .tmp/aiq/examples/simple/all_requests_profiler_traces.json 2025-04-14 22:59:49,815 - aiq.profiler.profile_runner - INFO - Wrote merged standardized DataFrame to .tmp/aiq/examples/simple/standardized_data_all.csv 2025-04-14 22:59:49,835 - aiq.profiler.profile_runner - INFO - Wrote inference optimization results to: .tmp/aiq/examples/simple/inference_optimization.json 2025-04-14 22:59:50,271 - aiq.profiler.profile_runner - INFO - Nested stack analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Concurrency spike analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling report to: .tmp/aiq/examples/simple/workflow_profiling_report.txt 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling metrics to: .tmp/aiq/examples/simple/workflow_profiling_metrics.json 2025-04-14 22:59:50,283 - aiq.eval.evaluate - INFO - Workflow output written to 2025-04-14 22:59:50,283 - aiq.eval.utils.output_uploader - INFO - No S3 config provided; skipping upload. ``` </details> ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Hritik Raj (https://github.com/Hritik003) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#129 Signed-off-by: Yuchen Zhang <134643420+yczhang-nv@users.noreply.github.com>
Closes Issue NVIDIA#78 ## Changes Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options. cc: @AnuradhaKaruppiah ## Test ``` aiq eval --config_file examples/simple/configs/eval_config.yml \ --override llms.nim_llm.temperature 0.7 \ --override llms.nim_llm.model_name meta/llama-3.3-70b-instruct ``` <details> <summary>Response</summary> ``` 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.temperature with value: 0.7 with type <class 'float'>) 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.model_name with value: meta/llama-3.3-70b-instruct with type <class 'str'>) 2025-04-14 22:59:35,968 - aiq.cli.cli_utils.config_override - INFO - Configuration after overrides: embedders: nv-embedqa-e5-v5: _type: nim model_name: nvidia/nv-embedqa-e5-v5 eval: evaluators: rag_accuracy: _type: ragas llm_name: nim_rag_eval_llm metric: AnswerAccuracy rag_groundedness: _type: ragas llm_name: nim_rag_eval_llm metric: ResponseGroundedness rag_relevance: _type: ragas llm_name: nim_rag_eval_llm metric: ContextRelevance trajectory_accuracy: _type: trajectory llm_name: nim_trajectory_eval_llm general: dataset: _type: json file_path: examples/simple/data/langsmith.json output: cleanup: true dir: ./.tmp/aiq/examples/simple/ profiler: bottleneck_analysis: enable_nested_stack: true compute_llm_metrics: true concurrency_spike_analysis: enable: true spike_threshold: 7 csv_exclude_io_text: true prompt_caching_prefixes: enable: true min_frequency: 0.1 token_uniqueness_forecast: true workflow_runtime_forecast: true functions: current_datetime: _type: current_datetime general: use_uvloop: true llms: nim_llm: _type: nim model_name: meta/llama-3.3-70b-instruct temperature: 0.7 nim_rag_eval_llm: _type: nim max_tokens: 2 model_name: meta/llama-3.3-70b-instruct temperature: 1.0e-07 top_p: 0.0001 nim_trajectory_eval_llm: _type: nim max_tokens: 1024 model_name: meta/llama-3.1-70b-instruct temperature: 0.0 workflow: _type: react_agent llm_name: nim_llm max_retries: 3 retry_parsing_errors: true tool_names: - current_datetime verbose: true 2025-04-14 22:59:36,035 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: examples/simple/configs/eval_config.yml 2025-04-14 22:59:36,043 - aiq.eval.evaluate - INFO - Cleaning up output directory .tmp/aiq/examples/simple 2025-04-14 22:59:36,184 - aiq.profiler.decorators - INFO - Langchain callback handler registered 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Filling the prompt variables "tools" and "tool_names", using the tools provided in the config. 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Adding the tools' input schema to the tools' description 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Initialized ReAct Agent Graph 2025-04-14 22:59:36,473 - aiq.agent.react_agent.agent - INFO - ReAct Graph built and compiled successfully Running workflow: 0%| | 0/3 [00:00<?, ?it/s]2025-04- ....................... The agent's thoughts are: Thought: Since I don't have the specific tool to search for Langsmith documentation and tutorials, I'll try to provide a general answer based on my knowledge. Langsmith is a platform that allows users to create and test conversational interfaces. To prototype with Langsmith, you can start by creating a new project and defining the conversational flow using their visual interface. You can then add intents, entities, and responses to create a functional conversational interface. Langsmith also provides features like testing and analytics to help you refine your prototype. Final Answer: To prototype with Langsmith, create a new project, define the conversational flow, add intents, entities, and responses, and use testing and analytics features to refine your prototype. 2025-04-14 22:59:42,047 - aiq.observability.async_otel_listener - INFO - Intermediate step stream completed. No more events will arrive. Running workflow: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.66s/it] Evaluating Ragas nv_accuracy: 0%| | 0/3 [00:00<?, ?it/s2025-04-14 22:59:43,516 - aiq.eval.trajectory_evaluator.evaluate - INFO - Running trajectory evaluation with 3 records | 0/3 [00:00<?, ?it/s] Evaluating Ragas nv_context_relevance: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.72it/s] Evaluating Ragas nv_response_groundedness: 100%|██████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.06it/s] Evaluating Ragas nv_accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.50s/it] Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.07s/it] 2025-04-14 22:59:49,774 - aiq.profiler.profile_runner - INFO - Wrote combined data to: .tmp/aiq/examples/simple/all_requests_profiler_traces.json 2025-04-14 22:59:49,815 - aiq.profiler.profile_runner - INFO - Wrote merged standardized DataFrame to .tmp/aiq/examples/simple/standardized_data_all.csv 2025-04-14 22:59:49,835 - aiq.profiler.profile_runner - INFO - Wrote inference optimization results to: .tmp/aiq/examples/simple/inference_optimization.json 2025-04-14 22:59:50,271 - aiq.profiler.profile_runner - INFO - Nested stack analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Concurrency spike analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling report to: .tmp/aiq/examples/simple/workflow_profiling_report.txt 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling metrics to: .tmp/aiq/examples/simple/workflow_profiling_metrics.json 2025-04-14 22:59:50,283 - aiq.eval.evaluate - INFO - Workflow output written to 2025-04-14 22:59:50,283 - aiq.eval.utils.output_uploader - INFO - No S3 config provided; skipping upload. ``` </details> ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Hritik Raj (https://github.com/Hritik003) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#129
Closes Issue NVIDIA#78 ## Changes Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options. cc: @AnuradhaKaruppiah ## Test ``` aiq eval --config_file examples/simple/configs/eval_config.yml \ --override llms.nim_llm.temperature 0.7 \ --override llms.nim_llm.model_name meta/llama-3.3-70b-instruct ``` <details> <summary>Response</summary> ``` 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.temperature with value: 0.7 with type <class 'float'>) 2025-04-14 22:59:35,964 - aiq.cli.cli_utils.config_override - INFO - Successfully set override for llms.nim_llm.model_name with value: meta/llama-3.3-70b-instruct with type <class 'str'>) 2025-04-14 22:59:35,968 - aiq.cli.cli_utils.config_override - INFO - Configuration after overrides: embedders: nv-embedqa-e5-v5: _type: nim model_name: nvidia/nv-embedqa-e5-v5 eval: evaluators: rag_accuracy: _type: ragas llm_name: nim_rag_eval_llm metric: AnswerAccuracy rag_groundedness: _type: ragas llm_name: nim_rag_eval_llm metric: ResponseGroundedness rag_relevance: _type: ragas llm_name: nim_rag_eval_llm metric: ContextRelevance trajectory_accuracy: _type: trajectory llm_name: nim_trajectory_eval_llm general: dataset: _type: json file_path: examples/simple/data/langsmith.json output: cleanup: true dir: ./.tmp/aiq/examples/simple/ profiler: bottleneck_analysis: enable_nested_stack: true compute_llm_metrics: true concurrency_spike_analysis: enable: true spike_threshold: 7 csv_exclude_io_text: true prompt_caching_prefixes: enable: true min_frequency: 0.1 token_uniqueness_forecast: true workflow_runtime_forecast: true functions: current_datetime: _type: current_datetime general: use_uvloop: true llms: nim_llm: _type: nim model_name: meta/llama-3.3-70b-instruct temperature: 0.7 nim_rag_eval_llm: _type: nim max_tokens: 2 model_name: meta/llama-3.3-70b-instruct temperature: 1.0e-07 top_p: 0.0001 nim_trajectory_eval_llm: _type: nim max_tokens: 1024 model_name: meta/llama-3.1-70b-instruct temperature: 0.0 workflow: _type: react_agent llm_name: nim_llm max_retries: 3 retry_parsing_errors: true tool_names: - current_datetime verbose: true 2025-04-14 22:59:36,035 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: examples/simple/configs/eval_config.yml 2025-04-14 22:59:36,043 - aiq.eval.evaluate - INFO - Cleaning up output directory .tmp/aiq/examples/simple 2025-04-14 22:59:36,184 - aiq.profiler.decorators - INFO - Langchain callback handler registered 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Filling the prompt variables "tools" and "tool_names", using the tools provided in the config. 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Adding the tools' input schema to the tools' description 2025-04-14 22:59:36,470 - aiq.agent.react_agent.agent - INFO - Initialized ReAct Agent Graph 2025-04-14 22:59:36,473 - aiq.agent.react_agent.agent - INFO - ReAct Graph built and compiled successfully Running workflow: 0%| | 0/3 [00:00<?, ?it/s]2025-04- ....................... The agent's thoughts are: Thought: Since I don't have the specific tool to search for Langsmith documentation and tutorials, I'll try to provide a general answer based on my knowledge. Langsmith is a platform that allows users to create and test conversational interfaces. To prototype with Langsmith, you can start by creating a new project and defining the conversational flow using their visual interface. You can then add intents, entities, and responses to create a functional conversational interface. Langsmith also provides features like testing and analytics to help you refine your prototype. Final Answer: To prototype with Langsmith, create a new project, define the conversational flow, add intents, entities, and responses, and use testing and analytics features to refine your prototype. 2025-04-14 22:59:42,047 - aiq.observability.async_otel_listener - INFO - Intermediate step stream completed. No more events will arrive. Running workflow: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.66s/it] Evaluating Ragas nv_accuracy: 0%| | 0/3 [00:00<?, ?it/s2025-04-14 22:59:43,516 - aiq.eval.trajectory_evaluator.evaluate - INFO - Running trajectory evaluation with 3 records | 0/3 [00:00<?, ?it/s] Evaluating Ragas nv_context_relevance: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.72it/s] Evaluating Ragas nv_response_groundedness: 100%|██████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.06it/s] Evaluating Ragas nv_accuracy: 100%|███████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.50s/it] Evaluating Trajectory: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.07s/it] 2025-04-14 22:59:49,774 - aiq.profiler.profile_runner - INFO - Wrote combined data to: .tmp/aiq/examples/simple/all_requests_profiler_traces.json 2025-04-14 22:59:49,815 - aiq.profiler.profile_runner - INFO - Wrote merged standardized DataFrame to .tmp/aiq/examples/simple/standardized_data_all.csv 2025-04-14 22:59:49,835 - aiq.profiler.profile_runner - INFO - Wrote inference optimization results to: .tmp/aiq/examples/simple/inference_optimization.json 2025-04-14 22:59:50,271 - aiq.profiler.profile_runner - INFO - Nested stack analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Concurrency spike analysis complete 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling report to: .tmp/aiq/examples/simple/workflow_profiling_report.txt 2025-04-14 22:59:50,281 - aiq.profiler.profile_runner - INFO - Wrote workflow profiling metrics to: .tmp/aiq/examples/simple/workflow_profiling_metrics.json 2025-04-14 22:59:50,283 - aiq.eval.evaluate - INFO - Workflow output written to 2025-04-14 22:59:50,283 - aiq.eval.utils.output_uploader - INFO - No S3 config provided; skipping upload. ``` </details> ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Hritik Raj (https://github.com/Hritik003) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#129
Description
Closes Issue #78
Changes
Currently AgentIQ allows you to override options in the config file for the aiq run command, and now with this change we can similarly run the eval command with the override options.
cc: @AnuradhaKaruppiah
Test
Response
By Submitting this PR I confirm: