KEMBAR78
Adjust nginx agent backoff settings and revert request timeout by bjee19 · Pull Request #3820 · nginx/nginx-gateway-fabric · GitHub
Skip to content

Conversation

@bjee19
Copy link
Contributor

@bjee19 bjee19 commented Aug 29, 2025

Proposed changes

Problem: As a result to NGINX Agent adding a wait for nginx workers to reload before updating nginx configuration, any long-lived connections would result in an excessively long wait between updating nginx configurations. This cause delays in NGF and on NGINX Plus, could cause a dropping of traffic if there was a wait between updating the configuration and updating the NGINX Plus upstreams.

Solution: Adjust the NGINX Agent reload configuration to lower the maximum wait time. Reverted the test request timeout duration back to 10 seconds.

Testing: Ran the SnippetsFilter functional test a couple of times and verified when there was a long-lived connection blocking on of the test cases, it went from a 27 second timeout -> 3 second. Also deployed nginx and verified the agent configuration was in there.

Closes #3801

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.


@github-actions github-actions bot added bug Something isn't working tests Pull requests that update tests labels Aug 29, 2025
@codecov
Copy link

codecov bot commented Aug 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.99%. Comparing base (13d2288) to head (7c4d029).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3820   +/-   ##
=======================================
  Coverage   86.99%   86.99%           
=======================================
  Files         128      128           
  Lines       16412    16412           
  Branches       62       62           
=======================================
  Hits        14278    14278           
- Misses       1958     1959    +1     
+ Partials      176      175    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bjee19 bjee19 force-pushed the bug/nginx-worker-delayed-shutdown branch from c5ec91f to da8852d Compare September 2, 2025 18:23
@bjee19 bjee19 marked this pull request as ready for review September 2, 2025 18:23
@bjee19 bjee19 requested a review from a team as a code owner September 2, 2025 18:23
@bjee19 bjee19 changed the title DRAFT: Adjust nginx agent backoff settings and revert request timeout Adjust nginx agent backoff settings and revert request timeout Sep 2, 2025
@bjee19
Copy link
Contributor Author

bjee19 commented Sep 2, 2025

Regarding these settings:

initial_interval: .5s
max_interval: 1.5s
max_elapsed_time: 3s
randomization_factor: 0.5
multiplier: 1.5

You can read about them here: https://pkg.go.dev/github.com/cenkalti/backoff/v4#section-readme.

But from my understanding:

initial_interval is the first sleep duration before the first retry
max_interval is the maximum sleep duration between reties
max_elapsed_time is the maximum time its going to wait before getting out of the backoff wait
randomization_factor adds a jitter so each sleep isn't the same.
multiplier is how the "base" interval grows. Basically how much the sleep interval is increasing.

Also all the settings are set to their default except for max_interval and max_elapsed_time. I wanted the maximum wait time to be around 2-3 seconds, since most nginx workers should restart in < 1 seconds, meaning only the long-lived connections would be over 2-3 seconds.

This wait is used here: https://github.com/nginx/agent/blob/main/internal/resource/nginx_instance_operator.go#L144

@bjee19 bjee19 enabled auto-merge (squash) September 2, 2025 21:06
@bjee19 bjee19 merged commit 918b9d4 into main Sep 2, 2025
44 checks passed
@bjee19 bjee19 deleted the bug/nginx-worker-delayed-shutdown branch September 2, 2025 21:22
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NGINX Gateway Fabric Sep 2, 2025
bjee19 added a commit that referenced this pull request Sep 2, 2025
Problem: As a result to NGINX Agent adding a wait for nginx workers to reload before updating nginx configuration, any long-lived connections would result in an excessively long wait between updating nginx configurations. This cause delays in NGF and on NGINX Plus, could cause a dropping of traffic if there was a wait between updating the configuration and updating the NGINX Plus upstreams.

Solution: Adjust the NGINX Agent reload configuration to lower the maximum wait time. Reverted the test request timeout duration back to 10 seconds.

Testing: Ran the SnippetsFilter functional test a couple of times and verified when there was a long-lived connection blocking on of the test cases, it went from a 27 second timeout -> 3 second. Also deployed nginx and verified the agent configuration was in there.
bjee19 added a commit that referenced this pull request Sep 2, 2025
#3828)

Cherry-pick #3820

Problem: As a result to NGINX Agent adding a wait for nginx workers to reload before updating nginx configuration, any long-lived connections would result in an excessively long wait between updating nginx configurations. This cause delays in NGF and on NGINX Plus, could cause a dropping of traffic if there was a wait between updating the configuration and updating the NGINX Plus upstreams.

Solution: Adjust the NGINX Agent reload configuration to lower the maximum wait time. Reverted the test request timeout duration back to 10 seconds.

Testing: Ran the SnippetsFilter functional test a couple of times and verified when there was a long-lived connection blocking on of the test cases, it went from a 27 second timeout -> 3 second. Also deployed nginx and verified the agent configuration was in there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working tests Pull requests that update tests

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

NGINX Plus worker process takes extended time to shut down.

4 participants