-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support http server port reuse #4616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@psybuzz @stephanwlee @nfelt Please review it. |
|
Sorry about delayed review. The team is discussing the PR and is coming up with a response. We will definitely get back to you by next week. Thanks for understanding and your patience. |
Thanks. Looking forward reply ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zuston! Thanks for the pull request. The feature that you’re looking
for seems reasonable enough, but we’re hesitant to add more networking
configuration to TensorBoard proper, especially platform-specific code.
Our Python API enables you to provide a custom server class: you can use
the tensorboard.program.TensorBoard class and pass a server_class
argument to the constructor. The program.TensorBoardServer class
specifies the interface that it needs to implement: basically, your
class is given a WSGI app and has complete control over the server.
Have you tried this? If not, could you give it a shot and let us know if
it works for your use case?
|
Thanks reply @wchargin. Actually, some similiar changes are also on TensorFlow. like this tensorflow/tensorflow#38705. Do you have some ideas on it? Thanks again. |
|
@zuston could you elaborate a little more about what constraint requires that you open the port in the parent process? Do you have a link to some technical documentation about ZooKeeper or Yarn that explains how this works? Without knowing more about your situation, it would seem more straightforward to just take whatever mechanism assigned the port number to the parent process and instead pass it directly through to TensorBoard. |
|
Thanks reply @nfelt . We use TensorFlow for model training on Hadoop Yarn, but Yarn does not natively support TensorFlow. Therefore, the TonY framework is used, which is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod. Simply speaking, TonY's implementation principle is to obtain the resources needed for training from Hadoop Yarn, such as machine nodes, memory, and cpu. After obtaining the machine resource (TonY task executor), TonY will obtain the tensorflow grpc port and tensorboard startup port (only in tensorflow chief role) required by the role in the TonY task executor. But in this way, the tensorboard in the user code needs to specify the pre-allocated port of TonY when it is started. How to get tensorboard port in code? It can use the environment variable of TonY, see the example for details. As we all know that the distributed TensorFlow training needs to specify the tf_config environment configuration for each role. Therefore, when the above operations are completed, the tf_config of each role of TensorFlow is determined. Therefore, tf_config is dynamically determined by TonY and injected into the runtime environment of each node by environment variables. I think this mechanism is similar to kubeflow tf-operator (which provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed TensorFlow jobs on Kubernetes.). Back to this pull request, why do you need port reuse? Because after TonY's task executor determines an available port, it will release the port before starting the user's training code. Task executor will be subject to port conflict where port could be grabbed by other processes on the host before task process launches and after ports are closed. reference: ping @wchargin |
|
Any update on it? @nfelt @wchargin @stephanwlee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context about why you need the feature. We're willing to accept this PR but in addition to the comments left inline, could you also check that this behaves as expected on Windows? (I don't think Windows has the SO_REUSEPORT option so we'll need code to handle that gracefully, which should be tested.)
Also, the tests are failing; you'll need to add several lines for reuse_port here:
tensorboard/tensorboard/plugins/core/core_plugin_test.py
Lines 46 to 71 in 1597e6d
| class FakeFlags(object): | |
| def __init__( | |
| self, | |
| bind_all=False, | |
| host=None, | |
| inspect=False, | |
| version_tb=False, | |
| logdir="", | |
| logdir_spec="", | |
| event_file="", | |
| db="", | |
| path_prefix="", | |
| generic_data="true", | |
| grpc_data_provider="", | |
| ): | |
| self.bind_all = bind_all | |
| self.host = host | |
| self.inspect = inspect | |
| self.version_tb = version_tb | |
| self.logdir = logdir | |
| self.logdir_spec = logdir_spec | |
| self.event_file = event_file | |
| self.db = db | |
| self.path_prefix = path_prefix | |
| self.generic_data = generic_data | |
| self.grpc_data_provider = grpc_data_provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate slightly here? E.g. something like
Enables the
SO_REUSEPORToption on the socket opened by TensorBoard's HTTP server, for platforms that support it. This is useful in cases when a parent process has obtained the port already and wants to delegate access to the port to TensorBoard as a subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave out this comment; it doesn't really add any more information than the code already has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should start below the docstring for the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check for the existence of the SO_REUSEPORT attribute before trying to access it, because it's not available on all platforms; see https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ. See examples below using hasattr for how we do this for other socket options. If the option isn't available but the flag was passed anyway, let's raise a TensorBoardServerException saying something like "TensorBoard --reuse_port option is not supported on this platform".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the code you're proposing to add, this method now does more than just IPV4 mapping, so the docstring should be updated. Can you add a new docstring that's more general, something like """Override to set custom options on the socket."""?
And then let's take the existing comment and change it to an inline comment (#-style), like
# Enable IPV4 mapping for IPV6 sockets when desired.
# The main use case for this is so that when no host is specified,
# TensorBoard can listen on all interfaces for both IPv4 and IPv6
# connections, rather than having to choose v4 or v6 and hope the
# browser didn't choose the other one.
socket_is_v6 = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
With my colleague’s Windows PC, I have checked that it will throw exception as expected. @nfelt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, will merge. Just FYI, most likely this won't be available in stable TensorBoard until our next release (2.5.0) which may be a month or two away, but it should be in nightly releases after tomorrow.
|
Thank you for your patience~. @nfelt |
ReusePort scenario:
parent process occupies the port, then share the port through service such as ZooKeeper and TonY (TensorFlow on Yarn), and then child process (TensorBoard process) reuse the port.
Technical description of changes
Screenshots of UI changes
Detailed steps to verify changes work correctly (as executed by you)
Alternate designs / implementations considered