KEMBAR78
Support http server port reuse by zuston · Pull Request #4616 · tensorflow/tensorboard · GitHub
Skip to content

Conversation

@zuston
Copy link
Contributor

@zuston zuston commented Jan 27, 2021

  • Motivation for features / changes

ReusePort scenario:
parent process occupies the port, then share the port through service such as ZooKeeper and TonY (TensorFlow on Yarn), and then child process (TensorBoard process) reuse the port.

  • Technical description of changes

  • Screenshots of UI changes

  • Detailed steps to verify changes work correctly (as executed by you)

  • Alternate designs / implementations considered

@google-cla google-cla bot added the cla: yes label Jan 27, 2021
@zuston
Copy link
Contributor Author

zuston commented Jan 27, 2021

@psybuzz @stephanwlee @nfelt Please review it.

@stephanwlee
Copy link
Contributor

Sorry about delayed review. The team is discussing the PR and is coming up with a response. We will definitely get back to you by next week. Thanks for understanding and your patience.

@zuston
Copy link
Contributor Author

zuston commented Jan 29, 2021

Sorry about delayed review. The team is discussing the PR and is coming up with a response. We will definitely get back to you by next week. Thanks for understanding and your patience.

Thanks. Looking forward reply ~

@rmothukuru rmothukuru linked an issue Jan 29, 2021 that may be closed by this pull request
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zuston! Thanks for the pull request. The feature that you’re looking
for seems reasonable enough, but we’re hesitant to add more networking
configuration to TensorBoard proper, especially platform-specific code.

Our Python API enables you to provide a custom server class: you can use
the tensorboard.program.TensorBoard class and pass a server_class
argument to the constructor. The program.TensorBoardServer class
specifies the interface that it needs to implement: basically, your
class is given a WSGI app and has complete control over the server.

Have you tried this? If not, could you give it a shot and let us know if
it works for your use case?

@zuston
Copy link
Contributor Author

zuston commented Feb 1, 2021

Thanks reply @wchargin.
It may not work. We provide users with a platform(on Hadoop Yarn) to deploy distributed TensorFlow.
Therefore, it is hard to rewrite the user code and extend the implementation of a TensorBoard server.

Actually, some similiar changes are also on TensorFlow. like this tensorflow/tensorflow#38705.

Do you have some ideas on it? Thanks again.

@nfelt
Copy link
Contributor

nfelt commented Feb 2, 2021

@zuston could you elaborate a little more about what constraint requires that you open the port in the parent process? Do you have a link to some technical documentation about ZooKeeper or Yarn that explains how this works? Without knowing more about your situation, it would seem more straightforward to just take whatever mechanism assigned the port number to the parent process and instead pass it directly through to TensorBoard.

@zuston
Copy link
Contributor Author

zuston commented Feb 2, 2021

Thanks reply @nfelt .

We use TensorFlow for model training on Hadoop Yarn, but Yarn does not natively support TensorFlow. Therefore, the TonY framework is used, which is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod.

Simply speaking, TonY's implementation principle is to obtain the resources needed for training from Hadoop Yarn, such as machine nodes, memory, and cpu. After obtaining the machine resource (TonY task executor), TonY will obtain the tensorflow grpc port and tensorboard startup port (only in tensorflow chief role) required by the role in the TonY task executor. But in this way, the tensorboard in the user code needs to specify the pre-allocated port of TonY when it is started. How to get tensorboard port in code? It can use the environment variable of TonY, see the example for details.

As we all know that the distributed TensorFlow training needs to specify the tf_config environment configuration for each role. Therefore, when the above operations are completed, the tf_config of each role of TensorFlow is determined. Therefore, tf_config is dynamically determined by TonY and injected into the runtime environment of each node by environment variables. I think this mechanism is similar to kubeflow tf-operator (which provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed TensorFlow jobs on Kubernetes.).

Back to this pull request, why do you need port reuse? Because after TonY's task executor determines an available port, it will release the port before starting the user's training code. Task executor will be subject to port conflict where port could be grabbed by other processes on the host before task process launches and after ports are closed.
Similar problems not only exist on tensorboard on TonY, but also on TensorFlow grpc service startup. TonY has now solved this problem through TensorFlow port reuse, see here for details.
So I think it is necessary to solve the tensorboard port conflict problem in a similar way

reference:
https://arxiv.org/abs/1904.01631 (TonY paper)
tony-framework/TonY#456 (TonY related issue)

ping @wchargin

@zuston
Copy link
Contributor Author

zuston commented Feb 4, 2021

Any update on it? @nfelt @wchargin @stephanwlee

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context about why you need the feature. We're willing to accept this PR but in addition to the comments left inline, could you also check that this behaves as expected on Windows? (I don't think Windows has the SO_REUSEPORT option so we'll need code to handle that gracefully, which should be tested.)

Also, the tests are failing; you'll need to add several lines for reuse_port here:

class FakeFlags(object):
def __init__(
self,
bind_all=False,
host=None,
inspect=False,
version_tb=False,
logdir="",
logdir_spec="",
event_file="",
db="",
path_prefix="",
generic_data="true",
grpc_data_provider="",
):
self.bind_all = bind_all
self.host = host
self.inspect = inspect
self.version_tb = version_tb
self.logdir = logdir
self.logdir_spec = logdir_spec
self.event_file = event_file
self.db = db
self.path_prefix = path_prefix
self.generic_data = generic_data
self.grpc_data_provider = grpc_data_provider

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate slightly here? E.g. something like

Enables the SO_REUSEPORT option on the socket opened by TensorBoard's HTTP server, for platforms that support it. This is useful in cases when a parent process has obtained the port already and wants to delegate access to the port to TensorBoard as a subprocess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave out this comment; it doesn't really add any more information than the code already has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should start below the docstring for the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check for the existence of the SO_REUSEPORT attribute before trying to access it, because it's not available on all platforms; see https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ. See examples below using hasattr for how we do this for other socket options. If the option isn't available but the flag was passed anyway, let's raise a TensorBoardServerException saying something like "TensorBoard --reuse_port option is not supported on this platform".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the code you're proposing to add, this method now does more than just IPV4 mapping, so the docstring should be updated. Can you add a new docstring that's more general, something like """Override to set custom options on the socket."""?

And then let's take the existing comment and change it to an inline comment (#-style), like

# Enable IPV4 mapping for IPV6 sockets when desired.
# The main use case for this is so that when no host is specified,
# TensorBoard can listen on all interfaces for both IPv4 and IPv6
# connections, rather than having to choose v4 or v6 and hope the
# browser didn't choose the other one.
socket_is_v6 = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zuston zuston requested a review from wchargin February 4, 2021 06:19
@zuston
Copy link
Contributor Author

zuston commented Feb 4, 2021

Thanks for the additional context about why you need the feature. We're willing to accept this PR but in addition to the comments left inline, could you also check that this behaves as expected on Windows? (I don't think Windows has the SO_REUSEPORT option so we'll need code to handle that gracefully, which should be tested.)

With my colleague’s Windows PC, I have checked that it will throw exception as expected. @nfelt

@zuston
Copy link
Contributor Author

zuston commented Feb 6, 2021

Please review it. Thanks~ @nfelt @wchargin

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, will merge. Just FYI, most likely this won't be available in stable TensorBoard until our next release (2.5.0) which may be a month or two away, but it should be in nightly releases after tomorrow.

@nfelt nfelt merged commit e93706f into tensorflow:master Feb 6, 2021
@zuston
Copy link
Contributor Author

zuston commented Feb 7, 2021

Thank you for your patience~. @nfelt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support tensorboard server port reuse

4 participants