Support http server port reuse #4616

zuston · 2021-01-27T10:16:57Z

Motivation for features / changes

ReusePort scenario:
parent process occupies the port, then share the port through service such as ZooKeeper and TonY (TensorFlow on Yarn), and then child process (TensorBoard process) reuse the port.

Technical description of changes
Screenshots of UI changes
Detailed steps to verify changes work correctly (as executed by you)
Alternate designs / implementations considered

zuston · 2021-01-27T10:19:01Z

@psybuzz @stephanwlee @nfelt Please review it.

stephanwlee · 2021-01-29T05:21:55Z

Sorry about delayed review. The team is discussing the PR and is coming up with a response. We will definitely get back to you by next week. Thanks for understanding and your patience.

zuston · 2021-01-29T05:31:52Z

Sorry about delayed review. The team is discussing the PR and is coming up with a response. We will definitely get back to you by next week. Thanks for understanding and your patience.

Thanks. Looking forward reply ~

wchargin

Hi @zuston! Thanks for the pull request. The feature that you’re looking
for seems reasonable enough, but we’re hesitant to add more networking
configuration to TensorBoard proper, especially platform-specific code.

Our Python API enables you to provide a custom server class: you can use
the tensorboard.program.TensorBoard class and pass a server_class
argument to the constructor. The program.TensorBoardServer class
specifies the interface that it needs to implement: basically, your
class is given a WSGI app and has complete control over the server.

Have you tried this? If not, could you give it a shot and let us know if
it works for your use case?

zuston · 2021-02-01T03:55:00Z

Thanks reply @wchargin.
It may not work. We provide users with a platform(on Hadoop Yarn) to deploy distributed TensorFlow.
Therefore, it is hard to rewrite the user code and extend the implementation of a TensorBoard server.

Actually, some similiar changes are also on TensorFlow. like this tensorflow/tensorflow#38705.

Do you have some ideas on it? Thanks again.

nfelt · 2021-02-02T01:11:59Z

@zuston could you elaborate a little more about what constraint requires that you open the port in the parent process? Do you have a link to some technical documentation about ZooKeeper or Yarn that explains how this works? Without knowing more about your situation, it would seem more straightforward to just take whatever mechanism assigned the port number to the parent process and instead pass it directly through to TensorBoard.

zuston · 2021-02-02T02:41:34Z

Thanks reply @nfelt .

We use TensorFlow for model training on Hadoop Yarn, but Yarn does not natively support TensorFlow. Therefore, the TonY framework is used, which is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod.

Simply speaking, TonY's implementation principle is to obtain the resources needed for training from Hadoop Yarn, such as machine nodes, memory, and cpu. After obtaining the machine resource (TonY task executor), TonY will obtain the tensorflow grpc port and tensorboard startup port (only in tensorflow chief role) required by the role in the TonY task executor. But in this way, the tensorboard in the user code needs to specify the pre-allocated port of TonY when it is started. How to get tensorboard port in code? It can use the environment variable of TonY, see the example for details.

As we all know that the distributed TensorFlow training needs to specify the tf_config environment configuration for each role. Therefore, when the above operations are completed, the tf_config of each role of TensorFlow is determined. Therefore, tf_config is dynamically determined by TonY and injected into the runtime environment of each node by environment variables. I think this mechanism is similar to kubeflow tf-operator (which provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed TensorFlow jobs on Kubernetes.).

Back to this pull request, why do you need port reuse? Because after TonY's task executor determines an available port, it will release the port before starting the user's training code. Task executor will be subject to port conflict where port could be grabbed by other processes on the host before task process launches and after ports are closed.
Similar problems not only exist on tensorboard on TonY, but also on TensorFlow grpc service startup. TonY has now solved this problem through TensorFlow port reuse, see here for details.
So I think it is necessary to solve the tensorboard port conflict problem in a similar way

reference:
https://arxiv.org/abs/1904.01631 (TonY paper)
tony-framework/TonY#456 (TonY related issue)

ping @wchargin

zuston · 2021-02-04T01:18:41Z

Any update on it? @nfelt @wchargin @stephanwlee

nfelt

Thanks for the additional context about why you need the feature. We're willing to accept this PR but in addition to the comments left inline, could you also check that this behaves as expected on Windows? (I don't think Windows has the SO_REUSEPORT option so we'll need code to handle that gracefully, which should be tested.)

Also, the tests are failing; you'll need to add several lines for reuse_port here:

tensorboard/tensorboard/plugins/core/core_plugin_test.py

Lines 46 to 71 in 1597e6d

    
           class FakeFlags(object): 
        
               def __init__( 
        
                   self, 
        
                   bind_all=False, 
        
                   host=None, 
        
                   inspect=False, 
        
                   version_tb=False, 
        
                   logdir="", 
        
                   logdir_spec="", 
        
                   event_file="", 
        
                   db="", 
        
                   path_prefix="", 
        
                   generic_data="true", 
        
                   grpc_data_provider="", 
        
               ): 
        
                   self.bind_all = bind_all 
        
                   self.host = host 
        
                   self.inspect = inspect 
        
                   self.version_tb = version_tb 
        
                   self.logdir = logdir 
        
                   self.logdir_spec = logdir_spec 
        
                   self.event_file = event_file 
        
                   self.db = db 
        
                   self.path_prefix = path_prefix 
        
                   self.generic_data = generic_data 
        
                   self.grpc_data_provider = grpc_data_provider

nfelt · 2021-02-04T02:49:14Z

tensorboard/plugins/core/core_plugin.py

Could you elaborate slightly here? E.g. something like

Enables the SO_REUSEPORT option on the socket opened by TensorBoard's HTTP server, for platforms that support it. This is useful in cases when a parent process has obtained the port already and wants to delegate access to the port to TensorBoard as a subprocess.

nfelt · 2021-02-04T02:50:34Z

tensorboard/program.py

Let's leave out this comment; it doesn't really add any more information than the code already has.

nfelt · 2021-02-04T02:52:45Z

tensorboard/program.py

The code should start below the docstring for the method.

nfelt · 2021-02-04T02:56:52Z

tensorboard/program.py

We should check for the existence of the SO_REUSEPORT attribute before trying to access it, because it's not available on all platforms; see https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ. See examples below using hasattr for how we do this for other socket options. If the option isn't available but the flag was passed anyway, let's raise a TensorBoardServerException saying something like "TensorBoard --reuse_port option is not supported on this platform".

nfelt · 2021-02-04T02:59:51Z

tensorboard/program.py

With the code you're proposing to add, this method now does more than just IPV4 mapping, so the docstring should be updated. Can you add a new docstring that's more general, something like """Override to set custom options on the socket."""?

And then let's take the existing comment and change it to an inline comment (#-style), like

# Enable IPV4 mapping for IPV6 sockets when desired. # The main use case for this is so that when no host is specified, # TensorBoard can listen on all interfaces for both IPv4 and IPv6 # connections, rather than having to choose v4 or v6 and hope the # browser didn't choose the other one. socket_is_v6 = ...

zuston · 2021-02-04T07:11:49Z

Thanks for the additional context about why you need the feature. We're willing to accept this PR but in addition to the comments left inline, could you also check that this behaves as expected on Windows? (I don't think Windows has the SO_REUSEPORT option so we'll need code to handle that gracefully, which should be tested.)

With my colleague’s Windows PC, I have checked that it will throw exception as expected. @nfelt

zuston · 2021-02-06T01:30:31Z

Please review it. Thanks~ @nfelt @wchargin

nfelt

Thanks for the contribution, will merge. Just FYI, most likely this won't be available in stable TensorBoard until our next release (2.5.0) which may be a month or two away, but it should be in nightly releases after tomorrow.

zuston · 2021-02-07T03:20:27Z

Thank you for your patience~. @nfelt

google-cla bot added the cla: yes label Jan 27, 2021

zuston mentioned this pull request Jan 29, 2021

Support tensorboard server port reuse #4623

Closed

rmothukuru linked an issue Jan 29, 2021 that may be closed by this pull request

Support tensorboard server port reuse #4623

Closed

wchargin reviewed Jan 29, 2021

View reviewed changes

nfelt suggested changes Feb 4, 2021

View reviewed changes

Support http server port reuse

5f82de6

zuston force-pushed the master branch from 166c9ff to 5f82de6 Compare February 4, 2021 05:32

zuston requested a review from nfelt February 4, 2021 05:35

zuston added 2 commits February 4, 2021 14:04

Fix unit test

812ca23

Fix code style error

5135874

zuston requested a review from wchargin February 4, 2021 06:19

Fix code style error again

69f58c2

nfelt approved these changes Feb 6, 2021

View reviewed changes

nfelt merged commit e93706f into tensorflow:master Feb 6, 2021

zuston mentioned this pull request Feb 7, 2021

Enable tensorboard port reuse option tony-framework/TonY#503

Merged

bileschi mentioned this pull request Oct 18, 2021

Tensorboard options reuse_port and path_prefix not workiing together #5368

Closed

	class FakeFlags(object):
	def __init__(
	self,
	bind_all=False,
	host=None,
	inspect=False,
	version_tb=False,
	logdir="",
	logdir_spec="",
	event_file="",
	db="",
	path_prefix="",
	generic_data="true",
	grpc_data_provider="",
	):
	self.bind_all = bind_all
	self.host = host
	self.inspect = inspect
	self.version_tb = version_tb
	self.logdir = logdir
	self.logdir_spec = logdir_spec
	self.event_file = event_file
	self.db = db
	self.path_prefix = path_prefix
	self.generic_data = generic_data
	self.grpc_data_provider = grpc_data_provider

Support http server port reuse #4616

Support http server port reuse #4616

Uh oh!

Conversation

zuston commented Jan 27, 2021

Uh oh!

zuston commented Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanwlee commented Jan 29, 2021

Uh oh!

zuston commented Jan 29, 2021

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

zuston commented Feb 1, 2021

Uh oh!

nfelt commented Feb 2, 2021

Uh oh!

zuston commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuston commented Feb 4, 2021

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuston commented Feb 4, 2021

Uh oh!

zuston commented Feb 6, 2021

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

zuston commented Feb 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zuston commented Jan 27, 2021 •

edited

Loading

zuston commented Feb 2, 2021 •

edited

Loading