[XLA] Add a new XLA mode: XLALite #34655

nouiz · 2019-11-27T17:58:59Z

Add 2 new XLA flags:

TF_XLA_FLAGS=--tf_xla_ops_to_cluster=[FUSIBLE,...]
TF_XLA_FLAGS=--tf_xla_auto_jit=fusible
This is a shortcut to TF_XLA_FLAGS=--tf_xla_ops_to_cluster=FUSIBLE TF_XLA_FLAGS=--tf_xla_auto_jit=1

This enables XLA but only for a subset of TF operations that XLA know how to fuse together. This allows using XLA operations fusion capabilities while removing some of the current XLA slow down case. In most cases where XLA is slower than TF classic, XLA isn't slower the TF classic. In some cases where XLA give speed up vs TF classic, XLALite give a good part of that speed up.

The flag tf_xla_supported_ops can accept TF operation names and/or predefined groups of operation. The group FUSIBLE includes all the groups defined. Multiple value can be passed by separating them by comma.

…LA and the tf_xla_supported_nodes=FUSIBLE flag. This make using that mode easier.

…sAdd operation.

thomasjoerg · 2019-12-02T12:29:33Z

tensorflow/compiler/jit/flags.cc

-           "things very likely to be improved; 2 = on for everything.  "
+           "things very likely to be improved; 2 = on for everything; "
+           "fusible = only for Tensorflow operations that XLA knows how to "
+           "fuse.  "


Please add something like (experimental) to indicate the feature may change in backward-incompatible ways going forward.

Not done? Also should this be under a different flag if this can change in unpredictable ways?

Right. I added the experimental mark to the new flags documentation. Added here too now.

thomasjoerg · 2019-12-02T12:32:42Z

tensorflow/compiler/jit/flags.cc

+           "If multiple, separate them with commas. Shortcuts: "
+           " PW: All point-wise operations."
+           " RED: All reduction operations."
+           " SMALL: Mixed small operations."


Can you explain what small means in this context? How did you choose the ops that were a good fit for the SMALL category?

Small is like Shape, Rank, Range and a few others I didn't know in which category to put. I could split that section into 'SMALL' and MIXED. Do you have a better idea of how to split that?

Since you said you rarely used the subcategories of FUSIBLE anyway, I wouldn't split them further. I was just wondering in what way the TF ops are SMALL. If this is a category for ops that just wouldn't fit anywhere else, I'd call it MISC to make this obvious.

thomasjoerg · 2019-12-02T12:35:45Z

tensorflow/compiler/jit/flags.h

  int32 tf_xla_max_cluster_size;

+  // If non-empty, limit XLA clustering to the following TF operations.
+  string tf_xla_supported_ops;


Nit: tf_xla_ops_to_cluster is a little clearer than 'tf_xla_supported_ops'.

thomasjoerg · 2019-12-02T12:46:17Z

tensorflow/compiler/jit/flags.cc

+           "(LRN, LRNGrad)."
+           " BN: TF FusedBatchNorm* operations."
+           " FUSIBLE: All TF operations that XLA can fuse (All the above). "
+           "You can also put any TF operation name."),


Are TF op names expected to be fully-qualified? Please provide an example.

I added an example. I'm not sure what you mean by fully-qualified name? I think TF operation have just one unique name like Add, Matmul, Sum,... It is those that should be used in this version.

thomasjoerg · 2019-12-02T13:05:42Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

  return true;
 }

+const absl::flat_hash_map<string, std::vector<string>> whitelist_table = {


What invariants must hold for this list? Or phrased differently, at what points does this list to be updated? Is it meant to be complete, i.e. all point-wise operations supported by XLA:GPU are listed here? In you experiments, did you find subsets of FUSIBLE useful?

Every times the TF-XLA bridge support a new TF operation, we need to check if we want to support it by default or not. So we should document that somewhere to help this not to be forgotten. Any idea where we should document that?

I didn't used the shortcut much. What I used is FUSIBLE plus other operation that I wasn't sure if we want to include or not. But if someone want to start to play more with it, I think the shortcut would be useful.

To make this list, I passed over the TF operations that the bridge knows and selected those that I was sure what they do and that I was sure XLA could fuse them. I could have missed some. In 2 benchmarks, I found some that could maybe fused that I didn't include. I timed when including them and it slowed down XLALite compared to not having them. So I didn't included them. They where:

"ReadVariableOp", "VarIsInitializedOp", "VariableShape", "ResourceApplyCenteredRMSProp", "ResourceApplyRMSProp", "ResourceScatterAdd", "ResourceApplyAdam"}```

@sanjoy Can you recommend a good spot for this documentation? Do you have an opinion regarding the TF Ops Frederic did not include?

As for the documentation I would vote for creating a unit test that checks that tf2xla supported ops are either whitelisted or explicitly blacklisted. See resource_operation_table_test.cc.

Do you have an opinion regarding the TF Ops Frederic did not include?

Right it isn't totally obvious which ops were included. I think we need some comments to describe the "format" of whitelist_table.

Good idea to have a test that force to have all XLA/TF operations registered to be whiteliste or blacklisted. It will guaranty that we make a decision when new operations are supported.
I added this test.

cheshire · 2019-12-02T16:16:55Z

tensorflow/compiler/jit/flags.cc

-           "things very likely to be improved; 2 = on for everything.  "
+           "things very likely to be improved; 2 = on for everything; "
+           "fusible = only for Tensorflow operations that XLA knows how to "
+           "fuse.  "


Not done? Also should this be under a different flag if this can change in unpredictable ways?

cheshire · 2019-12-02T16:18:46Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+
+std::unique_ptr<absl::flat_hash_set<string>> GetWhitelist() {
+  MarkForCompilationPassFlags* flags = GetMarkForCompilationPassFlags();
+  auto whitelist = absl::WrapUnique(new absl::flat_hash_set<string>());


make_unique?

As you suggested elsewhere, I return by copy instead of by reference now. As this is small and the compiler probably optimize it, it makes the code simpler.

cheshire · 2019-12-02T16:19:35Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+
+  if (VLOG_IS_ON(2) && whitelist->size() > 0) {
+    std::vector<string> vwhitelist(whitelist->begin(), whitelist->end());
+    std::sort(vwhitelist.begin(), vwhitelist.end());


absl::c_sort

cheshire · 2019-12-02T16:20:03Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+    }
+  }
+
+  if (VLOG_IS_ON(2) && whitelist->size() > 0) {


!whitelist->empty()

cheshire · 2019-12-02T16:22:32Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+    } else if (whitelist_table.contains(s)) {
+      auto v = whitelist_table.at(s);
+      whitelist->insert(v.begin(), v.end());
+    } else if (s.size() > 0) {


cheshire · 2019-12-02T16:22:58Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+      whitelist->insert(v.begin(), v.end());
+    } else if (s.size() > 0) {
+      // Should be a user provided TF operation.
+      whitelist->insert(string(s));


should we VLOG(5) here or something to avoid misspellings?

I already do a misspelling check on line 1195. It is there to help error reporting.

cheshire · 2019-12-02T16:23:35Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+    }
+  }
+
+  if (VLOG_IS_ON(2) && whitelist->size() > 0) {


Why not just sort on all codepaths, then this entire branch becomes VLOG(2) << ...

To remove the time for the sort when we do not print.
I tried to change the container to the sorted container absl::btree_set to remove the need for a sort. But btree_set isn't available in my version of absl. I'm not sure it is a good reason to bump TF absl required version and never changed TF dependency version.

If I put a VLOG(2) on all code path, this will make the output verbose even when this feature isn't used. Are you suggesting doing that? I wouldn't have a tendency of adding extra useless verbose output here. Do you see value to always print it even when not used?

tensorflow/compiler/jit/mark_for_compilation_pass.cc

cheshire · 2019-12-02T16:25:17Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc


+  auto whitelist = GetWhitelist();
+
+  auto vall_ops = XlaOpRegistry::GetAllRegisteredOps();


Explicit type would be useful here

cheshire · 2019-12-02T16:25:53Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

+  auto vall_ops = XlaOpRegistry::GetAllRegisteredOps();
+  absl::flat_hash_set<string> all_ops(vall_ops.begin(), vall_ops.end());
+  // Check that user's provided TF operation really exists.
+  for (auto s = whitelist->begin(); s != whitelist->end(); ++s) {


for-each loop instead?

cheshire · 2019-12-03T18:52:09Z

@thomasjoerg the issue with the non-trivially destructible global remains, it will break the build.

sanjoy · 2019-12-03T19:00:28Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

  return true;
 }

+const absl::flat_hash_map<string, std::vector<string>> whitelist_table = {


Please document the format of this table.

We also don't allow non-trivial global destructors because they don't play well with multi-threading. I think a better phrasing is:

absl::flat_hash_map<string, std::vector<string>> *CreateWhitelist() { absl::flat_hash_map<string, std::vector<string>>* result = new ...; // Use explicit code to populate "result", possibly with comments. } const absl::flat_hash_map<string, std::vector<string>>& GetOrCreateWhitelist() { static absl::flat_hash_map<string, std::vector<string>>* whitelist = CreateWhitelist(); return *whitelist; }

I fixed the non-trivial global destructors.
I also added documentation of the format. If you wanted more then this, just tell me.

I think it's better to have static in CreateWhitelist in order not accidentally create a leak if someone else calls it, also it seems this is how it is usually done in XLA, e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/transfer_manager.cc#L41

…itelist or blacklist.

cheshire · 2019-12-04T16:01:31Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc


+absl::flat_hash_map<string, std::vector<string>>* CreateWhitelist() {
+  // Table format: category name: {list of TF operations in that category}
+  absl::flat_hash_map<string, std::vector<string>>* result =


Should this be static? At the moment this seems to leak on every invocation.

I modified it. See other comment: #34655 (comment)

cheshire · 2019-12-04T16:04:25Z

tensorflow/compiler/jit/mark_for_compilation_pass.cc

  return true;
 }

+const absl::flat_hash_map<string, std::vector<string>> whitelist_table = {


I think it's better to have static in CreateWhitelist in order not accidentally create a leak if someone else calls it, also it seems this is how it is usually done in XLA, e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/transfer_manager.cc#L41

nouiz · 2019-12-09T03:18:47Z

I can't reply to the comment, so replying here.
I changed the code (and renamed the function name) to be safer to potential future changes as suggested.

thomasjoerg · 2019-12-11T14:05:22Z

@rthadur Did CopyBara get stuck somehow? It's been two days since the PR was approved. Can you kick CopyBara?

cheshire · 2019-12-11T19:26:43Z

@thomasjoerg we can also force run copybara by adding the kokoro:force-run tag, let me try to do it.

thomasjoerg · 2019-12-12T10:30:59Z

@thomasjoerg we can also force run copybara by adding the kokoro:force-run tag, let me try to do it.

@cheshire Frocing a Kokoro did not do the trick. I imported the PR manually.

nouiz · 2019-12-13T22:31:36Z

Note, I forgot to update the interface change in my test. So my test currently broke.
Why the CI didn't failed? Do they run XLA tests?
I just pushed a fix. I saw somewhere that tf nightly build are broken. Could this be related?

make it easier to find which operation. For example, the nodename can be AvgPool2d, while the TF operation is AvgPool.

PiperOrigin-RevId: 285730750 Change-Id: Ib53f29df2e956b8c4904d08af3d6f33f1c419a9f

nouiz · 2019-12-19T14:37:09Z

I made a new PR with the last commit that was missing:
#35263

nouiz added 18 commits November 18, 2019 10:17

Small comment fix

bd0238a

Add tf_xla_supported_nodes flags to limits nodes XLA consider.

1542db8

More Ops in the whitelist category.

e194f62

Add more XLA whitelist shortcut

08af421

Better user error detection, less verbose and better error message.

c44dc1f

Add an EXTRA category

727edfd

Add the flag value TF_XLA_FLAGS=--tf_xla_auto_jit=fusible to enable X…

65f614b

…LA and the tf_xla_supported_nodes=FUSIBLE flag. This make using that mode easier.

Update docmentation of the new flag

85ed59b

Use the absl containers.

7cd6479

Move code to a function to make the code more clear.

4785944

Rename the new flags to tf_xla_supported_ops

03250e3

Typo and add comment.

771393e

Replace a series of if with a table. Make code simpler to understand.

eb9081b

Small update (typo, clang-format, const)

aee8d90

XLALite, put BiasAddGrad in the right section and add the missing Bia…

5599e95

…sAdd operation.

Code formating

902c64a

Repair the XLALite flag shortcut since the rebase

177ffc7

clang-format

4576f25

tensorflow-bot bot added the size:M CL Change Size: Medium label Nov 27, 2019

googlebot added the cla: yes label Nov 27, 2019

Add TopKV2 to XLALite whitelist. XLA version is much faster the TF.

edf5517

rthadur self-assigned this Nov 28, 2019

rthadur added the comp:xla XLA label Nov 28, 2019

rthadur requested a review from sanjoy November 28, 2019 08:40

thomasjoerg self-requested a review December 2, 2019 12:20

thomasjoerg reviewed Dec 2, 2019

View reviewed changes

nouiz added 3 commits December 2, 2019 06:19

Remove some duplicate names.

7123fa0

Better documentation of the new parameter

cec299a

Rename tf_xla_supported_ops to tf_xla_ops_to_cluster

0bdd35e

cheshire suggested changes Dec 2, 2019

View reviewed changes

tensorflow-bot bot added the ready to pull PR ready for merge process label Dec 3, 2019

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 3, 2019

sanjoy reviewed Dec 3, 2019

View reviewed changes

nouiz added 3 commits December 3, 2019 12:20

Convert non-trivial global destructors to local static variable.

f370243

Rename a category

4dcdcad

Add test to make sure all XLA support operation are in the XLALite wh…

75e8b01

…itelist or blacklist.

nouiz dismissed thomasjoerg’s stale review via 75e8b01 December 3, 2019 22:08

rthadur requested a review from sanjoy December 4, 2019 02:56

cheshire reviewed Dec 4, 2019

View reviewed changes

Move the static inside the function to be safer.

9ff205f

cheshire previously approved these changes Dec 9, 2019

View reviewed changes

tensorflow-bot bot added the kokoro:force-run Tests on submitted change label Dec 9, 2019

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 9, 2019

rthadur added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Dec 10, 2019

cheshire added the kokoro:force-run Tests on submitted change label Dec 11, 2019

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 11, 2019

Fix compilation crash for the new XLA test since the interface change.

0f45e20

nouiz dismissed cheshire’s stale review via 0f45e20 December 13, 2019 22:31

Print the name that should be used to enable it. This

cd2e988

make it easier to find which operation. For example, the nodename can be AvgPool2d, while the TF operation is AvgPool.

tensorflow-copybara pushed a commit that referenced this pull request Dec 16, 2019

Merge pull request #34655 from nouiz:xlalite_pr

169124c

PiperOrigin-RevId: 285730750 Change-Id: Ib53f29df2e956b8c4904d08af3d6f33f1c419a9f

nouiz closed this Dec 19, 2019

nouiz deleted the xlalite_pr branch December 19, 2019 14:38


		auto whitelist = GetWhitelist();

		auto vall_ops = XlaOpRegistry::GetAllRegisteredOps();

[XLA] Add a new XLA mode: XLALite #34655

[XLA] Add a new XLA mode: XLALite #34655

Uh oh!

Conversation

nouiz commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanjoy Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheshire commented Dec 3, 2019

Uh oh!

nouiz commented Nov 27, 2019 •

edited

Loading

sanjoy Dec 3, 2019 •

edited

Loading