KEMBAR78
[DebuggerV2] Flesh out graph execution data display by caisq · Pull Request #3528 · tensorflow/tensorboard · GitHub
Skip to content

Conversation

@caisq
Copy link
Contributor

@caisq caisq commented Apr 17, 2020

  • Motivation for features / changes
    • Continue developing DebuggerV2 plugin, specifically its GraphExecutionContainer that visualizes the details of the intra-graph execution events.
  • Technical description of changes
    • Add necessary actions, selectors, reducers and effects to support the lazy, paged loading of GraphExecutions
    • Add a cdk-virtual-scroll-viewport to GraphExecutionComponent
    • The lazy loading is triggered by scrolling events on the cdk-virtual-scroll-viewport
    • The displaying detailed debug-tensor summaries such as dtype, rank, shape, and numeric breakdown will be added in the follow PRs. This PR just adds displaying of the tensor name and op type in the cdk-virtual-scroll-viewport.
  • Screenshots of UI changes
    • Loaded state: image
    • Loading state (mat-spinner to be added in follow-up CLs): image
  • Detailed steps to verify changes work correctly (as executed by you)
    • Unit tests added
    • Manual testing against logdirs with real tfdbg2 data of different sizes

@caisq
Copy link
Contributor Author

caisq commented Apr 17, 2020

cc @baileydesign

@caisq caisq marked this pull request as ready for review April 17, 2020 21:14
@caisq caisq requested a review from stephanwlee April 17, 2020 21:14
this.store.select(getGraphExecutionDisplayCount)
),
filter(([, runId, numGraphExecutions]) => {
return runId !== null && numGraphExecutions > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it a tiny bit readable, can we only select the things we need here? i.e., move L565-566 to L588?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 915 to 920
for (const {dataExists, page3Size, loadingPages} of [
{dataExists: false, page3Size: 0, loadingPages: [3]},
{dataExists: false, page3Size: 0, loadingPages: []},
{dataExists: true, page3Size: 2, loadingPages: []},
]) {
it('triggers GraphExecution loading', fakeAsync(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about jasmine, but for at least mocha, you need to have unique spec name in order for tests to be run as a separate spec. Please generate a unique name for these. Please try changing value of L917 to something faulty and see if it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

);

action.next(graphExecutionScrollToIndex({index: newScrollBeginIndex}));
tick(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this achieve? Is this to sequentialize promise returned by the data source? Can we instead use proper async/await?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tick(100) is needed because of the debounceTime(100) in the effect created by onGraphExecutionScroll(). This test won't pass without it. 100 is therefore not an arbitrary number.

Using fakeAsync() with tick() is better than async/await because it is simulated passage of time and hence should make the test run in shorter period of time.

Comment on lines +602 to +608
const pageIndex = Math.floor(i / pageSize);
if (graphExecutionDataLoadingPages.indexOf(pageIndex) !== -1) {
graphExecutionDataLoadingPages.splice(
graphExecutionDataLoadingPages.indexOf(pageIndex),
1
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we chose array for graphExecutionDataLoadingPages instead of an object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graphExecutionDataLoadingPages maintains the pages that are currently being loaded. An object could serve the same purpose, perhaps a little more performantly if the number of pages is large. But the downside is that the key type will be a little arbitrary (perhaps just null?) We don't want to keep track of all the pages, because the number of pages can be large (consider a debugger run with 500k graph executions, which leads to 500k / 200 = 2500 keys in the object if we did that). In practice, the performance shouldn't matter either, because the amount of loading pages should usually be small (the size of the array grows with user scrolling the list, at a debounced rate).

begin: 2,
end: 4,
graph_executions: [
createTestGraphExecution({op_name: 'TestOp_2'}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this name so I can know whether we are expecting to overwrite or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It is supposed to be overwritten. The test now reflects that.

}

.tensor-name-and-op-type {
direction: rtl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we use rtl in this file? What is this achieving? Is it abusing a11y for UI purposes? If so, how can we make these properly a11y complaint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the rationale for rtl is as follows. We often see very long node/tensor names in tf problems, e.g.,

  • resnet/residual_block_50/conv_30/Conv2D_0
  • resnet/residual_block_50/conv_31/Conv2D_0

When the space is limited, we want to omit some parts of the strings with ellipsis (...). But often the most interest part of a node/tensor name is at the end or toward the end. So we want to omit the beginning part of the name. Using rtl together with text-overflow: ellipsis and white-space: nowrapachieves that purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, about that. While I am okay for this to merge as is, I would like to talk about nicer approach. Let's chat offline (I want to show you something)


<!-- TODO(cais): Add cdk-virtual-scroll-viewport for graph executions -->
<cdk-virtual-scroll-viewport
#executionsScroll
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? I do not see any references to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's unused. Removed.

</div>
<div *ngIf="graphExecutionData[i]; else dataLoading">
<div class="tensor-name-and-op-type">
<div class="tensor-name">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No AI: you probably mean span in a lot of these places. I cannot tell by just reading these though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these cases here I do want div, because I want them to be one separate lines (e.g., tensor-name and op-type) and be nested (e.g., tensor-name-and-op-type and tensor-name). But follow-up CLs will indeed use a lot of spans. Stay tuned.

numGraphExecutions: number | null = null;

@Input()
graphExecutionData: {[index: number]: GraphExecution} = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

components should not define default values. Prefer to use graphExecutionData!: {[index: number]: GraphExecution};

(it forces container to pass the input or remove unused input in the component)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and elsewhere in this component file.

it('does not render execs viewport if # execs = 0', fakeAsync(() => {
const fixture = TestBed.createComponent(GraphExecutionsContainer);
store.overrideSelector(getNumGraphExecutions, 0);
fixture.autoDetectChanges();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting but prefer to use store.refreshState() and fixture.detectChanges() without tick and fakeAsync (is this for the virtual scrolling on @angular/material/cdk)?

Nonetheless, 500 in tick(500) feels very arbitrary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the special test patterns you see here (including autoDetectChanges(), tick() and fakeAsync()) are used because of the virtual scrolling used by the component.

  • autoDetectChanges() seems to be required by the cdk scrolling. I couldn't get the test to pass by using refreshState() and detectChanges().
  • As for tick(500), I found it can be replaced with simply tick(). Done.
  • fakeAsync() is required for tick(). See https://angular.io/api/core/testing/tick.

Copy link
Contributor Author

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

this.store.select(getGraphExecutionDisplayCount)
),
filter(([, runId, numGraphExecutions]) => {
return runId !== null && numGraphExecutions > 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


<!-- TODO(cais): Add cdk-virtual-scroll-viewport for graph executions -->
<cdk-virtual-scroll-viewport
#executionsScroll
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's unused. Removed.

</div>
<div *ngIf="graphExecutionData[i]; else dataLoading">
<div class="tensor-name-and-op-type">
<div class="tensor-name">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these cases here I do want div, because I want them to be one separate lines (e.g., tensor-name and op-type) and be nested (e.g., tensor-name-and-op-type and tensor-name). But follow-up CLs will indeed use a lot of spans. Stay tuned.

numGraphExecutions: number | null = null;

@Input()
graphExecutionData: {[index: number]: GraphExecution} = {};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and elsewhere in this component file.

it('does not render execs viewport if # execs = 0', fakeAsync(() => {
const fixture = TestBed.createComponent(GraphExecutionsContainer);
store.overrideSelector(getNumGraphExecutions, 0);
fixture.autoDetectChanges();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the special test patterns you see here (including autoDetectChanges(), tick() and fakeAsync()) are used because of the virtual scrolling used by the component.

  • autoDetectChanges() seems to be required by the cdk scrolling. I couldn't get the test to pass by using refreshState() and detectChanges().
  • As for tick(500), I found it can be replaced with simply tick(). Done.
  • fakeAsync() is required for tick(). See https://angular.io/api/core/testing/tick.

}

.tensor-name-and-op-type {
direction: rtl;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the rationale for rtl is as follows. We often see very long node/tensor names in tf problems, e.g.,

  • resnet/residual_block_50/conv_30/Conv2D_0
  • resnet/residual_block_50/conv_31/Conv2D_0

When the space is limited, we want to omit some parts of the strings with ellipsis (...). But often the most interest part of a node/tensor name is at the end or toward the end. So we want to omit the beginning part of the name. Using rtl together with text-overflow: ellipsis and white-space: nowrapachieves that purpose.

Comment on lines 915 to 920
for (const {dataExists, page3Size, loadingPages} of [
{dataExists: false, page3Size: 0, loadingPages: [3]},
{dataExists: false, page3Size: 0, loadingPages: []},
{dataExists: true, page3Size: 2, loadingPages: []},
]) {
it('triggers GraphExecution loading', fakeAsync(() => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

);

action.next(graphExecutionScrollToIndex({index: newScrollBeginIndex}));
tick(100);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tick(100) is needed because of the debounceTime(100) in the effect created by onGraphExecutionScroll(). This test won't pass without it. 100 is therefore not an arbitrary number.

Using fakeAsync() with tick() is better than async/await because it is simulated passage of time and hence should make the test run in shorter period of time.

begin: 2,
end: 4,
graph_executions: [
createTestGraphExecution({op_name: 'TestOp_2'}),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It is supposed to be overwritten. The test now reflects that.

Comment on lines +602 to +608
const pageIndex = Math.floor(i / pageSize);
if (graphExecutionDataLoadingPages.indexOf(pageIndex) !== -1) {
graphExecutionDataLoadingPages.splice(
graphExecutionDataLoadingPages.indexOf(pageIndex),
1
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graphExecutionDataLoadingPages maintains the pages that are currently being loaded. An object could serve the same purpose, perhaps a little more performantly if the number of pages is large. But the downside is that the key type will be a little arbitrary (perhaps just null?) We don't want to keep track of all the pages, because the number of pages can be large (consider a debugger run with 500k graph executions, which leads to 500k / 200 = 2500 keys in the object if we did that). In practice, the performance shouldn't matter either, because the amount of loading pages should usually be small (the size of the array grows with user scrolling the list, at a debounced rate).

@caisq caisq merged commit cad3ec8 into tensorflow:master Apr 21, 2020
caisq added a commit to caisq/tensorboard that referenced this pull request May 19, 2020
* Motivation for features / changes
  * Continue developing DebuggerV2 plugin, specifically its GraphExecutionContainer that visualizes the details of the intra-graph execution events. 
* Technical description of changes
  * Add necessary actions, selectors, reducers and effects to support the lazy, paged loading of `GraphExecution`s
  * Add a `cdk-virtual-scroll-viewport` to `GraphExecutionComponent`
  * The lazy loading is triggered by scrolling events on the `cdk-virtual-scroll-viewport`
  * The displaying detailed debug-tensor summaries such as dtype, rank, shape, and numeric breakdown will be added in the follow PRs. This PR just adds displaying of the tensor name and op type in the `cdk-virtual-scroll-viewport`.
* Screenshots of UI changes
  * Loaded state: ![image](https://user-images.githubusercontent.com/16824702/79614243-24f6e500-80ce-11ea-8b9f-412cb831a449.png)
  * Loading state (mat-spinner to be added in follow-up CLs): ![image](https://user-images.githubusercontent.com/16824702/79614131-e19c7680-80cd-11ea-8dad-2dfd4b998ada.png)
* Detailed steps to verify changes work correctly (as executed by you)
  * Unit tests added
  * Manual testing against logdirs with real tfdbg2 data of different sizes
caisq added a commit that referenced this pull request May 27, 2020
* Motivation for features / changes
  * Continue developing DebuggerV2 plugin, specifically its GraphExecutionContainer that visualizes the details of the intra-graph execution events. 
* Technical description of changes
  * Add necessary actions, selectors, reducers and effects to support the lazy, paged loading of `GraphExecution`s
  * Add a `cdk-virtual-scroll-viewport` to `GraphExecutionComponent`
  * The lazy loading is triggered by scrolling events on the `cdk-virtual-scroll-viewport`
  * The displaying detailed debug-tensor summaries such as dtype, rank, shape, and numeric breakdown will be added in the follow PRs. This PR just adds displaying of the tensor name and op type in the `cdk-virtual-scroll-viewport`.
* Screenshots of UI changes
  * Loaded state: ![image](https://user-images.githubusercontent.com/16824702/79614243-24f6e500-80ce-11ea-8b9f-412cb831a449.png)
  * Loading state (mat-spinner to be added in follow-up CLs): ![image](https://user-images.githubusercontent.com/16824702/79614131-e19c7680-80cd-11ea-8dad-2dfd4b998ada.png)
* Detailed steps to verify changes work correctly (as executed by you)
  * Unit tests added
  * Manual testing against logdirs with real tfdbg2 data of different sizes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants