-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Aggregate numeric values when slicing (What-If Tool) #2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate numeric values when slicing (What-If Tool) #2678
Conversation
This is a new boolean field `isNumeric` in `distanceStats_` to allow us to check is a feature is numeric instead of looking for the existance of `.stdDev`.
Each slice of the dataset is identified by a key. Currently, when slicing by a feature, each value becomes a slice and therefore a key. To facilitate how we handle with generating slices, and particularly to allow us to group multiple values into the same key (buckets for numerical features), we wrap the slice key generation in a separate function. This commit doesn't change any aggregating behaviour but makes way to the next steps.
We iterate over the dataset using the slice key function to attribute a key to each example. This avoids the usage of the `undefined` dummy keys, since all cases will be naturally covered by seeing all examples. Since we are now using a single function to handle slice keys throughout the code, we can change the key attribution logics in a consistent way.
We create buckets for each aggregated feature, so that we can associate each slice of that feature with an interval instead of a value. The getSliceKey_ function now checks if the feature has bucket to be aggregated and returns the slice key of the appropriate interval in the positive case. It uses the value of the feature as key otherwise (just as before).
`numBuckets` controls in how many buckets the range of values for a feature should be divided. This number is also used to avoid aggregating in buckets when we have less unique values than buckets.
When slicing by numeric features aggregated in buckets, displaying the buckets in order is a particularly useful feature. Alphabetical ordering is not enough for this purpose: we need to parse the number in the interval and order it by its value. Also, let's make this ordering the default one when aggregating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this! really excited for this feature. some minor comments
| numBuckets: { | ||
| type: Number, | ||
| value: 10, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this as a numeric paper-input element below the slice by dropdown (when a numeric feature is selected) with a range of 2-10 and default value of 2? can use a <template is="dom-if...> wrapping the element so it only appears in that case. same thing with another dropdown under the secondary slice if that is numeric too. then ensure the buckets are recalculated whenever numPrimaryBuckets or numSecondaryBuckets changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| type: Boolean, | ||
| value: false, | ||
| }, | ||
| featureBucketEdges_: Object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add short comment on what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| const feature1 = this.selectedBreakdownFeature; | ||
| if (feature1 == '') { | ||
| this.selectedSecondBreakdownFeature = ''; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably still want this logic here, which clears out the secondary slice selection when the primary slice selection is cleared. otherwise if you have both slices set, then clear the primary, the view goes back to showing only overall stats, but then when you select a slice again, the secondary slice is already selected as whatever it was last and you don't just see the primary slice you selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Thanks for catching this!
| } | ||
| } | ||
| this.visdata.forEach( | ||
| function(item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use arrow notation to simplify this, remove need for .bind(this) :
this.visdata.forEach(item => { blah blah.... return blah;});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! 👍
| if (edges) { | ||
| for (let i = 1; i < edges.length; i++) { | ||
| if ( | ||
| example[feature] < edges[i] || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this handle when an example is missing the feature value for the selected feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When example[feature] is undefined, this condition is false and we end up returning undefined (after this loop). This results in a key called "undefined" where all these values are grouped.
I think I can make it more explicit and human-friendly, though, by treating it separately.
| }.bind(this) | ||
| ); | ||
| const thresholds = Object.values(thresholdsMap); | ||
| this.set('selectedFeatureSort', 'Alphabetical'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why reset the sort order when slices change? i can imagine wanting to keep viewing slices by something like false positive rate as i change slicing criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It makes sense.
| * Calculate edges between buckets for aggregating numeric features. | ||
| * We do this beforehand to round numbers and avoid ugly interval labels. | ||
| */ | ||
| calculateFeatureBucketEdges_: function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
* Add explanation to `featureBucketEdges_`. * Treat unselection of slicing feature. * Use arrow function to simplify `forEach` using outer `this`.
When slicing by a feature, group missing values in a '?' key.
When slicing by a numeric feature, values can be aggregated to buckets. We're adding an input field to control in how many buckets the values will be aggregated. This only appears for numeric features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There you go, James :)
| const feature1 = this.selectedBreakdownFeature; | ||
| if (feature1 == '') { | ||
| this.selectedSecondBreakdownFeature = ''; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Thanks for catching this!
| type: Boolean, | ||
| value: false, | ||
| }, | ||
| featureBucketEdges_: Object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| } | ||
| } | ||
| this.visdata.forEach( | ||
| function(item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! 👍
| }.bind(this) | ||
| ); | ||
| const thresholds = Object.values(thresholdsMap); | ||
| this.set('selectedFeatureSort', 'Alphabetical'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It makes sense.
| * Calculate edges between buckets for aggregating numeric features. | ||
| * We do this beforehand to round numbers and avoid ugly interval labels. | ||
| */ | ||
| calculateFeatureBucketEdges_: function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
| if (edges) { | ||
| for (let i = 1; i < edges.length; i++) { | ||
| if ( | ||
| example[feature] < edges[i] || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When example[feature] is undefined, this condition is false and we end up returning undefined (after this loop). This results in a key called "undefined" where all these values are grouped.
I think I can make it more explicit and human-friendly, though, by treating it separately.
| numBuckets: { | ||
| type: Number, | ||
| value: 10, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks. Only issue is that the you don't necessary end up with equally-sized intervals when slicing by a numeric feature. Slice by age with 4 buckets and some buckets are width 10 and some width 20 (and the initial one is width 13 due to the min age of 17. Can the logic be adjusted so they all end up of width 10 or 20? Or we would need to document this in the tool to avoid confusion |
Make it slightly more precise than the order of magnitude of the standard deviation for the values of that feature.
|
@nfelt please review or assign for review |
|
@stephanwlee please review or assign for review, thx |
|
@jameswex is there a reason this needs an additional review? I think your approval is sufficient to merge this since it only affects WIT code. |
Historically we've given TB a chance to review our code before we commit to the repo, but I'm ok merging this with just my approval if you are. |
|
Fair enough - I appreciate the notice! But I think going forward it's fine to merge after your review for PRs that only affect the WIT plugin codebase. Of course if there are bigger structural changes or cases where you want specific reviewer expertise you're still welcome to request review, but don't feel like you need to be blocked on a TB core team reviewer for changes like this one. |


When slicing by a numeric feature, aggregate values in intervals instead of creating a slice for each different value.
The current implementation creates one slice for each different value of the feature. This might result in a huge number of slices that become unmanageable (imagine a slicing by features like "age" or "salary", which can have hundreds of unique values).
To work around this issue, we aggregate the values into buckets (like a histogram), then show aggregated statistics and performance results for each bucket.
Nothing changes for categorical features.
For both multi and age demos, wait for data to load and then:
I opted for the equally-sized intervals (as opposed to the equally-populated ones, for example), mainly because they should be simpler to understand and because one of comparative columns is precisely the number of examples in each slice. The number of slices is controlled by a variable that could become a setting if needed.
As for notations, I also considered other a simple
a — b, but the mathematical intervals are particularly precise and seem to be accessible enough.Finally, generating the bucket edges beforehand was important to make the values look ok, since we would oftentimes get things like [17.43333333333, 20] printed otherwise. I tried to simply calculate them ad hoc, but having consistent and simple display turned out to be more complex than expected.
cc @jameswex and @tolga-b