-
Notifications
You must be signed in to change notification settings - Fork 410
Transient intent addition #1343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It's a transient input because like screen taps it's not present all the time. The targetRaySpace represents the ray of the user's intent for example where the user was looking at the start of the interaction. This shouldn't change with time. The gripSpace is for associated gestures if that is applicable and can be used for manipulating the intended target. |
/tpac input discussion |
Thank you, Ada, for writing this up! I have a couple of comments but am generally good with the direction of this proposal. Definitely worth discussion at TPAC!
|
Can we get some clarification on this? My reading is that the target ray would (in the presumed case of eye-tracking-based selection) follow the users line of sight at the time of selection but stay static during the rest of the selection event chain (begin->end). In a system that uses a hand gesture to initiate the select event, though, the grip space would originate in the hand that did the gesture, and follow it? That would allow developers to handle some fairly complex interactions by measuring relative movement of the grip space, but it would also fail for a number of existing drag-and-drop mechanisms (like the speakers in the Positional Audio sample), which generally ignore the grip space and instead track only the target ray. While I think that enabling the more expressive input is great, it would be good to have a fallback that more closely emulates existing behavior to avoid breaking existing content. Perhaps a target ray that is initialized based on eye tracking but then for the duration of the select event becomes head locked. That way the user could at least do coarse manipulation of existing drag/drop interactions, like an incredibly expensive Google Cardboard. :) Hard to state that in a way that doesn't become overly device-specific, but I think it's worth addressing. |
Yes, I think each time I draft this proposal I picked a new name for the enum, initially I focused on ones related to gaze since that was the primary focus for this particular use case. But the more we worked on the proposal, this input type seems a better fit for any generic input that can’t/shouldn’t give continuous input. As well as the inputs from assistive technology mentioned in the spec it would be a good fit for Brain Control Interfaces, or even standard tracked inputs who want to run in a mode to further reduced fingerprinting based on ambient user motion.
My honest hope here is that we aren’t the only platform that uses it as their default input mechanism because it would be a really good fit for exposing a simpler model for assistive input technology than full emulation of tracked-pointers or hands but it shouldn't be used on it's own for just a11y otherwise it will reveal that assistive technology is being used.
This is a great point that needs further discussion because it can potentially compromise user privacy. Apple’s implementation is privacy-preserving in that websites and apps do not know what the user is looking at until an interaction is started. This includes Safari itself; as a user’s gaze moves around the page, the system highlights what the user is looking at with a glow effect (“gaze-glow”) but this information is not available to the page. Apps can declare interactive regions and the OS itself provides the visuals back to the user for which interactive are currently being gazed at. I’ve been thinking that a gaze-glow like thing could work by utilising the layers API to add a new layer type so that developers could declare their interactive regions which the OS can then provide the highlights for, but that would need to be in a different repo and not in this PR. From my experimentation on the Vision Pro with a prototype implementation of this proposal, no cursor has been required to perform interactions, you pinch and the thing you are looking at gets interacted with. Even with small hit targets such as chess pieces on a regulation size chess board from a standing or sitting position are a sufficiently large hit target that selecting pieces doesn’t present any issues. The developer could provide cursor themselves when the interaction starts to show what was initially selected and let the user release the pinch if the wrong thing was targeted or they change their mind about the interaction. In my more advanced demos I implement cursor re-targetting as a developer by calculating the point which is hit in my scene, then modifying that point by the inverse of the starting pose and applying the current pose each frame so that when the “selectend” even is fired if a ray through the new target location no longer connects then the event can be ignored. This can also be used for doing my own hover effects in the client.
Yeah that’s correct. For example if the gesture was initiated by a pinch, the grip space would be set to the point the fingers connect.
The tricky thing is that what the pose really wants to modify for the best experience is the point where the selection ray intersects the scene geometry so we can’t do smart things like this without using a depth buffer, (ideally a depth buffer of just interact-able objects) where as right now it works without it.
I think I would definitely prefer something based on the gripSpace if it is available with viewerSpace as a fallback, locking it to the head feels a little weird when the input will can naturally be modulated by the attached hand pose. Perhaps if gripSpace is available then modulating the direction of the targetRaySpace based on a point that sits one meter out from the user along the targetRaySpace would be acceptable. |
I guess I'm not surprised that Safari is treated like any other app in this regard, but I hadn't considered that limitation previously.
For both this and accessibility reasons, yeah. It would be great to explore some sort of object segmentation system! (Agreed that it's a separate PR, though. Maybe a new TPAC subject?) I'd be interested in learning more about how native apps communicate gaze-glow regions to the OS. I'm only familiar with the new CSS APIs that Safari is adding to facilitate it. It would be great if we could figure out a "best effort" method that gave some basic, but possibly imprecise, feedback to the user by default that the developer could then improve with more explicit input.
This is good to hear! I'm a little wary of looking at one platform's high-quality implementation, though, and generalizing to "no user feedback is necessary". Again, sounds like we need further discussion.
Completely agree that the best experience is one that takes the platform norms into account. In this case I'm really just thinking about how to best facilitate apps that were built well before this more of iteration was devised and ensuring they have some level of functionality, even if it's not ideal. "Awkward" is better than "broken".
I'd love to hear your conclusions from any prototyping that you do in this area! It's going to be tricky, but I'm optimistic that there's a way to facilitate both the ideal use cases and fallbacks. I also want to make sure that we don't accidentally encourage users to design apps that ONLY work for a certain type of input, whether that be gaze, hand, or controller. I think our "select" system still does an admirable job at that, but like with the deprecation discussion it's probably worth taking another look and making sure our initial assumptions hold up. |
I don't think I am prepared to talk about it yet
An interface which wasn't so precise could implement a cursor themselves, I am just saying for the Vision Pro use case I have found it unnecessary, though I think any cursor system would require the WebXR scene to provide a sensible depth map to show the object it's hitting.
I definitely want it to be good enough that it actually works well, the worst case scenario is that because it works well enough developers rely on the updated pose of the targetRaySpace even though it's just a best-guess when attaching the object to the gripSpace would give a really good experience. |
It's a bit odd to add a transient input source. Is the intent that this inputsource is constantly reported or only right after (and only once?) after a system gesture? I think that this use case would be better solved with an event. Maybe it could even report the space that the user was focusing on at the moment the system gesture happened. |
It is reported for the whole duration of the interaction, so for potentially many seconds for a more complex gesture. From my tests it seems to work well in existing frameworks. |
TPAC Feedback:
|
Thanks for this proposal @AdaRoseCannon ! |
@cabanier my hope is that developers don't need to make any changes to support this specially. So I am not quite sure what I would put in a sample. |
Here are some of the proposed enums, if you have more please mention them it would be good to settle on an appropriate name:
I want to push back a little against gaze although for the very specific use case of a Vision Pro this static targetRay is based on the users gaze direction at the moment the interaction starts. My intentions for this is that it can be used for any interaction which has a momentary intention. |
I prefer |
I suspect that authors will want to treat this new inputsource differently so there will be separate paths for each type. |
The "poses may be reported" algorithm previously indicated that poses could only be reported if the visibility state was "visible", but that appears to be a mistake originating from the fact that "visible-blurred" was added later. The description of "visible-blurred" mentioned throttled head poses, so obviously the intention was to allow some poses in that state, just not input poses. This change alters the "poses may be reported" algorithm to only suppress poses when the visiblity state is "hidden", and adds an explicit step to the "populate the pose" algorithm that prevents input poses from being returned if the visibility state is "visible-blurred"
I changed the PR itself to say that
and that the gripSpace if it's not something that is otherwise defined
|
Apologies, Ada. I had an action item to post alternate names and dropped it on the floor. The list you put up covered my suggestions, though. Thanks! And after thinking about it some more I agree with Rik that |
@Manishearth we likely want to wait until @AdaRoseCannon makes a PR with the renamed targetRayMode |
Yeah, I will talk about the changes with the group and we’ll pick a name. Then it’s all good hopefully. |
The feel of the room is for "transient-pointer" |
I'm happy for this to merged if everyone else is :) |
I'm going to start working on a sample explicitly highlighting that for this space things that should be attached to the hand should use gripSpace. Which would be the main unexpected side-effect of this particular situation. |
SHA: 47d14b6 Reason: push, by cabanier Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
SHA: 47d14b6 Reason: push, by cabanier Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR adds a new enum for inputs which don't represent any actual physical device but instead represent a user's intent from the operating system derived from other sources.
The enum "transient-intent" is certainly up for debate.
Preview | Diff