-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Labels
Description
I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:
- H --> [0.2, 0.3] (seconds)
- E --> [0.3, 0.7]
- L --> [0.7, 1.1]
- L --> [0.7, 1.1]
- O --> [1.1, 1.8]
I saw this attention images issue, which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?
Thanks!