Connection between output tokens and input audio

I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:
- H --> [0.2, 0.3] (seconds)
- E --> [0.3, 0.7]
- L --> [0.7, 1.1]
- L --> [0.7, 1.1]
- O --> [1.1, 1.8]

I saw this [attention images issue](https://github.com/espnet/espnet/issues/1254), which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Connection between output tokens and input audio #4278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connection between output tokens and input audio #4278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions