KEMBAR78
Connection between output tokens and input audio · Issue #4278 · espnet/espnet · GitHub
Skip to content

Connection between output tokens and input audio #4278

@Daniel-asr

Description

@Daniel-asr

I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:

  • H --> [0.2, 0.3] (seconds)
  • E --> [0.3, 0.7]
  • L --> [0.7, 1.1]
  • L --> [0.7, 1.1]
  • O --> [1.1, 1.8]

I saw this attention images issue, which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions