KEMBAR78
TinyML - 4 speech recognition | PDF
Project : Micro-speech Recognition
Command
Recognizer
“No”
“Yes”
Phase 2 :
Deploy to a Microcontroller
T
Command Recognizer
Recognize what people said.
3
Training
.wav data
To FFT Trained model
FFT
Feature
Command
Recognizer
Model
Get
.wav data
To FFT
FFT
Feature “Yes”
Training
Inference
https://bit.ly/2XBdE4q
Overall flow to this project
ADC PCM FFT and
pre-process
Audio
Spectrum
CNN model
output tensor
silence
unknown
yes
no
audio_provider feature_provider
Copy into input tensor
PopulateFeatureData
Interpreter
Invoke()
softmax
RecognizeCommands::
ProcessLatestResults
RespondToCommand
The audio features themselves are a two-
dimensional array, made up of horizontal slices
representing the frequencies at one point in time,
stacked on top of each other to form a spectrogram
showing how those frequencies changed over time.
How to get audio features ?
Fourier Transform on sound
Frequencies in sound
The magnitude spectrum of the signal
A magnitude spectrogram is a
visualization of the frequencies
in sound over time, and can be
useful as a feature for neural
network recognition on noise or
speech.
Examine the spectrogram “audio images"
Audio spectrum representants audio features
You can see how the 30-ms
sample window is moved
forward by 20 ms each time
until it has covered the full
one-second sample.
40
49
feature buffer(1 second)
we combine the results of running the FFT on 49 consecutive 30-ms slices
of audio, and this will pass into the model
each FFT row represents a
30ms sample of audio split into
40 frequency buckets.
int(
𝑙𝑒𝑛𝑔𝑡ℎ−𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒
𝑠𝑡𝑟𝑖𝑑𝑒
) + 1
30+48*20=990ms
running an FFT across a 30ms section of the audio
sample data
FFT FFT
Audio Recognition Model (CNN Model)
CNN Model
silence
unknown
yes
no
1 second audio=40x49 pixels image
40
49
Our Model
CNN
Model
Input
output
(1,49,40,1) (1,4)
Type: int8 Type: int8 (-128~127)
Input byte: (1x49x40)x1 byte=1960 0 1 2
unknown
silence yes no
3
1 second audio spectrogram (49x40)
tensorflow/lite/micro/examples/micro_speech
Project File Structure
main_function.cc Tensorflow Lite 框架主要程式
recognize_commands.cc  對推論結果進行處理
micro_features/model.cc  Tflite model
XXXX_test.cc  以_test.cc 為檔名結尾
是一些可以在開發主機上進行的測試程式
arduino, sparkfun_edge, zephyr_riscv,..
裡頭為特定硬體的處理檔案, 若在編譯時指定
TARGET=XXX, 則會以資料匣內的檔案取代原檔案
├── sparkfun_edge
| ├── command_responder.cc
| └── audio_provider.cc
├── micro_features
GetAudioSamples()
GenerateMicroFeatures()
Project Flow
程式流程
Audio Spectrum
ADC
PCM
sparkfun_edge/audio_provider.cc
GetAudioSamples ()
GenerateMicroFeatures() 40
49
kFeatureSliceCount
kFeatureSliceSize
kFeatureElementCount=49x40
1 second window
performs the FFT and returns the audio
frequency information.
feature_provider.cc FeatureProvider::PopulateFeatureData
model input
main_functions.cc
feature_provider.cc
The feature provider converts raw audio, obtained
from the audio provider, into spectrograms that can
be fed into our model. It is called during the main
loop
FeatureProvider::PopulateFeatureData() : Fills the
feature data with information from audio inputs,
and returns how many feature slices were updated.
The Feature Provider
PopulateFeatureData()
每次都是1秒鐘的語音資
料, 但不用每次又全部重
算FFT , 只針對有新的
audio slice計算其FFT 即可,
以節省計算量及時間
feature_provider.cc
PopulateFeatureData()
1 second window
it first requests audio for that slice from
the audio provider using GetAudioSamples()
, and then it calls GenerateMicroFeatures() to
perform the FFT and returns the audio
frequency information .
feature_provider.cc
1 second window
audio_samples
_size: 512
audio_samples
feature_data_
FFT
feature_provider.cc
micro_features/micro_model_settings.h
sparkfun_edge/audio_provider.cc
GetAudioSamples () is expected to return an array of
14-bit pulse code modulated (PCM) audio data.
The Audio Provider
audio_samples
FFT
Size: 512
20ms 40ms 60ms 80ms 100ms
Digital audio format
14 bit PCM(Pulse-Code Modulation)
kAudioSampleFrequency=16KHz
 audio sample size=16000 samples/second
=16 samples/ 1ms
Generating the Sample Rate for the ADC
Trigger frequency
am_hal_ctimer_period_set(3, AM_HAL_CTIMER_TIMERA, 750, 0);
12MHz/750 = 16KHz (sampling rate)
audio_provider.cc
d
MIC1
MIC0
Timer A3
GPIO11/ADC2
GPIO29/ADC1
14bit ADC
12MHz
32K
SRAM
DMA
FIFO
ADC set up as a repeat scan mode
trigger ADC periodically
slot number+ Sampling data
Microphone
GPIO29/ADC1
GPIO11/ADC2
the channel select bit field specifies
which one of the analog
multiplexer channels will be used
for the conversions requested for
an individual slot.
When each active slot obtains a
sample from the ADC, it is added to
the value in its accumulator.
All slots write their accumulated
results to the FIFO
sparkfun_edge/audio_provider.cc
Copy (size:kAdcSampleBufferSize)
GetAudioSamples()
sparkfun_edge/audio_provider.cc
g_ui32ADCSampleBuffer1 [kAdcSampleBufferSize]
g_audio_capture_buffer
g_audio_capture_buffer[g_audio_capture_buffer_start]
= temp.ui32Sample;
Copy(size: duration_ms)
30ms PCM audio data
GetAudioSamples
(int start_ms, int duration_ms)
g_audio_output_buffer
Copy when ADC Interrupt occurs
ui32Slot
ui32Sample
ADC data (Slot 1 +Slot2 )
g_ui32ADCSampleBuffer0 [kAdcSampleBufferSize]
ui32TargetAddress
kAdcSampleBufferSize =2 slot* 1024 samples per slot
16000
512
Audio data is transferred by
DMA transfer
GetAudioSamples()
start_ms
start_ms+duration_ms
g_audio_capture_buffer
g_audio_output_buffer
當ISR發生一次, time stamp 就加1, 16 次ISR 表示共讀了16 * 1000 samples, , 約略經過1ms
Time stamp 計算方式
16000
g_audio_output_buffer[kMaxAudioSampleSize]
kMaxAudioSampleSize =512 ( power of two)
Part of the word “yes” being captured in our window
One Problem : Audio is live streaming
YES
??
CNN model
output tensor
silence
unknown
yes
no
Interpreter
Invoke()
softmax
RecognizeCommands::
ProcessLatestResults
RespondToCommand
The length of the averaging window
(average_window_duration_ms)
The minimum average score that counts as a detection
(detection_threshold)
The amount of time we’ll wait after hearing a command
before recognizing a second one (suppression_ms)
The minimum number of inferences required in the window
for a result to count (3)
RecognizeCommands
recognize_commands.cc
產生燒錄檔 micro_speech_wire.bin
寫入燒錄檔到板子
Hands – on
https://drive.google.com/drive/folders/1FhkM
DQ5xZoQS8GLkPZJPoVvT3dD3pk3g
Study
tensorflow/lite/micro/examples/micro_speech
main_function.cc
feature_provider.cc
recognize_commands.cc
/sparkfun_edge/command_responder.cc
開啓終端機 (baud rate: 115200bps)
Demo 終端機會輸出以下訊息
將 Sparkfun edge 透過 USB 連接電源後
會看到有藍光一直在閃 ,表示此時板子在
正等待語音輸入

TinyML - 4 speech recognition

  • 1.
    Project : Micro-speechRecognition Command Recognizer “No” “Yes” Phase 2 : Deploy to a Microcontroller
  • 2.
    T Command Recognizer Recognize whatpeople said. 3 Training .wav data To FFT Trained model FFT Feature Command Recognizer Model Get .wav data To FFT FFT Feature “Yes” Training Inference https://bit.ly/2XBdE4q Overall flow to this project ADC PCM FFT and pre-process Audio Spectrum CNN model output tensor silence unknown yes no audio_provider feature_provider Copy into input tensor PopulateFeatureData Interpreter Invoke() softmax RecognizeCommands:: ProcessLatestResults RespondToCommand
  • 3.
    The audio featuresthemselves are a two- dimensional array, made up of horizontal slices representing the frequencies at one point in time, stacked on top of each other to form a spectrogram showing how those frequencies changed over time. How to get audio features ? Fourier Transform on sound Frequencies in sound
  • 4.
    The magnitude spectrumof the signal A magnitude spectrogram is a visualization of the frequencies in sound over time, and can be useful as a feature for neural network recognition on noise or speech. Examine the spectrogram “audio images"
  • 5.
    Audio spectrum representantsaudio features You can see how the 30-ms sample window is moved forward by 20 ms each time until it has covered the full one-second sample. 40 49 feature buffer(1 second) we combine the results of running the FFT on 49 consecutive 30-ms slices of audio, and this will pass into the model each FFT row represents a 30ms sample of audio split into 40 frequency buckets. int( 𝑙𝑒𝑛𝑔𝑡ℎ−𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒 𝑠𝑡𝑟𝑖𝑑𝑒 ) + 1 30+48*20=990ms running an FFT across a 30ms section of the audio sample data FFT FFT Audio Recognition Model (CNN Model) CNN Model silence unknown yes no 1 second audio=40x49 pixels image 40 49
  • 6.
    Our Model CNN Model Input output (1,49,40,1) (1,4) Type:int8 Type: int8 (-128~127) Input byte: (1x49x40)x1 byte=1960 0 1 2 unknown silence yes no 3 1 second audio spectrogram (49x40) tensorflow/lite/micro/examples/micro_speech Project File Structure main_function.cc Tensorflow Lite 框架主要程式 recognize_commands.cc  對推論結果進行處理 micro_features/model.cc  Tflite model XXXX_test.cc  以_test.cc 為檔名結尾 是一些可以在開發主機上進行的測試程式 arduino, sparkfun_edge, zephyr_riscv,.. 裡頭為特定硬體的處理檔案, 若在編譯時指定 TARGET=XXX, 則會以資料匣內的檔案取代原檔案 ├── sparkfun_edge | ├── command_responder.cc | └── audio_provider.cc ├── micro_features GetAudioSamples() GenerateMicroFeatures()
  • 7.
    Project Flow 程式流程 Audio Spectrum ADC PCM sparkfun_edge/audio_provider.cc GetAudioSamples() GenerateMicroFeatures() 40 49 kFeatureSliceCount kFeatureSliceSize kFeatureElementCount=49x40 1 second window performs the FFT and returns the audio frequency information. feature_provider.cc FeatureProvider::PopulateFeatureData model input
  • 8.
    main_functions.cc feature_provider.cc The feature providerconverts raw audio, obtained from the audio provider, into spectrograms that can be fed into our model. It is called during the main loop FeatureProvider::PopulateFeatureData() : Fills the feature data with information from audio inputs, and returns how many feature slices were updated. The Feature Provider
  • 9.
    PopulateFeatureData() 每次都是1秒鐘的語音資 料, 但不用每次又全部重 算FFT ,只針對有新的 audio slice計算其FFT 即可, 以節省計算量及時間 feature_provider.cc PopulateFeatureData() 1 second window it first requests audio for that slice from the audio provider using GetAudioSamples() , and then it calls GenerateMicroFeatures() to perform the FFT and returns the audio frequency information . feature_provider.cc
  • 10.
    1 second window audio_samples _size:512 audio_samples feature_data_ FFT feature_provider.cc micro_features/micro_model_settings.h
  • 11.
    sparkfun_edge/audio_provider.cc GetAudioSamples () isexpected to return an array of 14-bit pulse code modulated (PCM) audio data. The Audio Provider audio_samples FFT Size: 512 20ms 40ms 60ms 80ms 100ms Digital audio format 14 bit PCM(Pulse-Code Modulation) kAudioSampleFrequency=16KHz  audio sample size=16000 samples/second =16 samples/ 1ms Generating the Sample Rate for the ADC Trigger frequency am_hal_ctimer_period_set(3, AM_HAL_CTIMER_TIMERA, 750, 0); 12MHz/750 = 16KHz (sampling rate) audio_provider.cc d MIC1 MIC0 Timer A3 GPIO11/ADC2 GPIO29/ADC1 14bit ADC 12MHz 32K SRAM DMA FIFO ADC set up as a repeat scan mode trigger ADC periodically slot number+ Sampling data
  • 12.
    Microphone GPIO29/ADC1 GPIO11/ADC2 the channel selectbit field specifies which one of the analog multiplexer channels will be used for the conversions requested for an individual slot. When each active slot obtains a sample from the ADC, it is added to the value in its accumulator. All slots write their accumulated results to the FIFO
  • 13.
    sparkfun_edge/audio_provider.cc Copy (size:kAdcSampleBufferSize) GetAudioSamples() sparkfun_edge/audio_provider.cc g_ui32ADCSampleBuffer1 [kAdcSampleBufferSize] g_audio_capture_buffer g_audio_capture_buffer[g_audio_capture_buffer_start] =temp.ui32Sample; Copy(size: duration_ms) 30ms PCM audio data GetAudioSamples (int start_ms, int duration_ms) g_audio_output_buffer Copy when ADC Interrupt occurs ui32Slot ui32Sample ADC data (Slot 1 +Slot2 ) g_ui32ADCSampleBuffer0 [kAdcSampleBufferSize] ui32TargetAddress kAdcSampleBufferSize =2 slot* 1024 samples per slot 16000 512 Audio data is transferred by DMA transfer
  • 14.
    GetAudioSamples() start_ms start_ms+duration_ms g_audio_capture_buffer g_audio_output_buffer 當ISR發生一次, time stamp就加1, 16 次ISR 表示共讀了16 * 1000 samples, , 約略經過1ms Time stamp 計算方式 16000 g_audio_output_buffer[kMaxAudioSampleSize] kMaxAudioSampleSize =512 ( power of two) Part of the word “yes” being captured in our window One Problem : Audio is live streaming YES ??
  • 15.
    CNN model output tensor silence unknown yes no Interpreter Invoke() softmax RecognizeCommands:: ProcessLatestResults RespondToCommand Thelength of the averaging window (average_window_duration_ms) The minimum average score that counts as a detection (detection_threshold) The amount of time we’ll wait after hearing a command before recognizing a second one (suppression_ms) The minimum number of inferences required in the window for a result to count (3) RecognizeCommands
  • 16.
    recognize_commands.cc 產生燒錄檔 micro_speech_wire.bin 寫入燒錄檔到板子 Hands –on https://drive.google.com/drive/folders/1FhkM DQ5xZoQS8GLkPZJPoVvT3dD3pk3g Study tensorflow/lite/micro/examples/micro_speech main_function.cc feature_provider.cc recognize_commands.cc /sparkfun_edge/command_responder.cc
  • 17.
    開啓終端機 (baud rate:115200bps) Demo 終端機會輸出以下訊息 將 Sparkfun edge 透過 USB 連接電源後 會看到有藍光一直在閃 ,表示此時板子在 正等待語音輸入