TinyML - 4 speech recognition

Project : Micro-speech Recognition
Command
Recognizer
“No”
“Yes”
Phase 2 :
Deploy to a Microcontroller

T
Command Recognizer
Recognize what people said.
3
Training
.wav data
To FFT Trained model
FFT
Feature
Command
Recognizer
Model
Get
.wav data
To FFT
FFT
Feature “Yes”
Training
Inference
https://bit.ly/2XBdE4q
Overall flow to this project
ADC PCM FFT and
pre-process
Audio
Spectrum
CNN model
output tensor
silence
unknown
yes
no
audio_provider feature_provider
Copy into input tensor
PopulateFeatureData
Interpreter
Invoke()
softmax
RecognizeCommands::
ProcessLatestResults
RespondToCommand

The audio features themselves are a two-
dimensional array, made up of horizontal slices
representing the frequencies at one point in time,
stacked on top of each other to form a spectrogram
showing how those frequencies changed over time.
How to get audio features ?
Fourier Transform on sound
Frequencies in sound

The magnitude spectrum of the signal
A magnitude spectrogram is a
visualization of the frequencies
in sound over time, and can be
useful as a feature for neural
network recognition on noise or
speech.
Examine the spectrogram “audio images"

Audio spectrum representants audio features
You can see how the 30-ms
sample window is moved
forward by 20 ms each time
until it has covered the full
one-second sample.
40
49
feature buffer(1 second)
we combine the results of running the FFT on 49 consecutive 30-ms slices
of audio, and this will pass into the model
each FFT row represents a
30ms sample of audio split into
40 frequency buckets.
int(
𝑙𝑒𝑛𝑔𝑡ℎ−𝑤𝑖𝑛𝑑𝑜𝑤_𝑠𝑖𝑧𝑒
𝑠𝑡𝑟𝑖𝑑𝑒
) + 1
30+48*20=990ms
running an FFT across a 30ms section of the audio
sample data
FFT FFT
Audio Recognition Model (CNN Model)
CNN Model
silence
unknown
yes
no
1 second audio=40x49 pixels image
40
49

Our Model
CNN
Model
Input
output
(1,49,40,1) (1,4)
Type: int8 Type: int8 (-128~127)
Input byte: (1x49x40)x1 byte=1960 0 1 2
unknown
silence yes no
3
1 second audio spectrogram (49x40)
tensorflow/lite/micro/examples/micro_speech
Project File Structure
main_function.cc Tensorflow Lite 框架主要程式
recognize_commands.cc  對推論結果進行處理
micro_features/model.cc  Tflite model
XXXX_test.cc  以_test.cc 為檔名結尾
是一些可以在開發主機上進行的測試程式
arduino, sparkfun_edge, zephyr_riscv,..
裡頭為特定硬體的處理檔案, 若在編譯時指定
TARGET=XXX, 則會以資料匣內的檔案取代原檔案
├── sparkfun_edge
| ├── command_responder.cc
| └── audio_provider.cc
├── micro_features
GetAudioSamples()
GenerateMicroFeatures()

Project Flow
程式流程
Audio Spectrum
ADC
PCM
sparkfun_edge/audio_provider.cc
GetAudioSamples ()
GenerateMicroFeatures() 40
49
kFeatureSliceCount
kFeatureSliceSize
kFeatureElementCount=49x40
1 second window
performs the FFT and returns the audio
frequency information.
feature_provider.cc FeatureProvider::PopulateFeatureData
model input

main_functions.cc
feature_provider.cc
The feature provider converts raw audio, obtained
from the audio provider, into spectrograms that can
be fed into our model. It is called during the main
loop
FeatureProvider::PopulateFeatureData() : Fills the
feature data with information from audio inputs,
and returns how many feature slices were updated.
The Feature Provider

PopulateFeatureData()
每次都是1秒鐘的語音資
料, 但不用每次又全部重
算FFT , 只針對有新的
audio slice計算其FFT 即可,
以節省計算量及時間
feature_provider.cc
PopulateFeatureData()
1 second window
it first requests audio for that slice from
the audio provider using GetAudioSamples()
, and then it calls GenerateMicroFeatures() to
perform the FFT and returns the audio
frequency information .
feature_provider.cc

1 second window
audio_samples
_size: 512
audio_samples
feature_data_
FFT
feature_provider.cc
micro_features/micro_model_settings.h

GetAudioSamples () is expected to return an array of
14-bit pulse code modulated (PCM) audio data.
The Audio Provider
audio_samples
FFT
Size: 512
20ms 40ms 60ms 80ms 100ms
Digital audio format
14 bit PCM(Pulse-Code Modulation)
kAudioSampleFrequency=16KHz
 audio sample size=16000 samples/second
=16 samples/ 1ms
Generating the Sample Rate for the ADC
Trigger frequency
am_hal_ctimer_period_set(3, AM_HAL_CTIMER_TIMERA, 750, 0);
12MHz/750 = 16KHz (sampling rate)
audio_provider.cc
d
MIC1
MIC0
Timer A3
GPIO11/ADC2
GPIO29/ADC1
14bit ADC
12MHz
32K
SRAM
DMA
FIFO
ADC set up as a repeat scan mode
trigger ADC periodically
slot number+ Sampling data

Microphone
GPIO29/ADC1
GPIO11/ADC2
the channel select bit field specifies
which one of the analog
multiplexer channels will be used
for the conversions requested for
an individual slot.
When each active slot obtains a
sample from the ADC, it is added to
the value in its accumulator.
All slots write their accumulated
results to the FIFO

Copy (size:kAdcSampleBufferSize)
GetAudioSamples()
g_ui32ADCSampleBuffer1 [kAdcSampleBufferSize]
g_audio_capture_buffer
g_audio_capture_buffer[g_audio_capture_buffer_start]
= temp.ui32Sample;
Copy(size: duration_ms)
30ms PCM audio data
GetAudioSamples
(int start_ms, int duration_ms)
g_audio_output_buffer
Copy when ADC Interrupt occurs
ui32Slot
ui32Sample
ADC data (Slot 1 +Slot2 )
g_ui32ADCSampleBuffer0 [kAdcSampleBufferSize]
ui32TargetAddress
kAdcSampleBufferSize =2 slot* 1024 samples per slot
16000
512
Audio data is transferred by
DMA transfer

GetAudioSamples()
start_ms
start_ms+duration_ms
g_audio_capture_buffer
g_audio_output_buffer
當ISR發生一次, time stamp 就加1, 16 次ISR 表示共讀了16 * 1000 samples, , 約略經過1ms
Time stamp 計算方式
16000
g_audio_output_buffer[kMaxAudioSampleSize]
kMaxAudioSampleSize =512 ( power of two)
Part of the word “yes” being captured in our window
One Problem : Audio is live streaming
YES
??

CNN model
output tensor
silence
unknown
yes
no
Interpreter
Invoke()
softmax
RecognizeCommands::
ProcessLatestResults
RespondToCommand
The length of the averaging window
(average_window_duration_ms)
The minimum average score that counts as a detection
(detection_threshold)
The amount of time we’ll wait after hearing a command
before recognizing a second one (suppression_ms)
The minimum number of inferences required in the window
for a result to count (3)
RecognizeCommands

recognize_commands.cc
產生燒錄檔 micro_speech_wire.bin
寫入燒錄檔到板子
Hands – on
https://drive.google.com/drive/folders/1FhkM
DQ5xZoQS8GLkPZJPoVvT3dD3pk3g
Study
tensorflow/lite/micro/examples/micro_speech
main_function.cc
feature_provider.cc
recognize_commands.cc
/sparkfun_edge/command_responder.cc

開啓終端機 (baud rate: 115200bps)
Demo 終端機會輸出以下訊息
將 Sparkfun edge 透過 USB 連接電源後
會看到有藍光一直在閃，表示此時板子在
正等待語音輸入

TinyML - 4 speech recognition

More Related Content

What's hot

Similar to TinyML - 4 speech recognition

More from 艾鍗科技

Recently uploaded

TinyML - 4 speech recognition