Petoi's Bittle is a palm-sized, open-source, programmable robot dog for STEM and fun. Bittle can connect with Raspberry Pi and can be easily extended

Use Python to record

I used PyAudio at the beginning, but it is an old library. So I used Sounddevice and Soundfile instead.

Command/Key Words Recognition

From a functional point of view, the methods to do this can be divided into:

Speech to Text. And then look up the commands in the text. One good thing is that this can be combined with NLP applications but this is an overkill for Speech2Text.
Use acoustic features to do analysis and detect commands.

DTW (Dynamic Time Warping) (Used)

This belongs to the second category and it's similar to template matching. DTW can calculate the cost to match one piece of audio with template audio. We can pick the audio with the lowest cost. This method does not need training and is also applicable even if you want to add new commands. The bad thing is that the calculation is time-consuming. But at least the command audios are short in time and we can find ways to eliminate the silence and extract MFCC(Mel-frequency Cepstral Coefficients) feature.

CNN for Command/Key Word Recognition

This is a demo Speech Command Recognition with torchaudio — PyTorch Tutorials which is done by PyTorch Official. But we need to re-train the model when we have new commands coming in.

Voice Activity Detection

I was inspired by a blog Audio Handling Basics: Process Audio Files In Command-Line or Python | Hacker Noon. The blog mentions that we can eliminate the silence part of an audio recording according to the short-term energy of audio data. A Python library called librosa provides some functions for doing that.

Speech Recognition

I tried some open-source methods:

mozilla/DeepSpeech: Offline recognition, provides lightweight tflite models for low-resource devices.
SeanNaren/deepspeech.pytorch: It does not have lightweight models and the models are near 900MB. It's too big for a Raspberry Pi.
Uberi/speech_recognition: Provides multiple ways such as using Google/MS Api. The only offline recognition method is no longer being maintained.
alphacep/vosk (Used): Vosk provides offline recognition and lightweight models for both English and Chinese. The documentation is not complete.

Prepare environment

Simple Version (Run on Pi):

bash rpi_simple.sh

Complete Manual:

Enter Pi's terminal: sudo apt-get update && sudo apt-get upgrade
Install portaudio: sudo apt-get install make build-essential libportaudio0 libportaudio2 libportaudiocpp0 portaudio19-dev libatlas-base-dev
Create a new virtual environment with python==3.7.3 and activate it.
Install Scipy. Download the wheel file pip install your_filename
Install librosa:

Deactivate the environment and then:
sudo apt install libblas-dev llvm llvm-dev
export LLVM_CONFIG=/usr/bin/llvm-config
Activate the environment and then:
Download and unzip librosa source code. Then cd into the directory.

Install the remaining dependencies: pip install -r requirements_pi.txt
Download vosk model. Download vosk-model-small-en-us and vosk-model-small-cn. Move the folder into ./models.

Run

Enter my_vosk in the terminal: python main.py
Finetune the threshold for wakeup recognition. In config.yml, the value is now 0 for easy debug.
You should go into command recognition after waking up Petoi. Pre-defined commands are in cmd_lookup.py.
If you want to use the Chinese model, unzip the model and put it in ./models, then set the vosk_model_path in config.yml.

SmilingRobo

Voice-Control for Bittle

Arduino

Use Python to record

Command/Key Words Recognition

DTW (Dynamic Time Warping) (Used)

CNN for Command/Key Word Recognition

Voice Activity Detection

Speech Recognition

Prepare environment

Simple Version (Run on Pi):

Complete Manual:

Run

Logic

📦

Files

SmilngRobo

SmilingRobo, a platfrom of opensource robotics