CategoriesPython

Getting Started with DeepSpeech on AWS

Recently, I’ve been working on a project using Python + DeepSpeech. I will share some considerations for setting this type of project up on AWS, including which instance types to look at, in this article. It took quite a bit of trial and error to figure out which one would work best!

What Is DeepSpeech?

DeepSpeech is a speech-to-text engine + model. In other words, it comes with everything you need to get started transcribing audio files to text.

It comes with Python bindings and a client, which you can use as a command line utility, or as an example of how to write your own Python program that uses DeepSpeech.

There are some limitations: the model requires WAV audio at 16,000 hz. The client can use Sox to resample to 16,000 hz if required, but it’s up to you to make sure the file is in the WAV format. My project uses Pydub to handle preprocessing audio files.

DeepSpeech on AWS?

There are a few considerations when putting a DeepSpeech project on AWS EC2. At minimum, you need the right CPU and enough memory.

The main requirement of the CPU is that it support the AVX instruction set, which rules out several instance types. Even on those that do support the instruction set, you need to make sure to use HVM AMI in order to access it.

Beyond that, it’s helpful to note that DeepSpeech will only use one core of the CPU, so using an instance with a lot of cores will only help if transcribing multiple files in parallel.

Memory requirements are also important in the consideration, and depend on the size of the audio files. If you’re only transcribing small files, it shouldn’t be an issue. Trying to work with 30-45 minute-long recordings has required some working around to keep the memory usage reasonable, especially in the preprocessing area.

So, what instance type am I using? Right now, it’s a t3.small. It has the right kind of CPU and enough memory to do the preprocessing and transcribe small chunks at a time. However, I would need more memory if trying to transcribe a large audio file straight through.

If I were putting this in production, though, I think I would split preprocessing and transcribing and put the former on a C5 instance and the latter on either a C5 or P3 instance, after testing to see which works best for the requirements.

After picking an instance, installation is fairly easy. Just follow the instructions and it should work fine.

So, AWS experts, did I overlook an option that would suit DeepSpeech even better? Let me know!

–Harrison

Leave a Reply

Your email address will not be published.