AI/ML has made it a possibility to analyse audio for business value realization
Audio analytics has applications in the following areas:
- Audio captured and analysed at kiosks give insights into customer engagement and sentiment on products and services in the product sales cycle.
- Audio analytics at events (such as sports events, street fairs) and transportation services provide a way to understand the demography and sentiment.
Machine Learning is reasonably effective in ASR (Automatic Speech Recognition), KWS (Key Word Detection), gender and spoken language identification. Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. Diarised speeches along with ASR and speech-to-text conversion and natural language identification provide the required inputs for analysis. The analytics are based on the following ML based inferences.
- Use natural language and gender identification mechanisms to understand the demography of people visiting kiosks or events.
- Use audio to understand positive and negative emotions.
- Use speech diarisation to understand the influencers. For e.g., the influential gender in making product purchases at kiosks.
- Speech-to-text provides inputs for text-based analytics as well.
Audio Analytics Application Logical Flow
On-prem application collects audio streams, slices it for processing and feeds data for further processing. Model inferences and further processing of inferred data will make the data suitable for analytics. Generated such data will be fed to analytics applications to derive actionable business information. Continual learning is part of the application deployment.
Datasets
To start with, datasets from the public domain are used for training. Sample datasets used are given below.
- facebook/voxpopuli
- mozilla-foundation/common_voice
- facebook/multilingual_librispeech
As a continual improvement, live input data is curated and given for continual training.
Model Architectures
We created our solution based on the Wav2Vec2 and ResNet10 model architectures.
Fine tuning Wav2Vec2 improved the accuracy. The model size is bigger, and inference is slower. As the nature of Wav2Vec2 model architecture is that it learns by training on unlabelled data first, and fine tuning based on small size labelled data, it is easier and flexible to do multiple audio inferences.
On the other hand, ResNet10 based models are less accurate but very fast in inferences and much smaller.
Support Vector Machines and Random Forest based architectures are helpful for some early decisions and data preprocessing.
We combine the models to build our final application.
Model Performance Analysis
Below model performance is given as a sample for one of the analytics – gender classification.
Model | Model Size | Test Data | Interface Time (No Batching) | Inference Accuracy |
---|---|---|---|---|
ResNet10 | 677k params | Facebook/voxpopuli | 28ms | 95.2% |
facebook/wav2vec2-base (finetuning) | 315 mil. params | Facebook/voxpopuli | 1.2 sec | 95.9% |
facebook/wav2vec2-xls-r-300m (finetuning) | 315 mil. params | Facebook/voxpopuli | 1.2 sec | 97.6% |
Approach for Continual Improvement
To continually improve the performance some x% of the production samples are analysed for correctness using tools and manual methods. The corrected data is given for training and inferences. This continual learning mechanism is built into the application.