AI/ML has made it a possibility to analyse audio for business value realization

Audio analytics has applications in the following areas:

  1. Audio captured and analysed at kiosks give insights into customer engagement and sentiment on products and services in the product sales cycle.
  2. Audio analytics at events (such as sports events, street fairs) and transportation services provide a way to understand the demography and sentiment.

everapptech audio analytics

Machine Learning is reasonably effective in ASR (Automatic Speech Recognition), KWS (Key Word Detection), gender and spoken language identification. Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. Diarised speeches along with ASR and speech-to-text conversion and natural language identification provide the required inputs for analysis. The analytics are based on the following ML based inferences.

  1. Use natural language and gender identification mechanisms to understand the demography of people visiting kiosks or events.
  2. Use audio to understand positive and negative emotions.
  3. Use speech diarisation to understand the influencers. For e.g., the influential gender in making product purchases at kiosks.
  4. Speech-to-text provides inputs for text-based analytics as well.

Audio Analytics Application Logical Flow

On-prem application collects audio streams, slices it for processing and feeds data for further processing. Model inferences and further processing of inferred data will make the data suitable for analytics. Generated such data will be fed to analytics applications to derive actionable business information. Continual learning is part of the application deployment.

application flow

Datasets

To start with, datasets from the public domain are used for training. Sample datasets used are given below.

  1. facebook/voxpopuli
  2. mozilla-foundation/common_voice
  3. facebook/multilingual_librispeech

As a continual improvement, live input data is curated and given for continual training.

Model Architectures

We created our solution based on the Wav2Vec2 and ResNet10 model architectures.
Fine tuning Wav2Vec2 improved the accuracy. The model size is bigger, and inference is slower. As the nature of Wav2Vec2 model architecture is that it learns by training on unlabelled data first, and fine tuning based on small size labelled data, it is easier and flexible to do multiple audio inferences. On the other hand, ResNet10 based models are less accurate but very fast in inferences and much smaller.
Support Vector Machines and Random Forest based architectures are helpful for some early decisions and data preprocessing. We combine the models to build our final application.

Model Performance Analysis

Below model performance is given as a sample for one of the analytics – gender classification.

Model

Model Size

Test Data

Interface Time (No Batching)

Inference Accuracy

ResNet10

677k params

Facebook/voxpopuli

28ms

95.2%

facebook/wav2vec2-base (finetuning)

315 mil. params

Facebook/voxpopuli

1.2 sec

95.9%

facebook/wav2vec2-xls-r-300m (finetuning)

315 mil. params

Facebook/voxpopuli

1.2 sec

97.6%

Approach for Continual Improvement

image

To continually improve the performance some x% of the production samples are analysed for correctness using tools and manual methods. The corrected data is given for training and inferences. This continual learning mechanism is built into the application.