VAD on the frontend utilizes Multilayer Perceptron(MLP) s fed 6 MFCCs over 9 frames of input speech wavelet.
The MLP has 50 hidden units and 1 output unit. The hidden units has 9*6 inputs, corresponding weights, sigmoid function f = (1/1+e^-x) and one output. The output unit makes a result by aggregating the outputs of hidden units with weights. The MLP is trained using two outputs, a speech and a silence.
The probability of given frame being silence is computed by (e^silence/(e^silence+e^speech)).
Specially, they utilized low-pass filter before DCT computation during MFCC extraction.
References
Adami et al, Qualcomm-ICSI-OGI Features for ASR, ICSLP 2002
The MLP has 50 hidden units and 1 output unit. The hidden units has 9*6 inputs, corresponding weights, sigmoid function f = (1/1+e^-x) and one output. The output unit makes a result by aggregating the outputs of hidden units with weights. The MLP is trained using two outputs, a speech and a silence.
The probability of given frame being silence is computed by (e^silence/(e^silence+e^speech)).
Specially, they utilized low-pass filter before DCT computation during MFCC extraction.
References
Adami et al, Qualcomm-ICSI-OGI Features for ASR, ICSLP 2002




덧글