The VCT library uses Voice Activity Detection (VAD) functionality to detect when there is talking. Enabling VAD saves bandwidth and reduces the CPU load because audio packets are not transmitted when there is no talking. You can either enable VAD (which is the default state), or disable it using VCT_EnableVAD
.
The following figure provides an overview of VAD features and details about the adjustment parameters.
The segment marked (1) in Figure 4-1 indicates the amount of change in audio volume used for detecting the transition from silence to a frame containing speech. The active gain set with the VCT_SetVADActiveGain
function is a ratio (350% by default) that serves as a threshold for detecting this transition from silence to talking. Talking is determined to have started if the average speech power in the current audio frame is higher than the average speech power for the four past frames at the time of the preceding audio frame by an amount of more than the active gain ratio (3.5-fold by default). Setting a smaller value for the active gain makes it easier to detect the start of talking.
The inactive gain set with the VCT_SetVADInactiveGain
function is a ratio (280% by default) that serves as a threshold for detecting the transition from to talking to silence. Lack of talking (silence) is detected if the average speech power for the four past frames at the time of the current audio frame is lower than the average speech power for the four most recent frames when talking was last detected by an amount that is more than the inactive gain ratio (2.8-fold by default). Setting a larger value for the inactive gain makes it easier to detect silence.
If packet transmission is stopped immediately when silence is detected, listeners hear breaks in the talking because package transmission was stopped by interruptions such as glottal stops and other extremely short sounds, and by the talker pausing for breath. For this reason, the VCT library sets a VAD release time (segment (3) in Figure 4-1). VAD does not transition to the silence state unless the conditions for detecting silence continue to be satisfied throughout this release time. The default value is 720 ms (144 ms x 5), and you can change it by using the VCT_SetVADReleaseTime
function.
Segment (4) in Figure 4-1 represents the average audio power threshold for unconditionally treating a frame as silent, as set by the VCT_SetVADClampGain
function. If the current frame's average power is less than or equal to a specified value, the frame is processed as silent because this is an abnormal value. Frames determined to be silent based on this condition are not used in calculating the average audio power of silent frames, which is a value needed for detecting these transitions between talking and silence.
For more information about the functions for setting these parameters, see the function reference.