Problem definition: R-SE

The figure above shows 3 typical application scenarios of R-SE.
(a) Angular region: Only the voices within the direction range (azimuth & elevation) are desired. This can be useful when the target speakers are located in a fixed or pre-arranged region and have a certain direction difference from other competing speakers.
(b) Sphere: Defend sounds within a certain radius (source-to-array distance). This scenario is suitable when all the background sounds are recognized as interference or simultaneous speech comes from close azimuths. Note that a ring-like spatial region can also be defined by the difference between two spheres with outer and inner diameters, respectively.
(c) Cone: The cone has limitations both on the direction range and the distance threshold.

Real-recorded data in conference room

Data Recording

The figure below gives an application scenario of offline conference. The conversation takes place in a conference room (about 10 x 6 x 4 m). A conference table is placed at the center of the room with the dual-mic put on the table. Two male speakers sit by the table at 90° and 0° relative to the microphone array, respectively. The distance-to-array is about 1.5 meters.
A 8-element linear array is used to record the session, but we only use two of them to validate our model.

Model — A-ReZero

The azimuth-range-based ReZero (A-ReZero) model was trained on synthetic data and evaluated on real-recorded stereo data.
The microphone spacing between two recording sensors is 14 cm for the presented demo. The evaluation is also conducted on 2-mic recordings with 10.5 cm and 7 cm spacings.
Session 3
Session 3: Two male speakers, loud game character voice, music, B-Box

Time Stamp       Recording (stereo)       Query [45, 100°] Query [-30, 30°] Query [-90, -30°]
[9.0-38.0]
S3_m1

S3_q11

S3_q12

S3_q13
[39.0-61.0]
S3_m2

S3_q21

S3_q22

S3_q23
[68.0-94.0]
S3_m3

S3_q31

S3_q32

S3_q33
[94.0-109.0]
S3_m5

S3_q51

S3_q52

S3_q33
[145.5-160.0]
S3_m4

S3_q41

S3_q42

S3_q43

Different model sizes

To examine the feasibility of the proposed framework under light or even super-light configurations, we compare the performances of A-ReZero with different model sizes and computation complexities.

  • BSRNN-M: R=8, H=48, P=16, bs1 scheme
  • BSRNN-XXS: R=6, H=24, P=16, bs2 scheme (the spectrogram is splitted into 5×200 Hz, 6×500 Hz, followed by 4×1k Hz subbands)
  • BSRNN-XXXS: R=4, H=16, P=12, bs2 scheme

Time Stamp Recording [60, 100°] / [-15, 15°]
BSRNN-M
(6G/s, 3M)
BSRNN-XXS
(550M/s, 760K)
BSRNN-XXXS
(235M/s, 430K)
[9.0-38.0]
[39.0-61.0]
[68.0-94.0]

Different target audio types

To examine the generalization capability of the proposed framework to deal with different target audio types, i.e., not limited to a particular audio type such as speech signal, we simply alternated the target audio type to the instrumental music (without vocal). In this case, both speech signals and common noises (e.g., gaussian noise, sound events) are considered as the interferences that need to be removed.
Session 5: The recording scene is the same as Session 3 illustrated above. A computer as well as a male speaker is placed at about -30° (same as 30° for linear array) and play instrumental music, The female singer is located at about 90° while the other male singer is located at about 180° (same as 0° for linear array).
Note: Here we adopt two small causal BSRNN models to separately extract the instrumental music and speech/vocal. For the task of music source separation, larger BSRNN models (20x+ complexity), higher sample rate (44.1kHz) and finer band split scheme are preferred to produce more promising results.

Time Stamp Recording Music [15°, 45°] Speech / Vocal [-15°, 100°]
[10.0-33.0]


[32.0-42.0]
failed (too low SNR)

[108.0-118.0]


Real-recorded data with smart phone

Data Recording

The figure below gives an application scenario of smart-phone communication. Two male speakers were located at (90°, 0.5m) and (270°, 1.5m) relative to the smart phone, respectively. A female speaker was located (240°, ?m) relative to the smart phone.

Model — A-ReZero & C-ReZero

Both the azimuth-range-based ReZero (A-ReZero) and conical ReZero (C-ReZero) models were trained on synthetic data and evaluated on real-recorded stereo data.
The dual microphones are located at the top and bottom of the smart phone, respectively, with unknown spacing.
      Recording (stereo)       Angular Query { [75, 105°] } Conical Query { [75, 105°], 1.0m }
S3_m1 S3_q11 S3_q12

S3_m2

S3_q21

S3_q22

Real-recorded data in vehicle

Data Recording

The data is recorded in a driving 4-seat car, as shown in the figure below.
Two microphones are symmetrically located near the car navigation between the main driver and co-driver. Another two microphones are symmetrically distributed by the side of two rear seats. Compared to the small microphone array we used in the paper as well as presented in other demos, the spacings between microphones used in car are relatively large (~1m).

ReZero in car

The region shape in car is not defined in the paper, i.e., rectangular-box-like. In our implementation, we sample the vertexes and the center of the box to calculate corresponding spatial features.
Data simulation can be quite different for in-car scenario, including RIRs, noise, echo, speaker arrangements, etc. We have addressed these changes under the ReZero framework.
The output target of the model is the separate speech signals of four different regions. The model used here is fully causal.
Session: co-driver + rear2 + loud echo Session: very low SNR with strong wind noise
Recording (ref. ch.)
S3_m2

S3_q21
ReZero output
S3_m2

S3_m2



S3_q21

2D Angular R-SE

2-D angular region takes both azimuth and elevation windows into consideration.
In simulation, the microphone array is placed at a relatively high position to allow for the full elevation range. The width of elevation window is set in the range of [30°, 60°] both during training and evaluation.

The 2-D angular region is evenly divided into an 8×6 2-D mesh, where each mesh point indicates a sampled azimuth and elevation. The 2-D angular region feature is then computed via aggregating direction features at each mesh point using RNN-Loop method.

The mixture sample contains two speakers with very close azimuths but different elevations.

ID
Azm. range
Ele. range
  Mixture (1st ch.)  
  Ground Truth  
AE-ReZero Output
1 [-150, -100] [-90, -50]
2 [-150, -100] [-50, 0]
3 [-150, -120] [-50, 0]
4 [-120, -80] [-90, -50]
5 [-115, -80] [-50, 0]

Spherical R-SE

An example that distance-range-based ReZero (D-ReZero) produces different extraction results when alternating the query distances is illustrated in the figure above.
The mixture sample contains two speakers, two point noises and two isotropic noises, where the source-to-array distances of speaker 1 and speaker 2 are 0.36 m and 0.94 m, respectively. The azimuths of these two speakers are very close, which means direction features can hardly distinguish between them.

Distance
Mixture (1st ch.)
Ground Truth
D-ReZero Output
0.1m
0.5m
0.9m
1.5m

Conical R-SE

An example that C-ReZero is capable of processing universal region-customized speech extraction is illustrated in the figure below.
The mixture contains two speakers and two point noises. The adopted microphone array is illustrated in (a).
We design 7 query regions for one mixture sample, covering different region shapes and ranges. The details of the query regions are elaborated in the table and illustrated in (b).
We demonstrate great potentials of ReZero to produce desirable extraction results with respect to different query regions.

Region ID Region shape Region scope (θ, d) Q Mixture (1st ch.) Ground Truth C-ReZero Output
1 angular [-270°,-110°], [0, ∞] 2
2 sphere [-180°, 180°], [0, 0.5m] 1
3 ring [-180°, 180°], [0, 0.5m - 1.1m] 1
4 cone [-150°, -90°], [0, 1.5m] 1
5 cone [90°, 120°], [0, 1.0m] 1
6 cone [-130°, -60°], [0, 0.6m] 0
7 cone [-30°, 30°], [0, 1.0m] 0

Stay tuned! Something new is coming :)