ReZero

ReZero: Region-customizable Sound Extraction

Rongzhi Gu, Yi Luo

Abstract:

We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise sound extraction (R-SE) task. R-SE task aims at extracting all active target sounds (e.g., human speech) within a specific, user-defined spatial region, which is different from conventional and existing tasks where a blind separation or a fixed, predefined spatial region are typically assumed. The spatial region can be defined as an angular window, a sphere, a cone, or other geometric patterns. Being a solution to the R-SE task, the proposed ReZero framework includes (1) definitions of different types of spatial regions, (2) methods for region feature extraction and aggregation, and (3) a multi-channel extension of the band-split RNN (BSRNN) model specified for the R-SE task. We design experiments for different microphone array geometries, different types of spatial regions, and comprehensive ablation studies on different system configurations. Experimental results on both simulated and real-recorded data demonstrate the effectiveness of ReZero.

[PDF]

Problem definition: R-SE

The figure above shows 3 typical application scenarios of R-SE.
(a) Angular region: Only the voices within the direction range (azimuth & elevation) are desired. This can be useful when the target speakers are located in a fixed or pre-arranged region and have a certain direction difference from other competing speakers.
(b) Sphere: Defend sounds within a certain radius (source-to-array distance). This scenario is suitable when all the background sounds are recognized as interference or simultaneous speech comes from close azimuths. Note that a ring-like spatial region can also be defined by the difference between two spheres with outer and inner diameters, respectively.
(c) Cone: The cone has limitations both on the direction range and the distance threshold.

Real-recorded data in conference room

Data Recording

The figure below gives an application scenario of offline conference. The conversation takes place in a conference room (about 10 x 6 x 4 m). A conference table is placed at the center of the room with the dual-mic put on the table. Two male speakers sit by the table at 90° and 0° relative to the microphone array, respectively. The distance-to-array is about 1.5 meters.
A 8-element linear array is used to record the session, but we only use two of them to validate our model.

Model — A-ReZero

The azimuth-range-based ReZero (A-ReZero) model was trained on synthetic data and evaluated on real-recorded stereo data.
The microphone spacing between two recording sensors is 14 cm for the presented demo. The evaluation is also conducted on 2-mic recordings with 10.5 cm and 7 cm spacings.

Session 3: Two male speakers, loud game character voice, music, B-Box

Time Stamp	Recording (stereo)	Query [45, 100°]	Query [-30, 30°]	Query [-90, -30°]
[9.0-38.0]
[39.0-61.0]
[68.0-94.0]
[94.0-109.0]
[145.5-160.0]

Different model sizes

To examine the feasibility of the proposed framework under light or even super-light configurations, we compare the performances of A-ReZero with different model sizes and computation complexities.

BSRNN-M: R=8, H=48, P=16, bs1 scheme
BSRNN-XXS: R=6, H=24, P=16, bs2 scheme (the spectrogram is splitted into 5×200 Hz, 6×500 Hz, followed by 4×1k Hz subbands)
BSRNN-XXXS: R=4, H=16, P=12, bs2 scheme

Time Stamp	Recording	[60, 100°] / [-15, 15°]
Time Stamp	Recording	BSRNN-M (6G/s, 3M)	BSRNN-XXS (550M/s, 760K)	BSRNN-XXXS (235M/s, 430K)
[9.0-38.0]
[39.0-61.0]
[68.0-94.0]

Different target audio types

To examine the generalization capability of the proposed framework to deal with different target audio types, i.e., not limited to a particular audio type such as speech signal, we simply alternated the target audio type to the instrumental music (without vocal). In this case, both speech signals and common noises (e.g., gaussian noise, sound events) are considered as the interferences that need to be removed.
Session 5: The recording scene is the same as Session 3 illustrated above. A computer as well as a male speaker is placed at about -30° (same as 30° for linear array) and play instrumental music, The female singer is located at about 90° while the other male singer is located at about 180° (same as 0° for linear array).
Note: Here we adopt two small causal BSRNN models to separately extract the instrumental music and speech/vocal. For the task of music source separation, larger BSRNN models (20x+ complexity), higher sample rate (44.1kHz) and finer band split scheme are preferred to produce more promising results.

Time Stamp	Recording	Music [15°, 45°]	Speech / Vocal [-15°, 100°]
[10.0-33.0]
[32.0-42.0]		failed (too low SNR)
[108.0-118.0]

Real-recorded data with smart phone

Data Recording

The figure below gives an application scenario of smart-phone communication. Two male speakers were located at (90°, 0.5m) and (270°, 1.5m) relative to the smart phone, respectively. A female speaker was located (240°, ?m) relative to the smart phone.

Model — A-ReZero & C-ReZero

Both the azimuth-range-based ReZero (A-ReZero) and conical ReZero (C-ReZero) models were trained on synthetic data and evaluated on real-recorded stereo data.
The dual microphones are located at the top and bottom of the smart phone, respectively, with unknown spacing.

Recording (stereo)	Angular Query { [75, 105°] }	Conical Query { [75, 105°], 1.0m }

Real-recorded data in vehicle

Data Recording

The data is recorded in a driving 4-seat car, as shown in the figure below.
Two microphones are symmetrically located near the car navigation between the main driver and co-driver. Another two microphones are symmetrically distributed by the side of two rear seats. Compared to the small microphone array we used in the paper as well as presented in other demos, the spacings between microphones used in car are relatively large (~1m).

ReZero in car

The region shape in car is not defined in the paper, i.e., rectangular-box-like. In our implementation, we sample the vertexes and the center of the box to calculate corresponding spatial features.
Data simulation can be quite different for in-car scenario, including RIRs, noise, echo, speaker arrangements, etc. We have addressed these changes under the ReZero framework.
The output target of the model is the separate speech signals of four different regions. The model used here is fully causal.

	Session: co-driver + rear2 + loud echo	Session: very low SNR with strong wind noise
Recording (ref. ch.)
ReZero output

2D Angular R-SE

2-D angular region takes both azimuth and elevation windows into consideration.
In simulation, the microphone array is placed at a relatively high position to allow for the full elevation range. The width of elevation window is set in the range of [30°, 60°] both during training and evaluation.

The 2-D angular region is evenly divided into an 8×6 2-D mesh, where each mesh point indicates a sampled azimuth and elevation. The 2-D angular region feature is then computed via aggregating direction features at each mesh point using RNN-Loop method.

The mixture sample contains two speakers with very close azimuths but different elevations.

ID	Azm. range	Ele. range
1	[-150, -100]	[-90, -50]
2	[-150, -100]	[-50, 0]
3	[-150, -120]	[-50, 0]
4	[-120, -80]	[-90, -50]
5	[-115, -80]	[-50, 0]

Spherical R-SE

An example that distance-range-based ReZero (D-ReZero) produces different extraction results when alternating the query distances is illustrated in the figure above.
The mixture sample contains two speakers, two point noises and two isotropic noises, where the source-to-array distances of speaker 1 and speaker 2 are 0.36 m and 0.94 m, respectively. The azimuths of these two speakers are very close, which means direction features can hardly distinguish between them.

Distance	Mixture (1st ch.)	Ground Truth	D-ReZero Output
0.1m
0.5m
0.9m
1.5m

Conical R-SE

An example that C-ReZero is capable of processing universal region-customized speech extraction is illustrated in the figure below.
The mixture contains two speakers and two point noises. The adopted microphone array is illustrated in (a).
We design 7 query regions for one mixture sample, covering different region shapes and ranges. The details of the query regions are elaborated in the table and illustrated in (b).
We demonstrate great potentials of ReZero to produce desirable extraction results with respect to different query regions.

Region ID	Region shape	Region scope (θ, d)	Q
1	angular	[-270°,-110°], [0, ∞]	2
2	sphere	[-180°, 180°], [0, 0.5m]	1
3	ring	[-180°, 180°], [0, 0.5m - 1.1m]	1
4	cone	[-150°, -90°], [0, 1.5m]	1
5	cone	[90°, 120°], [0, 1.0m]	1
6	cone	[-130°, -60°], [0, 0.6m]	0
7	cone	[-30°, 30°], [0, 1.0m]	0

ReZero: Region-customizable Sound Extraction

Abstract:

Problem definition: R-SE

Real-recorded data in conference room

Data Recording

Model — A-ReZero

Different model sizes

Different target audio types

Real-recorded data with smart phone

Data Recording

Model — A-ReZero & C-ReZero

Real-recorded data in vehicle

Data Recording

ReZero in car

2D Angular R-SE

Spherical R-SE

Conical R-SE

Stay tuned! Something new is coming :)

Related works