采用多级渐进式门控卷积网络的歌声分离方法

罗庆予; 张天骐; 熊天

doi:10.13700/j.bh.1001-5965.2023.0419

采用多级渐进式门控卷积网络的歌声分离方法

doi: 10.13700/j.bh.1001-5965.2023.0419

重庆邮电大学通信与信息工程学院，重庆 400065

基金项目:

国家自然科学基金(61771085)；重庆市自然科学基金(cstc2021jcyj-msxmX0836)；重庆市教育委员会科研项目(KJ1600427,KJ1600429)

详细信息

通讯作者:
E-mail：zhangtq@cqupt.edu.cn

中图分类号: TN912.3
计量
- 文章访问数: 161
- HTML全文浏览量: 60
- PDF下载量: 15
- 被引次数: 0
出版历程
- 收稿日期: 2023-06-28
- 录用日期: 2023-11-03
- 网络出版日期: 2023-11-24
- 整期出版日期: 2025-09-30

Singing voice separation method using multi-stage progressive gated convolutional network

School of Communications and Information Engineering，Chongqing University of Posts and Telecommunications，Chongqing 400065，China

Funds:

National Natural Science Foundation of China (61771085); Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0836); Research Project of Chongqing Educational Commission (KJ1600427,KJ1600429)

More Information

Corresponding author: E-mail：zhangtq@cqupt.edu.cn

摘要

摘要:
针对目前基于卷积神经网络(CNN)的歌声分离方法对高低层特征融合时存在语义差异，以及忽视语音特征在通道维度上潜在价值的问题，提出了一种堆叠式的多级渐进式门控卷积网络来实现歌声分离。在每级子网络中设计一种门控自适应卷积(GAC)单元来充分学习并提取歌曲的时频特征，增强特征通道间的竞争合作关系；为减少浅层与深层网络信息融合时的语义误差，在子网络的编解码层间引入门控注意力机制；在各级子网络间提出一种监督注意力(SA)模块来选择性地传递有效信息流，并实现多级网络的渐进式学习。在公开的2个数据集上进行综合对比实验，结果表明：所提方法相比于近年来的代表性模型，在分离歌声与伴奏时均具有一定的优越性。
- 歌声分离 /
- 神经网络 /
- 多级渐进式 /
- 自适应 /
- 监督注意力
Abstract:
To solve the problems that current singing voice separation algorithms based on convolutional neural network (CNN) have semantic differences in the fusion of high- and low-layer features and ignore the potential value of speech features in channel dimension, this paper proposed a stacked multi-stage progressive gated convolutional network to achieve singing voice separation. Firstly, a gated adaptive convolution (GAC) unit was designed in each level of subnetwork to fully learn and extract the time-frequency features of songs and enhance competition and cooperation between the feature channels. Then, to reduce the semantic errors in the fusion of shallow and deep network information, a gated attention mechanism was introduced between the codec layers of the subnetwork. Finally, supervised attention (SA) was proposed for different levels of subnetwork to selectively deliver effective information flow and realize progressive learning of multi-stage networks. Comprehensive comparative experiments were carried out on a large dataset and a small dataset publicly available. The results show that compared with the representative models in recent years, the algorithm has certain advantages in separating singing voice and background accompaniment.
- singing voice separation /
- neural network /
- multi-stage progressive /
- adaptive /
- supervised attention

HTML全文

图 1 多级渐进式门控卷积网络

Figure 1. Multi-stage progressive gated convolutional network

下载: 全尺寸图片幻灯片

图 2 门控注意力机制

Figure 2. Gated attention mechanism

下载: 全尺寸图片幻灯片

图 3 门控自适应卷积单元

Figure 3. Gated adaptive convolution unit

下载: 全尺寸图片幻灯片

图 4 监督注意力模块

Figure 4. Supervised attention module

下载: 全尺寸图片幻灯片

图 5 门控卷积子网络

Figure 5. Gated convolutional subnetwork

下载: 全尺寸图片幻灯片

图 6 语谱图

Figure 6. Spectrograms

下载: 全尺寸图片幻灯片

图 7 参数量对比

Figure 7. Comparison of parameter quantities

下载: 全尺寸图片幻灯片

表 1 子网络结构参数

Table 1. Subnetwork structure parameters

层结构	层参数	层输入维度	层输出维度
初始卷积层	$d_c = 7$，1，1，$s_c = 1$	512×64×1	512×64×64
En_GAC_1	$ \varepsilon {\text{ = }}10^{-5} $，$d_c = 3$，$s_c = 1$，$p = 2$	512×64×64	512×32×64
En_GAC_2	$ \varepsilon {\text{ = }}10^{-5} $，$d_c = 3$，$s_c = 1$，$p = 2$	256×32×64	128×16×128
En_GAC_3	$ \varepsilon {\text{ = }}10^{-5} $，$d_c = 3$，$s_c = 1$，$p = 2$	128×16×128	64×8×192
En_GAC_4	$ \varepsilon {\text{ = }}10^{-5} $，$d_c = 3$，$s_c = 1$，$p = 2$	64×8×192	32×4×256
GCT+Conv2d	$d_c = 3$，$s_c = 1$，$ \varepsilon {\text{ = }}10^{-5} $	32×4×256	32×4×320
De_GAC_4	$d_c = 3$，$s_c = 1$，$ \varepsilon {\text{ = }}10^{-5} $	32×4×320	64×8×256
De_GAC_3	$d_c = 3$，$s_c = 1$，$ \varepsilon {\text{ = }}10^{-5} $	64×8×256	128×16×192
De_GAC_2	$d_c = 3$，$s_c = 1$，$ \varepsilon {\text{ = }}10^{-5} $	128×16×192	256×32×128
De_GAC_1	$d_c = 3$，$s_c = 1$，$ \varepsilon {\text{ = }}10^{-5} $	256×32×128	512×64×64
输出卷积层	$d_c = 3$，1，$s_c = 1$	512×64×64	512×64×64
Conv2d_1	$d_c = 1$，$s_c = 1$	512×64×64	512×64×2
Conv2d_2	$d_c = 1$，$s_c = 1$	512×64×2	512×64×64
Conv2d_3	$d_c = 1$，$s_c = 1$	512×64×64	512×64×64
SA	$d_c = 1$，$s_c = 1$	512×64×64	512×64×64

下载: 导出CSV

表 2 子网络个数对网络分离性能的影响

Table 2. Effect of number of subnetworks on network separation performance dB

子网络数量	GNSDR		GSIR		GSAR
子网络数量	歌声	伴奏	歌声	伴奏	歌声	伴奏
1	12.02	10.63	18.41	14.76	13.55	12.83
2	12.14	10.67	18.62	14.78	13.61	12.96
3	12.21	10.68	18.76	14.81	13.68	13.17
4	12.24	10.73	18.85	14.88	13.68	13.31
5	12.22	10.70	18.86	14.86	13.60	13.30
6	12.13	10.65	18.71	14.72	13.50	13.11

下载: 导出CSV

表 3 不同模块对网络分离性能的影响

Table 3. Effects of different modules on network separation performance dB

网络模型	GNSDR		GSIR		GSAR
网络模型	歌声	伴奏	歌声	伴奏	歌声	伴奏
SHN^[13]	10.51	9.88	16.01	14.24	12.53	12.36
GAC-SHN	11.98	10.65	18.35	14.73	13.45	12.78
GA-SHN	10.89	10.32	17.64	14.36	12.76	12.66
GAC-GA-SHN	12.01	10.69	18.25	14.78	13.66	12.91
GAC-GA-SA-SHN（本文）	12.24	10.73	18.85	14.88	13.64	13.31

下载: 导出CSV

表 4 MIR-1K数据集上不同方法的分离性能比较^{[9,14-16,19-21]}

Table 4. Comparison of separation performance of different methods on MIR-1K dataset^{[9,14-16,19-21]} dB

模型	GNSDR		GSIR		GSAR
模型	歌声	伴奏	歌声	伴奏	歌声	伴奏
MLRR^[19]	3.85	4.19	5.63	7.80	10.70	8.22
DRNN^[20]	7.45		13.08		9.68
ModGD^[21]	7.50		13.73		9.45
U-Net^[9]	7.43	7.45	11.79	11.43	10.42	10.38
U-Net-SE^[16]	7.49	7.46	11.78	11.42	10.38	10.41
FC-Net^[16]	10.85	9.91	16.95	14.09	12.54	12.64
TSMS-G-CS^[16]	10.93	9.96	17.02	14.13	12.59	12.68
SHN^[15]	10.51	9.88	16.01	14.24	12.53	12.36
SE-SHN^[16]	10.48	9.86	15.99	14.21	12.52	12.35
SA-SHN^[14]	11.66	10.60	17.65	14.92	13.38	13.17
P-SHN^[15]	10.83	9.89	16.54	14.01	12.67	12.65
本文	12.24	10.73	18.85	14.88	13.68	13.31

下载: 导出CSV

表 5 MIR-1K数据集中5首音乐片段下的歌声与伴奏的MOS得分

Table 5. MOS scores of singing voice and accompaniment for 5 music segments in MIR-1K dataset

歌曲名称	SHN						本文
	歌声			伴奏			歌声			伴奏
	男	女	Avg	男	女	Avg	男	女	Avg	男	女	Avg
annar_5_03	3.32	3.31	3.31	3.43	3.42	3.42	3.51	3.53	3.52	3.45	3.42	3.43
Bobon_2_06	3.25	3.28	3.26	3.39	3.4	3.39	3.42	3.43	3.42	3.4	3.41	3.40
Kenshin_3_08	3.26	3.23	3.24	3.46	3.42	3.44	3.36	3.34	3.35	3.43	3.47	3.45
Khair_6_06	3.36	3.34	3.35	3.32	3.36	3.33	3.48	3.5	3.49	3.32	3.33	3.32
Yifen_1_07	3.38	3.40	3.39	3.43	3.45	3.44	3.56	3.58	3.57	3.42	3.45	3.43

下载: 导出CSV

表 6 MUSDB18数据集上不同先进方法的分离性能比较^[22-26]

Table 6. Comparison of separation performance of different advanced methods on MUSDB18 dataset^[22-26] dB

模型	NSDR		SIR		SAR
模型	歌声	伴奏	歌声	伴奏	歌声	伴奏
Conv-Tasnet^[22]	6.81	12.69	14.30	17.61	6.87	13.77
E-MRP-CNN^[23]	6.36	12.99	13.68	16.18	6.60	14.41
D3Net^[24]	7.24	13.52
RPCA-DRNN^[25]	6.41	8.70	19.53	24.77	6.87	15.78
FC-U²-Net^[26]	8.22	13.39	20.70	21.26	8.33	14.57
本文	8.37	12.81	19.83	21.13	8.45	15.81

下载: 导出CSV

参考文献(26)

[1]	RAFII Z, PARDO B. Repeating pattern extraction technique (REPET): a simple method for music/voice separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(1): 73-84. doi: 10.1109/TASL.2012.2213249
[2]	HUANG P S, CHEN S D, SMARAGDIS P, et al. Singing-voice separation from monaural recordings using robust principal component analysis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2012: 57-60.
[3]	GRAIS E M, ERDOGAN H. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks[C]//Proceedings of the Interspeech 2011. Copenhagen: ISCA, 2011: 1773-1776.
[4]	UHLICH S, GIRON F, MITSUFUJI Y. Deep neural network based instrument extraction from music[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2015: 2135-2139.
[5]	SPRECHMANN P, BRUNA J, LECUN Y. Audio source separation with discriminative scattering networks[C]//Proceedings of the Latent Variable Analysis and Signal Separation. Berlin: Springer, 2015: 259-267.
[6]	张天骐, 熊梅, 张婷, 等. 结合区分性训练深度神经网络的歌声与伴奏分离方法[J]. 声学学报, 2019, 44(3): 393-400. ZHANG T Q, XIONG M, ZHANG T, et al. A separation method of singing and accompaniment combining discriminative training deep neural network[J]. Acta Acustica, 2019, 44(3): 393-400(in Chinese).
[7]	CHEN J T, WANG D L. Long short-term memory for speaker generalization in supervised speech separation[J]. The Journal of the Acoustical Society of America, 2017, 141(6): 4705. doi: 10.1121/1.4986931
[8]	张天. 单通道音乐信号中的人声伴奏分离方法研究[D]. 重庆: 重庆邮电大学, 2020: 43-57. ZHANG T. Research on separation method of vocal accompaniment in single channel music signal[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2020: 43-57(in Chinese).
[9]	JANSSON A, HUMPHREY E, MONTECCHIO N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the 18th International Society for Music Information Retrieval Conference. [S. l. ]: DBLP, 2017: 745-751.
[10]	STOLLER D, EWERT S, DIXON S. Wave-U-Net: a multi-scale neural network for end-to-end audio source separation[EB/OL]. (2018-06-08)[2023-06-01]. http://arxiv.org/abs/1806.03185v1.
[11]	DÉFOSSEZ A, USUNIER N, BOTTOU L, et al. Demucs: deep extractor for music sources with extra unlabeled data remixed[EB/OL]. (2019-09-03)[2023-06-01]. http://arxiv.org/abs/1909.01174v1.
[12]	IBTEHAZ N, RAHMAN M S. MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2020, 121: 74-87. doi: 10.1016/j.neunet.2019.08.025
[13]	PARK S, KIM T, LEE K, et al. Music source separation using stacked hourglass networks[EB/OL]. (2018-06-22)[2023-06-01]. http://arxiv.org/abs/1805.08559v2.
[14]	YUAN W T, WANG S B, LI X R, et al. A skip attention mechanism for monaural singing voice separation[J]. IEEE Signal Processing Letters, 2019, 26(10): 1481-1485. doi: 10.1109/LSP.2019.2935867
[15]	BHATTARAI B, PANDEYA Y R, LEE J. Parallel stacked hourglass network for music source separation[J]. IEEE Access, 2020, 8: 206016-206027. doi: 10.1109/ACCESS.2020.3037773
[16]	买峰. 基于深度卷积神经网络的音乐源分离算法及其应用研究[D]. 成都: 电子科技大学, 2021: 20-70. MAI F. Research on music source separation algorithm based on deep convolution neural network and its application[D]. Chengdu: University of Electronic Science and Technology of China, 2021: 20-70(in Chinese).
[17]	SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2021: 21-25.
[18]	YANG Z X, ZHU L C, WU Y, et al. Gated channel transformation for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 11791-11800.
[19]	YANG Y H. Low-rank representation of both singing voice and music accompaniment via learned dictionaries[C]//Proceedings of the 14th International Society for Music Information Retrieval Conference. Curitiba: [s. n. ], 2013: 427-432.
[20]	HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2136-2147. doi: 10.1109/TASLP.2015.2468583
[21]	SEBASTIAN J, MURTHY H A. Group delay based music source separation using deep recurrent neural networks[C]//Proceedings of the International Conference on Signal Processing and Communications. Piscataway: IEEE Press, 2016: 1-5.
[22]	DÉFOSSEZ A, USUNIER N, BOTTOU L, et al. Music source separation in the waveform domain[EB/OL]. (2021-04-28)[2023-06-01]. http://arxiv.org/abs/1911.13254v2.
[23]	YUAN W T, DONG B F, WANG S B, et al. Evolving multi-resolution pooling CNN for monaural singing voice separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 807-822. doi: 10.1109/TASLP.2021.3051331
[24]	TAKAHASHI N, MITSUFUJI Y, TAKAHASHI N, et al. D3Net: densely connected multidilated DenseNet for music source separation[EB/OL]. (2021-05-27)[2023-06-01]. http://arxiv.org/abs/2010.01733v4.
[25]	LAI W H, WANG S L. RPCA-DRNN technique for monaural singing voice separation[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022, 2022: 4. doi: 10.1186/s13636-022-00236-9
[26]	NI X, REN J. FC-U²-Net: a novel deep neural network for singing voice separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 489-494. doi: 10.1109/TASLP.2022.3140561