留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

采用多级渐进式门控卷积网络的歌声分离方法

罗庆予 张天骐 熊天

罗庆予,张天骐,熊天. 采用多级渐进式门控卷积网络的歌声分离方法[J]. 北京麻豆精品秘 国产传媒学报,2025,51(9):2872-2881 doi: 10.13700/j.bh.1001-5965.2023.0419
引用本文: 罗庆予,张天骐,熊天. 采用多级渐进式门控卷积网络的歌声分离方法[J]. 北京麻豆精品秘 国产传媒学报,2025,51(9):2872-2881 doi: 10.13700/j.bh.1001-5965.2023.0419
LUO Q Y,ZHANG T Q,XIONG T. Singing voice separation method using multi-stage progressive gated convolutional network[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(9):2872-2881 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0419
Citation: LUO Q Y,ZHANG T Q,XIONG T. Singing voice separation method using multi-stage progressive gated convolutional network[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(9):2872-2881 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0419

采用多级渐进式门控卷积网络的歌声分离方法

doi: 10.13700/j.bh.1001-5965.2023.0419
基金项目: 

国家自然科学基金(61771085);重庆市自然科学基金(cstc2021jcyj-msxmX0836);重庆市教育委员会科研项目(KJ1600427,KJ1600429)

详细信息
    通讯作者:

    E-mail:zhangtq@cqupt.edu.cn

  • 中图分类号: TN912.3

Singing voice separation method using multi-stage progressive gated convolutional network

Funds: 

National Natural Science Foundation of China (61771085); Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0836); Research Project of Chongqing Educational Commission (KJ1600427,KJ1600429)

More Information
  • 摘要:

    针对目前基于卷积神经网络(CNN)的歌声分离方法对高低层特征融合时存在语义差异,以及忽视语音特征在通道维度上潜在价值的问题,提出了一种堆叠式的多级渐进式门控卷积网络来实现歌声分离。在每级子网络中设计一种门控自适应卷积(GAC)单元来充分学习并提取歌曲的时频特征,增强特征通道间的竞争合作关系;为减少浅层与深层网络信息融合时的语义误差,在子网络的编解码层间引入门控注意力机制;在各级子网络间提出一种监督注意力(SA)模块来选择性地传递有效信息流,并实现多级网络的渐进式学习。在公开的2个数据集上进行综合对比实验,结果表明:所提方法相比于近年来的代表性模型,在分离歌声与伴奏时均具有一定的优越性。

     

  • 图 1  多级渐进式门控卷积网络

    Figure 1.  Multi-stage progressive gated convolutional network

    图 2  门控注意力机制

    Figure 2.  Gated attention mechanism

    图 3  门控自适应卷积单元

    Figure 3.  Gated adaptive convolution unit

    图 4  监督注意力模块

    Figure 4.  Supervised attention module

    图 5  门控卷积子网络

    Figure 5.  Gated convolutional subnetwork

    图 6  语谱图

    Figure 6.  Spectrograms

    图 7  参数量对比

    Figure 7.  Comparison of parameter quantities

    表  1  子网络结构参数

    Table  1.   Subnetwork structure parameters

    层结构 层参数 层输入维度 层输出维度
    初始卷积层 $d_c = 7$,1,1,$s_c = 1$ 512×64×1 512×64×64
    En_GAC_1 $ \varepsilon {\text{ = }}10^{-5} $,$d_c = 3$,$s_c = 1$,$p = 2$ 512×64×64 512×32×64
    En_GAC_2 $ \varepsilon {\text{ = }}10^{-5} $,$d_c = 3$,$s_c = 1$,$p = 2$ 256×32×64 128×16×128
    En_GAC_3 $ \varepsilon {\text{ = }}10^{-5} $,$d_c = 3$,$s_c = 1$,$p = 2$ 128×16×128 64×8×192
    En_GAC_4 $ \varepsilon {\text{ = }}10^{-5} $,$d_c = 3$,$s_c = 1$,$p = 2$ 64×8×192 32×4×256
    GCT+Conv2d $d_c = 3$,$s_c = 1$,$ \varepsilon {\text{ = }}10^{-5} $ 32×4×256 32×4×320
    De_GAC_4 $d_c = 3$,$s_c = 1$,$ \varepsilon {\text{ = }}10^{-5} $ 32×4×320 64×8×256
    De_GAC_3 $d_c = 3$,$s_c = 1$,$ \varepsilon {\text{ = }}10^{-5} $ 64×8×256 128×16×192
    De_GAC_2 $d_c = 3$,$s_c = 1$,$ \varepsilon {\text{ = }}10^{-5} $ 128×16×192 256×32×128
    De_GAC_1 $d_c = 3$,$s_c = 1$,$ \varepsilon {\text{ = }}10^{-5} $ 256×32×128 512×64×64
    输出卷积层 $d_c = 3$,1,$s_c = 1$ 512×64×64 512×64×64
    Conv2d_1 $d_c = 1$,$s_c = 1$ 512×64×64 512×64×2
    Conv2d_2 $d_c = 1$,$s_c = 1$ 512×64×2 512×64×64
    Conv2d_3 $d_c = 1$,$s_c = 1$ 512×64×64 512×64×64
    SA $d_c = 1$,$s_c = 1$ 512×64×64 512×64×64
    下载: 导出CSV

    表  2  子网络个数对网络分离性能的影响

    Table  2.   Effect of number of subnetworks on network separation performance dB

    子网络数量 GNSDR GSIR GSAR
    歌声 伴奏 歌声 伴奏 歌声 伴奏
    1 12.02 10.63 18.41 14.76 13.55 12.83
    2 12.14 10.67 18.62 14.78 13.61 12.96
    3 12.21 10.68 18.76 14.81 13.68 13.17
    4 12.24 10.73 18.85 14.88 13.68 13.31
    5 12.22 10.70 18.86 14.86 13.60 13.30
    6 12.13 10.65 18.71 14.72 13.50 13.11
    下载: 导出CSV

    表  3  不同模块对网络分离性能的影响

    Table  3.   Effects of different modules on network separation performance dB

    网络模型 GNSDR GSIR GSAR
    歌声 伴奏 歌声 伴奏 歌声 伴奏
    SHN[13] 10.51 9.88 16.01 14.24 12.53 12.36
    GAC-SHN 11.98 10.65 18.35 14.73 13.45 12.78
    GA-SHN 10.89 10.32 17.64 14.36 12.76 12.66
    GAC-GA-SHN 12.01 10.69 18.25 14.78 13.66 12.91
    GAC-GA-SA-SHN(本文) 12.24 10.73 18.85 14.88 13.64 13.31
    下载: 导出CSV

    表  4  MIR-1K数据集上不同方法的分离性能比较[9,14-16,19-21]

    Table  4.   Comparison of separation performance of different methods on MIR-1K dataset[9,14-16,19-21] dB

    模型 GNSDR GSIR GSAR
    歌声 伴奏 歌声 伴奏 歌声 伴奏
    MLRR[19] 3.85 4.19 5.63 7.80 10.70 8.22
    DRNN[20] 7.45 13.08 9.68
    ModGD[21] 7.50 13.73 9.45
    U-Net[9] 7.43 7.45 11.79 11.43 10.42 10.38
    U-Net-SE[16] 7.49 7.46 11.78 11.42 10.38 10.41
    FC-Net[16] 10.85 9.91 16.95 14.09 12.54 12.64
    TSMS-G-CS[16] 10.93 9.96 17.02 14.13 12.59 12.68
    SHN[15] 10.51 9.88 16.01 14.24 12.53 12.36
    SE-SHN [16] 10.48 9.86 15.99 14.21 12.52 12.35
    SA-SHN[14] 11.66 10.60 17.65 14.92 13.38 13.17
    P-SHN[15] 10.83 9.89 16.54 14.01 12.67 12.65
    本文 12.24 10.73 18.85 14.88 13.68 13.31
    下载: 导出CSV

    表  5  MIR-1K数据集中5首音乐片段下的歌声与伴奏的MOS得分

    Table  5.   MOS scores of singing voice and accompaniment for 5 music segments in MIR-1K dataset

    歌曲名称 SHN 本文
    歌声 伴奏 歌声 伴奏
    Avg Avg Avg Avg
    annar_5_03 3.32 3.31 3.31 3.43 3.42 3.42 3.51 3.53 3.52 3.45 3.42 3.43
    Bobon_2_06 3.25 3.28 3.26 3.39 3.4 3.39 3.42 3.43 3.42 3.4 3.41 3.40
    Kenshin_3_08 3.26 3.23 3.24 3.46 3.42 3.44 3.36 3.34 3.35 3.43 3.47 3.45
    Khair_6_06 3.36 3.34 3.35 3.32 3.36 3.33 3.48 3.5 3.49 3.32 3.33 3.32
    Yifen_1_07 3.38 3.40 3.39 3.43 3.45 3.44 3.56 3.58 3.57 3.42 3.45 3.43
    下载: 导出CSV

    表  6  MUSDB18数据集上不同先进方法的分离性能比较[22-26]

    Table  6.   Comparison of separation performance of different advanced methods on MUSDB18 dataset[22-26] dB

    模型 NSDR SIR SAR
    歌声 伴奏 歌声 伴奏 歌声 伴奏
    Conv-Tasnet[22] 6.81 12.69 14.30 17.61 6.87 13.77
    E-MRP-CNN[23] 6.36 12.99 13.68 16.18 6.60 14.41
    D3Net[24] 7.24 13.52
    RPCA-DRNN[25] 6.41 8.70 19.53 24.77 6.87 15.78
    FC-U2-Net[26] 8.22 13.39 20.70 21.26 8.33 14.57
    本文 8.37 12.81 19.83 21.13 8.45 15.81
    下载: 导出CSV
  • [1] RAFII Z, PARDO B. Repeating pattern extraction technique (REPET): a simple method for music/voice separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(1): 73-84. doi: 10.1109/TASL.2012.2213249
    [2] HUANG P S, CHEN S D, SMARAGDIS P, et al. Singing-voice separation from monaural recordings using robust principal component analysis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2012: 57-60.
    [3] GRAIS E M, ERDOGAN H. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks[C]//Proceedings of the Interspeech 2011. Copenhagen: ISCA, 2011: 1773-1776.
    [4] UHLICH S, GIRON F, MITSUFUJI Y. Deep neural network based instrument extraction from music[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2015: 2135-2139.
    [5] SPRECHMANN P, BRUNA J, LECUN Y. Audio source separation with discriminative scattering networks[C]//Proceedings of the Latent Variable Analysis and Signal Separation. Berlin: Springer, 2015: 259-267.
    [6] 张天骐, 熊梅, 张婷, 等. 结合区分性训练深度神经网络的歌声与伴奏分离方法[J]. 声学学报, 2019, 44(3): 393-400.

    ZHANG T Q, XIONG M, ZHANG T, et al. A separation method of singing and accompaniment combining discriminative training deep neural network[J]. Acta Acustica, 2019, 44(3): 393-400(in Chinese).
    [7] CHEN J T, WANG D L. Long short-term memory for speaker generalization in supervised speech separation[J]. The Journal of the Acoustical Society of America, 2017, 141(6): 4705. doi: 10.1121/1.4986931
    [8] 张天. 单通道音乐信号中的人声伴奏分离方法研究[D]. 重庆: 重庆邮电大学, 2020: 43-57.

    ZHANG T. Research on separation method of vocal accompaniment in single channel music signal[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2020: 43-57(in Chinese).
    [9] JANSSON A, HUMPHREY E, MONTECCHIO N, et al. Singing voice separation with deep U-Net convolutional networks[C]//Proceedings of the 18th International Society for Music Information Retrieval Conference. [S. l. ]: DBLP, 2017: 745-751.
    [10] STOLLER D, EWERT S, DIXON S. Wave-U-Net: a multi-scale neural network for end-to-end audio source separation[EB/OL]. (2018-06-08)[2023-06-01]. http://arxiv.org/abs/1806.03185v1.
    [11] DÉFOSSEZ A, USUNIER N, BOTTOU L, et al. Demucs: deep extractor for music sources with extra unlabeled data remixed[EB/OL]. (2019-09-03)[2023-06-01]. http://arxiv.org/abs/1909.01174v1.
    [12] IBTEHAZ N, RAHMAN M S. MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation[J]. Neural Networks, 2020, 121: 74-87. doi: 10.1016/j.neunet.2019.08.025
    [13] PARK S, KIM T, LEE K, et al. Music source separation using stacked hourglass networks[EB/OL]. (2018-06-22)[2023-06-01]. http://arxiv.org/abs/1805.08559v2.
    [14] YUAN W T, WANG S B, LI X R, et al. A skip attention mechanism for monaural singing voice separation[J]. IEEE Signal Processing Letters, 2019, 26(10): 1481-1485. doi: 10.1109/LSP.2019.2935867
    [15] BHATTARAI B, PANDEYA Y R, LEE J. Parallel stacked hourglass network for music source separation[J]. IEEE Access, 2020, 8: 206016-206027. doi: 10.1109/ACCESS.2020.3037773
    [16] 买峰. 基于深度卷积神经网络的音乐源分离算法及其应用研究[D]. 成都: 电子科技大学, 2021: 20-70.

    MAI F. Research on music source separation algorithm based on deep convolution neural network and its application[D]. Chengdu: University of Electronic Science and Technology of China, 2021: 20-70(in Chinese).
    [17] SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2021: 21-25.
    [18] YANG Z X, ZHU L C, WU Y, et al. Gated channel transformation for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 11791-11800.
    [19] YANG Y H. Low-rank representation of both singing voice and music accompaniment via learned dictionaries[C]//Proceedings of the 14th International Society for Music Information Retrieval Conference. Curitiba: [s. n. ], 2013: 427-432.
    [20] HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2136-2147. doi: 10.1109/TASLP.2015.2468583
    [21] SEBASTIAN J, MURTHY H A. Group delay based music source separation using deep recurrent neural networks[C]//Proceedings of the International Conference on Signal Processing and Communications. Piscataway: IEEE Press, 2016: 1-5.
    [22] DÉFOSSEZ A, USUNIER N, BOTTOU L, et al. Music source separation in the waveform domain[EB/OL]. (2021-04-28)[2023-06-01]. http://arxiv.org/abs/1911.13254v2.
    [23] YUAN W T, DONG B F, WANG S B, et al. Evolving multi-resolution pooling CNN for monaural singing voice separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 807-822. doi: 10.1109/TASLP.2021.3051331
    [24] TAKAHASHI N, MITSUFUJI Y, TAKAHASHI N, et al. D3Net: densely connected multidilated DenseNet for music source separation[EB/OL]. (2021-05-27)[2023-06-01]. http://arxiv.org/abs/2010.01733v4.
    [25] LAI W H, WANG S L. RPCA-DRNN technique for monaural singing voice separation[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2022, 2022: 4. doi: 10.1186/s13636-022-00236-9
    [26] NI X, REN J. FC-U2-Net: a novel deep neural network for singing voice separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 489-494. doi: 10.1109/TASLP.2022.3140561
  • 加载中
图(7) / 表(6)
计量
  • 文章访问数:  161
  • HTML全文浏览量:  60
  • PDF下载量:  15
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-06-28
  • 录用日期:  2023-11-03
  • 网络出版日期:  2023-11-24
  • 整期出版日期:  2025-09-30

目录

    /

    返回文章
    返回
    常见问答