




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
多說話人分離技術(shù)及應(yīng)用進(jìn)展2024.3綱
要1.研究背景2.工業(yè)版本—模塊化系統(tǒng)3.
改進(jìn)方案4.落地應(yīng)用1.研究背景多說話人分離(說話人日志):給定一個(gè)包含多人交替說話的語音,系統(tǒng)需要判斷每個(gè)時(shí)間段是誰在說話。音頻分割信息多說話人分離系統(tǒng)1.研究背景應(yīng)用場(chǎng)景:會(huì)議紀(jì)要,多說話人轉(zhuǎn)錄,智能客服,錄音質(zhì)檢等
...終端設(shè)備:錄音筆智能手機(jī)個(gè)人電腦支持廠商:科大訊飛(智能辦公本)、(AI紀(jì)要)、聲云(語音轉(zhuǎn)寫)
...1.研究背景DIHARD(I,II,III)CHiME-6VoxSRC(20,21,22,23)AliMeetingRichTranscription(RT)MIXER62013CALLHOMEAMI競(jìng)賽/數(shù)據(jù)集M2MeT,
AISHELL-4
M2MeT2.0,
CHiME-7200020022006
2009201820192020202120222023模塊化架構(gòu)架構(gòu)端到端架構(gòu)研究趨勢(shì):簡(jiǎn)單場(chǎng)景→復(fù)雜場(chǎng)景挑戰(zhàn):噪聲干擾,人數(shù)未知,語音重疊等應(yīng)用:離線=>在線,單麥克風(fēng)=>多麥克風(fēng),適配新場(chǎng)景1.研究背景—模塊化系統(tǒng)聚類方法:AHC[1]、SC
[2,3]
、VB
/VBx
[4,5]
、UIS-RNN
[6]
、DNC
[7][1]
K.
C.
Gowda
and
G.
Krishna,
“Agglomerative
Clustering
Using
the
Concept
of
Mutual
Nearest
Neighbourhood,”
Pattern
Recognition,
vol.
10,
pp.
105–112,
1978.[2]
U.
von
Luxburg,
“A
tutorial
on
spectral
clustering,”
Statistics
and
Computing,
vol.
17,
pp.
395–416,
2007.[3]
T.
Park,
Kyu
J.
Han,
Manoj
Kumar,
and
Shrikanth
S.
Narayanan,
“Auto-tuning
Spectral
Clustering
for
Speaker
Diarization
Using
Normalized
Maximum
Eigengap,”
IEEE
SignalProcessing
Letters,
vol.
27,
pp.
381–385,
2020.[4]
M.
Diez,
L.
Burget,
S.
Wang,
J.
Rohdin,
H.
Cernocky,
“Bayesian
HMM
based
x-vector
Clustering
for
Speaker
Diarization,”
Interspeech,
2019,
pp.346-350.[5]
M.
Diez,
L.
Burget,
F.
Landini,
J.
Cernocky,
"Analysis
of
Speaker
Diarization
based
on
Bayesian
HMM
with
Eigenvoice
Priors,"
IEEE/ACM
Transactions
on
Audio
Speech
andLanguage
Processing,
vol.
28,
p
355-368,
2020.[6]
A.
Zhang,
Q.
Wang,
Z.
Zhu,
J.
Paisley,
and
C.
Wang,
“Fully
Supervised
Speaker
Diarization,”
ICASSP,
2019.[7]
Q.J.
Li,
F.
L.
Kreyssig,
C.
Zhang,
P.C.
Woodland,
“Discriminative
Neural
Clustering
for
Speaker
Diarisation,”
IEEE
Spoken
Language
Technology
Workshop
(SLT
2021),
Jan
2021,Shenzhen,
China.1.研究背景—端到端系統(tǒng)基于Bi-LSTM的端到端模型EEND[1]SA-EEND[2]基于Transformer
encoder的端到端模型端到端系統(tǒng)EDA-EEND[3]可以預(yù)測(cè)人數(shù)的EEND模型…TS-VAD[4]目標(biāo)說話人音頻端點(diǎn)檢測(cè)模型[1]
Y.
Fujita,
N.
Kanda,
S.
Horiguchi,
K.
Nagamatsu,
and
S.
Watanabe,
“End-to-end
Neural
Speaker
Diarization
with
Permutation-free
Objectives,”
in
Interspeech,
2019,
pp.
4300–4304.[2]
Y.
Fujita,
N.
Kanda,
S.
Horiguchi,
Y.
Xue,
K.
Nagamatsu
and
S.
Watanabe,
“End-to-End
Neural
Speaker
Diarization
with
Self-Attention,”
2019
IEEE
Automatic
Speech
Recognitionand
Understanding
Workshop
(ASRU),
SG,
Singapore,
2019,
pp.
296-303.[3]
S.
Horiguchi,
Y.
Fujita,
S.
Watanabe,
Y.
Xue,
and
K.
Nagamatsu,
“End-to-end
speaker
diarization
for
an
unknown
number
of
speakers
with
encoder-decoder
based
attractors,”
inInterspeech,
2020,
pp.
269–273.[4]
I.
Medennikov,
M.
Korenevsky,
et
al.,
“Target-speaker
Voice
Activity
Detection:
a
Novel
Approach
for
Multi-speaker
Diarization
in
a
Dinner
Party
Scenario,”
arXiv,
vol.abs/2005.07272,
2020.1.研究背景—聚類算法匯總聚類算法AHC訓(xùn)練方式無監(jiān)督聚類無監(jiān)督聚類無監(jiān)督聚類無監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類輸入特征x-vectori-vectorx-vectorx-vectord-vectord-vector聲學(xué)特征i-vector重疊檢測(cè)不支持不支持不支持不支持不支持支持預(yù)測(cè)人數(shù)閾值VB初始化調(diào)節(jié)初始化調(diào)節(jié)閾值/NME適合2人VBxSCUIS-RNNDNC輸出節(jié)點(diǎn)輸出節(jié)點(diǎn)輸出節(jié)點(diǎn)EENDTS-VAD支持支持在線版本:研究主要集中在EEND[1,2]或UIS-RNN[3,4]框架麥陣版本:多通道輸入TS-VAD[5]或前后端聯(lián)合優(yōu)化特定場(chǎng)景:不同場(chǎng)景采用不同策略[6][1]
Y.
Xue,
S.
Horiguchi,
Y.
Fujita,
S.
Watanabe,
P.
Garcia,
and
K.
Nagamatsu,
“Online
end-to-end
neural
diarization
with
speaker-tracing
buffer,”
in
IEEE
Spoken
LanguageTechnology
Workshop
(SLT),
2021,
pp.
841–848.[2]
E.
Han,
C.
Lee,
and
A.
Stolcke,
“Bw-eda-eend:
Streaming
end-toend
neural
speaker
diarization
for
a
variable
number
of
speakers,”
in
ICASSP,
2021.[3]
E.
Fini
and
A.
Brutti,
“Supervised
online
diarization
with
sample
mean
loss
for
multi-domain
data,”
in
ICASSP,
2020,
pp.
7134–7138.[4]
X.
Wan,
K.
Liu,
H.
Zhou,
"Online
speaker
diarization
equipped
with
discriminative
modeling
and
guided
inference,”
in
Interspeech,
2021.[5]
I.
Medennikov,
M.
Korenevsky,
et
al.,
“Target-speaker
Voice
Activity
Detection:
a
Novel
Approach
for
Multi-speaker
Diarization
in
a
Dinner
Party
Scenario,”
arXiv,
vol.abs/2005.07272,
2020.[6]
Y.-X.
Wang,
J.
Du,
M.-K.
He,
S.-T.
Niu,
L.
Sun,
C.-H.
Lee,
"Scenario-dependent
speaker
diarization
for
DIHARD-III
challgenge,"
in
Interspeech,
2021.2.工業(yè)版本—模塊化系統(tǒng)2.1
分割音頻功能:轉(zhuǎn)換為聚類問題2.工業(yè)版本—模塊化系統(tǒng)2.2
提取說話人表征vectorvectorvector功能:提取段級(jí)別說話人表征2.工業(yè)版本—模塊化系統(tǒng)2.3
聚類—凝聚層次聚類(AHC)AHC功能:對(duì)相同說話人片段聚類K.
C.
Gowda
and
G.
Krishna,
“Agglomerative
Clustering
Using
the
Concept
of
Mutual
Nearest
Neighbourhood,”
Pattern
Recognition,
vol.
10,
pp.
105–112,
1978.2.工業(yè)版本—模塊化系統(tǒng)第一代產(chǎn)品(與ASV-Subtools*結(jié)合)語音端點(diǎn)檢測(cè)*/Snowdar/asv-subtools說話人日志(SD)語音識(shí)別識(shí)別后處理(VAD)(ASR)說話人1說話人2說話人3說話人4原始音頻算法流程:VAD->平均分割->Subtools提取x-vector->PCA降維->Cosine打分->AHC聚類2.工業(yè)版本—模塊化系統(tǒng)存在問題—語音重疊說話人混疊:
目標(biāo)區(qū)域是否發(fā)生了說話重疊?
誰和誰發(fā)生了重疊?圖片和音頻:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。每5秒一段,窗移2.5秒圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。提取x-vector時(shí),去除重疊語音,并合并同一人語音H.Bredin,
“Pyannote.audio
2.1speaker
diarization
pipeline:
principle,
benchmark,
and
recip
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- T/CCSAS 033-2023吸收、吸附單元操作機(jī)械化、自動(dòng)化設(shè)計(jì)方案指南
- T/CCSAS 018-2022加氫站氫運(yùn)輸及配送安全技術(shù)規(guī)范
- T/CCOA 2-2019特級(jí)核桃油
- T/CCOA 15-2020稻殼白炭黑
- T/CCMA 0130-2022瀝青路面熱風(fēng)微波復(fù)合加熱就地?zé)嵩偕鷻C(jī)組
- T/CAPMA 3-2017生貉皮質(zhì)量檢驗(yàn)
- 嘉實(shí)基金java面試題及答案
- 公司集群面試題及答案
- 概論1考試題及答案
- 工作類面試題及答案
- 文化傳播學(xué)課程設(shè)計(jì)
- 汽修廠安全生產(chǎn)標(biāo)準(zhǔn)化管理體系全套資料匯編(2019-2020新標(biāo)準(zhǔn)實(shí)施模板)
- 錨梁錨固系統(tǒng)施工方案
- 醫(yī)院開業(yè)宣傳策劃方案
- 高職《旅游英語》課程標(biāo)準(zhǔn)
- BEC商務(wù)英語(中級(jí))閱讀模擬試卷11(共405題)
- 《研學(xué)旅行基地運(yùn)營與管理》課件-2.2研學(xué)旅行基地產(chǎn)品的開發(fā)
- 2024-2030年中國煙草收獲機(jī)行業(yè)市場(chǎng)發(fā)展趨勢(shì)與前景展望戰(zhàn)略分析報(bào)告
- 《第10課 我喜歡的機(jī)器人》參考課件1
- 2024年7月浙江省高中學(xué)業(yè)水平考試數(shù)學(xué)試卷真題(含答案詳解)
- 2024高考前測(cè)試-文綜試題卷
評(píng)論
0/150
提交評(píng)論