多說話人分離技術(shù)及應(yīng)用進(jìn)展分析報(bào)告-培訓(xùn)課件_第1頁
多說話人分離技術(shù)及應(yīng)用進(jìn)展分析報(bào)告-培訓(xùn)課件_第2頁
多說話人分離技術(shù)及應(yīng)用進(jìn)展分析報(bào)告-培訓(xùn)課件_第3頁
多說話人分離技術(shù)及應(yīng)用進(jìn)展分析報(bào)告-培訓(xùn)課件_第4頁
多說話人分離技術(shù)及應(yīng)用進(jìn)展分析報(bào)告-培訓(xùn)課件_第5頁
已閱讀5頁,還剩15頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

多說話人分離技術(shù)及應(yīng)用進(jìn)展2024.3綱

要1.研究背景2.工業(yè)版本—模塊化系統(tǒng)3.

改進(jìn)方案4.落地應(yīng)用1.研究背景多說話人分離(說話人日志):給定一個(gè)包含多人交替說話的語音,系統(tǒng)需要判斷每個(gè)時(shí)間段是誰在說話。音頻分割信息多說話人分離系統(tǒng)1.研究背景應(yīng)用場(chǎng)景:會(huì)議紀(jì)要,多說話人轉(zhuǎn)錄,智能客服,錄音質(zhì)檢等

...終端設(shè)備:錄音筆智能手機(jī)個(gè)人電腦支持廠商:科大訊飛(智能辦公本)、(AI紀(jì)要)、聲云(語音轉(zhuǎn)寫)

...1.研究背景DIHARD(I,II,III)CHiME-6VoxSRC(20,21,22,23)AliMeetingRichTranscription(RT)MIXER62013CALLHOMEAMI競(jìng)賽/數(shù)據(jù)集M2MeT,

AISHELL-4

M2MeT2.0,

CHiME-7200020022006

2009201820192020202120222023模塊化架構(gòu)架構(gòu)端到端架構(gòu)研究趨勢(shì):簡(jiǎn)單場(chǎng)景→復(fù)雜場(chǎng)景挑戰(zhàn):噪聲干擾,人數(shù)未知,語音重疊等應(yīng)用:離線=>在線,單麥克風(fēng)=>多麥克風(fēng),適配新場(chǎng)景1.研究背景—模塊化系統(tǒng)聚類方法:AHC[1]、SC

[2,3]

、VB

/VBx

[4,5]

、UIS-RNN

[6]

、DNC

[7][1]

K.

C.

Gowda

and

G.

Krishna,

“Agglomerative

Clustering

Using

the

Concept

of

Mutual

Nearest

Neighbourhood,”

Pattern

Recognition,

vol.

10,

pp.

105–112,

1978.[2]

U.

von

Luxburg,

“A

tutorial

on

spectral

clustering,”

Statistics

and

Computing,

vol.

17,

pp.

395–416,

2007.[3]

T.

Park,

Kyu

J.

Han,

Manoj

Kumar,

and

Shrikanth

S.

Narayanan,

“Auto-tuning

Spectral

Clustering

for

Speaker

Diarization

Using

Normalized

Maximum

Eigengap,”

IEEE

SignalProcessing

Letters,

vol.

27,

pp.

381–385,

2020.[4]

M.

Diez,

L.

Burget,

S.

Wang,

J.

Rohdin,

H.

Cernocky,

“Bayesian

HMM

based

x-vector

Clustering

for

Speaker

Diarization,”

Interspeech,

2019,

pp.346-350.[5]

M.

Diez,

L.

Burget,

F.

Landini,

J.

Cernocky,

"Analysis

of

Speaker

Diarization

based

on

Bayesian

HMM

with

Eigenvoice

Priors,"

IEEE/ACM

Transactions

on

Audio

Speech

andLanguage

Processing,

vol.

28,

p

355-368,

2020.[6]

A.

Zhang,

Q.

Wang,

Z.

Zhu,

J.

Paisley,

and

C.

Wang,

“Fully

Supervised

Speaker

Diarization,”

ICASSP,

2019.[7]

Q.J.

Li,

F.

L.

Kreyssig,

C.

Zhang,

P.C.

Woodland,

“Discriminative

Neural

Clustering

for

Speaker

Diarisation,”

IEEE

Spoken

Language

Technology

Workshop

(SLT

2021),

Jan

2021,Shenzhen,

China.1.研究背景—端到端系統(tǒng)基于Bi-LSTM的端到端模型EEND[1]SA-EEND[2]基于Transformer

encoder的端到端模型端到端系統(tǒng)EDA-EEND[3]可以預(yù)測(cè)人數(shù)的EEND模型…TS-VAD[4]目標(biāo)說話人音頻端點(diǎn)檢測(cè)模型[1]

Y.

Fujita,

N.

Kanda,

S.

Horiguchi,

K.

Nagamatsu,

and

S.

Watanabe,

“End-to-end

Neural

Speaker

Diarization

with

Permutation-free

Objectives,”

in

Interspeech,

2019,

pp.

4300–4304.[2]

Y.

Fujita,

N.

Kanda,

S.

Horiguchi,

Y.

Xue,

K.

Nagamatsu

and

S.

Watanabe,

“End-to-End

Neural

Speaker

Diarization

with

Self-Attention,”

2019

IEEE

Automatic

Speech

Recognitionand

Understanding

Workshop

(ASRU),

SG,

Singapore,

2019,

pp.

296-303.[3]

S.

Horiguchi,

Y.

Fujita,

S.

Watanabe,

Y.

Xue,

and

K.

Nagamatsu,

“End-to-end

speaker

diarization

for

an

unknown

number

of

speakers

with

encoder-decoder

based

attractors,”

inInterspeech,

2020,

pp.

269–273.[4]

I.

Medennikov,

M.

Korenevsky,

et

al.,

“Target-speaker

Voice

Activity

Detection:

a

Novel

Approach

for

Multi-speaker

Diarization

in

a

Dinner

Party

Scenario,”

arXiv,

vol.abs/2005.07272,

2020.1.研究背景—聚類算法匯總聚類算法AHC訓(xùn)練方式無監(jiān)督聚類無監(jiān)督聚類無監(jiān)督聚類無監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類有監(jiān)督聚類輸入特征x-vectori-vectorx-vectorx-vectord-vectord-vector聲學(xué)特征i-vector重疊檢測(cè)不支持不支持不支持不支持不支持支持預(yù)測(cè)人數(shù)閾值VB初始化調(diào)節(jié)初始化調(diào)節(jié)閾值/NME適合2人VBxSCUIS-RNNDNC輸出節(jié)點(diǎn)輸出節(jié)點(diǎn)輸出節(jié)點(diǎn)EENDTS-VAD支持支持在線版本:研究主要集中在EEND[1,2]或UIS-RNN[3,4]框架麥陣版本:多通道輸入TS-VAD[5]或前后端聯(lián)合優(yōu)化特定場(chǎng)景:不同場(chǎng)景采用不同策略[6][1]

Y.

Xue,

S.

Horiguchi,

Y.

Fujita,

S.

Watanabe,

P.

Garcia,

and

K.

Nagamatsu,

“Online

end-to-end

neural

diarization

with

speaker-tracing

buffer,”

in

IEEE

Spoken

LanguageTechnology

Workshop

(SLT),

2021,

pp.

841–848.[2]

E.

Han,

C.

Lee,

and

A.

Stolcke,

“Bw-eda-eend:

Streaming

end-toend

neural

speaker

diarization

for

a

variable

number

of

speakers,”

in

ICASSP,

2021.[3]

E.

Fini

and

A.

Brutti,

“Supervised

online

diarization

with

sample

mean

loss

for

multi-domain

data,”

in

ICASSP,

2020,

pp.

7134–7138.[4]

X.

Wan,

K.

Liu,

H.

Zhou,

"Online

speaker

diarization

equipped

with

discriminative

modeling

and

guided

inference,”

in

Interspeech,

2021.[5]

I.

Medennikov,

M.

Korenevsky,

et

al.,

“Target-speaker

Voice

Activity

Detection:

a

Novel

Approach

for

Multi-speaker

Diarization

in

a

Dinner

Party

Scenario,”

arXiv,

vol.abs/2005.07272,

2020.[6]

Y.-X.

Wang,

J.

Du,

M.-K.

He,

S.-T.

Niu,

L.

Sun,

C.-H.

Lee,

"Scenario-dependent

speaker

diarization

for

DIHARD-III

challgenge,"

in

Interspeech,

2021.2.工業(yè)版本—模塊化系統(tǒng)2.1

分割音頻功能:轉(zhuǎn)換為聚類問題2.工業(yè)版本—模塊化系統(tǒng)2.2

提取說話人表征vectorvectorvector功能:提取段級(jí)別說話人表征2.工業(yè)版本—模塊化系統(tǒng)2.3

聚類—凝聚層次聚類(AHC)AHC功能:對(duì)相同說話人片段聚類K.

C.

Gowda

and

G.

Krishna,

“Agglomerative

Clustering

Using

the

Concept

of

Mutual

Nearest

Neighbourhood,”

Pattern

Recognition,

vol.

10,

pp.

105–112,

1978.2.工業(yè)版本—模塊化系統(tǒng)第一代產(chǎn)品(與ASV-Subtools*結(jié)合)語音端點(diǎn)檢測(cè)*/Snowdar/asv-subtools說話人日志(SD)語音識(shí)別識(shí)別后處理(VAD)(ASR)說話人1說話人2說話人3說話人4原始音頻算法流程:VAD->平均分割->Subtools提取x-vector->PCA降維->Cosine打分->AHC聚類2.工業(yè)版本—模塊化系統(tǒng)存在問題—語音重疊說話人混疊:

目標(biāo)區(qū)域是否發(fā)生了說話重疊?

誰和誰發(fā)生了重疊?圖片和音頻:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。每5秒一段,窗移2.5秒圖片:https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all3.改進(jìn)方案—神經(jīng)網(wǎng)絡(luò)分割解決辦法:進(jìn)行分段,每段用神經(jīng)網(wǎng)絡(luò)判斷說話人,最多3人。提取x-vector時(shí),去除重疊語音,并合并同一人語音H.Bredin,

“Pyannote.audio

2.1speaker

diarization

pipeline:

principle,

benchmark,

and

recip

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論