Mixtures of Deep Neural Experts for Automated Speech Scoring

Abstract

Task : automatic assessment of second language proficiency

- CALL(computer assisted language learning) task 중 하나

- 이 논문의 주제는 second language proficiency 에 대한 자동 평가

- 이 때 다양한 모달리티 중 spoken responses 를 통한 실험

- 뉴럴 네트워크로 여러 experts 를 만들고 이것들을 섞어서 기존의 방법보다 더 좋은 성능을 내려는 노력

- 독일어

이 논문의 접근법의 주요 두가지 모듈

1) an automatic speech recognition system

2) a multiple classifier system

- 음성인식 후 나온 텍스트 결과물을 바탕으로 multi-classifier system 을 쓰는 것

- spoken responses 를 가지고 하지만, 결국에는 이렇게 음성인식 후 나온 결과물인 텍스트를 다루기 때문에 어떻게 보면 이 논문에서 acoustic 접근보다 NLP 접근 방향으로 감.

* key points

- 다양한 텍스트 terms : reference grammar, outcome of probabilistic language models(통계기반 언어모델), several word embeddings, two bag-of-word models

- combination 방법 두가지
1) probabilistic pseudo-joint model : 비교군
2) neural mixture of experts : 이 논문에서 제안한 방법

- dataset : Spoken CALL Shared Task challenge

1. Introduction

Automatic proficiency scoring in L2 learning

- CALL : Computer Assisted Language Learning

- Measure L2 proficiency relying on ground truth by human experts : 이 분야는 주로 전문가들이 레이블링한 값을 가지고 측정

Specific task in this paper

- Dataset provided by the Spoken CALL Shared Task Challenge(2019)

- Automatic scoring of sentences uttered by Swiss German teenagers learning English 스위스인이 영어 배울 때

- Prompt-Response pairs
--> prompt : written questions in German
--> response : spoken utterance in English
--> 독일어로 질문 text 가 나오면 답을 영어로 해서 <독일어 질문 – 영어 대답> pairs 데이터 사용

Previous works

- 오디오 시그널과 음성인식결과인 automatic transcriptions 로부터 뽑은 hand-crafted features 를 전통적인 classifier (예를 들어 logistic regression) 에 넣기

- 워드 임베딩 사용하는 방법

- ASR 결과와 reference grammar 사이의 sentence similarity 를 구해서 다양한 수준에서 비교

--> reference grammar : prompt 에 대한 예상 답안 리스트

- BERT, ELMO 를 이용한 워드 임베딩 사용하는 방법

This paper

- 지금까지 소개되었던 방법들 섞자. – combines the outputs of several scoring systems

- Two-step modeling

1) Speech recognition – obtain transcripts of the noisy responses

- noisy responses 에서 텍스트 결과 뽑아내기

2) FFNN/RNN deep learning – model different representations of the transcription

- feed forward 나 recurrent 기반의 딥러닝 사용해 transcription 에 대한 representation 만들기

- Representations by DNN-based “Experts”

--> representation 하는 각각의 DNN 모델을 “experts” 라고 함

여러 개의 DNN 각각의 역할 – 이것들을 다 합쳐서 이전 방식들보다 더 좋은 성능을 내보겠다.

- scores yielded by a reference grammar – reference grammar 전문가

- likelihoods estimated by a number of probabilistic LMs – 통계기반언어모델에 대한 전문가

- sequences of word embeddings of different type – 워드임베딩에 대한 전문가

- two variants of the bag-of-words representation – bag-of-words 에 대한 전문가

Combining multiple DNN-based experts

- instead of applying individual models, combine into a higher-level, more robust classifier

- combination methods : pseudo-joint probability criterion over the individual DNN estimates, a mixture of DNNs

Mixtures of multiple DNNs for speech scoring 컨셉의 다른 선행 논문

- two DNNs 사용

- Lexical DNN : GloVe 와 같은 pre-trained 된 모델에서 나온 결과물을 가지고 인코딩

- Acoustic DNN : word-level acoustic features, 예를 들어 spectrogram 같은 것을 인코딩

- Linear regression for the DNN combination

2. Task and systems description

Dataset : 3rd Spoken CALL Shared Task challenge (2019)

prompt-response pairs

- prompt : written questions in German

- response : spoken utterance in English

각각의 response 는 사람에 의해 accepted/rejected labels 로 태깅됨

- < language = correct > & < meaning = correct > --> accepted

- meaning 점수 : 질문을 잘 이해하여 대답한 경우

- 엉뚱한 대답, 즉 incorrect 하나라도 있으면 rejected

data distributions

- training set (merged set from 17 & 18) : 11,919 utterances

- dev set (test set from 2017) : 995 utterances – 2017 챌린지 때의 test set 을 이 때의 dev set 으로 씀

- eval set (test set from 2018) : 1,000 utterances – 2018 챌린지 때의 test set 을 이 때의 eval set 으로 씀

- test set : 1,000 utterances (실제 2019 test set)

Acoustic Model

- Trained using a popular Kaldi recipe that relies on a time-delay NN optimized using the lattice-free MMI (maximum mutual information)

--> Kaldi 레시피에서 학습할 수 있는 time-delay NN 모델 – lattice-free MMI 를 통해 optimized.

- Datasets used for training

--> 위의 CALL Shared Task Training set 말고도 + PF-STAR 라는 독일 아이들이 영어로 낭독한 텍스트 데이터들 일부 사용 + ISLE corpus 라는 intermediate level 에서의 영어

Language Model

- 3-gram stochastic LM

- 7.5% WER on Eval set

2.1. FBK baseline system

The winner of the 2019 challenge

- 작년 챌린지 1위팀이 이번에 또 나오고, 이번것을 논문으로 써서 interspeech 2020.

Settings

- standard : num of words, of content words, num and % of OOV words

- reference : 5 features from the reference grammar and the edit error

- LMs : log-probability, OOVs, num. of back-offs, 12 LMs defined 1-gram to 4-grams computed on 3 datasets

- Generic – around 3M words from English TED talks

- TrainRejRec/TrainAccRec : ASR outputs bounded by labels _start_ an _end_ corresponding to the incorrect/correct utterances of Training Set

Classification

- Several FFNNs used. Majority voting to the most promising classification outputs

- 2019 년 것이 baseline

2.2. Stand-alone DNNs and textual features used

DNNs 다 섞기 전에 개별 DNN 들 소개

LSTM-W2V

LSTM 에 word2vec 합친 구조
- Trained over sequences of 300-dimensional real-valued Word2Vec embeddings
: 300차원 real-valued embeddings 시퀀스를 학습

- W2V embeddings of Prompt-Response pairs are concatenated to form a single sequence of vectors
: prompt 앞, response 뒤로 붙여서 sequence 로 만들어 벡터.

LSTM-W2V-L

loss function 에 대한 불만.
- 보통 classification task 는 Cross Entropy Loss 를 쓰는데, 챌린지에서는 binary (accepted/rejected). 이렇게 계산하는 척도가 달라서 mismatch 생김

- Ad hoc version of the loss function devised to account for the mismatch between the usual training criterion (Cross Entropy Loss) and the evaluation metric

- 따라서 evaluation metric 과 비슷하게 correct 냐 아니냐의 loss function을 새롭게 고안한 방법으로 표현

LSTM-W2V-M

- Interpose an EOS(end-of-sentence) marker in the sequence of embeddings Prompt sequence
: EOS 새로 붙여서 여기까지는 prompt 야, 준비해. 라는 식으로 알려줌

- Additional 0/1 marker component added to the original 300-dimensional input
: 0이냐 1이냐 알려주는 마커를 추가해서 기존의 300차원에 301차원의 input 이 됨.

LSTM-BERT

- Replace Word2Vec with BERT embeddings (768-dimensional real-valued vectors)
: W2V 대신에 BERT 임베딩 넣기

- Add the 0/1 EOS marker component = total dims : 769

Deep FFNN on BOW / TF-IDF

Deep FFNN on BOW : Feed forward 기반의 Bag-of-Words 형태

- Bag-Of-Words (BOW) : considered as a plain counter of the occurrences of individual words in the text
-- 단순한 counter. 텍스트 상에서 각각의 단어들이 얼만큼 어떤 빈도를 가지고 있는지 설명

Term Frequency-Inverse Document Frequency (TF-IDF)

- The importance of a target term in a specific document among a group of docmuments
: 문서군의 특정 문서에 대해서 타겟 단어 또는 term 이 그 문서에서 얼만큼 중요한지 수치적으로 나타내는 것

BOW and TF-IDF representations

- 1,020 word vocabulary (German prompts – English responses)

- 2040-dimensional input fed to the FFNN : BOW 1020 + TF-IDF 1020 = 2040 차원의 input 을 feed forward neural net 에 집어넣기

2.3. Combining multiple DNNs

Pseudo-joint probability

- DNNs 합치는 2가지 방식 중 하나. Joint 기 때문에 합치는 것. 적어도 모델 두 개 이상.

: a model independent of any other individual models
– 각각의 모델 -- 독립 되어있는, 서로 영향을 주지 않는 모델 – 싸이 1, 싸이 2 : 첫번째 모델, 두번째 모델

: different classes involved in the classification task
– classification task 에서 각각의 class – k개의 클래스라면 오메가1 ~ 오메가k 까지 있는 것

“pseudo” : models are not actually independent of each other in real world

- Each model can be interpreted as individual DNN-based experts

-- joint probability 라는 것은 두 개의 모델에 대해서 특정 class 가 나오는 확률을 구하는 것인데 그대로 구하기는 힘들고 전제조건이 필요 – 그 중 하나가 모델들은 다 독립적이라는 것, 근데 이것은 이론상의 얘기고 현실에서는 그럴 수 없으니 (알 수 없는 이유로 서로 엮일 수도 있으니) pseudo 붙여서 pseudo-joint probability

- 이 식은 chain rule + 베이즈정리를 통해 위와 같이 변형 가능
- 분자 : 모델 1에 대한 확률과 모델 2에 대한 확률 각각을 구할 수 있음 + 분모 : 클래스에 대한 확률 구할 수 있음
== 간단하게 풀 수 있음 (독립 전제 하에)

Hard mixture of experts

-- DNNs 합치는 2가지 방식 중 두번째

- 각각의 DNN 들이 다 학습을 할텐데, 그 학습에 필요한 features 들을 싹 다 모아놓음 = overall feature space

- 이 feature space 내부에는 서로 overlap 되지 않는 각각의 sub feature space 들이 있는데, 그게 위 그림의 overall feature space 내부의 분리선

-- 각각 sub feature space 별로 전문가들이 본인 해당 feature space 에서 학습함
– 각각 결과물 y1, y2, y3, y4 를 가지고 gating network 가 0 을 줄지, 1을 줄지 판단함

-- 0이면 현재상황에서 무시, 1이면 정보 가져옴 = 그게 맨 아래 식.

-- x라는 input 이 있고, 0일지 1일지 결정하는 알파가 곱해져서 j=1 부터 k개의 experts 들의 (0일지 1일지) 결과를 합친 최종 z 결과물이 나옴.

3. Experiments and results

- wc 는 word count 로 추정.
- 모델들과 각 레이어 수, 레이어에 따른 뉴런 수 소개

Standard-alone DNNS

각 개별 DNN 의 실험 결과

- LSTM-W2V-M : 여기까지가 prompt 뒤에는 response 이런 식으로 명시적으로 밝힌, 마커가 붙어있는 모델이 성능이 가장 좋게 나옴

Mixtures of DNNS

다 합친 DNNs 의 실험 결과

- PJ : pseudo-joint combination

- MIX : Mixture DNNs

- Mixture 중에서 best 3, 성능이 가장 좋았던 세가지 모델들을 섞은 mixture 가 가장 성능이 잘 나옴. (성능의 근거는 F1)

- 앞선 standard-alone 모델보다 mixture 모델이 성능이 더 좋다는 것이 증명됨.

4. Conclusions

- Two combination techniques are proposed and both achieved significant improvements over the SOTA of the 2019 challenge
: 두 가지 combination 소개 + 각각 standard-alone 보다 성능 향상 보임

- Different text representations and DNNs capture diverse aspects of the linguistic phenomea, complementing each other effectively
: 각 DNN 이 전문가가 되어 각자가 가진 시점에서 text represents 들을 학습 – 서로가 서로를 보완하며 성능이 더 좋아짐

- Future plan
-- aim to address more complex tasks in speech scoring
: speech scoring 을 두 가지 binary 가 아닌 좀 더 complex 형태로 해보고 싶음
-- extend the multiple classifiers to consider DNN experts trained on acoustic features
: DNN experts 를 쓰되 acoustic features 활용하여 실험해보고 싶음

Papi et al. (2020) @ Interspeech2020 < Mixtures of Deep Neural Experts for Automated Speech Scoring > 논문 리뷰입니다. 서울대학교 SLP 연구실의 이주영님 리뷰 발표를 듣고 정리한 글입니다.

'paper review' 카테고리의 다른 글

CNN-RNN-CTC Based End-to-End Mispronunciation Detection and Diagnosis - Leung, W. K., Liu, X., & Meng, H. @ ICASSP 2019 (0)	2022.06.06
Phoneme mispronunciation detection by jointly learning to align - Binghuai Lin, Liyuan Wang @ ICASSP 2022 (0)	2022.06.06
Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models - Kim, H. et al. (2021) (3) (0)	2021.11.16
Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models - Kim, H. et al. (2021) (2) (0)	2021.11.16
Multi-domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models - Kim, H. et al. (2021) (1) (0)	2021.11.16

nongdevlog

Mixtures of Deep Neural Experts for Automated Speech Scoring - Papi et al. (2020) (Interspeech2020)