Active speech synthesis based on listener perceptual modeling

This work was supported by JSPS KAKENHI Grant Number 18J22090. (Apr. 2018 -- Mar. 2021)


Research Project 1: Generative-adversarial-network-based text-to-speech synthesis using auditory-sensitive features

Research highlights:

Proposed GAN-based TTS. We introduce an average-pooling to the training for obtaining low-dimensional features representing spectral envelopes that are important in human speech perception.

Datasets:

Speech samples (in Japanese, DNN-based TTS):

  1. NAT: Natural speech.
  2. MSE: Minimizing mean squared error.
  3. GAN: Proposed GAN-based training.

MethodSample 1Sample 2Sample 3
NAT:
MSE:
GAN:

References:


Research Project 2: Speaker embedding using crowdsourced subjective inter-speaker similarity

Research highlights:

Speaker similarity graph (upper) and speaker similarity matrices (lower) representing crowdsourced subjective inter-speaker similarity.

Datasets:

Speech samples (in English, VAE-based speech modeling using mcep):

  1. NAT: Natural speech.
  2. D-VEC: Speech-recognition-based embedding.
  3. SIM-VEC: Similarity vector embedding.
  4. SIM-MAT: Similarity matrix embedding using ALL speaker pairs.
  5. SIM-MAT-RE: Similarity matrix embedding using ONLY SIMILAR speaker pairs.

MethodSpeaker 1Speaker 2Speaker 3
NAT:
D-VEC:
SIM-VEC:
SIM-MAT:
SIM-MAT-RE:

References: