Journal and conference on speech
CCF-A | NeuraIPS AAAI IJAI ACMMM |
CCF-B | ICASSP COLING SpeechCom TSLP TASLP JSLHR TMM TOMCCAP ICME |
CCF-C | INTERSPEECH ICPR |
other | ICLR |
General TTS
2022
1 | DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs | |
2 | The MSXF TTS System for ICASSP 2022 ADD Challenge | |
3 | MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription | |
4 | Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance | |
5 | ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech | |
6 | Unsupervised word-level prosody tagging for controllable speech synthesis | |
7 | FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation | |
8 | Building Synthetic Speaker Profiles in Text-to-Speech Systems | |
9 | Revisiting Over-Smoothness in Text to Speech | |
10 | A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS | |
11 | A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis | |
12 | A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing | |
13 | Applying Syntax–Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis | |
14 | BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis | |
15 | Differentiable Duration Modeling for End-to-End Text-to-Speech | |
16 | DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning | |
17 | ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis | |
18 | Improve few-shot voice cloning using multi-modal learning | |
19 | JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech | |
20 | Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech | |
21 | Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation | |
22 | Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition | |
23 | Variational Auto-Encoder based Mandarin Speech Cloning | |
24 | Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise | |
25 | vTTS: visual-text to speech | |
26 | WavThruVec: Latent speech representation as intermediate features for neural speech synthesis | |
27 | Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss | |
28 | SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech | |
29 | Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech | |
30 | Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis | |
31 | Simple and Effective Unsupervised Speech Synthesis | |
32 | AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech | |
33 | VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature | |
34 | Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis | |
35 | NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | |
36 | Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech | |
37 | Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History | |
38 | Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech | |
39 | NatiQ: An End-to-end Text-to-Speech System for Arabic | |
40 | R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS | |
41 | TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder | |
42 | UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder | |
43 | Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models | |
44 | Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation | |
45 | Diffsound: Discrete Diffusion Model for Text-to-sound Generation | |
46 | LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech | |
47 | ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | |
48 | DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders | |
49 | Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech | |
50 | SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate | |
51 | BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model | |
52 | Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) | |
53 | Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS) | |
54 | Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need | |
55 | Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks | |
56 | Visualising Model Training via Vowel Space for Text-To-Speech Systems | |
57 | A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis | |
58 | EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models | |
59 | A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS | |
60 | Controllable Accented Text-to-Speech Synthesis | |
61 | Deep Speech Synthesis from Articulatory Representations | |
62 | AudioGen: Textually Guided Audio Generation |
2021
1 | Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet | |
2 | VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention | |
3 | LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search | |
4 | Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech | |
5 | ADASPEECH: ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE | |
6 | Building Multilingual TTS using Cross-Lingual Voice Conversion | |
7 | Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis | |
8 | Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis | |
9 | Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input | |
10 | Data-Efficient Training Strategies for Neural TTS Systems | |
11 | Multilingual Byte2Speech Text-To-Speech Models Are Few-shot Spoken Language Learners | |
12 | Text-to-speech for the hearing impaired | |
13 | Continual Speaker Adaptation for Text-to-Speech Synthesis | |
14 | Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling | |
15 | PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS | |
16 | SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model | |
17 | Fast DCTTS: Efficient Deep Convolutional Text-to-Speech | |
18 | Diff-TTS: A Denoising Diffusion Model for Text-to-Speech | |
19 | Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling | |
20 | Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features | |
21 | Speech Resynthesis from Discrete Disentangled Self-Supervised Representations | |
22 | Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech | |
23 | Review of end-to-end speech synthesis technology based on deep learning | |
24 | dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech | |
26 | TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model Stanislav Beliaev, Boris Ginsburgfor Speech Synthesis with Explicit Pitch and Duration Prediction | |
27 | Signal Representations for Synthesizing Audio Textures with Generative Adversarial Networks | |
28 | SpeechNet: A Universal Modularized Model for Speech Processing Tasks | pdf blog |
29 | How do Voices from Past Speech Synthesis Challenges Compare Today? | pdf blog |
30 | Learning Robust Latent Representations for Controllable Speech Synthesis | |
31 | MASS: Multi-task Anthropomorphic Speech Synthesis Framework | |
32 | VQCPC-GAN: Variable-length Adversarial Audio Synthesis using Vector-Quantized Contrastive Predictive Coding | |
33 | SpeechNet: A Universal Modularized Model for Speech Processing Tasks | |
34 | Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech | |
35 | Ito^TTS and Ito^Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation | |
36 | Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling | |
37 | A learned conditional prior for the VAE acoustic space of a TTS system | |
38 | A Survey on Neural Speech Synthesis | |
39 | An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis | |
40 | Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla | |
41 | Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech | |
42 | Controllable Context-aware Conversational Speech Synthesis | |
43 | Multi-Scale Spectrogram Modelling for Neural Text-to-Speech | |
44 | Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis | |
45 | FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis | |
46 | GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis | |
47 | Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech | |
48 | Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows | |
49 | Non-native English lexicon creation for bilingual speech synthesis | |
50 | Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech | |
51 | Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis | |
52 | Speech BERT Embedding For Improving Prosody in Neural TTS | |
53 | WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis | |
54 | Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance | |
55 | VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis | |
56 | Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm | |
57 | Federated Learning with Dynamic Transformer for Text to Speech | |
58 | Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis | |
59 | End to End Bangla Speech Synthesis | |
60 | Perceptually Guided End-to-End Text-to-Speech With MOS Prediction | |
61 | One TTS Alignment To Rule Them All | |
62 | Combining speakers of multiple languages to improve quality of neural voices | |
63 | DeepEigen: Learning-based Modal Sound Synthesis with Acoustic Transfer Maps | |
64 | Neural HMMs are all you need (for high-quality attention-free TTS) | |
65 | PortaSpeech: Portable and High-Quality Generative Text-to-Speech | |
66 | Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS | |
67 | Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network | |
68 | An Audio Synthesis Framework Derived from Industrial Process Control | |
69 | On-device neural speech synthesis | |
70 | fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit | |
71 | A study on the efficacy of model pre-training in developing neural text-to-speech system | |
72 | DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 | |
73 | Discrete acoustic space for an efficient sampling in neural text-to-speech | |
74 | EdiTTS: Score-based Editing for Controllable Text-to-Speech | |
75 | Emphasis control for parallel neural TTS | |
76 | Environment Aware Text-to-Speech Synthesis | |
77 | ESPnet2-TTS: Extending the Edge of TTS Research | |
78 | FedSpeech: Federated Text-to-Speech with Continual Learning | |
79 | Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS | |
80 | Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings | |
81 | Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge | |
82 | On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis | |
83 | PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control | |
84 | Prosody-TTS: An end-to-end speech synthesis system with prosody control | |
85 | A study on the efficacy of model pre-training in developing neural text-to-speech system | |
86 | Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video | |
87 | Guided-TTS:Text-to-Speech with Untranscribed Speech | |
88 | Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control | |
89 | Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data | |
90 | More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech | |
91 | Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis | |
92 | RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses | |
93 | Speaker Generation |
2020
1 | INTERACTIVE TEXT-TO-SPEECH VIA SEMI-SUPERVISED STYLE TRANSFER LEARNING | |
2 | SQUEEZEWAVE EXTREMELY LIGHTWEIGHT VOCODERS FOR ON DEVICE SPEECH SYNTHESIS | pdf demo code |
3 | LOCATION RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG FORM SPEECH SYNTHESIS | |
4 | End to End Adversarial Text to Speech | pdf demo |
5 | FastSpeech 2 Fast and High Quality End to End Text to Speech | |
6 | Deep Representation Learning in Speech Processing Challenges Recent Advances and Future Trends | |
7 | Flowtron an Autoregressive Flowbased Generative Network for TexttoSpeech Synthesis.pdf | |
8 | JDI-T- Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment | |
9 | FastPitch- Parallel Text-to-speech with Pitch Prediction.pdf | |
10 | Glow-TTS- A Generative Flow for Text-to-Speech via Monotonic Alignment Search.pdf | |
11 | FLOW-TTS: A NON-AUTOREGRESSIVE NETWORK FOR TEXT TO SPEECH BASED ON FLOW | |
12 | SpeedySpeech- Efficient Neural Speech Synthesis | |
13 | End-to-End Adversarial Text-to-Speech | |
14 | Controllable Neural Prosody Synthesis | |
15 | Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling | |
16 | Exploring TTS without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020) | |
17 | From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint | |
18 | Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning | |
19 | Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit | |
20 | Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages | |
21 | Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning | |
22 | Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning | |
23 | NON-ATTENTIVE TACOTRON- ROBUST AND CONTROLLABLE NEURAL TTS SYNTHESIS INCLUDING UNSUPERVISED DURATION MODELING | |
24 | PARALLEL TACOTRON- NON-AUTOREGRESSIVE AND CONTROLLABLE TTS | |
25 | TTS-BY-TTS- TTS-DRIVEN DATA AUGMENTATION FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS | |
26 | SPEECH SYNTHESIS AND CONTROL USING DIFFERENTIABLE DSP | |
27 | FEATHERTTS- ROBUST AND EFFICIENT ATTENTION BASED NEURAL TTS | |
28 | GRAPHSPEECH: SYNTAX-AWARE GRAPH ATTENTION NETWORK FOR NEURAL SPEECH SYNTHESIS | |
29 | HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS | 30 | DEVICETTS: A SMALL-FOOTPRINT, FAST, STABLE NETWORK FOR ON-DEVICE TEXT-TO-SPEECH | 31 | PRETRAINING STRATEGIES, WAVEFORM MODEL CHOICE, AND ACOUSTIC CONFIGURATIONS FOR MULTI-SPEAKER END-TO-END SPEECH SYNTHESIS | 32 | Fast and lightweight on-device TTS with Tacotron2 and LPCNet |
2019
2019 | isca 2019 speech | papers |
1 | Deep Text-to-Speech System with Seq2Seq Model | |
2 | FastSpeech: Fast, Robust and Controllable Text to Speech | |
3 | Neural Speech Synthesis with Transformer Network | pdf ppt demo |
4 | Parallel Neural Text-to-Speech | |
5 | Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTS | |
6 | LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech | |
7 | Forward-Backward Decoding for Regularizing End-to-End TTS | |
8 | Self-attention Based Prosodic Boundary Prediction for Chinese Speech Synthesis | |
9 | Guide to Speech Synthesis with Deep Learning | ppt |
10 | tts tutorial part1 part2 | ppt1 ppt2 |
11 | maximizing mutual information for tacotron | |
12 | durlan | |
13 | Non-Autoregressive Neural Text-to-Speech | |
14 | Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems |
2018
2018 | isca 2018 speech | papers |
1 | Deep voice 3: Scaling text-to-speech with convolutional sequence learning | |
2 | ClariNet Parallel Wave Generation in End-to-End Text-to-Speech | |
3 | Linear Networks Based Speaker Adaptation For Speech Synthesis |
2017
2017 | isca 2017 speech | papers |
1 | Tacotron: Towards End-to-End Speech Synthesis | pdf page |
2 | Char2Wav: End-to-End Speech Synthesis | |
3 | Deep Voice: Real-time Neural Text-to-Speech | |
4 | Deep Voice 2: Multi-Speaker Neural Text-to-Speech | |
5 | VoiceLoop voice fitting and synthesis via a phonological loop | |
6 | Attention Is All You Need |
2016
2016 | isca 2016 speech | papers |
1 | Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices | |
2 | Merlin: An Open Source Neural Network Speech Synthesis System |
2015
1 | Acoustic modeling instatistical parametric speechsynthesis-from HMM to LSTM-RNN | pdf ppt |
2 | Effective Approaches to Attention-based Neural Machine Translation | |
3 | htkbook-3.5 | |
5 | A study of speaker adaptation for DNN-based speech synthesis |
2014
1 | TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks |
2013
1 | Statical parameteric speech synthesis Using deep neural networks |
Vocoder
2022
1 | ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation | |
2 | End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation | |
3 | Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet | |
4 | Phase Vocoder Done Right | |
5 | It's Raw! Audio Generation with State-Space Models | |
6 | InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Trainin | |
7 | A Neural Vocoder Based Packet Loss Concealment Algorithm | |
8 | AdaVocoder: Adaptive Vocoder for Custom Voice | |
9 | Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge | |
10 | HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement | |
11 | iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform | |
12 | Neural Vocoder is All You Need for Speech Super-resolution | |
13 | SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping | |
14 | Parallel Synthesis for Autoregressive Speech Generation | |
15 | Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation | |
16 | FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | |
17 | Streamable Neural Audio Synthesis With Non-Causal Convolutions | |
18 | A Post Auto-regressive GAN Vocoder Focused on Spectrum Fracture | |
19 | BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis | |
20 | Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation | |
21 | cMelGAN: An Efficient Conditional Generative Model Based on Mel Spectrograms | |
22 | Avocodo: Generative Adversarial Network for Artifact-free Vocoder | |
23 | BigVGAN: A Universal Neural Vocoder with Large-Scale Training | |
24 | GoodBye WaveNet -- A Language Model for Raw Audio with Context of 1/2 Million Samples | |
25 | WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis | |
26 | End-to-End Binaural Speech Synthesis | |
27 | Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer | |
28 | Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0 | |
29 | DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation | |
30 | Mel Spectrogram Inversion with Stable Pitch | |
31 | An Initial study on Birdsong Re-synthesis Using Neural Vocoders |
2021
1 | GAN Vocoder: Multi-Resolution Discriminator Is All You Need | |
2 | Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss | |
3 | Universal Neural Vocoding with Parallel WaveNet | |
4 | LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation | |
5 | High-Quality Vocoding Design with Signal Processing for Speech Synthesis and Voice Conversion | |
6 | Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains | |
7 | Improve GAN-based Neural Vocoder using Pointwise Relativistic LeastSquare GAN | |
8 | Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN | |
9 | Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders | |
10 | High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling | |
11 | a generative model for raw audio using transformer architectures | |
12 | WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution | |
13 | Advances in Speech Vocoding for Text-to-Speech with Continuous Parameters | |
14 | Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition | |
15 | Catch-A-Waveform: Learning to Generate Audio from a Single Short Example | |
16 | Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis | |
17 | CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis | |
18 | Fre-GAN: Adversarial Frequency-consistent Audio Synthesis | |
19 | Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis | |
20 | Improving the expressiveness of neural vocoding with non-affine Normalizing Flows | |
21 | Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis | |
22 | Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder | |
23 | UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation | |
24 | WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution | |
25 | A GENERATIVE MODEL FOR RAW AUDIO USING TRANSFORMER ARCHITECTURES | |
26 | Neural Waveshaping Synthesis | |
27 | A Generative Model for Raw Audio Using Transformer Architectures | |
28 | DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs | |
29 | Ito^TTS and Ito^Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation | |
31 | A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate | |
32 | FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis | |
33 | MSR-NV: Neural vocoder using multiple sampling rates | |
34 | Chunked Autoregressive GAN for Conditional Waveform Synthesis | |
35 | Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations | |
36 | Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet | |
37 | Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks | |
38 | The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction | |
39 | Towards Universal Neural Vocoding with a Multi-band Excited WaveNet | |
40 | High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latenc | |
41 | RAVE: A variational autoencoder for fast and high-quality neural audio synthesis | |
42 | VocBench: A Neural Vocoder Benchmark for Speech Synthesis | |
43 | DiffWave: A Versatile Diffusion Model for Audio Synthesis |
2020
1 | multiband melgan | |
2 | FeatherWave An efficient high fidelity neural vocoder with multiband linear prediction | |
3 | parallel wavegan | |
4 | VocGan | |
5 | WAVEGRAD- ESTIMATING GRADIENTS FOR WAVEFORM GENERATION | |
6 | PARALLEL WAVEGAN- A FAST WAVEFORM GENERATION MODEL BASED ON GENERATIVE ADVERSARIAL NETWORKS WITH MULTI-RESOLUTION SPECTROGRAM | |
7 | WAVEGRAD ESTIMATING GRADIENTS FOR WAVEFORM GENERATION | |
8 | A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems | |
9 | Bunched LPCNet - Vocoder for Low-cost Neural Text-To-Speech Systems | |
10 | Quasi-Periodic Parallel WaveGAN Vocoder- A Non-autoregressive Pitchdependent Dilated Convolution Model for Parametric Speech Generation | |
11 | Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder | |
12 | Improving Opus Low Bit Rate Quality with Neural Speech Synthesis | |
13 | WG-WaveNet- Real-Time High-Fidelity Speech Synthesis without GPU | |
14 | Vocoder-Based Speech Synthesis from Silent Videos | |
15 | Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis | |
16 | Speaker Conditional WaveRNN- Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions | |
17 | GAUSSIAN LPCNET FOR MULTISAMPLE SPEECH SYNTHESIS | |
18 | UNIVERSAL MELGAN: A ROBUST NEURAL VOCODER FOR HIGH-FIDELITY WAVEFORM GENERATION IN MULTIPLE DOMAINS | |
19 | Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network | |
20 | What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS | |
21 | Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition | |
22 | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis |
2019
1 | High quality lightweight and adaptable TTS using LPCNet | |
2 | A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis | |
3 | RawNet: Fast End-to-End Neural Vocoder | |
4 | A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet | |
5 | Lpcnet improving neural speech synthesis through linear prediction | pdf demo code |
6 | Waveglow | |
7 | melgan | |
8 | AN INVESTIGATION OF SUBBAND WAVENET VOCODER COVERING ENTIRE AUDIBLE FREQUENCY RANGE WITH LIMITED ACOUSTIC FEATURES | |
9 | A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction |
2018
1 | Natural TTS Synthesis by Conditioning Wavennet on MEL spectrogram predictions(tacotron2) | pdf code |
2 | Efficient Neural Audio Synthesis (WaveRNN) | |
3 | Improving FFTNet vocoder with noise shaping and subband approaches | |
4 | FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER | |
4 | SQUEEZEWAVE: EXTREMELY LIGHTWEIGHT VOCODERS FOR ON-DEVICE SPEECH SYNTHESIS | pdf code |
2017
1 | Parallel WaveNet: Fast High-Fidelity Speech Synthesis |
2016
1 | Wavenet A Generative Model For Raw Audio | pdf demo code |
2 | Fast Wavenet Geneartion Algorithm |
Adap & Multispeaker & Multilingual
2022
1 | Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training | |
2 | Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention | |
3 | nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech | |
4 | Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module | |
5 | Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features | |
6 | Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis | |
7 | Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus | |
8 | VoiceMe: Personalized voice generation in TTS | |
9 | Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech | |
10 | Data-augmented cross-lingual synthesis in a teacher-student framework | |
11 | Fine-grained Noise Control for Multispeaker Speech Synthesis | |
12 | Self supervised learning for robust voice cloning | |
13 | Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis | |
14 | AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios | |
15 | Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding | |
16 | Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations | |
17 | SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech | |
18 | CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer | |
19 | Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech | |
20 | AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation | |
21 | Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data | |
22 | TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS | |
23 | Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS | |
24 | When Is TTS Augmentation Through a Pivot Language Useful? | |
25 | A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System | |
26 | Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis | |
27 | ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS | |
28 | Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech |
2021
1 | Building Multilingual TTS using Cross-Lingual Voice Conversion | |
2 | ADASPEECH: ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE | |
3 | Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech | |
4 | Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning | |
5 | CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge | |
6 | Real-time Timbre Transfer and Sound Synthesis using DDSP | |
7 | The Multi-speaker Multi-style Voice Cloning Challenge 2021 | |
8 | The AS-NU System for the M2VoC Challenge | |
9 | Exploring Disentanglement with Multilingual and Monolingual VQ-VAE | |
10 | Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation | |
11 | Speaker Adaptation with Continuous Vocoder-based DNN-TTS | |
12 | GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints | |
13 | Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration | |
14 | Adapting TTS models For New Speakers using Transfer Learning | |
15 | Cloning one's voice using very limited data in the wild | |
16 | Adapting TTS models For New Speakers using Transfer Learning | |
17 | Applying Phonological Features in Multilingual Text-To-Speech | |
18 | Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech | |
19 | Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data | |
20 | Revisiting IPA-based Cross-lingual Text-to-speech | |
21 | Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis | |
22 | Cross-lingual Low Resource Speaker Adaptation Using Phonological Features | |
23 | Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech | |
24 | V2C: Visual Voice Cloning |
2020
1 | Cross lingual Multispeaker Text to Speech under Limited Data Scenario | |
2 | Efficient neural speech synthesis for low resource languages through multilingual modeling | |
3 | EndtoEnd Code Switching TTS with Cross Lingual Language Model | |
4 | Generating Multilingual Voices Using Speaker Space Translation Based on Bilingual Speaker Data | |
5 | One Model Many Languages Meta learning for Multilingual Text to Speech | |
6 | SPEAKER ADAPTATION OF A MULTILINGUAL ACOUSTIC MODEL FOR CROSS-LANGUAGE SYNTHESIS | |
7 | Multilingual speech synthesis | |
8 | Domain-adversarial training of multi-speaker TTS | |
9 | Focusing on Attention Prosody Transfer and Adaptative Optimization Strategy for Multi Speaker End to End Speech Synthesis | |
10 | Zero Shot Multi Speaker Text To Speech with State of the art Neural Speaker Embeddings | |
11 | Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS | |
12 | Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes | |
13 | Phonological Features for 0-shot Multilingual Speech Synthesis | |
14 | Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation | |
15 | Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion | |
16 | USING IPA-BASED TACOTRON FOR DATA EFFICIENT CROSS-LINGUAL SPEAKER ADAPTATION AND PRONUNCIATION ENHANCEMENT |
2019
1 | Cross lingual Multi speaker Texttospeech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers | |
2 | Learning to Speak Fluently in a Foreign Language Multilingual Speech Synthesis and Cross Language Voice Cloning | |
3 | 个性化语音合成中说话人特征不同嵌入方式的研究 | |
4 | Cross lingual Multispeaker Text To Speech Synthesis Using Neural Speaker Embedding | |
5 | Automatic Multispeaker Voice Cloning | pdf code |
6 | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | pdf demo |
7 | Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora |
2017
1 | Speaker adaptation in DNN-based speech synthesis using d-vectors |
2016
6 | Speaker Representations for Speaker Adaptation in Multiple Speakers’BLSTM-RNN-based Speech Synthesis |
2015
6 | Multi-speaker modeling and speaker adaptation for dnn-based tts synthesis |
Expressive TTS
2022
1 | Disentangling Style and Speaker Attributes for TTS Style Transfer | |
2 | MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis | |
3 | Distribution augmentation for low-resource expressive text-to-speech | |
4 | Cross-speaker style transfer for text-to-speech using data augmentation | |
5 | Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis | |
6 | Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation | |
7 | Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis | |
8 | StyleWaveGAN: Style-based synthesis of drum sounds with extensive controls using generative adversarial networks | |
9 | StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis | |
10 | GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis | |
11 | End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue | |
12 | Expressive, Variable, and Controllable Duration Modelling in TTS | |
13 | iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis based on Disentanglement between Prosody and Timbre | |
14 | Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems | |
15 | Self-supervised Context-aware Style Representation for Expressive Speech Synthesis | |
16 | Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody | |
17 | Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis | |
18 | Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS | |
19 | PoeticTTS -- Controllable Poetry Reading for Literary Studies | |
20 | Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis | |
21 | Speech Synthesis with Mixed Emotions | |
22 | Towards Cross-speaker Reading Style Transfer on Audiobook Dataset | |
23 | The Role of Voice Persona in Expressive Communication:An Argument for Relevance in Speech Synthesis Design |
2021
1 | Whispered and Lombard Neural Speech Synthesis | |
2 | Expressive Neural Voice Cloning | |
3 | Model architectures to extrapolate emotional expressions in DNN-based text-to-speech | |
4 | Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system | |
5 | STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech | |
6 | Expressive Text-to-Speech using Style Tag | |
7 | Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability | |
8 | Towards Multi-Scale Style Control for Expressive Speech Synthesis | |
9 | AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data | |
10 | Exploring emotional prototypes in a high dimensional TTS latent space | |
11 | Global Rhythm Style Transfer Without Text Transcriptions | |
12 | Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS | |
13 | Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech | |
14 | Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis | |
15 | UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control | |
16 | Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis | |
17 | AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style | |
18 | Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis | |
19 | Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis | |
20 | Enhancing audio quality for expressive Neural Text-to-Speech | |
21 | Emotional Speech Synthesis for Companion Robot to Imitate Professional Caregiver Speech | |
22 | Controllable cross-speaker emotion transfer for end-to-end speech synthesis | |
23 | Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech | |
24 | GANtron: Emotional Speech Synthesis with Generative Adversarial Networks | |
25 | Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation | |
26 | StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis | |
27 | Fine-grained style control in Transformer-based Text-to-speech Synthesis | |
28 | Using multiple reference audios and style embedding constraints for speech synthesis | |
29 | Emotional Prosody Control for Speech Generation | |
30 | Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning | |
31 | Word-Level Style Control for Expressive, Non-attentive Speech Synthesis | |
32 | Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios | |
33 | Multi-speaker Emotional Text-to-speech Synthesizer |
2020
1 | Controllable Neural Prosody Synthesis | |
2 | FULLY-HIERARCHICAL FINE-GRAINED PROSODY MODELING FOR INTERPRETABLE SPEECH SYNTHESIS | |
3 | Flowtron- an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis | |
4 | Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion | |
5 | Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis | |
6 | Controllable Emotion Transfer For End-to-End Speech Synthesis | |
7 | Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis |
2019
1 | MULTI-REFERENCE NEURAL TTS STYLIZATION WITH ADVERSARIAL CYCLE CONSISTENCY | |
2 | Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis | |
3 | MELLOTRON- MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS |
2018
1 | HIERARCHICAL GENERATIVE MODELING FOR CONTROLLABLE SPEECH SYNTHESIS.pdf | |
2 | Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron | |
3 | Style Tokens- Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | |
4 | PREDICTING EXPRESSIVE SPEAKING STYLE FROM TEXT IN END-TO-END SPEECH SYNTHESIS.pdf |
Voice Conversion
2022
1 | Invertible Voice Conversion | |
2 | Emotion Intensity and its Control for Emotional Voice Conversion | |
3 | IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion | |
4 | Noise-robust voice conversion with domain adversarial training | |
5 | DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning | |
6 | AVQVC: One-shot Voice Conversion by Vector Quantization with applying contrastive learning | |
7 | An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion | |
8 | Analysis of Voice Conversion and Code-Switching Synthesis Using VQ-VAE | |
9 | DGC-vector: A new speaker embedding for zero-shot voice conversion | |
10 | Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion | |
11 | DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning | |
12 | Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution | |
13 | Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE | |
14 | HiFi-VC: High Quality ASR-Based Voice Conversion | |
15 | Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion | |
16 | SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlen | |
17 | Text-free non-parallel many-to-many voice conversion using normalising flows | |
18 | Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion | |
19 | Time Domain Adversarial Voice Conversion for ADD 2022 | |
20 | Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion | |
21 | Towards Improved Zero-shot Voice Conversion with Conditional DSVAE | |
22 | End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions | |
23 | An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions | |
24 | End-to-End Voice Conversion with Information Perturbation | |
25 | Identifying Source Speakers for Voice Conversion based Spoofing Attacks on Speaker Verification Systems | |
26 | Speak Like a Dog: Human to Non-human creature Voice Conversion | |
27 | Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion | |
28 | Streaming non-autoregressive model for any-to-many voice conversion | |
29 | Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion | |
30 | A Comparative Study of Self-supervised Speech Representation Based Voice Conversion | |
31 | Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | |
32 | GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion | |
33 | Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers | |
34 | Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion | |
35 | TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training | |
36 | ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm | |
37 | Boosting Star-GANs for Voice Conversion with Contrastive Discriminator | |
38 | DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion | |
39 | Investigation into Target Speaking Rate Adaptation for Voice Conversion |
2021
1 | EMOCAT: LANGUAGE-AGNOSTIC EMOTIONAL VOICE CONVERSION | |
2 | Building Multilingual TTS using Cross-Lingual Voice Conversion | |
3 | High-Quality Vocoding Design with Signal Processing for Speech Synthesis and Voice Conversion | |
4 | Hierarchical disentangled representation learning for singing voice conversio | |
5 | Adversarially learning disentangled speech representations for robust multi-factor voice conversion | |
6 | Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram | |
7 | Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion | |
8 | crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder | |
9 | MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames | |
10 | Axial Residual Networks for CycleGAN-based Voice Conversion | |
11 | Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning | |
12 | CycleDRUMS: Automatic Drum Arrangement For Bass Lines Using CycleGAN | |
13 | Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques | |
14 | S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations | |
15 | StarGAN-based Emotional Voice Conversion for Japanese Phrases | |
16 | NoiseVC: Towards High Quality Zero-Shot Voice Conversion | |
17 | Non-autoregressive sequence-to-sequence voice conversion | |
18 | FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion | |
19 | NoiseVC: Towards High Quality Zero-Shot Voice Conversion | |
20 | Non-autoregressive sequence-to-sequence voice conversion | |
21 | Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss | |
22 | Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels | |
23 | An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion | |
24 | Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction | |
25 | Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery | |
26 | DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion | |
27 | Emotional Voice Conversion: Theory, Databases and ESD | |
28 | Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance | |
29 | A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion | |
30 | Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion | |
31 | Improving robustness of one-shot voice conversion with deep discriminative speaker encoder | |
32 | NVC-Net: End-to-End Adversarial Voice Conversion | |
33 | Pathological voice adaptation with autoencoder-based voice conversion | |
34 | Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments | |
35 | VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion | |
36 | StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion | |
37 | On Prosody Modeling for ASR+TTS based Voice Conversion | |
38 | An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation | |
39 | Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder | |
40 | Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer | |
41 | Improving robustness of one-shot voice conversion with deep discriminative speaker encoder | |
42 | StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition | |
43 | Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning | |
44 | Noisy-to-Noisy Voice Conversion Framework with Denoising Model | |
45 | Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion | |
46 | Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme | |
47 | Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks | |
48 | MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features | |
49 | S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations | |
50 | Sequence-To-Sequence Voice Conversion using F0 and Time Conditioning and Adversarial Learning | |
51 | Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments | |
52 | Toward Degradation-Robust Voice Conversion | |
53 | Towards Identity Preserving Normal to Dysarthric Voice Conversion | |
54 | A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | |
55 | AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion | |
56 | Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion | |
57 | CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer | |
58 | Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion | |
59 | One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation | |
60 | SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines | |
61 | Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features | |
62 | Conditional Deep Hierarchical Variational Autoencoder for Voice Conversion | |
63 | YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone |
2020
1 | Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data | |
2 | An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning | |
3 | Converting Anyone’s Emotion- Towards Speaker-Independent Emotional Voice Conversion | |
4 | SEEN AND UNSEEN EMOTIONAL STYLE TRANSFER FOR VOICE CONVERSION WITH A NEW EMOTIONAL SPEECH DATASET | |
5 | ANY-TO-ONE SEQUENCE-TO-SEQUENCE VOICE CONVERSION USING SELF-SUPERVISED DISCRETE SPEECH REPRESENTATIONS | |
6 | GAZEV- GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus | |
7 | TOWARDS LOW-RESOURCE STARGAN VOICE CONVERSION USING WEIGHT ADAPTIVE INSTANCE NORMALIZATION | |
8 | CycleGAN-VC3- Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion | |
9 | Accent and Speaker Disentanglement in Many-to-many Voice Conversion |
2019
1 | AUTOVC- Zero-Shot Voice Style Transfer with Only Autoencoder Loss | |
2 | An Overview of Voice Conversion Systems | |
3 | Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion | |
4 | Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations |
2017
1 | An Overview of Voice Conversion Systems |
Sing Synthesis
2022
1 | Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals | |
2 | partitura: A Python Package for Handling Symbolic Musical Data | |
3 | FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control | |
4 | Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis | |
5 | MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder | |
6 | Quantized GAN for Complex Music Generation from Dance Videos | |
7 | Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder | |
8 | Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher | |
9 | Music Generation Using an LSTM | |
10 | SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy | |
11 | U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity | |
12 | WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses | |
13 | Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis | |
14 | Deep Performer: Score-to-Audio Music Performance Synthesis | |
15 | Learning the Beauty in Songs: Neural Singing Voice Beautifier | |
16 | SUSing: SU-net for Singing Voice Synthesis | |
17 | Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis | |
<18/a> | A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion | |
19 | Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis | |
20 | Multi-instrument Music Synthesis with Spectrogram Diffusion | |
21 | HouseX: A Fine-grained House Music Dataset and its Potential in the Music Industry | |
22 | WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training | |
23 | What is missing in deep music generation? A study of repetition and structure in popular music | |
24 | A New Corpus for Computational Music Research and A Novel Method for Musical Structure Analysis | |
25 | MeloForm: Generating Melody with Musical Form based on Expert Systems and Neural Networks | |
26 | Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer | |
27 | Musika! Fast Infinite Waveform Music Generation | |
28 | Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN | |
29 | musicaiz: A Python Library for Symbolic Music Generation, Analysis and Visualization | |
30 | Domain Adversarial Training on Conditional Variational Auto-Encoder for Controllable Music Generation | |
31 | SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias | |
32 | What is missing in deep music generation? A study of repetition and structure in popular music |
2021
1 | Anyone GAN Sing | |
2 | Latent Space Explorations of Singing Voice Synthesis using DDSP | |
3 | Learning to Generate Music With Sentiment | |
4 | Hierarchical disentangled representation learning for singing voice conversio | |
5 | Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis | |
6 | DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis | |
7 | LoopNet: Musical Loop Synthesis Conditioned On Intuitive Musical Parameters | |
8 | Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis | |
9 | Music Generation using Deep Learning | |
10 | MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis | |
11 | N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement | |
12 | Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System | |
13 | A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis | |
14 | An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures | |
15 | A Melody-Unsupervision Model for Singing Voice Synthesis | |
16 | A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems | |
17 | DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding | |
18 | Enhanced Memory Network: The novel network structure for Symbolic Music Generation | |
19 | KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms | |
20 | KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke | |
21 | Pitch Preservation In Singing Voice Synthesis | |
22 | SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation | |
23 | Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding | |
24 | A Melody-Unsupervision Model for Singing Voice Synthesis | |
25 | A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems | |
26 | A-Muze-Net: Music Generation by Composing the Harmony based on the Generated Melody | |
27 | Learning To Generate Piano Music With Sustain Pedals | |
28 | Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control | |
29 | Symbolic Music Loop Generation with VQ-VAE | |
30 | Video Background Music Generation with Controllable Music Transformer | |
31 | Zero-shot Singing Technique Conversion | |
32 | Evaluating Deep Music Generation Methods Using Data Augmentation | |
33 | Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus | |
34 | EmotionBox: a music-element-driven emotional music generation system using Recurrent Neural Network |
2020
1 | HIFISINGER TOWARDS HIGH FIDELITY NEURAL SINGING VOICE SYNTHESIS | |
2 | ByteSing A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder Decoder Acoustic Models and WaveRNN Vocoders | |
3 | DurIAN SC Duration Informed Attention Network based Singing Voice Conversion System | |
4 | Jukebox A Generative Model for Music | |
5 | XiaoiceSing- A High-Quality and Integrated Singing Voice Synthesis System | 6 | Speech-to-Singing Conversion based on Boundary Equilibrium GAN | 7 | A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions |
2019
1 | MELLOTRON- MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS |
Talking Head
2022
1 | Multi-modal data fusion of Voice and EMG data for Robotic Control | |
2 | Stitch it in Time: GAN-Based Facial Editing of Real Videos | |
3 | Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels | |
4 | DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering | |
5 | Improving Cross-lingual Speech Synthesis with Triplet Training Scheme | |
6 | VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion | |
7 | CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations | |
8 | Recent Advances and Challenges in Deep Audio-Visual Correlation Learning | |
9 | Freeform Body Motion Generation from Speech | |
10 | Transformer-based Multimodal Information Fusion for Facial Expression Analysis | |
11 | Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion | |
12 | VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices | |
13 | Lip to Speech Synthesis with Visual Context Attentional GAN | |
14 | Residual-guided Personalized Speech Synthesis based on Face Image | |
15 | Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video | |
16 | Text/Speech-Driven Full-Body Animation | |
17 | Talking Face Generation with Multilingual TTS | |
18 | A Novel Speech-Driven Lip-Sync Model with CNN and LSTM | |
19 | FlexLip: A Controllable Text-to-Lip System | |
20 | Learning Speaker-specific Lip-to-Speech Generation | |
21 | VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection | |
22 | Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models | |
<23/a> | Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks | |
24 | FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis | |
25 | StableFace: Analyzing and Improving Motion Stability for Talking Face Generation | |
26 | Facial Landmark Predictions with Applications to Metaverse | |
27 | AutoLV: Automatic Lecture Video Generator | |
28 | Continuously Controllable Facial Expression Editing in Talking Face Videos | |
29 | TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection | |
30 | Talking Head from Speech Audio using a Pre-trained Image Generator | |
31 | Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild |
2021
1 | Generating coherent spontaneous speech and gesture from text | |
2 | Creating Song From Lip and Tongue Videos With a Convolutional Vocoder | |
3 | SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer | |
4 | What is Multimodality? | |
5 | MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement | |
6 | Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices | |
7 | Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary | |
8 | Recent Advances and Trends in Multimodal Deep Learning: A Review | |
9 | Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing | |
10 | Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking | |
11 | LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization | |
12 | NWT: Towards natural audio-to-video generation with representation learning | |
13 | Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text | |
14 | Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion | |
15 | A Survey on Audio Synthesis and Audio-Visual Multimodal Processing | |
16 | Integrated Speech and Gesture Synthesis | |
17 | AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person | |
18 | Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates | |
19 | Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation | |
20 | Audio-to-Image Cross-Modal Generation | |
21 | Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor | |
22 | Talking Head Generation with Audio and Speech Related Facial Action Units | |
23 | LiMuSE: Lightweight Multi-modal Speaker Extraction | |
24 | Metric-based multimodal meta-learning for human movement identification via footstep recognition | |
25 | FaceFormer: Speech-Driven 3D Facial Animation with Transformers | |
26 | Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation | |
27 | PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound |
2020
1 | What comprises a good talking head video generation? A Survey and Benchmark | pdf code |
2 | A Novel Face-tracking Mouth Controller and its Application to Interacting with Bioacoustic Models | |
3 | Large-scale multilingual audio visual dubbing |
2019
1 | (talking head) Text-based Editing of Talking-head Video | pdf vedio |
2 | Talking Face Generation by Adversarially Disentangled Audio-Visual Representation | pdf code demo |
Robust TTS
2022
1 |
2020
1 | Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS | |
2 | Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement | |
3 | Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training |
2019
1 | Neural Text to Speech Adaptation from Low Quality Public Recordings |
2018
1 | Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization |
Front End
2022
1 | Neural Grapheme-to-Phoneme Conversion with Pre-trained Grapheme Models | |
2 | Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end | |
3 | An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer | |
4 | g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin | |
5 | Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization | |
6 | A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation | |
7 | Automatic Prosody Annotation with Pre-Trained Text-Speech Model | |
8 | SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation | |
<9/a> | A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese | |
10 | Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0 | |
11 | Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech |
2021
1 | Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised Learning | |
2 | Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects | |
3 | Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems | |
4 | Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis | |
5 | A Unified Transformer-based Framework for Duplex Text Normalization |
2020
1 | A UNIFIED SEQUENCE-TO-SEQUENCE FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS.pdf | |
2 | A HYBRID TEXT NORMALIZATION SYSTEM USING MULTI-HEAD SELF-ATTENTION FOR MANDARIN.pdf | |
3 | A Mask-based Model for Mandarin Chinese Polyphone Disambiguation | |
4 | Unified Mandarin TTS Front-end Based on Distilled BERT Model |
2019
1 | A Mandarin Prosodic Boundary Prediction Model Based on Multi Task Learning | |
2 | Token Level Ensemble Distillation for Grapheme to Phoneme Conversion | |
3 | Pre trained Text Representations for Improving Front End Text Processing in Mandarin Text to Speech Synthesis |
2018
1 | Mandarin Prosody Prediction Based on Attention Mechanism and Multi-model Ensemble |
2016
1 | Improving Prosodic Boundaries Prediction for Mandarin Speech Synthesis by Using Enhanced Embedding Feature and Model Fusion Approach |
2015
1 | AUTOMATIC PROSODY PREDICTION FOR CHINESE SPEECH SYNTHESIS USING BLSTM-RNN AND EMBEDDING FEATURES |
Alignment
2022
1 |
2021
1 | Triple M: A Practical Neural Text-to-speech System With Multi-guidance Attention And Multi-band Multi-time Lpcnet | |
2 | Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling |
2020
1 | LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORM SPEECH SYNTHESIS | |
2 | Attentron- Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding | |
3 | Peking Opera Synthesis via Duration Informed Attention Network | |
4 | Understanding Self-Attention of Self-Supervised Audio Transformers |
2019
1 | Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments | |
2 | Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS |
2018
1 | MONOTONIC CHUNKWISE ATTENTION | |
2 | FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS.pdf |
2017
1 | Online and Linear-Time Attention by Enforcing Monotonic Alignments | |
2 | Attention Is All You Need |
Dual Learning
2022
1 |
2021
1 | Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation |
2020
1 | LRSpeech- Extremely Low-Resource Speech Synthesis and Recognition | |
2 | Almost Unsupervised Text to Speech and Automatic Speech Recognition |
2018
1 | Machine Speech Chain with One-shot Speaker Adaptation | |
2 | Listening while Speaking- Speech Chain by Deep Learning |
EEG
2022
1 |
2021
1 | On Interfacing the Brain with Quantum Computers: An Approach to Listen to the Logic of the Mind |
2020
1 | Advancing Speech Synthesis using EEG | |
2 | Speech Synthesis using EEG | |
3 | Predicting Different Acoustic Features from EEG and towards direct synthesis of Audio Waveform from EEG |
S2S
2022
1 | CVSS Corpus and Massively Multilingual Speech-to-Speech Translation | |
2 | Creating Speech-to-Speech Corpus from Dubbed Series | |
3 | Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation | |
4 | Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation | |
5 | Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation | |
6 | TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation |
2021
1 | Assessing Evaluation Metrics for Speech-to-Speech Translation | |
2 | Direct simultaneous speech to speech translation | |
3 | Incremental Speech Synthesis For Speech-To-Speech Translation | |
4 | Textless Speech-to-Speech Translation on Real Data |
Other
2022
1 | J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis | |
2 | KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics | |
3 | Residual-Guided Non-Intrusive Speech Quality Assessment | |
4 | Robotic Speech Synthesis: Perspectives on Interactions, Scenarios, and Ethics | |
5 | STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent | |
6 | The VoiceMOS Challenge 2022 | |
7 | Improving Self-Supervised Learning-based MOS Prediction Networks | |
8 | LibriS2S: A German-English Speech-to-Speech Translation Corpus | |
9 | Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch | |
10 | Fusion of Self-supervised Learned Models for MOS Prediction | |
11 | Karaoker: Alignment-free singing voice synthesis with speech training data | |
12 | Arabic Text-To-Speech (TTS) Data Preparation | |
13 | DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores | |
14 | SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis | |
15 | A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality | |
16 | UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022 | |
17 | MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment | |
18 | Into-TTS : Intonation Template based Prosody Control System | |
19 | Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel's Weekly Video Podcasts | |
20 | Macedonian Speech Synthesis for Assistive Technology Applications | |
21 | TuGeBiC: A Turkish German Bilingual Code-Switching Corpus | |
22 | Audio Similarity is Unreliable as a Proxy for Audio Quality | |
23 | Comparison of Speech Representations for the MOS Prediction System | |
24 | SAQAM: Spatial Audio Quality Assessment Metric | |
25 | Speech Quality Assessment through MOS using Non-Matching References | |
26 | The ZevoMOS entry to VoiceMOS Challenge 2022 | |
27 | Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities | |
28 | EEG2Mel: Reconstructing Sound from Brain Responses to Music | |
29 | BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus | |
30 | DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech | |
31 | Evaluating generative audio systems and their metrics | |
32 | Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks | |
33 | MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline | |
34 | ESPnet-ONNX: Bridging a Gap Between Research and Production | |
35 | Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset |
2021
1 | MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network | |
2 | Hi-Fi Multi-Speaker English TTS Dataset | |
3 | ProsoBeast Prosody Annotation Tool | |
4 | KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset | |
5 | Deep Learning Based Assessment of Synthetic Speech Naturalness | |
6 | Deep Learning Based Assessment of Synthetic Speech Naturalness | |
7 | Speaker disentanglement in video-to-speech conversion | |
8 | Voice of Your Brain: Cognitive Representations of Imagined Speech,Overt Speech, and Speech Perception Based on EEG | |
9 | ADEPT: A Dataset for Evaluating Prosody Transfer | |
10 | EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model | |
11 | HUI-Audio-Corpus-German: A high quality TTS dataset | |
12 | Mixtures of Deep Neural Experts for Automated Speech Scoring | |
13 | RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis | |
14 | Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging | |
15 | Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging | |
16 | Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input | |
17 | An Objective Evaluation Framework for Pathological Speech Synthesis | |
18 | Digital Einstein Experience: Fast Text-to-Speech for Conversational AI | |
19 | Translatotron 2: Robust direct speech-to-speech translation | |
20 | Direct speech-to-speech translation with discrete units | |
21 | Fighting Game Commentator with Pitch and Loudness Adjustment Utilizing Highlight Cues | |
22 | RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform | |
23 | "Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real World | |
24 | FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection | |
25 | AQP: An Open Modular Python Platform for Objective Speech and Audio Quality Metrics | |
26 | Generalization Ability of MOS Prediction Networks | |
27 | LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech | |
28 | Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence | |
29 | How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey | |
30 | Cross-lingual Low Resource Speaker Adaptation Using Phonological Features | |
31 | Visualising and Explaining Deep Learning Models for Speech Quality Prediction |