Journal and conference on speech
CCF-A | NeuraIPS AAAI IJAI ACMMM |
CCF-B | ICASSP COLING SpeechCom TSLP TASLP JSLHR TMM TOMCCAP ICME |
CCF-C | INTERSPEECH ICPR |
other | ICLR |
Hybrid & General ASR
2022
1 | Improving End-to-End Contextual Speech Recognition with Fine-grained Contextual Knowledge Selection | |
2 | Sentiment-Aware Automatic Speech Recognition pre-training for enhanced Speech Emotion Recognition | |
3 | Internal language model estimation through explicit context vector learning for attention-based encoder-decoder ASR | |
4 | Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition | |
5 | Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech Recognition with Pinyin and Character | |
6 | Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition | |
7 | Human and Automatic Speech Recognition Performance on German Oral History Interviews | |
8 | Recent Progress in the CUHK Dysarthric Speech Recognition System | |
9 | The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition | |
10 | Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR | |
11 | Ask2Mask: Guided Data Selection for Masked Speech Modeling | |
12 | The PCG-AIID System for L3DAS22 Challenge: MIMO and MISO convolutional recurrent Network for Multi Channel Speech Enhancement and Speech Recognition | |
13 | Non-Autoregressive ASR with Self-Conditioned Folded Encoders | |
14 | MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition | |
15 | Conversational Speech Recognition By Learning Conversation-level Characteristics | |
16 | The RoyalFlush System of Speech Recognition for M2MeT Challenge | |
17 | Visual Speech Recognition for Multiple Languages in the Wild | |
18 | Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments | |
19 | Wav2Vec2.0 on the Edge: Performance Evaluation | |
20 | 4-bit Conformer with Native Quantization Aware Training for Speech Recognition | |
21 | A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings | |
22 | Chain-based Discriminative Autoencoders for Speech Recognition | |
23 | CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR | |
24 | Enhancing Speech Recognition Decoding via Layer Aggregation | |
25 | Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR | |
26 | Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition | |
27 | Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR | |
28 | Similarity and Content-based Phonetic Self Attention for Speech Recognition | |
29 | Speaker recognition by means of a combination of linear and nonlinear predictive models | |
30 | STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation | |
31 | Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings | |
32 | Transformer-based Streaming ASR with Cumulative Attention | |
33 | Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models | |
34 | Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition | |
35 | Improved far-field speech recognition using Joint Variational Autoencoder | |
36 | E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR | |
37 | Self-critical Sequence Training for Automatic Speech Recognition | |
38 | 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition | |
39 | A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition | |
40 | Text-To-Speech Data Augmentation for Low Resource Speech Recognition | |
41 | Multiple Confidence Gates For Joint Training Of SE And ASR | |
42 | End-to-End Multi-speaker ASR with Independent Vector Analysis | |
43 | Filter-based Discriminative Autoencoders for Children Speech Recognition | |
44 | Global Normalization for Streaming Speech Recognition in a Modular Framework | |
45 | Heterogeneous Reservoir Computing Models for Persian Speech Recognition | |
46 | PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit | |
47 | Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition | |
48 | Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator | |
49 | Unified Modeling of Multi-Domain Multi-Device ASR Systems | |
50 | Conformer with dual-mode chunked attention for joint online and offline ASR | |
51 | Context-based out-of-vocabulary word recovery for ASR systems in Indian languages | |
52 | Improving the Training Recipe for a Robust Conformer-based Hybrid Model | |
53 | Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition | |
54 | Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition | |
55 | Squeezeformer: An Efficient Transformer for Automatic Speech Recognition | |
56 | Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training | |
57 | Learning a Dual-Mode Speech Recognition Model via Self-Pruning | |
58 | Improving Mandarin Speech Recogntion with Block-augmented Transformer | |
59 | Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities | |
60 | Online Continual Learning of End-to-End Speech Recognition Models | |
61 | Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder | |
62 | Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies | |
63 | Compute Cost Amortized Transformer for Streaming ASR | |
64 | Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition | |
65 | Comparison and Analysis of New Curriculum Criteria for End-to-End ASR | |
66 | Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition | |
67 | Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition | |
68 | Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition | |
69 | Attention Enhanced Citrinet for Speech Recognition | |
70 | Deep Sparse Conformer for Speech Recognition |
2021
1 | The History of Speech Recognition to the Year 2030 | |
2 | Multilingual Speech Recognition using Knowledge Transfer across Learning Processes | |
3 | Efficient domain adaptation of language models in ASR systems using Prompt-tuning | |
4 | Word Order Does Not Matter For Speech Recognition | |
5 | Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition | |
6 | Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition | |
7 | Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets | |
8 | A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming | |
9 | AequeVox: Automated Fairness Testing of Speech Recognition Systems | |
10 | An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition | |
11 | synchronous Decentralized Distributed Training of Acoustic Models | |
12 | Beyond Lp clipping: Equalization-based Psychoacoustic Attacks against ASRs | |
13 | Continual learning using lattice-free MMI for speech recognition | |
14 | Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees | |
15 | Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training | |
16 | FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition | |
17 | improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Model | |
18 | Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition | |
19 | Improving Pseudo-label Training For End-to-end Speech Recognition Using Gradient Mask | |
20 | Integrating Categorical Features in End-to-End ASR | |
21 | Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition | |
22 | Multi-Modal Pre-Training for Automated Speech Recognition | |
23 | Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet | |
24 | Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding | |
25 | Parallel Composition of Weighted Finite-State Transducers | |
26 | SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition | |
27 | Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition | |
28 | Speech Technology for Everyone: Automatic Speech Recognition for Non-Native English with Transfer Learning | |
29 | Spell my name: keyword boosted speech recognition | |
30 | Towards efficient end-to-end speech recognition with biologically-inspired neural networks | |
31 | Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR | |
32 | Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition | |
33 | Word Order Does Not Matter For Speech Recognition | |
34 | Voice Conversion Can Improve ASR in Very Low-Resource Settings | |
35 | Towards Building ASR Systems for the Next Billion Users | |
36 | Scaling ASR Improves Zero and Few Shot Learning | |
37 | Romanian Speech Recognition Experiments from the ROBIN Project | |
38 | Retrieving Speaker Information from Personalized Acoustic Models for Speech Recognition | |
39 | Recent Advances in End-to-End Automatic Speech Recognition | |
40 | Privacy attacks for automatic speech recognition acoustic models in a federated learning framework | |
41 | Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature | |
42 | Mixed Precision DNN Qunatization for Overlapped Speech Separation and Recognition | |
43 | Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition | |
44 | Effect of noise suppression losses on speech distortion and ASR performance | |
45 | Do We Still Need Automatic Speech Recognition for Spoken Language Understanding? | |
46 | Conformer-based Hybrid ASR System for Switchboard Dataset | |
47 | A comparison of streaming models and data augmentation methods for robust speech recognition | |
48 | Are E2E ASR models ready for an industrial usage? | |
49 | Voice Quality and Pitch Features in Transformer-Based Speech Recognition | |
50 | Investigation of Densely Connected Convolutional Networks with Domain Adversarial Learning for Noise Robust Speech Recognition | |
51 | Continual Learning for Monolingual End-to-End Automatic Speech Recognition | |
52 | Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems | |
53 | Speech frame implementation for speech analysis and recognition | |
54 | Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN | |
55 | Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech | |
56 | Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems | |
57 | Training end-to-end speech-to-text models on mobile phones | |
58 | Robust Speech Representation Learning via Flow-based Embedding Regularization | |
59 | A Mixture of Expert Based Deep Neural Network for Improved ASR | |
60 | A higher order Minkowski loss for improved prediction ability of acoustic model in ASR | |
61 | X-Vector based voice activity detection for multi-genre broadcast speech-to-text |
2020
1 | On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition | |
2 | Conformer: Convolution-augmented Transformer for Speech Recognition | |
3 | ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context | |
4 | Improved Noisy Student Training for Automatic Speech Recognition( | |
5 | CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition | |
6 | A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition | |
7 | Imputer: Sequence modelling via imputation and dynamic programming | |
8 | Automatic Speech Recognition Errors Detection and Correction: A Review | |
9 | A review of on-device fully neural end-to-end automatic speech recognition algorithms |
2018
1 | Accelerating recurrent neural network language model based online speech recognition system | |
2 | Towards Language-Universal End-to-End Speech Recognition |
2017
1 | Reducing Bias in Production Speech Models | |
2 | Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition |
RNN-T
2022
1 | Improving the fusion of acoustic and text representations in RNN-T | |
2 | A Study of Transducer based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies | |
3 | A Likelihood Ratio based Domain Adaptation Method for E2E Models | |
4 | Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models | |
5 | Memory-Efficient Training of RNN-Transducer with Sampled Softmax | |
6 | Streaming parallel transducer beam search with fast-slow cascaded encoders | |
7 | Efficient Training of Neural Transducer for Speech Recognition | |
8 | An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition | |
9 | A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes | |
10 | On the Prediction Network Architecture in RNN-T for ASR | |
11 | Pruned RNN-T for fast, memory-efficient ASR training | |
12 | Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer | |
13 | Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition | |
14 | Composing RNNs and FSTs for Small Data: Recovering Missing Characters in Old Hawaiian Text | |
15 | VQ-T: RNN Transducers using Vector-Quantized Prediction Network States | |
16 | ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition | |
17 | Streaming Target-Speaker ASR with Neural Transducer |
2021
1 | Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer | |
2 | Streaming End-to-End Multi-Talker Speech Recognition | |
3 | A Better and Faster End-to-End Model for Streaming ASR | |
4 | Tied & Reduced RNN-T Decoder | |
5 | Tiny Transducer: A Highly-efficient Speech Recognition Model on Edge Devices | |
6 | Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter | |
7 | On Language Model Integration for RNN Transducer based Speech Recognition | |
8 | A Unified Speaker Adaptation Approach for ASR | |
9 | Factorized Neural Transducer for Efficient Language Model Adaptation | |
10 | Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition | |
11 | Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models | |
12 | On Language Model Integration for RNN Transducer based Speech Recognition | |
13 | Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution | |
14 | Word-level confidence estimation for RNN transducers | |
15 | Sequence Transduction with Graph-based Supervision | |
16 | Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer | |
17 | Context-Aware Transformer Transducer for Speech Recognition | |
18 | Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding | |
19 | Multi-turn RNN-T for streaming recognition of multi-party speech | |
20 | Investigation of Training Label Error Impact on RNN-T |
2020
1 | RNN-T For Latency Controlled ASR With Improved Beam Search | |
2 | Transformer Transducer: A Streamable Speech Recognition Model With Transformer Encoders And RNN-T Loss | |
3 | A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency | |
4 | Towards Fast And Accurate Streaming E2E ASR | |
5 | Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-end Speech Recognition | |
6 | Transfer Learning Approaches for Streaming End-to-End Speech Recognition System | |
7 | Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer | |
8 | Alignment Restricted Streaming Recurrent Neural Network Transducer | |
9 | Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR | |
10 | Improving RNN transducer with normalized jointer network | |
11 | Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer | |
12 | Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data | |
13 | FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization | |
14 | Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer | |
15 | Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition |
2019
1 | Self-Attention Transducers for End-to-End Speech Recognition |
2018
1 | Streaming E2E Speech Recognition For Mobile Devices |
CTC
2022
1 | Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers | |
2 | Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer | |
3 | Improving CTC-based speech recognition via knowledge transferring from pre-trained language models | |
4 | Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding | |
5 | Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition | |
6 | Better Intermediates Improve CTC Inference | |
7 | Multi-sequence Intermediate Conditioning for CTC-based ASR | |
8 | InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR | |
9 | Improving CTC-based ASR Models with Gated Interlayer Collaboration | |
10 | A CTC Triggered Siamese Network with Spatial-Temporal Dropout for Speech Recognition | |
11 | Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM | |
12 | Distilling the Knowledge of BERT for CTC-based ASR |
2021
1 | Why does CTC result in peaky behavior? | |
2 | Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input | |
3 | CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition | |
4 | Improved Mask-CTC for Non-Autoregressive End-to-End ASR | |
5 | An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR | |
6 | Back from the future: bidirectional CTC decoding using future information in speech recognition | |
7 | CTC Variations Through New WFST Topologies | |
8 | Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units |
2020
1 | Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict |
2019
1 | Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition |
2018
1 | An improved hybrid CTC-Attention model for speech recognition |
2017
1 | Residual Convolutional CTC Networks for Automatic Speech Recognition | |
2 | Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling |
AED
2022
1 | Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR | |
2 | USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder | |
3 | Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems | |
4 | Supervised Attention in Sequence-to-Sequence Models for Speech Recognition | |
5 | LegoNN: Building Modular Encoder-Decoder Models |
2021
1 | SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition | |
2 | K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables | |
3 | SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition | |
4 | Attention based end to end Speech Recognition for Voice Search in Hindi and English | |
5 | A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation | |
6 | Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI |
2020
1 | Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition | |
2 | High Performance Sequence-to-Sequence Model for Streaming Speech Recognition | |
3 | Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition | |
4 | Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory | |
5 | CTC-synchronous Training for Monotonic Attention Model | |
6 | Low Latency End-to-End Streaming Speech Recognition with a Scout Network | |
7 | Synchronous Transformers For E2E Speech Recognition | |
8 | Transformer Online CTC/Attention E2E Speech Recognition Architecture | |
9 | Streaming Automatic Speech Recognition With The Transformer Model | |
10 | Minimum Latency Training Strategies For Streaming seq-to-seq ASR | |
11 | Enhancing Monotonic Multihead Attention for Streaming ASR | |
12 | Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition | |
13 | Insertion-Based Modeling for End-to-End Automatic Speech Recognition | |
14 | Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition | |
15 | Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition | |
16 | Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer |
2019
1 | Streaming Transformer ASR with Blockwise Synchronous Inference | |
2 | Triggered Attention for End-to-End Speech Recognition | |
3 | Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition | |
4 | Spelling Correction Model For E2E Speech Recognition | |
5 | An Empirical Study Of Efficient ASR Rescoring With Transformers | |
6 | Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model |
2018
1 | State-of-the-art Speech Recognition With Sequence-to-Sequence Models | |
2 | Montonic Chunkwise Attention |
2017
1 | Multilingual Speech Recognition With A Single End-To-End Model | |
2 | Attention-Based End-to-End Speech Recognition in Mandarin | |
3 | Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping |
2016
1 | Wav2Letter: an End-to-End ConvNet-based Speech Recognition System |
2015
1 | Listen, attend and spell: A neural network for large vocabulary conversational speech recognition |
Unified & Rescoring
2022
1 | Two-Pass End-to-End ASR Model Compression | |
2 | Korean Tokenization for Beam Search Rescoring in Speech Recognition | |
3 | WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit | |
4 | RescoreBERT: Discriminative Speech Recognition Rescoring with BERT | |
5 | On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode | |
6 | Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems |
2021
1 | Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition | |
2 | One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition | |
3 | Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition | |
4 | WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit | |
5 | U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition | |
6 | An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR | |
7 | ASR Rescoring and Confidence Estimation with ELECTRA | |
8 | Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition | |
9 | Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes | |
10 | Lattention: Lattice-attention in ASR rescoring | |
11 | Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model | |
12 | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | |
13 | ASR Rescoring and Confidence Estimation with ELECTRA | |
14 | Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition | |
15 | Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes | |
16 | Lattention: Lattice-attention in ASR rescoring |
2020
1 | Transformer Transducer: One Model Unifying Streaming And Non-Streaming Speech Recognition | |
2 | Universal ASR: Unify And Improve Streaming ASR With Full-Context Modeling | |
3 | Cascaded encoders for unifying streaming and non-streaming ASR | |
4 | Dynamic latency speech recognition with asynchronous revision | |
5 | Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition |
2018
1 | Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units |
2017
1 | Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM |
Data Aug
2022
1 | Investigation of Data Augmentation Techniques for Disordered Speech Recognition | |
2 | LPC Augment: An LPC-Based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects | |
3 | Spectral Modification Based Data Augmentation For Improving End-to-End ASR For Children's Speech | |
4 | Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations | |
5 | Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition | |
6 | Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation | |
7 | Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition | |
8 | Improving Data Driven Inverse Text Normalization using Data Augmentation | |
9 | Data Augmentation for Low-Resource Quechua ASR Improvement | |
10 | Non-Parallel Voice Conversion for ASR Augmentation |
2021
1 | MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition | |
2 | Data Augmentation with Locally-time Reversed Speech for Automatic Speech Recognition | |
3 | Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition | |
4 | Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition | |
5 | Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective | |
6 | Data Augmentation based Consistency Contrastive Pre-training for Automatic Speech Recognition | |
7 | PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit Training for Robust Uyghur E2E Speech Recognition |
2020
1 |
2019
1 | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition |
LM
2022
1 | Neural-FST Class Language Model for End-to-End Speech Recognition | |
2 | Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model | |
3 | Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR | |
4 | Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers | |
5 | A practical framework for multi-domain speech recognition and an instance sampling method to neural language modeling | |
6 | An Empirical Study of Language Model Integration for Transducer based Speech Recognition | |
7 | Improving Speech Recognition for Indic Languages using Language Model | |
8 | Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition | |
9 | Detecting Unintended Memorization in Language-Model-Fused ASR | |
10 | Improving Rare Word Recognition with LM-aware MWER Training | |
11 | Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems | |
12 | Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems | |
13 | Distilling a Pretrained Language Model to a Multilingual ASR Model | |
14 | Residual Language Model for End-to-end Speech Recognition | |
15 | ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks | |
16 | Bayesian Neural Network Language Modeling for Speech Recognition | |
17 | Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models | |
18 | SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data |
2021
1 | Private Language Model Adaptation for Speech Recognition | |
2 | Disambiguation-BERT for N-best Rescoring in Low-Resource Conversational ASR | |
3 | Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition | |
4 | Learning Domain Specific Language Models for Automatic Speech Recognition through Machine Translation | |
5 | Private Language Model Adaptation for Speech Recognition | |
6 | ViraPart: A Text Refinement Framework for ASR and NLP Tasks in Persian | |
7 | Conversational speech recognition leveraging effective fusion methods for cross-utterance language modeling | |
8 | Mixed Precision of Quantization of Transformer Language Models for Speech Recognition | |
9 | Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition |
Unsupervised
2022
1 | A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition | |
2 | Robust Self-Supervised Audio-Visual Speech Recognition | |
3 | Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction | |
4 | IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion | |
5 | Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition | |
6 | Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition | |
7 | Self-supervised Learning with Random-projection Quantizer for Speech Recognition | |
8 | The CORAL++ Algorithm for Unsupervised Domain Adaptation of Speaker Recogntion | |
9 | Autoregressive Co-Training for Learning Discrete Speech Representations | |
10 | Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks | |
11 | Learning Audio Representations with MLPs | |
12 | Privacy-Preserving Speech Representation Learning using Vector Quantization | |
13 | Probing phoneme, language and speaker information in unsupervised speech representations | |
14 | TRILLsson: Distilled Universal Paralinguistic Speech Representations | |
15 | XTREME-S: Evaluating Cross-lingual Speech Representations | |
16 | A Brief Overview of Unsupervised Neural Speech Representation Learning | |
17 | Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition | |
18 | Audio Self-supervised Learning: A Survey | |
19 | Federated Domain Adaptation for ASR with Full Self-Supervision | |
20 | Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment | |
21 | Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition | |
22 | Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition | |
23 | LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT | |
24 | Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data | |
25 | Towards Representative Subset Selection for Self-Supervised Speech Recognition | |
26 | Unsupervised Word Segmentation using K Nearest Neighbors | |
27 | Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training | |
28 | Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition? | |
29 | Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation | |
30 | ATST: Audio Representation Learning with Teacher-Student Transformer | |
31 | Improving Self-Supervised Speech Representations by Disentangling Speakers | |
32 | BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations | |
33 | HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition | |
34 | Can Self-Supervised Learning solve the problem of child speech recognition? | |
35 | Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction | |
36 | Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning | |
37 | Federated Self-supervised Speech Representations: Are We There Yet? | |
38 | Towards End-to-end Unsupervised Speech Recognition | |
39 | Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation | |
40 | Disentangled Speech Representation Learning Based on Factorized Hierarchical Variational Autoencoder with Self-Supervised Objective | |
41 | Unsupervised Data Selection via Discrete Speech Representation for ASR | |
42 | Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices | |
43 | Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition | |
44 | A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems | |
45 | Contrastive Siamese Network for Semi-supervised Speech Recognition | |
46 | Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR | |
47 | Deploying self-supervised learning in the wild for hybrid automatic speech recognition | |
48 | Self-Supervised Speech Representation Learning: A Review | |
49 | SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization | |
50 | Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing | |
51 | Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages | |
52 | Boosting Cross-Domain Speech Recognition with Self-Supervision | |
53 | Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training | |
54 | Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training | |
55 | DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children's ASR | |
56 | FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition | |
57 | Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition | |
58 | Joint Encoder-Decoder Self-Supervised Pre-training for ASR | |
59 | Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models | |
60 | Wav2Vec-Aug: Improved self-supervised training with limited data | |
61 | Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network | |
62 | Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge | |
63 | Unsupervised data selection for Speech Recognition with contrastive loss ratios | |
64 | Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data | |
65 | Improving Low-Resource Speech Recognition with Pretrained Speech Models: Continued Pretraining vs. Semi-Supervised Training | |
66 | FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning | |
67 | Thai Wav2Vec2.0 with CommonVoice V8 | |
68 | Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning | |
69 | Watch What You Pretrain For: Targeted, Transferable Adversarial Examples on Self-Supervised Speech Recognition models | |
70 | Unsupervised domain adaptation for speech recognition with unsupervised error correction | |
71 | An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning | |
72 | Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset |
2021
1 | Private Language Model Adaptation for Speech Recognition | |
2 | Analyzing the Robustness of Unsupervised Speech Recognition | |
3 | Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition | |
4 | Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning | |
5 | Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition | |
6 | WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | |
7 | Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses | |
8 | Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages | |
9 | Self-Supervised Learning for speech recognition with Intermediate layer supervision |
Multilingual
2022
1 | Reducing language context confusion for end-to-end code-switching automatic speech recognition | |
2 | Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition | |
3 | Data and knowledge-driven approaches for multilingual training to improve the performance of speech recognition systems of Indian languages | |
4 | A Survey of Multilingual Models for Automatic Speech Recognition | |
5 | Code Switched and Code Mixed Speech Recognition for Indic languages | |
6 | Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition | |
7 | Hierarchical Softmax for End-to-End Low-resource Multilingual Speech Recognition | |
8 | Adaptive Activation Network For Low Resource Multilingual Speech Recognition | |
9 | Bilingual End-to-End ASR with Byte-Level Subwords | |
10 | LAE: Language-Aware Encoder for Monolingual and Multilingual ASR | |
11 | Language-specific Characteristic Assistance for Code-switching Speech Recognition | |
12 | Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition | |
13 | Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition | |
14 | A Language Agnostic Multilingual Streaming On-Device ASR System | |
15 | Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation | |
16 | Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification | |
17 | Learning ASR pathways: A sparse multilingual ASR model | |
18 | Multilingual Transformer Language Model for Speech Recognition in Low-resource Languages | |
19 | ASR2K: Speech Recognition for Around 2000 Languages without Audio |
2021
1 | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | |
2 | Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 | |
3 | Mandarin-English Code-switching Speech Recognition with Self-supervised Speech Representation Models | |
4 | Minimum word error training for non-autoregressive Transformer-based code-switching ASR | |
5 | Multilingual Speech Recognition using Knowledge Transfer across Learning Processes | |
6 | Joint Unsupervised and Supervised Training for Multilingual ASR | |
7 | Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization | |
8 | Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data | |
9 | Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching | |
10 | Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition |
Personal
2022
1 | ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users | |
2 | Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition | |
3 | Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models | |
4 | End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system | |
5 | Curriculum optimization for low-resource speech recognition | |
6 | Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass | |
7 | Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding | |
8 | Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition | |
9 | PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations | |
10 | Using Adapters to Overcome Catastrophic Forgetting in End-to-End Automatic Speech Recognition | |
11 | Speaker adaptation for Wav2vec2 based dysarthric ASR | |
12 | Contextual Adapters for Personalized Speech Recognition in Neural Transducers | |
13 | Adaptive multilingual speech recognition with pretrained models | |
14 | A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data | |
15 | Confidence Score Based Conformer Speaker Adaptation for Speech Recognition |
2021
1 | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | |
2 | Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition | |
3 | Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets | |
4 | Personalizing ASR with limited data using targeted subset selection | |
5 | Prompt-tuning in ASR systems for efficient domain-adaptation |
Accent
2022
1 | Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition | |
2 | Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition | |
3 | Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents | |
4 | Cleanformer: A microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR | |
5 | Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data | |
6 | A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech Recognition | |
7 | Performance Disparities Between Accents in Automatic Speech Recognition |
2021
1 | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | |
2 | Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Embeddings | |
3 | Multi-Dialect Arabic Speech Recognition |
Dataset
2022
1 | The Norwegian Parliamentary Speech Corpus | |
2 | CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition | |
3 | Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset | |
4 | Finnish Parliament ASR corpus - Analysis, benchmarks and statistics | |
5 | Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some benchmarks | |
6 | Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset | |
7 | GigaST: A 10,000-hour Pseudo Speech Translation Corpus | |
8 | GWA: A Large High-Quality Acoustic Dataset for Audio Processing | |
9 | SDS-200: A Swiss German Speech to Standard German Text Corpus | |
10 | Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi | |
11 | Bengali Common Voice Speech Dataset for Automatic Speech Recognition | |
12 | TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline | |
13 | The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition | |
14 | Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition | |
15 | UserLibri: A Dataset for ASR Personalization Using Only Text |
2021
1 | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | |
2 | Building a Noisy Audio Dataset to Evaluate Machine Learning Approaches for Automatic Speech Recognition Systems | |
3 | CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese | |
4 | WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition | |
5 | Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions | |
6 | The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage | |
7 | JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification |
Robust
2022
1 | A Conformer Based Acoustic Model for Robust Automatic Speech Recognition | |
2 | Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition | |
3 | Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data | |
4 | RED-ACE: Robust Error Detection for ASR using Confidence Embeddings | |
5 | Speech-enhanced and Noise-aware Networks for Robust Speech Recognition | |
6 | Mask scalar prediction for improving robust automatic speech recognition | |
7 | Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning | |
8 | Calibrate and Refine! A Novel and Agile Framework for ASR-error Robust Intent Detection | |
9 | Speaker Reinforcement Using Target Source Extraction for Robust Automatic Speech Recognition | |
10 | Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping | |
11 | ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding | |
12 | pMCT: Patched Multi-Condition Training for Robust Speech Recognition | |
13 | DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition | |
14 | Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition |
2021
1 | Robustifying automatic speech recognition by extracting slowly varying features | |
2 | Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR | |
3 | Sequential Randomized Smoothing for Adversarially Robust Speech Recognition |
Speaker Diarization
2022
1 | ASR-Aware End-to-end Neural Diarization | |
2 | Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge | |
3 | The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge | |
4 | EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers | |
5 | Multi-scale Speaker Diarization with Dynamic Scale Weighting | |
6 | Multi-Target Filter and Detector for Speaker Diarization | |
7 | Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios | |
8 | Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization | |
9 | Robust End-to-end Speaker Diarization with Generic Neural Clustering | |
10 | Self-supervised Speaker Diarization | |
11 | From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization | |
12 | Multimodal Clustering with Role Induced Constraints for Speaker Diarization | |
13 | Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization | |
14 | PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification | |
15 | Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization | |
16 | Online Neural Diarization of Unlimited Numbers of Speakers | |
17 | Utterance-by-utterance overlap-aware neural diarization with Graph-PIT | |
18 | Online Target Speaker Voice Activity Detection for Speaker Diarization | |
19 | Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription | |
20 | Speaker Diarization and Identification from Single-Channel Classroom Audio Recording Using Virtual Microphones | |
21 | Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization | |
22 | Chronological Self-Training for Real-Time Speaker Diarization | |
23 | Robust Acoustic Domain Identification with its Application to Speaker Diarization | |
24 | Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting |
MultiChannel
2022
1 | Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge | |
2 | The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge | |
3 | Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge | |
4 | The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge | |
5 | Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study |
MultiModal
2022
1 | Improved Meta Learning for Low Resource Speech Recognition | |
2 | A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection | |
3 | End-to-End Multi-Person Audio/Visual Automatic Speech Recognition | |
4 | AVATAR: Unconstrained Audiovisual Speech Recognition | |
5 | Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition | |
6 | Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment | |
7 | SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning | |
8 | Towards Generalisable Audio Representations for Audio-Visual Navigation | |
9 | Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition | |
10 | Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands | |
11 | Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR | |
12 | Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception |
Speech translation
2022
1 | Who Are We Talking About? Handling Person Names in Speech Translation | |
2 | Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation | |
3 | Efficient yet Competitive Speech Translation: FBK@IWSLT2022 | |
4 | Cross-modal Contrastive Learning for Speech Translation | |
5 | ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks | |
6 | Non-Parametric Domain Adaptation for End-to-End Speech Translation | |
7 | On the Impact of Noises in Crowd-Sourced Data for Speech Translation | |
8 | Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation | |
9 | Revisiting End-to-End Speech-to-Text Translation From Scratch | |
10 | The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task | |
11 | M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation | |
12 | A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation | |
13 | Direct Speech Translation for Automatic Subtitling |
Other
2022
1 | Endpoint Detection for Streaming End-to-End Multi-talker ASR | |
2 | How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR | |
3 | Comparative Study of Acoustic Echo Cancellation Algorithms for Speech Recognition System in Noisy Environment | |
4 | Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition | |
5 | Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition | |
6 | Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection | |
7 | Towards Better Meta-Initialization with Task Augmentation for Kindergarten-aged Speech Recognition | |
8 | Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey | |
9 | VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition | |
10 | Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition | |
11 | ASRPU: A Programmable Accelerator for Low-Power Automatic Speech Recognition | |
12 | A two-step approach to leverage contextual data: speech recognition in air-traffic communications | |
13 | Semantic-aware Speech to Text Transmission with Redundancy Removal | |
14 | Joint Speech Recognition and Audio Captioning | |
15 | Error Correction in ASR using Sequence-to-Sequence Models | |
16 | Visualizing Automatic Speech Recognition -- Means for a Better Understanding? | |
17 | BEA-Base: A Benchmark for ASR of Spontaneous Hungarian | |
18 | Language Dependencies in Adversarial Attacks on Speech Recognition Systems | |
19 | Analysis of EEG frequency bands for Envisioned Speech Recognition | |
20 | Attacks as Defenses: Designing Robust Audio CAPTCHAs Using Attacks on Automatic Speech Recognition Systems | |
21 | Automatic Speech recognition for Speech Assessment of Preschool Children | |
22 | Building Robust Spoken Language Understanding by Cross Attention between Phoneme Sequence and ASR Hypothesis | |
23 | Computing Optimal Location of Microphone for Improved Speech Recognition | |
24 | Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition | |
25 | Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition | |
26 | How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications | |
27 | Impact of Dataset on Acoustic Models for Automatic Speech Recognition | |
28 | indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages | |
29 | Integrate Lattice-Free MMI into End-to-End Speech Recognition | |
30 | Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages? | |
31 | Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training | |
32 | Mel Frequency Spectral Domain Defenses against Adversarial Attacks on Speech Recognition Systems | |
33 | Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data | |
34 | Neural Predictor for Black-Box Adversarial Attacks on Speech Recognition | |
35 | Recent improvements of ASR models in the face of adversarial attacks | |
36 | Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors | |
37 | Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture | |
38 | Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator | |
39 | Towards Privacy-Preserving Speech Representation for Client-Side Data Sharing | |
40 | Vakyansh: ASR Toolkit for Low Resource Indic languages | |
41 | Disappeared Command: Spoofing Attack On Automatic Speech Recognition Systems with Sound Masking | |
42 | Extracting Targeted Training Data from ASR Models, and How to Mitigate It | |
43 | ASR in German: A Detailed Error Analysis | |
44 | Unified Speech-Text Pre-training for Speech Translation and Recognition | |
45 | Building an ASR Error Robust Spoken Virtual Patient System in a Highly Class-Imbalanced Scenario Without Speech Data | |
46 | Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners | |
47 | Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser | |
48 | Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition | |
49 | Successes and critical failures of neural networks in capturing human-like speech recognition | |
50 | Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition | |
51 | End-to-end multi-talker audio-visual ASR using an active speaker attention module | |
52 | End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation | |
53 | Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition | |
54 | An Investigation on Applying Acoustic Feature Conversion to ASR of Adult and Child Speech | |
55 | FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech | |
56 | Content-Context Factorized Representations for Automated Speech Recognition | |
57 | Insights on Neural Representations for End-to-End Speech Recognition | |
58 | Streaming Noise Context Aware Enhancement For Automatic Speech Recognition in Multi-Talker Environments | |
59 | SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation | |
60 | Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection | |
61 | Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech | |
62 | Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language | |
63 | Challenges and Opportunities in Multi-device Speech Processing | |
64 | Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection | |
65 | Decoupled Federated Learning for ASR with Non-IID Data | |
66 | Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech | |
67 | FedNST: Federated Noisy Student Training for Automatic Speech Recognition | |
68 | Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition | |
69 | TEVR: Improving Speech Recognition by Token Entropy Variance Reduction | |
70 | The THUEE System Description for the IARPA OpenASR21 Challenge | |
71 | Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus | |
72 | Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project | |
73 | Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada | |
74 | Subword Dictionary Learning and Segmentation Techniques for Automatic Speech Recognition in Tamil and Kannada | |
75 | Implementation Of Tiny Machine Learning Models On Arduino 33 BLE For Gesture And Speech Recognition | |
76 | ASR Error Detection via Audio-Transcript entailment | |
77 | Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition | |
78 | Towards Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription | |
79 | ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale | |
80 | Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation | |
81 | Sotto Voce: Federated Speech Recognition with Differential Privacy Guarantees | |
82 | Position Prediction as an Effective Pretraining Strategy | |
83 | Efficient spike encoding algorithms for neuromorphic speech recognition | |
84 | RSD-GAN: Regularized Sobolev Defense GAN Against Speech-to-Text Adversarial Attacks | |
85 | End-to-end speech recognition modeling from de-identified data | |
86 | Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding | |
87 | Generating gender-ambiguous voices for privacy-preserving speech recognition | |
88 | Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism | |
89 | Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition | |
90 | Swiss German Speech to Text system evaluation | |
91 | Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models | |
92 | Towards Disentangled Speech Representations | |
93 | Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages | |
94 | Low-Level Physiological Implications of End-to-End Learning of Speech Recognition | |
95 | Improving Hypernasality Estimation with Automatic Speech Recognition in Cleft Palate Speech | |
96 | ASR Error Correction with Constrained Decoding on Operation Prediction | |
97 | Adversarial Attacks on ASR Systems: An Overview | |
98 | DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition | |
99 | Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition | |
100 | Blind Signal Dereverberation for Machine Speech Recognition | |
101 | On the Impact of Speech Recognition Errors in Passage Retrieval for Spoken Question Answering | |
102 | Assessing ASR Model Quality on Disordered Speech using BERTScore | |
103 | ESPnet-ONNX: Bridging a Gap Between Research and Production | |
104 | A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation | |
105 | Modeling Dependent Structure for Utterances in ASR Evaluation | |
106 | VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition | |
107 | Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model |
2021
1 | Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric | |
2 | Speech recognition for air traffic control via feature learning and end-to-end training | |
3 | A study on native American English speech recognition by Indian listeners with varying word familiarity level | |
4 | Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition Systems |