ISCSLP@INTERSPEECH 2014 - The 9th International Symposium on Chinese Spoken Language Processing
12-14 September 2014, Singapore

Keynote Speakers



Friday, 12 Sep 2014

Dr Michiel Bacchiani

Saturday, 13 Sep 2014

Prof Tanja Schultz

Sunday, 14 Sep 2014

Dr Yifan Gong

Friday, 12 September 2014
Dr Michiel Bacchiani
Google, USA
Large Scale Neural Network Optimization for Mobile Speech Recognition Applications
Recent years have shown a large scale adoption of speech recognition by the public, in particular around mobile devices. Google, with its Android operating system, has integrated speech recognition as a key input modality. The decade of speech that our recognizer processes each day is a clear indication of the popularity of this technology with the public. This talk will describe the current mobile speech applications in more detail. In particular, it will provide a more detailed description of the Deep Neural Network (DNN) technology that is used as the acoustic model in this system and its distributed, asynchronous training infrastructure. Since a DNN is a static classifier, it is ill matched to the speech recognition sequence classification problem. The asynchrony that is inherent to our distributed training infrastructure further complicates the optimization of such models. Our recent research efforts have focused on the optimization of the DNN model, matched to the speech recognition problem. This has resulted in three related algorithmic improvements. First a novel way to bootstrap training of a DNN model. Second the use a sequence rather than a frame-based optimization metric. Third, we have succeeded in applying a recurrent neural network structure to our large scale, large vocabulary application. These novel algorithms have shown effective even in light of the asynchrony in our training infrastructure. The algorithms have reduced the error rate of our system with 10% or more over DNNs well optimized with a frame-based objective. And this trend is holding across all 48 languages where we support speech recognition as an input modality.
Biography: Michiel Bacchiani has been an active speech researcher for over 20 years. Although he has worked in various areas of speech, his main focus has been on acoustic modeling for automatic speech recognition. He currently manages the acoustic modeling team of the speech group at Google. His team is responsible for developing novel algorithms and training infrastructure for the acoustic models for all speech recognition applications backing Google services. These systems include its flagship voice search application which is currently fielded in more than 48 languages. At Google, he previously led the efforts around voicemail transcription fielded in the Google Voice application and led the group that produced the transcription component for the YouTube automatic captioning system.

Before joining Google, Michiel Bacchiani worked as a member of technical staff at IBM Research where he was responsible for the recognition component of the IBM system entered in the European Union funded TC-STAR speech-to-speech translation evaluation. Before that, he was a technical staff member at AT&T Labs Research where he co-developed the Scanmail voicemail transcription and navigation prototype and developed the transcription system underlying the AT&T Spoken Document Retrieval system entered in the TREC8 DARPA evaluation.

Michiel Bacchiani received the "ingenieur" (ir.) degree from the Technical University of Eindhoven, The Netherlands and the Ph.D. degree from Boston University, both in electrical engineering. He has authored numerous scientific publications. He is a repeated elected member of the IEEE speech technical committee and has served as a member of various conference and workshop technical committees. He has served as a board member of Speech Communication and has repeatedly served as an area chair for the ICASSP and Interspeech international conferences.

[Back to Top]

Saturday, 13 Sep 2014
Prof Tanja Schultz
Karlsruhe Institute of Technology (KIT), Germany
Multilingual Automatic Speech Recognition for Code-switching Speech
The performance of speech and language processing technologies has improved dramatically over the last years, with an increasing number of systems being deployed in a variety of languages and applications. Unfortunately, recent methods and models heavily rely on the availability of massive amounts of resources which only become available in languages spoken by a large number of people in countries of great economic interest, and populations with immediate information technology needs. Furthermore, todays speech processing systems target monolingual scenarios for speakers who are assumed to use one single language while interacting via voice. However, I believe that today’s globalized world requires truly multilingual speech processing systems which support phenomena of multilingualism such as code-switching and accented speech. As these are spoken phenomena, methods are required which perform reliably even if only few resources are available.

In my talk I will present ongoing work at the Cognitive Systems Lab on applying concepts of Multilingual Speech Recognition to rapidly adapt systems to yet unsupported or under-resourced languages. Based on these concepts, I will describe the challenges of building a code-switch speech recognition system using the example of Singaporean speakers code-switching between Mandarin and English. Proposed solutions include the sharing of data and models across both languages to build truly multilingual acoustic models, dictionaries, and language models. Furthermore, I will describe the web-based Rapid Language Adaptation Toolkit (RLAT, see which lowers the overall costs for system development by automating the system building process, leveraging off crowd sourcing, and reducing the data needs without suffering significant performance losses. The toolkit enables native language experts to build speech recognition components without requiring detailed technology expertise. Components can be evaluated in an end-to-end system allowing for iterative improvements. By keeping the users in the developmental loop, RLAT can learn from the users’ expertise to constantly adapt and improve. This will hopefully revolutionize the system development process for yet under-resourced languages.
Biography: Tanja Schultz received her Ph.D. and Masters in Computer Science from University of Karlsruhe, Germany in 2000 and 1995 respectively and passed the German state examination for teachers of Mathematics, Sports, and Educational Science from Heidelberg University, in 1990. She joined Carnegie Mellon University in 2000 and became a Research Professor at the Language Technologies Institute. Since 2007 she is a Full Professor at the Department of Informatics of the Karlsruhe Institute of Technology (KIT) in Germany. She directs the Cognitive Systems Lab, where research activities focus on human-machine interfaces with a particular area of expertise on multilingual speech processing and rapid adaptation of speech processing systems to new domains and languages. She co-edited a book on this subject and received several awards for this work, such as the FZI price for an outstanding Ph.D. thesis in 2001, the Allen Newell Medal for Research Excellence from Carnegie Mellon and the ISCA best paper award in 2002. In 2005 she received the Carnegie Mellon Language Technologies Institute Junior Faculty Chair. Her recent research work on silent speech interfaces based on myoelectric signals received best demo and paper prices in 2006, 2008, 2009, and 2013 and was awarded with the Alcatel-Lucent Research Award for Technical Communication in 2012. Tanja Schultz is the author of more than 280 articles published in books, journals, and proceedings. She regularly serves on many committees and is a member of the Society of Computer Science (GI), the IEEE Computer Society, and the International Speech Communication Association (ISCA), where she currently serves as elected president.

[Back to Top]

Sunday, 14 Sep 2014
Dr Yifan Gong
Microsoft, USA
Selected Challenges and Solutions for DNN Acoustic Modeling
Acoustic modeling with DNN (Deep Neural Networks) has been shown to deliver high speech recognition accuracy on broad range of application scenarios. Increasingly DNN is used in commercial speech recognition products, on either server or device based computing platforms. This creates opportunities for developing algorithms and engineering solutions for DNN-based modeling.

For large scale speech recognition applications, this presentation focuses on several recent techniques to make DNN more effective, including reducing sparseness and run-time cost with SVD based training, improving robustness to acoustic environment with i-vector based DNN modeling, adapting to speakers based on small number of free parameters, increasing language capability by reusing speech training material across languages, parameter tying for multi-style DNN training, reducing word error rate by adding large amount of un-transcribed data, boosting the accuracy of small DNN with behavior transferring training.

The presentation will also identify and elaborate the limitation of current DNN in acoustic modeling, illustrated by experimental results from various applications, and discuss some future directions in DNN for speech recognition.
Biography: Yifan Gong is a Principal Science Manager in the areas of speech modeling core technology, acoustic modeling computing infrastructure, and speech model and language development for Microsoft speech recognition products. His research interests include automatic speech recognition/interpretation, signal processing, algorithm development, and engineering process/infrastructure and management.

He received B.Sc. from the Department of Communication Engineering, Southeast University, China, M.Sc. in electrical engineering and instrumentation from the Department of Electronics, University of Paris, France, and the Ph.D. in computer science from the Department of Mathematics and Computer Science, University of Henri Poincaré, France.

He served the National Scientific Research Center (CNRS) and INRIA-Lorraine, France, as Research Engineer and then joined CNRS as Senior Research Scientist. As Associate Lecturer, he taught computer programming and digital signal processing at the Department of Computer Science, University of Henri Poincaré. He was a Visiting Research Fellow at the Communications Research Center of Canada. As Senior Member of Technical Staff, he worked for Texas Instruments at the Speech Technologies Lab, where he developed speech modeling technologies robust against noisy environments, designed systems, algorithms, and software for speech and speaker recognition, and delivered memory- and CPU-efficient recognizers for mobile devices. He joined Microsoft in 2004.

Yifan Gong has authored over 130 publications in journals, IEEE Transactions, books, and conferences. His has been awarded over 30 U.S. patents. His specific contribution to the speech recognition includes stochastic trajectory modeling, source normalization HMM training, joint compensation of additive and convolutional noises, variable parameter HMM, and “Speech recognition in noisy environments: A survey” [Speech communication 16 (3), 261-291]. In these areas, he gave tutorials and other invited presentations in international conferences. He has been serving as member of technical committee and session chair for many international conferences, and with IEEE Signal Processing Spoken Language Technical Committees from 1998 to 2002 and since 2013.

[Back to Top]


Diamond Sponsors

Copyright © 2013-2014 Chinese and Oriental Languages Information Processing Society
Conference managed by Meeting Matters International