ISCSLP@INTERSPEECH 2014 - The 9th International Symposium on Chinese Spoken Language Processing
12-14 September 2014, Singapore


Saturday, 13 September 2014

The ISCSLP 2014 Organising Committee is pleased to announce the following 4 tutorials presented by distinguished speakers at the conference and will be offered on Saturday, 13 September 2014. All Tutorials will be of two (2) hour duration, and registration fee is free for ISCSLP 2014 delegates.

The tutorial handouts will be provided electronically, ahead of the tutorials. Please download and print at your convenience, as we will not be providing hard copies of these at the conference.


1330 – 1530


Adaptation Techniques for Statistical Speech Recognition
- Kai Yu


Emotion and Mental State Recognition: Features, Models, System Applications and Beyond
- Chung-Hsien Wu, Hsin-Min Wang, Julien Epps and Vidhyasaharan Sethu


1600 – 1800


Unsupervised Speech and Language Processing via Topic Models
- Jen-Tzung Chien


Deep Learning for Speech Generation and Synthesis
- Yao Qian and Frank K. Soong


Title: Adaptation Techniques for Statistical Speech Recognition
Presenters: Kai Yu (Shanghai Jiao Tong University, Shanghai)

Abstract: Adaptation is a technique to make better use of existing models for test data from new acoustic or linguistic conditions. It is an important and challenging research area of statistical speech recognition. This tutorial gives a systematic review of fundamental theories as well as introduction of state-of-the-art adaptation techniques. It includes both acoustic and language model adaptation. Following a simple example of acoustic model adaptation, basic concepts, procedures and categories of adaptation will be introduced. Then, a number of advanced adaptation techniques will be discussed, such as discriminative adaptation, Deep Neural Network adaptation, adaptive training, relationship to noise robustness etc. After the detailed review of acoustic model adaptation, an introduction of language model adaptation, such as topic adaptation will also be given. The whole tutorial is then summarised and future research direction will be discussed.

Biography: Kai Yu is a research professor in the Computer Science and Engineering Department of Shanghai Jiao Tong University, China. He obtained his Bachelor and Master degrees from Tsinghua University, Beijing, China and his Ph.D. from Cambridge University. He has published over 50 peer-reviewed journal and conference publications on speech recognition, synthesis and dialogue systems. He was a key member of the Cambridge team to build state-of-the-art LVCSR systems in the DARPA funded EARS and GALE projects. He has also managed the design and implementation of large-scale real-world ASR cloud. He is a senior member of IEEE, a member of ISCA and the IET. He was the area chair for speech recognition and processing for INTERSPEECH 2009 and EUSIPCO 2011, the publication chair for IEEE ASRU 2011 and the area chair of spoken dialogue systems for INTERSPEECH 2014. He was selected into the "1000 Overseas Talent Plan (Young Talent)" by Chinese central government in 2012. He was also selected into the Programme for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

[Back to Top]


Title:Emotion and Mental State Recognition: Features, Models, System Applications and Beyond
Presenters: Chung-Hsien Wu (National Cheng Kung University, Tainan City), Hsin-Min Wang (Academia Sinica, Taipei), Julien Epps (The University of New South Wales, Australia) and Vidhyasaharan Sethu (The University of New South Wales, Australia)

Abstract: Emotion recognition is the ability to identify what you are feeling from moment to moment and to understand the connection between your feelings and your expressions. In today’s world, human-computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interfaces, automated analysis and recognition of human emotion has attracted increasing attention from researchers in multidisciplinary research fields. A specific area of current interest that also has key implications for HCI is the estimation of cognitive load (mental workload), research into which is still at an early stage. Technologies for processing daily activities including speech, text and music have expanded the interaction modalities between humans and computer-supported communicational artifacts.

In this tutorial, we will present theoretical and practical work offering new and broad views of the latest research in emotional awareness from audio and speech. We discuss several parts spanning a variety of theoretical background and applications ranging from salient emotional features, emotional-cognitive models, compensation methods for variability due to speaker and linguistic content, to machine learning approaches applicable to emotion recognition. In each topic, we will review the state of the art by introducing current methods and presenting several applications. In particular, the application to cognitive load estimation will be discussed, from its psychophysiological origins to system design considerations. Eventually, technologies developed in different areas will be combined for future applications, so in addition to a survey of future research challenges, we will envision a few scenarios in which affective computing can make a difference.

Biography: Prof. Chung-Hsien Wu received the Ph.D. degree in electrical engineering from National Cheng Kung University, Tainan, Taiwan, R.O.C., in 1991. Since August 1991, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan. He became professor and distinguished professor in August 1997 and August 2004, respectively. From 1999 to 2002, he served as the Chairman of the Department. Currently, he is the deputy dean of the College of Electrical Engineering and Computer Science, National Cheng Kung University. He also worked at Computer Science and Artificial Intelligence Laboratory of Massachusetts Institute of Technology (MIT), Cambridge, MA, in summer 2003 as a visiting scientist. He received the Outstanding Research Award of National Science Council in 2010 and the Distinguished Electrical Engineering Professor Award of the Chinese Institute of Electrical Engineering in 2011, Taiwan. He is currently associate editor of IEEE Trans. Audio, Speech and Language Processing, IEEE Trans. Affective Computing, ACM Trans. Asian Language Information Processing, and the Subject Editor on Information Engineering of Journal of the Chinese Institute of Engineers (JCIE). His research interests include affective speech recognition, expressive speech synthesis, and spoken language processing. Dr. Wu is a senior member of IEEE and a member of International speech communication association (ISCA). He was the President of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) in 2009~2011. He was the Chair of IEEE Tainan Signal Processing Chapter and has been the Vice Chair of IEEE Tainan Section since 2009.

Biography: Dr. Hsin-Min Wang received the B.S. and Ph.D. degrees in electrical engineering from National Taiwan University in 1989 and 1995, respectively. In October 1995, he joined the Institute of Information Science, Academia Sinica, where he is now a research fellow and deputy director. He was an adjunct associate professor with National Taipei University of Technology and National Chengchi University. He currently serves as the president of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP), a managing editor of Journal of Information Science and Engineering, and an editorial board member of International Journal of Computational Linguistics and Chinese Language Processing. His major research interests include spoken language processing, natural language processing, multimedia information retrieval, and pattern recognition. Dr. Wang received the Chinese Institute of Engineers (CIE) Technical Paper Award in 1995 and the ACM Multimedia Grand Challenge First Prize in 2012. He is a senior member of IEEE, a member of ISCA and ACM, and a life member of Asia Pacific Signal and Information Processing Association (APSIPA), ACLCLP, and Institute of Information & Computing Machinery (IICM).

Biography: Dr Julien Epps received the BE and PhD degrees in Electrical Engineering from the University of New South Wales, Australia, in 1997 and 2001 respectively. After an appointment as a Postdoctoral Fellow at the University of New South Wales, he worked on speech recognition and speech processing research firstly as a Research Engineer at Motorola Labs and then as a Senior Researcher at National ICT Australia. He was appointed as a Senior Lecturer in the UNSW School of Electrical Engineering and Telecommunications in 2007 and then as an Associate Professor in 2013. Dr Epps has also held visiting academic and research appointments at The University of Sydney and the A*STAR Institute for Infocomm Research (Singapore). He has authored or co-authored around 150 publications, which have been collectively cited more than 1500 times. He has served as a reviewer for most major speech processing journals and conferences and as a Guest Editor for the EURASIP Journal on Advances in Signal Processing Special Issue on Emotion and Mental State Recognition from Speech. He has also co-organised or served on the committees of key workshops related to this tutorial, such as the ACM ICMI Workshop on Inferring Cognitive and Emotional States from Multimodal Measures (2011), ASE/IEEE Int. Conf. on Social Computing Workshop on Wide Spectrum Social Signal Processing (2012), 4th International Workshop on Corpora for Research on Emotion, Sentiment and Social Signals (Satellite of LREC 2012), Audio/Visual Emotion Challenge and Workshop AVEC 2011 (part of the Int. Conf. on Affective Computing and Intelligent Interaction), AVEC 2012 (part of ACM ICMI) and AVEC 2013 (part of ACM Multimedia). His research interests include applications of speech modelling to emotion and mental state classification and speaker verification.

Biography: Dr Vidhyasaharan Sethu received his BE degree from Anna University, India, and his MEngSc (Signal Processing) degree from the University of New South Wales, Australia. He was awarded his PhD in 2010 for his work on Automatic Emotion Recognition, by the University of New South Wales (UNSW). Following this, he worked as a Postdoctoral Research Fellow at the speech research group at UNSW on the joint modelling of linguistic and paralinguistic information in speech with a focus on emotion recognition. He is currently a Lecturer in Signal Processing at the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, Australia. He teaches courses on speech processing, signal processing and electrical system design in the school and is a reviewer for a number of journals including Speech Communication and EURASIP Journal on Audio, Speech and Music Processing and IEEE Transactions on Education. His research interests include emotion recognition, speaker recognition, language identification and the application of machine learning in speech processing.

[Back to Top]


Title: Unsupervised Speech and Language Processing via Topic Models
Presenters: Jen-Tzung Chien (National Chiao Tung University, Hsinchu)

Abstract: In this tutorial, we will present state-of-art machine learning approaches for speech and language processing with highlight on the unsupervised methods for structural learning from the unlabeled sequential patterns. In general, speech and language processing involves extensive knowledge of statistical models. We require designing a flexible, scalable and robust system to meet heterogeneous and nonstationary environments in the era of big data. This tutorial starts from an introduction of unsupervised speech and language processing based on factor analysis and independent component analysis. The unsupervised learning is generalized to a latent variable model which is known as the topic model. The evolution of topic models from latent semantic analysis to hierarchical Dirichlet process, from non-Bayesian parametric models to Bayesian nonparametric models, and from single-layer model to hierarchical tree model shall be surveyed in an organized fashion. The inference approaches based on variational Bayesian and Gibbs sampling are introduced. We will also present several case studies on topic modeling for speech and language applications including language model, document model, retrieval model, segmentation model and summarization model. At last, we will point out new trends of topic models for speech and language processing.

Biography: Jen-Tzung Chien received his Ph.D. degree in electrical engineering from National Tsing Hua University, Hsinchu, in 1997. During 1997-2012, he was with the National Cheng Kung University, Tainan. Since 2012, he has been with the Department of Electrical and Computer Engineering, National Chiao Tung University (NCTU), Hsinchu, where he is currently a Distinguished Professor. He serves as an adjunct professor in the Department of Computer Science, NCTU. He held the Visiting Researcher positions at the Panasonic Technologies Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Tokyo, Japan, the Georgia Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing, China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY. His research interests include machine learning, speech recognition, information retrieval and blind source separation. He served as the associate editor of the IEEE Signal Processing Letters in 2008-2011, the guest editor of the IEEE Transactions on Audio, Speech and Language Processing in 2012, the organization committee member of the ICASSP 2009, and the area coordinator of the Interspeech 2012. He is appointed as the APSIPA Distinguished Lecturer for 2012-2013. He received the Distinguished Research Award from the National Science Council in 2006 and 2010. He was a co-recipient of the Best Paper Award of the IEEE Automatic Speech Recognition and Understanding Workshop in 2011. Dr. Chien has served as the tutorial speaker for ICASSP 2012 at Kyoto, Interspeech 2013 at Lyon, and APSIPA 2013 at Kaohsiung.

Jen-Tzung Chien (

[Back to Top]


Title: Deep Learning for Speech Generation and Synthesis
Presenters: Yao Qian and Frank K. Soong (Microsoft Research Asia, Beijing)

Abstract: Deep learning, which can represent high-level abstractions in data with an architecture of multiple non-linear transformation, has made a huge impact on automatic speech recognition (ASR) research, products and services. However, deep learning for speech generation and synthesis (i.e., text-to-speech), which is an inverse process of speech recognition (i.e., speech-to-text), has not generated the similar momentum as it is for ASR yet. Recently, motivated by the success of Deep Neural Networks in speech recognition, some neural network based research attempts have been tried successfully on improving the performance of statistical parametric based speech generation/synthesis. In this tutorial, we focus on deep learning approaches to the problems in speech generation and synthesis, especially on Text-to-Speech (TTS) synthesis and voice conversion.

First, we give a review for the current main stream of statistical parametric based speech generation and synthesis, or the GMM-HMM based speech synthesis and GMM-based voice conversion with emphasis on analyzing the major factors responsible for the quality problems in the GMM-based voice synthesis/conversion and the intrinsic limitations of a decision-tree based, contextual state clustering and state-based statistical distribution modeling. We then present the latest deep learning algorithms for feature parameter trajectory generation, in contrast to deep learning for recognition or classification. We cover common technologies in Deep Neural Network (DNN) and improved DNN: Mixture Density Networks (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long Short Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our research insights and hand-on experience on building speech generation and synthesis systems based upon deep learning algorithms.

Biography: Yao Qian is a Lead Researcher in Speech Group, Microsoft Research Asia. She received her Ph.D in the Dept. of EE, The Chinese University of Hong Kong, in 2005. She joined Microsoft research Asia in September, 2005, right after receiving her PhD. Her research interests are in spoken language processing, including TTS speech synthesis and automatic speech recognition. Her recent research projects include speech synthesis, voice transformation, prosody modeling and Computer-Assisted Language Learning (CALL). She has over 50 publications on international journals and conferences. She also has ten U.S. patent applications, five issued. She has been recognized within Microsoft and in the speech research community for her contributions to TTS and many other speech technologies. She is a senior member of IEEE and a member of ISCA.

Yao Qia (

Biography: Frank K. Soong is a Principal Researcher, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the "outstandingly the best". He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package.

He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE International Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow.

[Back to Top]


Diamond Sponsors

Copyright © 2013-2014 Chinese and Oriental Languages Information Processing Society
Conference managed by Meeting Matters International