Xavier Anguera, Ph.D. - List of publications
(Mouse over the paper titles to read the abstracts)
2011
Abstract - Content-based video copy detection algorithms (CBCD) focus on detecting video segments that are identical or trans- formed versions of segments in a known video. In recent years some systems have proposed the combination of orthogonal modalities (e.g. derived from audio and video) to improve detection performance, although not always achieving consistent results. In this paper we propose a fusion algorithm that is able to combine as many modalities as available at the decision level. The algorithm is based on the weighted sum of the normalized scores, which are modified depend- ing on how well they rank in each modality. This leads to a virtually parameter-free fusion algorithm. We performed several tests using 2010 TRECVID VCD datasets and obtain up to 46% relative improvement in min-NDCR while also improving the F1 metric on the fused results in comparison to just using the best single modality..
- "Multimodal fusion for video copy detection ",
Xavier Anguera, Juan Manuel Barrios, Tomasz Adamek and Nuria Oliver, in Proc. ACM Multimedia 2011. pdf
Abstract - Achieving an accurate speaker modeling is a crucial step in any speaker-related algorithm. Many statistical speaker modeling techniques that deviate from the classical GMM/UBM approach have been proposed for some time now that can accurately dis- criminate between speakers. Although many of them imply the evaluation of high dimensional feature vectors and represent a speaker with a single vector, therefore not using any temporal information. In addition, they place most emphasis on model- ing the most recurrent acoustic events, instead of less occurring speaker discriminant information. In this paper we explain the main benefits of our recently proposed binary speaker modeling technique and show its benefits in two particular applications, namely for speaker recognition and speaker diarization. Both applications achieve near to state-of-the-art results while benefiting from performing most processing in the binary space.
- "Speaker modeling using local binary decisions ",
Jean-Francois Bonastre, Xavier Anguera, Gabriel H. Sierra and Pierre-Michel Bousquet, in Proc. Interspeech 2011. pdf
Abstract - This notebook paper summarizes the algorithms behind Telefonica Research participation in the NIST-TRECVID 2011 evaluation on the Video Copy Detection task. This year we have focused on 1) Improving the image-based matching system to better process video files; 2) implemented and tested a novel audio local fingerprint; and 3) improved the multimodality fusion algorithm from last year.For this year we have submitted 4 runs in total, whose main characteristics are described below: a) TID.m.[BALANCED/NOFA].multimodal: These correspond to our main submissions, both for the no false alarm and balanced profiles. They are based on the fusion between the local audio and local video monomodal systems. b) TID.m.BALANCED.mask: This submission is based on the monomodal audio-based system, which this year uses a novel audio fingerprint called MASK. c) TID.m.BALANCED.joint: This submission is the fusion (at decision level) from our two monomodal system outputs with the output from the PRISMA group video-only system. This submission resulted in our best results for the evaluation. Over all, we are very pleased with the results for this year’s evaluation. On the one hand, our video-based system is reaching maturity, using local image descriptors (DART) developed by Telefonica. On the other hand, we have developed and applied to the evaluation novel audio local features called MASK. Even though we did not spend much time tuning the new feature to the Trecvid copy detection datasets, we are very please with its results. In addition, we have improved the fusion algorithm from last year and have shown that it does work well to fuse results from multiple outputs, improving on the results obtained by either one of our systems and those from the PRISMA submission.
- "Telefonica Research at TRECVID 2011 Content-Based Copy Detection ",
Xavier Anguera, Tomasz Adamek, Daru Xu and Juan Manuel Barrios, NIST-TRECVID workshop 2011. pdf
Abstract - Most of current Video Copy Detection systems (VCD) per- form a multimodal detection by dividing the system into subsystems. Each subsystem performs a copy detection us- ing a different feature (either visual or audio), and the sets of candidates are combined (fused) to create the final result. We present a VCD system that fuses visual and audio de- scriptors at the similarity search level. The system produces the copy candidates by comparing video segments using vi- sual and audio descriptors instead of fusing copy candidates from independent subsystems. We submitted four Runs to TRECVID 2011 CCD task: a)PRISMA.m.balanced.EhdGry: a combination of two visual global descriptors. Two detection candidates per query. b) PRISMA.m.balanced.EhdRgbAud: a combination of two visual global descriptors and one audio descriptor. Two detection candidates per query. c) PRISMA.m.nofa.EhdGry: a combination of two visual global descriptors. One detection candidate per query. d) PRISMA.m.nofa.EhdRgbAud: a combination of two visual global descriptors and one audio descriptor. One detection candidate per query. Our Runs achieve good detection effectiveness, especially for NoFA profile, and they are among the fastest Runs. To the best of our knowledge, this is the first VCD system that successfully fuses audio and visual descriptors at an earlier stage than decision level. Additionally, we have performed a joint submission with Telefonica Research team, under the name Telefonica-research.m.balanced.joint, which tests the combination at the decision level of Telefonica’s local de- scriptor, audio descriptor, and PRISMA’s EhdRgb global de- scriptors.
- "Combining Features at Search Time: PRISMA at Video Copy Detection Task ",
Juan Manuel Barrios, Benjamin Bustos and Xavier Anguera, NIST-TRECVID workshop 2011. pdf
Abstract - This working paper describes the system proposed by Telefonica research for the task of spoken voice search within the Mediaeval benchmarking evaluation campaign in 2011. The proposed system is based exclusively on a pattern match- ing approach, which is able to perform a query-by-example search with no prior knowledge of the acoustics or language being spoken. The system’s main contributions are the us- age of a novel method to obtain speaker independent acoustic features to later perform the matching through a DTW- like matching algorithm. Obtained results are promising and show, in our opinion, the potential of such class of techniques for this task.
- "Telefonica System for the Spoken Web Search Task at Mediaeval 2011 ",
Xavier Anguera, MediaEval Workshop, November 2011, Pisa, Italy. pdf
Abstract - Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.
- "Speaker Diarization: a review of recent research",
Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals, accepted for publication in Transactions on Audio, Speech and Language Processing (TASLP), special issue on New Frontiers in Rich Transcription. pdf
Abstract - The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the Rich Transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This article therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The article also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.
- "The ICSI RT-09 Speaker Diarization System",
Gerald Friedland, Adam Janin, David Imseng, Xavier Anguera, Luke Gottlieb, Marijn Huijbregts, Mary Tai Knox and Oriol Vinyals, accepted for publication in Transactions on Audio, Speech and Language Processing (TASLP), special issue on New Frontiers in Rich Transcription, July 2011. pdf
Abstract - With the constant improvements in the technical capabilities and bandwidth available to mobile phones, mobile audio and video streaming services are booming. This allows music enthusiasts to watch their favourite music videos instead of only listening to the audio tracks, thus augmenting their listening experience to be multimodal. Usually the highly compressed audio tracks in these videos result in poorer quality music, compared to music the user might already have locally on their phone or playing through another sound source. In this paper we present MuViSync Mobile, a mobile phone application that synchronises real-time high quality music, either stored locally or input through the microphone, with the corresponding streaming music video. We extend previous work on music to music video synchronisation by proposing an alternative algorithm for higher efficiency with similar alignment accuracy, which we tested with music video examples including simulated noise. This algorithm correctly aligns 90% of the audio frames to within 100 ms of the known alignment. We also describe its implementation on an iPhone.
- "Real-Time synchronization of multimedia streams in a mobile device",
Robert Macrae, Joachim Neumann, Xavier Anguera, Nuria Oliver and Simon Dixon, to appear in Proc. ADMIRE Workshop within ICME 2011, Barcelona, Spain. pdf
Abstract - The e-book industry is starting to flourish due, in part, to the availability of affordable and user-friendly e-book readers. As users are increasingly moving from traditional paper books to e-books, there is an opportunity to reinvent and enhance their reading experience, for example, by leveraging the multimedia capabilities of these devices in order to turn the act of reading into a real multimedia experience. In this paper, we focus on the augmentation of the written text with its associated audiobook, so that users can listen to the book they are (currently) reading. We propose an audiobook-to-ebook alignment system by applying a Text-to-Speech (TTS)-based text to audio alignment algorithm, and enhance it with a silence filtering algorithm to cope with the difference on reading style between the TTS output and the speakers in the ebook environment. Experiments done using 12 five-minute excerpts of 6 different audio-books (read by men and women) yield usable word alignment errors below 120ms for 90% of the words. Finally, we also show a user interface implementation in the Ipad for synchronized e-book reading while listening to the associated audiobook.
- "Automatic Synchronization of Electronic and Audio Books via TTS Alignment and Silence Filtering",
Xavier Anguera, Néstor Pérez, Andreu Urruela and Nuria Oliver, in Proc. Hot Topics in Multimedia within ICME 2011, Barcelona, Spain. pdf
Abstract - The automatic summarization of speech recordings is typically carried out as a two step process: the speech is first decoded using an automatic speech recognition system and the resulting text transcripts are processed to create a summary. However, this approach might not be suitable in adverse acoustic conditions or when applied to languages with limited training resources. In order to address these limitations, in this paper we propose an automatic speech summarization method that is based on the automatic discovery of recurrent patterns in the speech: recurrent acoustic patterns are first extracted from the audio and then are clustered and ranked according to the number of repetitions, creating an approximate acoustic summary of what was spoken. This approach allows us to build what we call a "Spoken WordCloud" termed after similarity with textbased word-clouds. We present an algorithm that achieves a cluster purity of up to 90% and an inverse purity of 71% in preliminary experiments using a small dataset of connected spoken words.
- "Spoken Wordcloud: clustering recurrent patterns in speech",
Remy Flamary, Xavier Anguera and Nuria Oliver, in Proc. CBMI 2011, Madrid, Spain. pdf
Abstract - Splitting a speech signal into speakers is the main goal of a speaker diarization system, which has become an important building block in many speech processing algorithms. Current state of the art systems are able to obtain good diarization error rates, but most of them are rather slow, which is a strong handicap in applications that require overall faster than real-time processing. In this paper we present a novel speaker diarization system which is built following a bottom-up agglomerative clustering approach and based on speaker binary keys, recently proposed for speaker modeling. After initialization, processing is entirely done over binary vectors and using exclusively binary metrics, which makes the system very fast. On tests performed using all conference meetings datasets released for the NIST RT evaluation campaigns we achieve diarization error rates just slightly worse than a classic acoustic-based system while running over 10 times faster.
- "Fast Speaker Diarization Based on Binary Keys",
Xavier Anguera and Jean-Francois Bonastre, in Proc. ICASSP 2011, Prague, Check Republic. pdf
Abstract - In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GMM model. This supervector space is used as a data representation space, which has a high dimensionality. Moreover, this space is not intrinsically discriminant and a complete speech segment is represented by only one vector, withdrawing mainly the possibility to take into account temporal or sequential information. This work proposes a new approach where each acoustic frame is represented in a discriminant binary space. The proposed approach relies on a UBM to structure the acoustic space in regions. Each region is then populated with a set of Gaussian models, denoted as "specificities", able to emphasize speaker specific information. Each acoustic frame is mapped in the discriminant binary space, turning ”on” or ”off” all the specificities to create a large binary vector. All the following steps, speaker reference extraction, likelihood estimation or decision take place in this binary space. Even if this work is a first step in this avenue, the experiments based on NIST SRE 2008 framework demonstrate the potential of the proposed approach. Moreover, this approach opens the opportunity to rethink all the classical processes using a discrete, binary view.
- "Discriminant Binary Data Representation for Speaker Recognition",
Jean-Francois Bonastre, Xavier Anguera Miro, Pierre-Michel Bousquet, Driss Matrouf, in Proc. ICASSP 2011, Prague, Check Republic. pdf
Abstract - In this paper, the use of closed-form expressions is compared to the BIC approximation, with respect to speaker clustering. We first show that the particular BIC setting which is commonly used in this task, namely the approximation of the marginal - with respect to the model parameters - and conditional - with respect to the latent variables - likelihood, belongs to an exponential family, and hence admits a closed-form expression by attaching conjugate priors. We then formalize the role of the tuning parameter as a hyperparameter of the prior and finally we explain the several proposed setting - global, local and segmental - based on the strength of the prior. Experiments are carried out for the speaker clustering task and improvement over the BIC approximation is reported.
- "Crosed-Form Expressions vs. BIC: a Comparison for Speaker Clustering",
Themos Stafylakis, Xavier Anguera, Vassilis Katsouros, George Carayannis, in Proc. ICASSP 2011, Prague, Check Republic. pdf
2010
Abstract - This notebook paper presents the participation of Telefonica Research in the task of Video Copy Detection in TRECVID 2010. This is our second participation and, for this year, we have developed two local-based monomodal systems that we then combine using a score-based fusion to obtain a multimodal system output. We submitted 4 runs in total, whose main characteristics are described below:
- TID.m.[BALANCED/NOFA].fusion: These correspond to our main submission, both for the no false alarm and balanced profiles. They are based on the fusion between the local audio and local video monomodal systems.
- TID.m.BALANCED.videoonly: This submission is based on the monomodal video-based system using DART local features and with a temporal consistency postprocessing.
- TID.m.BALANCED.audioonly: This submission is based on the monomodal audio-based system using frequency-based audio local features.
From these four systems submitted, two of them are processing only monomodal information (audio or video) and the fusion system takes the output of the previous two to output a fused result. Results for the monomodal systems in terms of NDCR are far from optimal, mainly due to an excess of false alarms that our monomodal systems still output. Results for F1 scores are very good for all cases. When combining the monomodal systems into he fusion the NDCR scores improve quite a bit as most false alarms are eliminated. The proposed fusion turned out to work very well for combining our two monomodal systems. We will further investigate it to improve it for future evaluations.
- "Telefonica Research at TRECVID 2010 Content-Based Copy Detection",
Ehsan Younessian, Xavier Anguera, Tomasz Adamek, Nuria Oliver and David Marimon, NIST Trecvid Workshop notebook paper.pdf
Abstract - The approach presented in this paper represents voice recordings by a novel acoustic key composed only of binary values. Except for the process being used to extract such keys, there is no need for acoustic modeling and processing in the approach proposed, as all the other elements in the system are based on the binary vectors. We show that this binary key is able to effectively model a speaker's voice and to distinguish it from other speakers. Its main properties are its small size compared to current speaker modeling techniques and its low computational cost when comparing different speakers as it is limited to obtaining a similarity metric between two binary vectors. Furthermore, the binary key vector extraction process does not need any hard threshold and offers the opportunity to set the decision steps in a well defined binary domain where scores and decisions are easy to interpret and implement.
- "Novel binary key representation for biometric speaker recognition",
Xavier Anguera and Jean-François Bonastre, in Proc. Interspeech 2010, Makuhari, Japan.pdf
- "System output combination for improved speaker diarization",
Simon Bozonet, Nicholas Evans, Xavier Anguera, Oriol Vinyals, Gerald Friedland and Corinne Fredouille, in Proc. Interspeech 2010, Makuhari, Japan.
Abstract - This paper discusses a set of modifications regarding the use of the Bayesian Information Criterion (BIC) for the speaker diarization task. We focus on the specific variant of the BIC that deploys models of equal - or roughly equal - statistical complexity under partitions of different number of speakers and we examine three modifications. Firstly, we investigate a way to deal with the permutation-invariance property of the estimators when dealing with mixture models, while the second is derived by attaching a weakly informative prior over the space of speaker-level state sequences. Finally, based on the recently proposed segmental-BIC approach, we examine its effectiveness when mixture of gaussians are used to model the emission probabilities of a speaker. The experiments are carried out using NIST rich transcription evaluation campaign for meeting data and show improvement over the baseline setting.
- "Improvements to the equal-parameter BIC for Speaker Diarization",
Themos Stafylakis, Xavier Anguera, in Proc. Interspeech 2010, Makuhari, Japan.pdf
Abstract - In recent years, the popularity of compressible music files and online music downloads has increased dramatically. Today’s users own large digital collections of high quality music on their computers and portable devices to be played in their homes or on the go. In addition, music videos are being offered online both for free and through monthly subscriptions, opening up the opportunity to turn the music listening activity into a multimedia experience. The work presented in this paper addresses the challenge of music audio and music video synchronisation. In particular, we have developed a prototype, named MuViSync, to automatically synchronise music videos to the songs that users are listening to in real-time. At the core of the MuViSync prototype are novel audio synchronization algorithms to tackle the differences in tempo, pitch, sampling rates, structure, and introductions and endings that are common in the various digital recordings of the same song in modern music. The music and the music video are initially aligned and then kept in sync within the limits of human perception. In our experiments with 320 matching pairs of audio files and associated music videos, the proposed algorithms are able to successfully synchronise music and video to within 100 milliseconds of each other in over 90% of the cases.
- "MuViSync: Realtime Music Video Alignment",
Robert Macrae, Xavier Anguera and Nuria Oliver, in Proc. ICME 2010 pdf
Abstract - Mood annotation of music is challenging as it concerns not only audio content but also extra-musical information. It is a representative research topic about how to traverse the well known semantic gap. In this paper, we propose a new music-mood-specific ontology. Novel ontology-based semantic reasoning methods are applied to effectively bridge content-based information with web-based resources. Also, the system can automatically discover closely relevant semantics for music mood and thus a novel weighting method is proposed for mood propagation. Experiments show that the proposed method outperforms purely content based methods and significantly enhances the mood prediction accuracy. Furthermore, evaluations show the system's accuracy could be promisingly increased with the enrichment of metadata.
- "Enriching Music Mood Annotation by Semantic Association Reasoning",
Jun Wang, Xavier Anguera, Xiaoou Chen and Deshun Yang, in Proc. AdMiRe Workshop, in ICME 2010 pdf
Abstract - Before the advent of Hidden Markov Models(HMM)-based speech recognition, many speech applications were built using pattern matching algorithms like the Dynamic Time Warping (DTW) algorithm, which are generally robust to noise and easy to implement. The standard DTW algorithm usually suffers from lack of flexibility on start-end matching points and has high computational costs. Although some DTW-based algorithms have been proposed over the years to solve either one of these problems, none is able to discover multiple alignment paths with low computational costs. In this paper, we present an “unbounded” version on the DTW (UDTW in short) that is computationally lightweight and allows for total flexibility on where the matching segment occurs. Results on a word matching database show very competitive performances both in accuracy and processing time compared to existing alternatives.
- "Partial Sequence Matching Using an Unbounded Dynamic Time Warping Algorithm",
Xavier Anguera, Robert Macrae and Nuria Oliver, in Proc. ICASSP 2010 pdf
Abstract - Mobile phones have become truly multimedia devices. It is common to observe users capturing and consuming photos and videos on their mobile phones on a regular basis. As the amount of digital multimedia content expands, it becomes increasingly difficult to find specific images in the device. Therefore, novel mobile multimedia search applications and algorithms are needed. In previous work, we have presented MAMI (Multimodal Automatic Mobile Indexing), a light-weight mobile phone application that allows users to annotate, index and/or search for digital photos on their phones via a combination of speech, text or image input. When using speech annotations, MAMI uses a Dynamic Time Warping (DTW)-based pattern matching algorithm in order to find pictures that are annotated with a matching acoustic query. Such an approach is language and vocabulary independent, and light-weight-enough to be run in real-time on the mobile phone. The main drawback of using DTW is that only full acoustic sequences can be matched, therefore constraining the type of acoustic annotations that can be used for tagging and search. In this paper we expand the acoustic search and retrieval capabilities of MAMI by enabling unrestricted acoustic sentence matching. We propose substituting DTW by the recently proposed Unbounded Dynamic Time Warping (U-DTW). Given a spoken query, U-DTW finds all database acoustic annotations with matching acoustic segments, regardless of their start-end points and length. In addition, we propose a number of speed-ups to U-DTW that make it suitable for mobile applications like MAMI.
- "Unrestricted Voice Annotations and Search of Personal Photographs in a Mobile Phone",
Xavier Anguera, Mauro Cherubini and Nuria Oliver, in Proc. Of Spoken Query 2010 Workshop on voice search, in ICASSP 2010 pdf
2009
Abstract - This notebook paper presents the systems presented by Telefonica Research within the MESH team for the task of Video copy detection in TRECVID 2009. We participated in the Video-only, Audio-only and Audio+Video tasks. Our main contribution is the combination (when possible) of audio and video features within the same system by using global features extracted both from the reference videos and the queries. We also experimented with SIFTbased search methods and are aiming at building a hybrid search system. This is our first participation year and results are far from optimal, but some of them indicate the potential of the presented systems.
- "Telefonica Research Content-Based Copy Detection TRECVID Submission",
Xavier Anguera, Pere Obrador, Tomasz Adamek, David Marimon and Nuria Oliver, NIST Trecvid 2009 Workshop notebook paper pdf
Abstract - The accuracy levels achieved by state-of-the-art Speaker Verification systems are high enough for the technology to be used in real-life applications. Unfortunately, the transfer from the lab to the field is not as straight-forward as could be: the best performing systems can be computationally expensive to run and need large speaker model footprints. In this paper, we compare two speaker verification algorithms (GMM-SVM Supervectors and Kharroubi's GMM-SVM vectors) and propose an improvement of Kharroubi's system that: (a) achieves up to 17% relative performance improvement when compared to the Supervectors algorithm; (b) is 24% faster in run time and (c) makes use of speaker models that are 94% smaller than those needed by the Supervectors algorithm.
- "MiniVectors: an Improved GMM-SVM Approach for Speaker Verification",
Xavier Anguera, in Proc. Interspeech 2009 pdf
Abstract - Reliable content-based copy detection algorithms (CBCD) are at the core of effective multimedia data management and copyright enforcement systems. CBCD techniques focus on detecting videos that are identical to or transformed versions of an original video. The fast growth of online video sharing services challenges state-of-the-art copy detection algorithms as they need to be: able to deal with vast amounts of data, computationally efficient and robust to a wide range of image and audio transformations. In this paper, we present two related multimodal CBCD algorithms that effectively fuse audio and video information by means of a compact multimodal signature based on audio and video global descriptors. We validate our algorithms with a benchmark database (MUSCLE-VCD) and obtain over a 14% relative improvement with respect to state-of-the-art systems. In addition, we illustrate the performance of our approach in a video view-count re-ranking task with YouTube data.
- "Multimodal video copy detection of social media",
Xavier Anguera, Pere Obrador and Nuria Oliver, in Proc. first SIGMM Workshop on Social Media (WSM2009) at ACM MM09 pdf
Abstract - In recent years, there has been a proliferation of consumer digital photographs taken and stored in both personal and online repositories. As the amount of user-generated digital photos increases, there is a growing need for efficient ways to search for relevant images to be shared with friends and family. Text-query based search approaches rely heavily on the similarity between the input textual query and the tags added by users to the digital content. Unfortunately, text-query based search results might include a large number of relevant photos, all of them containing very similar tags, but with varying levels of image quality and aesthetic appeal. In this paper we introduce an image re-ranking algorithm that takes into account the aesthetic appeal of the images retrieved by a consumer image sharing site search engine (Google's Picasa Web Album). In order to do so, we extend a state-of-the-art image aesthetic appeal algorithm by incorporating a set of features aimed at consumer photographs. The results of a controlled user study with 37 participants reveal that image aesthetics play a varying role on the selected images depending on the query type and on the user preferences.
- "The role of tags and image aesthetics in social image search",
Pere Obrador, Xavier Anguera, Rodrigo de Oliveira and Nuria Oliver, in Proc. first SIGMM Workshop on Social Media (WSM2009) at ACM MM09 pdf
Abstract - Speech and typed text are two common input modalities for mobile phones. However, little research has compared them in their ability to support annotation and retrieval of digital pictures on mobile devices. In this paper, we report the results of a month-long eld study in which participants took pictures with their camera phones and had the choice of adding annotations using speech, typed text, or both. Subsequently, the same subjects participated in a controlled experiment where they were asked to retrieve images based on annotations as well as retrieve annotations based on images in order to study the ability of each modality to effectively support users' recall of the previously captured pictures. Results demonstrate that each modality has advantages and shortcomings for the production of tags and retrieval of pictures. Several guidelines are suggested when designing tagging applications for portable devices.
- "Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones",
M. Cherubini, X. Anguera, N. Oliver and R. de Oliveira, in Proc. MobileHCI, Bonn, Germany, September 2009, (best paper award nominee) pdf
Abstract - This papers presents a simple and near real-time performance system for detecting highlighted events of soccer game retransmissions and generating their video summaries. The proposed detection algorithm is based on two acoustic features of the audio track: the block energy and the acoustic repetition index. To the authors' knowledge, the acoustic repetition index has not been used previously in similar applications. This index represents the correlation between a narrow acoustic section and the seconds just after and before it, in order to detect sections of audio where repetitions occur. The system has been validated on a corpus with UEFA EURO competition games, achieving good scores in goal recall.
- "Audio-Based Soccer Game Summarization", Helenca Duxans,
Xavier Anguera and David Conejero, in Proc. IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB09) pdf
Abstract - Although TV commercial identification and clustering are suitable applications for automatic multimedia indexing technology, they remain as problems still unsolved. Most current systems either require a big computational load and therefore can not be executed online, or just perform a detection, without clustering nor identification. In this paper two advertisement indexing approaches are presented: an off-line detection and clustering system and an online identification system, both based only on audio features for computational reasons. For the off-line clustering two metrics are evaluated, and an initial commercial boundary detection algorithm, based on identifying drop energy points which are also acoustic change boundaries, is presented. For the on-line system we analyze the response-time/identification scores constraints. Experiments performed on real data validate both off-line and on-line implementations as well as that audio only features are enough discriminant to detect and classify TV commercials.
- "Audio-Based Automatic Management of Audio Commercials",
H. Duxans, D. Conejero and X. Anguera, in Proc. ICASSP 2009, Taipei, Taiwan. April 2009 pdf
2008
Abstract - Detection and clustering of commercial advertisements plays an important role in multimedia indexing as well as in the creation of personalized user content. Its aim is at detecting individual commercials within a broadcast and grouping together all repetitions of the same commercial over time. Several algorithms are found in the literature to tackle the detection task using either video and audio or only video cues, but none has been found for clustering. In this paper we present an acoustic-only system to perform both the detection and clustering of commercials. On the one hand, detection is done in three steps, incrementally refining an initial coarse energy detection. On the other hand clustering is later done over all previously detected commercials to find out how many times each commercial appears. Our detection system achieves 82% precision and recall using only acoustic information. For the clustering step, three algorithms are compared, obtaining best results using a modified dynamic Time Warping approach, which achieves 100% recall and 99% precision.
- "TV advertisements detection and clustering based on acoustic information",
D. Conejero and X. Anguera, in proc. International Conference on Computational Intelligence for Modelling, Control and Automation - CIMCA08, Viena, Austria, December 2008 pdf
Abstract - We present MAMI (i.e. Multimodal Automatic Mobile Indexing), a mobile-phone prototype that allows users to annotate and search for digital photos on their camera phone via speech input. MAMI is implemented as a mobile application that runs in real-time on the phone. Users can add speech annotations at the time of capturing photos or at a later time. Additional metadata is also stored with the photos, such as location, user identification, date and time of capture and image-based features. Users can search for photos in their personal repository by means of speech. MAMI does not need connectivity to a server. Hence, instead of full-fledged speech recognition, we propose using a Dynamic Time Warping-based metric to determine the distance between the speech input and all other existing speech annotations. We present our preliminary results with the MAMI prototype and outline our future directions of research, including the integration of additional metadata in the search.
- "MAMI: Multimodal Annotations on a Camera Phone",
X. Anguera and N. Oliver, in Proc. MobileHCI, Amsterdam, September 2008 pdf
Abstract - La gran cantidad de material multimedia que se genera hoy en día hace difícil al usuario poder encontrar de manera sencilla aquello que busca. El indexado automático de audio/vídeo es un área que está suscitando gran interés, ya que permite indagar en el contenido de los documentos multimedia y extraer información relevante. La mayoría de sistemas existentes permiten analizar un número limitado de características, pero no están pensados para interactuar con otros sistemas complementarios para enriquecer la salida del indexado. En esta publicación presentamos una arquitectura de indexado que permite la implementación de módulos especializados que interactúen entre sí para obtener un resultado que es accesible vía web.
- "Sistema de Indexación Automática de Contenidos Multimedia",
U.Urdapilleta, D.Conejero, X. Anguera, D. Cacenabes and F.J. Caminero, in Proc. XVIII Jornadas Telecom I+D, Bilbao, Spain pdf
Abstract - Mobile phones have become multimedia devices. Therefore it is not uncommon to observe users capturing photos and videos on their mobile phones. As the amount of digital multimedia content expands, it becomes increasingly difficult to find specific images in the device. In this paper, we present our experience with MAMI, a mobile phone prototype that allows users to annotate and search for digital photos on their camera phone via speech input. MAMI is implemented as a mobile application that runs in real-time on the phone. Users can add speech annotations at the time of capturing photos or at a later time. Additional metadata is also stored with the photos, such as location, user identification, date and time of capture and image-based features. Users can search for photos in their personal repository by means of speech without the need of connectivity to a server. In this paper, we focus on our findings from a user study aimed at comparing the efficacy of the search and the ease-of-use and desirability of the MAMI prototype when compared to the standard image browser available on mobile phones today.
- "Multimodal and Mobile Personal Image Retrieval: A User Study",
X. Anguera, N.Oliver and M. Cherubini, in Proc. Workshop on Mobile Information Retrieval, MOBIR'08, Singapore pdf
Abstract - Mobile phones are becoming multimedia devices. It is common to observe users capturing photos and videos on their mobile phones on a regular basis. As the amount of digital multimedia content expands, it becomes increasingly difficult to find specific images in the device. In this paper, we present a multimodal and mobile image retrieval prototype named MAMI (Multimodal Automatic Mobile Indexing). It allows users to annotate, index and search for digital photos on their phones via speech or image input. Speech annotations can be added at the time of capturing photos or at a later time. Additional metadata such as location, user identification, date and time of capture is stored in the phone automatically. A key advantage of MAMI is that it is implemented as a stand-alone application which runs in real-time on the phone. Therefore, users can search for photos in their personal archives without the need of connectivity to a server. In this paper, we compare multimodal and monomodal approaches for image retrieval and we propose a novel algorithm named the Multimodal Redundancy Reduction (MR2) Algorithm. In addition to describing in detail the proposed approaches, we present our experimental results and compare the retrieval accuracy of monomodal versus multimodal algorithms.
- "Multimodal Photo Annotation and Retrieval on a Mobile Phone",
X. Anguera, J.Xu, N. Oliver, in Proc. ACM Intl. Conference on Multimedia Information Retrieval, Vancouver, Canada. 2008 pdf
2007
Abstract - Human-Machine interaction in meetings requires the localization and identification of the speaker that interacts with the system and the recognition of the message spoken. A seminal phase towards this goal is the so-called rich transcription research, which covers speaker diarization together with the annotation of sentence boundaries and elimination of speaker disfluencies. The subarea of speaker diarization intends to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper we analyze the correlation between signals coming from different microphones and propose an improved method to do speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information about the delays between signals coming from all the sources. With this procedure, we have been able to achieve the best performance in the last spring 2006 National Institute of Standards Rich Transcription Evaluation, improving from 15% to 20% relative the Diarization Error Rate (DER) compared to previous systems.
- "Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information",
Jose M. Pardo, Xavier Anguera and Chuck Wooters, IEEE Transactions on Computers, September 2007, volume 56, number 9, pp. 1189-1224. pdf
Abstract - When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they can not outperform the simpler case of using the best single microphone. In this work the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are present include blind reference-channel selection, two-step Time Delay of Arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.
- "Acoustic beamforming for speaker diarization of meetings",
Xavier Anguera, Chuck Wooters and Javier Hernando, IEEE Transactions on Audio, Speech and Language Processing, September 2007, volume 15, number 7, pp.2011-2023. pdf
Abstract - Accurate modeling of speaker clusters is important in the task of speaker diarization. Creating accurate models involves both selection of the model complexity and optimum training given the data. Using models with fixed complexity and trained using the standard EM algorithm poses a risk of overfitting, which can lead to a reduction in diarization performance. In this paper a technique proposed by the author to estimate the complexity of a model is combined with a novel training algorithm called “Cross-Validation EM” to control the number of training iterations. This combination leads to more robust speaker modeling and results in an increase in speaker diarization performance. Tests on the NIST RT (MDM) datasets for meetings show a relative improvement of 10.6% relative on the test set
- "Model Complexity Selection and Cross-Validation EM Training for Robust Speaker Diarization",
Xavier Anguera, Takahiro Shinozaki, Chuck Wooters and Javier Hernando, ICASSP, Hawaii, USA, April 2007. pdf
Abstract - In the task of speaker diarization for meetings it has been shown in previous work that it is useful to use the Time Delay of Arrival (TDOA) between the different audio channels in the meeting room as an extra source of information in addition to the acoustic features. When combining feature streams, we use a weight to control the relative contributions of the streams. In the past, this weight was determined using development data and the same weight value was applied to all meetings. In this paper we present a method for automatically determining the weight. A metric derived from the Bayesian Information Criterion (BIC) computed for each feature stream estimates the weight for each meeting on the initial clustering iteration and adapts its value throughout the diarization process. By using this technique we achieve a more robust system and up to 18.2% relative improvement over the method of tuning the weight on development data.
- "Automatic Weighting for the Combination of TDOA and Acoustic Features in Speaker Diarization for Meetings",
Xavier Anguera, Chuck Wooters, J.M Pardo and Javier Hernando, ICASSP, Hawaii, USA, April 2007. pdf
Abstract - In this paper the authors present the UPC speaker diarization system for the NIST Rich Transcription Evaluation (RT07s) [1] conducted on the conference environment. The presented system is based on the ICSI RT06s system, which employs agglomerative clustering with a modified Bayesian Criterion (BIC) measure to decide which pairs of clusters to merge and to determine when to stop merging clusters [2]. This is the first participation of the UPC in the RT Speaker Diarization Evaluation and the purpose of this work has been the consolidation of a baseline system which can be used in the future for further research in the field of diarization. We dave introduced, a prior modules before the diarization system, an Speech/Non-Speech detection module based on a Support Vector Machine from UPC and a Wienner Filtering from an implementation of the QIO front-end. In the speech parameterization a frequency Filtering (FF) of the filter-bank energies is applied instead the classical Discrete Cosine Transform in the Mel-Cepstrum analysis. In addition, it is introduced a small changes in the complexity selection algorithm and a new post-processing technique which process the shortest clusters at the end of each Viterbi segmentation.
- "Speaker Diarization for Conference Room: The UPC RT07s Evaluation System",
Jordi Luque, Xavier Anguera, Andrey Temko, and Javier Hernando, RT07s Rich Transcription evaluation workshop, Washington, May 2007 pdf
Abstract - We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neuralnet-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition
- "The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System",
Andreas Stolcke, Xavier Anguera, Kofi Boakye, Ozgur Çetin, Adam Janin, Mathew Magimai-Doss, Chuck Wooters, and Jing Zheng, RT07s Rich Transcription evaluation workshop, Washington, May 2007 pdf
2006
Abstract - When performing speaker diarization, it is common to use an agglomerative clustering approach where the acoustic data is first split in small pieces and then pairs are merged until reaching a stopping point. When using a purely agglomerative clustering technique, one cluster cannot be split into two. Therefore, errors caused by multiple speakers being assigned to one cluster can be common. Furthermore, clusters often contain non-speech frames, creating problems when deciding which two clusters to merge and when to stop the clustering. In this paper, we present two algorithms that aim to purify the clusters. The first assigns conflicting speech segments to a new cluster, and the second detects and eliminates non-speech frames when comparing two clusters. We show improvements of over 18% relative using three datasets from the most current Rich Transcription (RT) evaluations.
- "Purity Algorithms for Speaker Diarization of Meetings Data",
Xavier Anguera, Chuck Wooters and Javier Hernando. ICASSP 2006,
Toulouse, France, May 2006. pdf
Abstract - We present a method to extract speaker turn segmentation from multiple distant microphones (MDM) using only delay values found via a crosscorrelation between the available channels. The method is robust against the number of speakers (which is unknown to the system), the number of channels, and the acoustics of the room. The delays between channels are processed and clustered to obtain a segmentation hypothesis. We have obtained a 31.2% diarization error rate (DER) for the NIST´s RT05s MDM conference room evaluation set. For a MDM subset of NIST´s RT04s development set, we have obtained 36.93% DER and 35.73% DER. Comparing those results with the ones presented by Ellis and Liu [8], who also used between-channels differences for the same data, we have obtained 43% relative improvement in the error rate.
- "Speaker Diarization for Multi-Microphone Meetings Using only Between-Channel Differences",
Jose M. Pardo, Xavier Anguera, Chuck Wooters, In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third InternationalWorkshop (MLMI 2006), Lecture
Notes in Computer Science. Springer pdf
Abstract - The goal of speaker diarization is to determine where each participant speaks in a recording. One of the most commonly used technique is agglomerative clustering, where some number of initial models are grouped into the number of present speakers. The choice of complexity, topology, and the number of initial models is vital to the final outcome of the clustering algorithm. In prior systems, these parameters were directly assigned based on development data, and were the same for all recordings. In this paper we present three techniques to select the parameters individually for each case, obtaining a system that is more robust to changes in the data. Although the choice of these values depends on tunable parameters, they are less sensitive to changes in the acoustic data and to how the algorithm distributes data among the different clusters. We show that by using the three techniques, we achieve an improvement up to 8% relative in the development set and 19% relative in the test set over prior systems.
- "Automatic Cluster Complexity and Quantity Selection: Towards Robust Speaker Diarization",
Xavier Anguera, Chuck Wooters, Javier Hernando, In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2006), Lecture Notes in Computer Science. Springer pdf
Abstract - In this paper we present the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) [1] conducted on the meetings environment. The presented system is based on the RT05s system, which uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to decide which pairs of clusters to merge and to determine when to stop merging clusters. In this year’s system we have eliminated any remaining need for training data, therefore increasing robustness. In our primary system we have introduced several improvements from last year. First, we use a new training-free speech/non-speech detection algorithm. Second, we introduce a new algorithm for system initialization. The third improvement is the use of a frame purification algorithm to increase cluster discriminability. Finally, we describe the use of inter-channel delays as features. We explain each of these improvements and show our system’s results on the official evaluation data using hand-aligned references and forced-alignments. We also analyze some of the results and propose improvements
- "Robust Speaker Diarization for Meetings: ICSI RT06s Meetings Evaluation System",
Xavier Anguera, Chuck Wooters and Jose M. Pardo, In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2006), Lecture Notes in
Computer Science. Springer pdf
Abstract - Speaker diarization is often performed as a first step to speaker or speech recognition systems, which work better when the input signal is split into its speakers. When performing speaker diarization, it is common to use an agglomerative clustering approach in which the acoustic data is first split in small pieces and then pairs are merged until reaching a stopping point. The speaker clusters often contain non-speech frames that jeopardize discrimination between speakers, creating problems when deciding which two clusters to merge and when to stop the clustering. In this paper, we present one algorithm that aims to purify the clusters, eliminating the non-discriminant frames –selected using a likelihood-based metric– when comparing two clusters.We show improvements of over 15.5% relative using three datasets from the most current Rich Transcription (RT) evaluations.
- "Frame Purification for Cluster Comparison in Speaker Diarization",
Xavier Anguera, Chuck Wooters, Javier Hernando, MMUA 2006, Toulouse, France, May 2006. pdf
Abstract - When performing speaker diarization, it is common practice to use an agglomerative clustering approach where the acoustic data is first split in small segments and then pairs of these segments are merged until a particular stopping point is reached. The diarization performance can be greatly improved by the use of a speech/non-speech detector. The use of a speech/non-speech detector helps the diarization system by preventing non-speech frames from “confusing” both the merging and the stopping processes. Over the years there has been extensive research on speech/non-speech detectors. Often times, speech/non-speech detectors require training data and their accuracy is strongly dependent on setting various thresholds correctly. In this work we present a hybrid speech/non-speech detector for use in our speaker diarization system within the meetings domain. Our proposed speech/non-speech system runs in two stages. The first stage performs an energy-based detection. The second stage performs a model-based decoding using the previous stage’s data as a bootstrap for the acoustic models, thus avoiding the need for any outside training data. We show an improvement of 14% and 10% relative on a development and test set.
- "Hybrid Speech/Non-Speech Detector Applied to Speaker Diarization of Meetings",
Xavier Anguera, Mateu Aguilo, Chuck Wooters, Climent Nadeu and Javier Hernando, Speaker Odyssey 2006, San Juan de Puerto Rico, USA, June 2006. pdf
Abstract - The task of speaker diarization consists of answering the question “Who spoke when?”. The most commonly used technique consists on an agglomerative clustering of multiple initial clusters into the optimum amount of speakers present in the recording. Even though the initial clustering is greatly modified by iterative clusters merging and possibly multiple resegmentations of the data, the initialization algorithm is a key module for system performance and robustness. In this paper we present a novel approach that obtains a desired initial number of clusters in three steps. It first obtains possible speaker change points via a standard technique based on the Bayesian information criterion (BIC). It then classifies the resulting segments into friend and enemy groups and creates an initial set of clusters for the system to run on. We test this algorithm with the dataset used in the RT05s evaluation, where we show a 13% Diarization error rate relative improvement and a 2.5% absolute cluster purity improvement with respect to the previously used algorithm.
- "Friends and Enemies: A Novel Initialization for Speaker Diarization",
Xavier Anguera, Chuck Wooters and Javier Hernando, ICSLP06, Pittsburgh, Pensilvania, USA, September 2006. pdf
Abstract - In this paper we present the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) [1] conducted on the meetings environment. This is a set of yearly evaluations which in the last two years have included the speaker diarization of two kinds of distinct meetings: the conference room and the lecture room. The system presented focuses on being robust to changes in the meeting conditions by not using any training data. In this paper we introduce four of the main improvements to the system from last years’ submission: The first is a new training-free speech/non-speech detection algorithm. The second is the introduction of a new algorithm for the system initialization. The third is the use of a frame purification algorithm to increase clusters differentiability. The last improvement is the use of interchannel delays as features, greatly improving performance. We show the diarization error rate (DER) score of this system on all available meetings datasets to date for the multiple distant microphone (MDM) and single distant microphone (SDM) conditions
- "Robust Speaker Diarization for Meetings: ICSI RT06s evaluation system",
Xavier Anguera, Chuck Wooters and Jose M. Pardo, ICSLP06, Pittsburgh, Pensilvania, USA, September 2006. pdf
Abstract - We describe the development of the ICSI-SRI speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2006 Meeting Rich Transcription (RT-06S) evaluation, highlighting improvements made since last year, including improvements to the delay-and-sum algorithm, the nearfield segmenter, language models, posterior-based features, HMM adaptation methods, and adapting to a small amount of new lecture data. Results are reported on RT-05S and RT-06S meeting data. Compared to the RT-05S conference system, we achieved an overall improvement of 4% relative in the MDM and SDM conditions, and 11% relative in the IHM condition. On lecture data, we achieved an overall improvement of 8% relative in the SDM condition, 12% on MDM, 14% on ADM, and 15% on IHM.
- "The ICSI-SRI Spring 2006 Meeting Recognition System",
A. Janin, A. Stolcke, X. Anguera, K. Boakye, O. Cetin, J. Frankel, and J. Zheng, In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2006), Lecture Notes in Computer Science. Springer pdf
Abstract - In the context of speech and speaker recognition systems, it is well known that the combination of different feature streams can improve significantly their performance. However, the application of multi-stream (MS) techniques to speaker diarization systems has not been extensively studied. In this paper, we address this issue: we formulate different MS techniques, such as feature combination, probability combination and selection, for their specific application to the segmentation and clustering modules of a speaker diarization system. We evaluate the different methods proposed for the meetings domain (RT04s database) and two different pairs of streams: first, MFCC and PLP and second, MFCC and prosodic features. For both types of multi-streams, results show that the MS probability combination approach applied to the segmentation stage clearly outperforms the single-stream, MS feature combination and MS selection systems.
- "Multi-Stream Speaker Diarization Systems for the Meetings Domain",
Ascension Gallardo, Xavier Anguera and Chuck Wooters, ICSLP06, Pittsburgh, Pensilvania, USA, September 2006. pdf
Abstract - Speaker diarization for recordings made in meetings consists of identifying the number of participants in each meeting and creating a list of speech time intervals for each participant. In recently published work [7] we presented some experiments using only TDOA values (Time Delay Of Arrival for different channels) applied to this task. We demonstrated that information in those values can be used to segment the speakers. In this paper we have developed a method to mix the TDOA values with the acoustic values by calculating a combined loglikelihood between both sets of vectors. Using this method we have been able to reduce the DER by 16.34% (relative) for the NIST RT05s set (scored without overlap and manually transcribed references) the DER for our devel06s set (scored with overlap and force-aligned references) by 21% (relative) and the DER for the NIST RT06s (scored with overlap and manually transcribed references) by 15% (relative) .
- "Speaker Diarization for Multiple Distant Microphone Meetings: Mixing Acoustic Features And Inter-Channel Time Differences",
Jose M. Pardo, Xavier Anguera and Chuck Wooters, ICSLP06, Pittsburgh, Pensilvania, USA, September 2006. pdf
2005
Abstract - In this paper a novel probability based measure is presented that shows good results for real time blind speaker segmentation. In such task there is no previous information about the identity or how many speakers there are. Similar to the Bayesian Information Criterion (BIC), the proposed measure indicates the similarity between two speech segments on either side of a given test point. By computing cross probabilities between both segments, an abrupt decrease of the measure value indicates the existence of an acoustic change point. A scrolling window implementation, in a similar way that is used in metric based techniques, is shown to give better results regarding speed and change detection. This measure allows building real-time systems. Tests with Broadcast news data show a general improvement compared to the commonly used BIC method.
- "XBIC: Real-Time Cross Probabilities Measure for Speaker Segmentation",
Xavier Anguera. International Computer Science Institute Technical Report TR-05-008. pdf
Abstract - In this paper we describe the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplifying the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, including a “purification” module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay-and-sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post-evaluation work we further improved the delay-and-sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lecture room environment.
- "Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System",
Xavier Anguera, Chuck Wooters, Barbara Peskin and Mateu Aguilo. In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2005), Lecture Notes in Computer Science. Springer pdf
Abstract - We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on lecture data are comparable to the best reported results for that task.
- "Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System",
Andreas Stolcke, Xavier Anguera, Kofy Boakye, Ozgur Cetin, Frantisek Grezl, Adam Janin, Arindam Mandal, Barbara Peskin, Chuck Wooters and Jing Zheng. In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction: Third International Workshop (MLMI 2005), Lecture Notes in Computer Science. Springer pdf
Abstract - One of the sub-tasks of the Spring 2004 and Spring 2005 NIST Meetings evaluations requires segmenting multi-party meetings into speaker-homogeneous regions using data from multiple distant microphones (the "MDM" sub-task). One approach to this task is to run a speaker segmentation system on each of the microphone channels separately, and then merge the results. This can be thought of as a many-to-one post-processing approach. In this paper we propose an alternative approach in which we use delay-and-sum beamforming techniques to fuse the signals from each of the multiple distant microphones into a single enhanced signal. This approach can be thought of a many-to-one preprocessing approach. In the pre-processing approach we propose, the time delay of arrival (TDOA) between each of the multiple distant channels and a reference channel is computed incrementally using a window that steps through the signals from each of the multiple microphones. No information about the locations or setup of the microphones is required. Using the TDOA information, the channels are first aligned and then summed and the resulting "enhanced" signal is clustered using our standard speaker diarization system. We test our approach on the 2004 and 2005 NIST meetings evaluation databases and show that the technique performs very well.
- "Speaker Diarization for Multi-Party Meetings Using Acoustic Fusion",
Xavier Anguera, Chuck Wooters and Javier Hernando. Automatic Speech Recognition and Understanding (ASRU). Puerto Rico, November 2005. pdf
Sorry, no abstract available for this publication.
- "PETRA: Advanced Oral Interfaces for Unified Messaging Applications",
David Hernando, Javier Hernando and Xavier Anguera. Buran magazine, IEEE Barcelona student branch. Number 22, September 2005.
2004
Abstract - When performing blind speaker segmentation one of the main problems is not knowing how many speakers appear in a conversation and wether they appear once or more than once. In this paper, an iterative method, which is based on the EvolutiveHMM is presented. Two main improvements to this system are introduced. On one hand, a repository generic speaker is used to model all utterances and all speaker models are derived from this iteratively. Different normalization of the scores are applied to the repository and the speakers to emphasize speaker changes. On the other hand, in all cases we use Gaussian Mixture Models (GMM) for their flexibility compared to an HMM structure. This method has been successfully tested using multi-speaker speech sequences generated by concatenation of speech segments from Speecon.
- "Evolutive Speaker Segmentation using a Repository System",
Xavier Anguera and Javier Hernando. ICSLP, Korea 2004. pdf
Abstract - L'evolució de la societat de la informació ha esdevingut un incessant increment de continguts audiovisuals que s’emeten constantment en cadenes de televisió i emissores de radio locals i nacionals en llengua catalana. Aquestes emissions normalment s'arxiven en bases de dades multimèdia per tal de poder ser consultades posteriorment, però degut a la gran quantitat de dades emmagatzemades resulta difícil, si no impossible, i molt costós poder accedir a aquesta informació. Amb aquesta comunicació pretenem donar a conèixer les tècniques existents actualment d'indexació automàtica de material sonor en les quals estem treballant en el departament de Teoria del Senyal i Comunicacions de la UPC. Mitjançant una indexació automàtica de les bases de dades és possible realitzar cerques concretes i recuperar documents molt més ràpidament. Mostrem especial èmfasi en el cas de la indexació de la identitat de les persones que apareixen a la base de dades, i en quin interval de temps parlen. Presentem una mesura anomenada XBIC per detectar els canvis de locutor dins d'un senyal de veu, creada dins del nostre grup. Es mostren resultats d'aquesta nova tècnica sobre una base de dades recollida en llengua catalana.
- "Segmentació de locutor per a la indexació automàtica de bases de dades multimèdia en català",
Xavier Anguera, Mireia Farrús , Javier Hernando and Alberto Abad. II Congrés d'enginyeria en llengua catalana,
Andorra 2004. pdf
- "Els sistemes de reconeixement de veu i traduccio automatica en catala: present i futur",
Mireia Farrus, Jan Anguita, Xavier Anguera, Josep M. Crego, Adria de Gispert, Javier Hernando, Climent Nadeu. II Congres d'enginyeria en llengua catalana, Andorra 2004.
- "XBIC: Nueva Medida para Segmentación de Locutor hacia el Indexado Automático de la Señal de Voz",
Xavier Anguera, Javier Hernando and Jan Anguita. III Jornadas en Tecnología del Habla, Valencia, 17-10 Nov 2004.
Abstract - We describe the ICSI-SRI entry in the Fall 2004 DARPA EARS Metadata Evaluation. The current system was derived from ICSI’s Fall 2003 Speaker-attributed STT system. Our system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. The main advantage of this approach is that it does not require pre-trained acoustic models, providing robustness and portability. Changes for this year’s system include: different front-end features, the addition of SRI’s Broadcast News speech/non-speech detector, and modifications to the segmentation routine. In post-evaluation work, we found further improvement by changing the stopping criterion from the BIC-like measure to a Viterbi measure. Additionally, we have explored issues related to pruning and improved initialization.
- "Towards Robust Speaker Segmentation: The ICSI-SRI Fall 2004 Diarization System",
Chuck Wooters, James Fung, Barbara Peskin and
Xavier Anguera. EARS Program RT-04 Workshop, nov 7-10 2004. pdf
Copyright note:
The documents distributed in this page are provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that the works are offered here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be distributed without the explicit permission of the copyright holder.