This paper presents and evaluates an inverse filtering technique of the speech signal which is based on the Stabilized Weighted Linear Prediction (SWLP) of speech. SWLP emphasizes the speech samples that fit the underlying speech production model well, by imposing temporal weighting of the square of the residual signal. The performance of SWLP is compared to the conventional Linear Prediction based inverse filtering techniques, such as the Autocorrelation and Closed Phase Covariance method. All the inverse filtering approaches are evaluated on a database of speech signals generated by a physical model of the voice production system. Results show that the estimated glottal flows using SWLP are closer to the original glottal flow than those estimated by the Autocorrelation approach, while its performance is comparable to the Closed Phase Covariance approach.
In this paper, we present an extension of a recently developed AM-FM decomposition algorithm, which will be referred to as the extended adaptive Quasi-Harmonic Model (eaQHM). It was previously shown that the adaptive Quasi-Harmonic Model (aQHM) is an efficient AM-FM decomposition algorithm with applications in speech analysis. In this paper, we show that a simple extension of the aQHM algorithm to include not only frequency but also amplitude adaptation results in higher performance in terms of Signal-to-Reconstruction-Error Ratio (SRER). To support our hypothesis, eaQHM is tested both on synthetic signals and on a subset of the ARCTIC database of speech. Overall, compared with aQHM, eaQHM improves the SRER by more than 2 dB, on average.
In this paper, the performance of the recently proposed adaptive signal models on modeling speech voiceless stop sounds is presented. Stop sounds are transient parts of speech that are highly non-stationary in time. State-of-the-art sinusoidal models fail to model them accurately and efficiently, thus introducing an artifact known as the pre-echo effect. The adaptive QHM and the extended adaptive QHM (eaQHM) are tested to confront this effect and it is shown that highly accurate, pre-echo-free representations of stop sounds are possible using adaptive schemes. Results on a large database of voiceless stops show that, on average, eaQHM improves by 100% the Signal to Reconstruction Error Ratio (SRER) obtained by the standard sinusoidal model.
In this paper, a simple method for time-scale modifications of speech based on a recently suggested model for AM-FM decomposition of speech signals, is presented. This model is referred to as the adaptive Harmonic Model (aHM). A full-band speech analysis/synthesis system based on the aHM representation is built, without the necessity of separating a deterministic and/or a stochastic component from the speech signal. The aHM models speech as a sum of harmonically related sinusoids that can adapt to the local characteristics of the signal and provide accurate instantaneous amplitude, frequency, and phase trajectories. Because of the high quality representation and reconstruction of speech, aHM can provide high quality time-scale modifications. Informal listenings show that the synthetic time-scaled waveforms are natural and free of some common artifacts encountered in other state-of-the-art models, such as “metallic quality”, chorusing, or musical noise.
Percussive musical instrument sounds figure among the most challenging to model using sinusoids particularly due to the characteristic attack that features a sharp onset and transients. Attack transients present a highly nonstationary inharmonic behaviour that is very difficult to model with traditional sinusoidal models which use slowly varying sinusoids, commonly introducing an artifact known as pre-echo. In this work we use an adaptive sinusoidal model dubbed eaQHM to model percussive sounds from musical instruments such as plucked strings or percussion and investigate how eaQHM handles the sharp onsets and the nonstationary inharmonic nature of the attack transients. We show that adaptation renders a virtually perceptually identical sinusoidal representation of percussive sounds from different musical instruments, improving the Signal to Reconstruction Error Ratio (SRER) obtained with a traditional sinusoidal model. The result of a listening test revealed that the percussive sounds modeled with eaQHM were considered perceptually closer to the original sounds than their traditional-sinusoidal-modeled counterparts. Most listeners reported that they used the attack as cue.
Nowadays, sinusoidal modeling commonly includes a residual obtained by the subtraction of the sinusoidal model from the original sound. This residual signal is often further modeled as filtered white noise. In this work, we evaluate how well filtered white noise models the residual from sinusoidal modeling of musical instrument sounds for several sinusoidal algorithms. We compare how well each sinusoidal model captures the oscillatory behavior of the partials by looking into how “noisy” their residuals are. We performed a listening test to evaluate the perceptual similarity between the original residual and the modeled counterpart. Then we further investigate whether the result of the listening test can be explained by the fine structure of the residual magnitude spectrum. The results presented here have the potential to subsidize improvements on residual modeling.
In this paper, a simple method for pitch-scale modifications of speech based on a recently suggested model for AM-FM decomposition of speech signals, is presented. This model is referred to as the adaptive Harmonic Model (aHM). The aHM models speech as a sum of harmonically related sinusoids that can adapt to the local characteristics of the signal. It was shown that this model provides high quality reconstruction of speech and thus, it can also provide high quality pitch-scale modifications. For the latter, the amplitude envelope is estimated using the Discrete All-Pole (DAP) method, and the phase envelope estimation is performed by utilizing the concept of relative phase. Formal listening tests on a database of several languages show that the synthetic pitch-scaled waveforms are natural and free of some common artefacts encountered in other state-of-the-art models, such as HNM and STRAIGHT.
Recent advances in speech analysis have shown that voiced speech can be very well represented using quasi-harmonic frequency tracks and local parameter adaptivity to the underlying signal. In this paper, we revisit the quasi-harmonicity approach through the extended adaptive Quasi-Harmonic Model - eaQHM, and we show that the application of a continuous f0 estimation method plus an adaptivity scheme can yield high resolution quasi-harmonic analysis and perceptually indistinguishable resynthesized speech. This method assumes an initial harmonic model which successively converges to quasi-harmonicity. Formal listening tests showed that eaQHM is robust against f0 estimation artefacts and can provide a higher quality in resynthesizing speech, compared to a recently developed model, called the adaptive Harmonic Model (aHM), and the classic Sinusoidal Model (SM).
Processing of emotional (or expressive) speech has gained attention over recent years in the speech community due to its numerous applications. In this paper, an adaptive sinusoidal model (aSM), dubbed extended adaptive Quasi-Harmonic Model - eaQHM, is employed to analyze emotional speech in accurate, robust, continuous, timevarying parameters (amplitude, frequency, and phase). It is shown that these parameters can adequately and accurately represent emotional speech content. Using a well known database of narrowband expressive speech (SUSAS) we show that very high Signal-to-Reconstruction-Error Ratio (SRER) values can be obtained, compared to the standard sinusoidal model (SM). Formal listening tests on a smaller wideband speech database show that the eaQHM outperforms SM from a perceptual resynthesis quality point of view. Finally, preliminary emotion classification tests show that the parameters obtained from the adaptive model lead to a higher classification score, compared to the standard SM parameters.
Automatic classification of emotional speech is a challenging task with applications in synthesis and recognition. In this paper, an adaptive sinusoidal model (aSM), called the extended adaptive Quasi-Harmonic Model — eaQHM, is applied on emotional speech analysis for classification purposes. The parameters of the model (amplitude and frequency) are used as features for the classification. Using a well known database of narrowband expressive speech (SUSAS), we develop two separate Vector Quantizers (VQ) for the classification, one for the amplitude and one for the frequency features. It is shown that the eaQHM can outperform the standard Sinusoidal Model in classification scores. However, single feature classification is inappropriate for higher-rate classification. Thus, we suggest a combined amplitude-frequency classification scheme, where the classification scores of each VQ are weighted and ranked, and the decision is made based on the minimum value of this ranking. Experiments show that the proposed scheme achieves higher performance when the features are obtained from eaQHM. Future work can be directed to different classifiers, such as HMMs or GMMs, and ultimately to emotional speech transformations and synthesis.
Nonstationary oscillations are ubiquitous in music and speech, ranging from the fast transients in the attack of musical instruments and consonants to amplitude and frequency modulations in expressive variations present in vibrato and prosodic contours. Modeling nonstationary oscillations with sinusoids remains one of the most challenging problems in signal processing because the fit also depends on the nature of the underlying sinusoidal model. For example, frequency modulated sinusoids are more appropriate to model vibrato than fast transitions. In this paper, we propose to model nonstationary oscillations with adaptive sinusoids from the extended adaptive quasi-harmonic model (eaQHM). We generated synthetic nonstationary sinusoids with different amplitude and frequency modulations and compared the modeling performance of adaptive sinusoids estimated with eaQHM, exponentially damped sinusoids estimated with ESPRIT, and log-linear-amplitude quadratic-phase sinusoids estimated with frequency reassignment. The adaptive sinusoids from eaQHM outperformed frequency reassignment for all nonstationary sinusoids tested and presented performance comparable to exponentially damped sinusoids.
In this paper, a recently proposed high-resolution Sinusoidal Model, dubbed the extended adaptive Quasi-Harmonic Model (eaQHM), is applied on modeling unvoiced speech sounds. Unvoiced speech sounds are parts of speech that are highly non-stationary in the time-frequency plane. Standard sinusoidal models fail to model them accurately and efficiently, thus introducing artefacts, while the reconstructed signals do not attain the quality and naturalness of the originals. Motivated by recently proposed non-stationary transforms, such as the Fan-Chirp Transform (FChT), eaQHM is tested to confront these effects and it is shown that highly accurate, artefact-free representations of unvoiced sounds are possible using the non-stationary properties of the model. Experiments on databases of unvoiced sounds show that, on average, eaQHM improves the Signal to Reconstruction Error Ratio (SRER) obtained by the standard Sinusoidal Model (SM) by 93%. Moreover, modeling superiority is also supported via informal listening tests with two other models, namely the SM and the well-known STRAIGHT method.
The paper investigates acoustic properties of the Greek voiceless plosives /p, t, k/, including the palatal allophone [c], by examining absolute and relative VOT and closure duration, relative burst intensity and spectral moments. Variability due to place of articulation, vowel context, gender and age is examined. The speech material comprised C1VC2V real words (C1=/p, t, k/, V=/i, a/, C2=dental/alveolar). Data from 12 adult speakers and 12 children (6 male and 6 female in each group) were analysed. Results showed that relative closure duration decreased and relative VOT duration increased in the order /p/, /t/, /k/ showing the anticipated inverse relationship reported in the literature. VOT was longer in the high vowel context for /t, k/. All spectral moments were significantly affected by place of articulation. Relative burst intensity was greater for the velar. Effects of gender and age were variable. Results are discussed in relation to theory and crosslinguistic evidence.
Sinusoids are widely used to represent the oscillatory modes of musical instrument sounds in both analysis and synthesis. However, musical instrument sounds feature transients and instrumental noise that are poorly modeled with quasi-stationary sinusoids, requiring spectral decomposition and further dedicated modeling. In this work, we propose a full-band representation that fits sinusoids across the entire spectrum. We use the extended adaptive Quasi-Harmonic Model (eaQHM) to iteratively estimate amplitude- and frequency-modulated (AM–FM) sinusoids able to capture challenging features such as sharp attacks, transients, and instrumental noise. We use the signal-to-reconstruction-error ratio (SRER) as the objective measure for the analysis and synthesis of 89 musical instrument sounds from different instrumental families. We compare against quasi-stationary sinusoids and exponentially damped sinusoids. First, we show that the SRER increases with adaptation in eaQHM. Then, we show that full-band modeling with eaQHM captures partials at the higher frequency end of the spectrum that are neglected by spectral decomposition. Finally, we demonstrate that a frame size equal to three periods of the fundamental frequency results in the highest SRER with AM–FM sinusoids from eaQHM. A listening test confirmed that the musical instrument sounds resynthesized from full-band analysis with eaQHM are virtually perceptually indistinguishable from the original recordings.
We propose a fast speech analysis method which simultaneously performs high-resolution voiced/unvoiced detection (VUD) and accurate estimation of glottal closure and glottal opening instants (GCIs and GOIs, respectively). The proposed algorithm exploits the structure of the glottal flow derivative in order to estimate GCIs and GOIs only in voiced speech using simple time-domain criteria. We compare our method with well-known GCI/GOI methods, namely, the dynamic programming projected phase-slope algorithm (DYPSA), the yet another GCI/GOI algorithm (YAGA) and the speech event detection using the residual excitation and a mean-based signal (SEDREAMS). Furthermore, we examine the performance of the aforementioned methods when combined with state-of-the-art VUD algorithms, namely, the robust algorithm for pitch tracking (RAPT) and the summation of residual harmonics (SRH). Experiments conducted on the APLAWD and SAM databases show that the proposed algorithm outperforms the state-of-the-art combinations of VUD and GCI/GOI algorithms with respect to almost all evaluation criteria for clean speech. Experiments on speech contaminated with several noise types (white Gaussian, babble, and car-interior) are also presented and discussed. The proposed algorithm outperforms the state-of-the-art combinations in most evaluation criteria for signal-to-noise ratio greater than 10 dB.
Source code is available here
Sinusoidal Modeling is one of the most widely used parametric methods for speech and audio signal processing. The accurate estimation of sinusoidal parameters (amplitudes, frequencies, and phases) is a critical task for close representation of the analyzed signal. In this thesis, based on recent advances in sinusoidal analysis, we propose high resolution adaptive sinusoidal models for analysis, synthesis, and modifications systems of speech. Our goal is to provide systems that represent speech in a highly accurate and compact way. Inspired by the recently introduced adaptive Quasi-Harmonic Model (aQHM) and adaptive Harmonic Model (aHM), we overview the theory of adaptive Sinusoidal Modeling and we propose a model named the extended adaptive Quasi-Harmonic Model (eaQHM), which is a non-parametric model able to adjust the instantaneous amplitudes and phases of its basis functions to the underlying time-varying characteristics of the speech signal, thus significantly alleviating the so-called local stationarity hypothesis. The eaQHM is shown to outperform aQHM in analysis and resynthesis of voiced speech. Based on the eaQHM, a hybrid analysis/synthesis system of speech is presented (eaQHNM), along with a hybrid version of the aHM (aHNM). Moreover, we present motivation for a full-band representation of speech using the eaQHM, that is, representing all parts of speech as high resolution AM-FM sinusoids. Experiments show that adaptation and quasi-harmonicity is sufficient to provide transparent quality in unvoiced speech resynthesis. The full-band eaQHM analysis and synthesis system is presented next, which outperforms state-of-the-art systems, hybrid or full-band, in speech reconstruction, providing transparent quality confirmed by objective and subjective evaluations. Regarding applications, the eaQHM and the aHM are applied on speech modifications (time and pitch scaling). The resulting modifications are of high quality, and follow very simple rules, compared to other state-of-the-art modification systems. Results show that harmonicity is preferred over quasi-harmonicity in speech modifications due to the embedded simplicity of representation. Moreover, the full-band eaQHM is applied on the problem of modeling audio signals, and specifically of musical instrument sounds. The eaQHM is evaluated and compared to state-of-the-art systems, and is shown to outperform them in terms of resynthesis quality, successfully representing the attack, transient, and stationary part of a musical instrument sound. Finally, another application is suggested, namely the analysis and classification of emotional speech. The eaQHM is applied on the analysis of emotional speech, providing its instantaneous parameters as features that can be used in recognition and Vector-Quantization-based classification of the emotional content of speech. Although the sinusoidal models are not commonly used in such tasks, results are promising.
This book presents the fundamental principles of continuous and discrete time signal processing and system analysis in a simple and comprehensible way. Emphasis is given on intuitive analysis and iterpretation of its subjects, accompanied with mathematical rigor where necessary. Fourier, Laplace, and Z Transforms play a major role, along with LTI system analysis and their applications. Each chapter of the book emerges from the unsolved problems and requirements of previous chapters, thus providing a flow that helps the reader to grow the thinking skills of an engineer. The book contains a variety of images and figures while it includes many, carefully selected solved examples. Moreover, the reader can find a number of exercises at the end of each chapter. Finally, the theory is supported by selected implementations in MATLAB, while the source code of each example is provided in a dedicated webpage, along with supplementary files.