8 min read

Machine listening

We can acquire data from various sources. In a visualization context, however, we may encounter situations wherein we will need to control some elements of an animation with respect to some particular characteristic of a signal, for example, their amplitude or their frequency. Yet, these kinds of information are attributes of the signal, rather than parts of it. In other words, we need something to happen not with respect to some existent data (that is, our signal in this context) but with respect to certain characteristics of a data flow. Consider that an audio signal is completely unaware of how loud it is or of what its frequency is. Remember that audio signals are merely streams of numbers and that sounds are merely fluctuations of air pressure. The reason we understand sounds as having loudness or pitch, is because our auditory apparatus analyzes them and provides the brain with information on certain sonic qualities. Further, more sophisticated perceptual and cognitive processes perform additional kinds of analyses to extract as well as attribute information and meanings, so that we perceptually decipher what we hear. Likewise, we can say that a signal is periodic and has a certain frequency, only if we somehow analyze it. Remember that the output of a sinusoidal wave at a frequency of 200 Hz is just a flow of numbers between ±1. The datum 200 is not part of this signal, so the only way to make something happen with respect to this number is to actually generate it by means of analyzing the audio signal against its frequency. The task of retrieving statistical and other kinds of information from audio signals is generally referred to as machine listening.

Machine listening is, in essence, to analyze signals in order to generate information that represent certain qualities of these signals. To properly understand and evaluate the kind of information we may get from some machine listening algorithm, it is worth distinguishing briefly the different kinds of properties a signal may have. Acoustic properties refer to physical properties of sound, and consequently of audio signals, particularly qualities such as amplitude, frequency, and spectrum. Psychoacoustic properties refer to low-level perceptional properties of audio signals, such as loudness, pitch, and timbre. Psychoacoustic properties are fundamentally different than their acoustic equivalents, the latter being intrinsically linked to perception. For instance, loudness refers to how loud something sounds, while the amplitude stands for the actual amount of the displacement of the air particles that occurs in the physical space. It has to be stressed that the various psychoacoustic qualities do relate and depend upon the acoustic properties of sound; nonetheless, the relationships are very complex and not that straightforward as they may appear to be. For example, loudness does not depend exclusively upon amplitude, but it also depends upon frequency, spectral content, and even upon a series of psychological and other factors. We can also speak of several families of higher-level perceptional properties, such as musical ones (scale, tonality, rhythm, genre, expressivity, and so on), cognitive ones (semantics, symbolical signification, and so on), and psychological ones (irritability, entertainability, ability to cause relaxation, and so on). Again, such properties may depend or relate to some extent to the acoustic or psychoacoustic qualities of sound; yet the inter-relationships may be extremely complex and even not fully understood in certain cases.

Machine listening algorithms are not limited only to simple acoustic properties of a signal; sophisticated algorithms have been proposed for more complex problems as well such as musical style recognition and rhythm extraction. As far as musical qualities are concerned, the more specialized term musical information retrieval is sometimes encountered too. In SuperCollider, we can easily perform basic audio analyses to retrieve information on both physical as well as certain perceptional properties of audio signals using the available machine listening UGens, the most important of which will be discussed immediately.

Music Information Retrieval (MIR) is an interdisciplinary field of science dealing with how to retrieve and classify information from music

Tracking amplitude and loudness

We can also use the Peak UGen, which will return the maximum peak amplitude every time it receives a trigger or the PeakFollower UGen which smoothly decays from the maximum value by some specified decay time. To track the minimum or the maximum value of a signal, we can use the RunningMin or RunningMax UGens. To track Root Mean Square (RMS) amplitude, we can use the RunningSum UGen. The following example shows how to use these UGens:

(// tracking amplitude { var sound = SinOsc.ar(mul:LFNoise2.kr(1).range(0,1)); // source RunningSum.rms(sound,100).poll(label:'rms'); // rms Amplitude.kr(sound).poll(label:'peak'); // peak Peak.kr(sound, Impulse.kr(1)).poll(label:'peak_trig'); // peak when triggered PeakFollower.kr(sound).poll(label:'peak_dec'); // peak with decay RunningMin.kr(sound).poll(label:'min'); // minimum RunningMax.kr(sound).poll(label:'max'); // maximum Out.ar(0,sound); // write to output }.play; )

Sometimes we may want something to happen when a signal is silent or at least when it is below a certain level. In such cases, we can use DetecteSilence. There also exists a Loudness UGent which will estimate loudness in Sones (the measure of loudness). It is designed to analyze spectra and requires an FFT window of size 1024 for sampling rates of 44100 or 48000 and of the size 2048 for 88200 or 96000, respectively. For example:

( // track loudness { var sound, loudness; sound = SinOsc.ar(LFNoise2.ar(1).range(100,10000), mul:LFNoise0.ar(1).range(0,1)); // source loudness = FFT(LocalBuf(1024),sound); // sampling rates of 44.1/48K // loudness = FFT(LocalBuf(1024),sound); // sampling rates of 88.2/96K loudness = Loudness.kr(loudness).poll(label:loudness); Out.ar(0, sound); }.play; )

Tracking frequency

As far as frequency is concerned, there are a number of relevant UGens, each of them implemented differently. The most simple one is ZeroCrossing, which will estimate the frequency by keeping track of how often an input signal crosses the horizontal axis, which represents 0 in terms of amplitude. Pitch is a more accurate frequency tracker, which also allows for some tweaking. Note that, regardless of its name, it performs frequency tracking rather than pitch tracking, the latter also depends on a series of other factors. More advanced frequency trackers are Tartini (which is based on the method used in the homonymous open source pitch tracker) and Qitch (which has to be used along with one of the special auxiliary WAV files it is distributed with). Tartini and Qitch are not included in the standard SuperCollider distribution but on the SC3Plugins extension bundle (available at http://sc3-plugins.sourceforge.net/). Pitch, Tartini, and Qitch will all return an array of instances of OutProxy obtaining both the estimated frequency as well as a flag of 1 or 0 to denote whether they successfully tracked some frequency or not. When attempting to track frequency we should always bear in mind that the former being a complicated process, not all trackers would work equally well for all kinds of signals. For example:

( // frequency tracking var qitchBuffer = Buffer.read (Server.default,"/Users/marinos/Library/Application Support/ SuperCollider/Extensions/SC3plugins/PitchDetection/extraqitchfiles/ QspeckernN2048SR44100.wav"); // path to auxiliary wav file for Qitch { // a complex signal var sound = Saw.ar(LFNoise2.ar(1).range(500,1000).poll(label: ActualFrequency)) + WhiteNoise.ar(0.4); ZeroCrossing.ar(sound).poll(label:ZeroCross); Pitch.kr(sound).poll(label:Pitch); Tartini.kr(sound).poll(label:Tartini); Qitch.kr(sound,qitchBuffer).poll(label:Qitch); Out.ar(0,sound!2); }.play; )

For this signal, Qitch is probably the most reasonable choice, judging by the output on my machine:

ActualFrequency: 864.222 ZeroCross: 6368.27 Pitch: 171.704 Pitch: 1 Tartini: 95.0917 Tartini: 1 Qitch: 845.466 Qitch: 1

Timbre analysis and feature detection

Timbre is a psycho-acoustic quality, and refers to what makes sounds distinct even if they have the same loudness and pitch. Of course this is a broad oversimplification of a very complex subject; in reality there isn’t even a consensus on what exactly timbre stands for. While timbre has been proposed to depend on several qualities, in a machine listening context timbre recognition is almost exclusively based on analyzing spectra. Herein, we will focus on how to broadly detect several spectral features, rather than timbre per se, which is a rather indefinite quality. By the term feature we refer to anything that could be characteristic about a signal’s spectral characteristics.

In SuperCollider there is a plethora of relevant UGens, both in the standard distribution as well as in extension libraries. Of the most useful are SpecCentroid and ScpeFlatness used to calculate the spectral centroid and the spectral flatness, respectively. The former roughly stands for the most perceptually prominent frequency range in our signal while the latter is an indicator of how complicated our signal is (for example, for a sinusoid it would be 0 while for white noise close to 1). The SpecPcile UGen will calculate the cumulative distribution of a spectrum, and given a percentile of spectral energy as an argument, will return that frequency from which the given percentile of spectral energy lies below. In the SC3Plugins extensions bundle, we will also find the FFTCrest UGen, which will calculate the spectral crest of a signal, which, in short, indicates how flat or peaky a signal is, and the SensoryDissonance UGen, which will attempt to calculate how dissonant a signal is (with 1 being totally dissonant and 0 being totally consonant). The FFTSpread UGen measures the spectral spread of a signal, that is how wide or narrow its spectrum is and FFTSlope calculates the slope of the linear correlation line derived from the spectral magnitudes. Finally, the Goertzel UGen calculates the magnitude and phase at a single specified frequency. For example:

( // feature extraction { var sound = SinOsc.ar(240,mul:0.5) + Resonz.ar(ClipNoise.ar,2000,0.6,mul:SinOsc.kr(0.05).range(0,0.5)) + Saw.ar(2000,mul:SinOsc.kr(0.1).range(0,0.3)); var fft = FFT(LocalBuf(2048),sound); // a complex signal SpecCentroid.kr(fft).poll(label:Centroid); SpecFlatness.kr(fft).poll(label:Flatness); SpecPcile.kr(fft,0.8).poll(label:Percentile); FFTCrest.kr(fft,1800,2200).poll(label:Crest); SensoryDissonance.kr(fft).poll(label:Dissonance); Out.ar(0,sound!2); }.play; )


In this article, we discussed about machine listening techniques and ways to retrieve information from audio signals.

Resources for Article:

Further resources on this subject:

Subscribe to the weekly Packt Hub newsletter

* indicates required


Please enter your comment!
Please enter your name here