audio feature-extraction audeering-opensmile

How to create custom config files in OpenSMILE

I am trying to extract some features from an audio sample using OpenSMILE, but I'm realizing how difficult it is to set up a config file.

The documentation is not very helpful. The best I could do was run some of the sample config files that are provided, see what came out, and then go into the config file and try to determine where the feature was specified. Here's what I did:

I used the default feature set used from The INTERSPEECH 2010 Paralinguistic Challenge (IS10_paraling.conf).

I ran it over a sample audiofile.

I looked at what came out. Then I read the config file in depth, trying to find out where the feature was specified.

Here's a little markdown table showing the results of my exploration:

| Feature generated | instruction in the conf file                            |
|-------------------|---------------------------------------------------------|
| pcm_loudness      | I see: 'loudness=1'                                     |
| mfcc              | I see a section: [mfcc:cMfcc]                           |
| lspFreq           | no matches for the text 'lspFreq' anywhere              |
| F0finEnv          | I seeF0finalEnv = 1 under [pitchSmooth:cPitchSmoother]  |

What I see, is 4 different features, all generated by a different instruction in the config file. Well, for one of them, there was no disconcernable instruction in the config file that I could find. With no pattern or intuitive syntax or apparent system, I have no idea how I can eventually figure out how to specify my own features I want to generate.

There are no tutorials, no YouTube videos, no StackOverflow question and no blog posts out there talking about how this could be done. Which is really surprising since this is obviously a huge part of using OpenSMILE.

If anyone finds this, please, can you advise me on how to create custom config files of OpenSMILE? Thanks!

Solution

thanks for your interest in openSMILE and your eagerness to build your own configuration files.

Most users in the scientific community actually use openSMILE for its pre-defined config files for the baseline feature sets, which in version 2.3 are even more flexible to use (more commandline options to output to different file formats etc.).

I admit that the documentation provided is not as good as it could be. However, openSMILE is a very complex piece of Software with a lot of functionality, of which only the most important parts are currently well documented.

The best starting point would be to read the openSMILE book and the SIG'MM tutorials all referenced at http://opensmile.audeering.com/ . It contains a section on how to write configuration files. The next important element is the online help of the binary:

SMILExtract -L lists the available components
SMILExtract -H cComponentName lists all options which a given component supports (and thus also features it can extract) with a short description for each
SMILExtract -configDflt cComponentName gives you a template configuration section for the component with all options listed and defaults set

Due to the architecture of openSMILE, which is centered on incremental processing of all audio features, there is (at least not yet) no easy syntax to define the features you want. Rather, you define the processing chain by adding components:

data sources will read in data (from audio files, csv files, or microphone, for example),
data processors will do signal processing and feature extraction in individual steps (windowing, window function, FFT, magnitudes, mel-spectrum, cepstral coefficients (MFCC), for example for extracting MFCC); for each step there is a data processor.
data sinks will write data to output files or send results to a server etc.

You connect the components via the "reader.dmLevel" and "writer.dmLevel" options. These define a name of a data memory level that the components use to exchange data. Only one component may write to one level, i.e. writer.dmLevel=levelName defines the level and may appear only once. Multiple components can read from this level by setting reader.dmLevel=levelName.

In each component you then set the options to enable computation of features and set parameters for this. To answer your question about lspFreq: This is probably enabled by default in the cLsp component, so you don't see an explicit option for it. For future versions of openSMILE the practice of setting all options explicitly will and should be followed more tightly.

The names of the features in the output will be automatically defined by the components. Often each component adds a part the the name, so you can infer from the name the full chain of processing. The options nameAppend and copyInputName (available to most data processors) control this behaviour, although some components might internally override them or change the behaviour a bit.

To see the names (and other info) for each data memory level, including e.g. which features a component in the configuration produces, you can set the option "printLevelStats=5" in the section of componentInstances:cComponentManager.

As everyhting in openSMILE is built for real-time incremental processing, each data memory level has a buffer, which by default is a ring buffer to keep memory footprint constant when the application runs for a longer time. Sometimes you might want to summarise features over a window of a given length (e.g. with the cFunctionals component). In this case you must ensure that the buffer size of the input level to this component is large enough to hold the full window. You do this via the following options:

writer.levelconf.isRb = 1/0 : sets type of buffer to ringbuffer (1) or fixed size buffer
writer.levelconf.growDyn = 1/0 : sets the buffer to dynamically grow if more data is written to it (1)
writer.levelconf.nT = sets the size of the buffer in frames. Alternatively you can use bufferSizeSec=x to set the size size in seconds and convert to frames automatically.

In most cases the sizes will be set correctly automatically. Subsequent levels also inherit the configuration from the previous levels. Exceptions are when you set a cFunctionals component to read the full input (e.g. only produce one feature at the end of the file), the you must use growDyn=1 on the level that the functionals component reads from, or if you use a variable framing mode (see below).

The cFunctionals component provides frameMode, frameSize, and frameStep options. Where frameMode can be full* (one vector produced at end of input/file), **list (specify a list of frames), var (receive messages, e.g. from a cTurnDetector component, that define frames on-the-fly), or fix (fixed length window). Only in the case of fix the options frameSize set the size of this window, and frameStep the rate at which the window is shifted forward. In case of fix the buffer size of the input level is set correctly automatically, in the other cases you have to set it manually.

I hope this helps you to get started! With every new openSMILE release we at audEERING are trying to document things a bit better and unify things through various components.

We also welcome contributions from the community (e.g. anybody willing to write a graphical configuration file editor where you drag/drop components and connect them graphically? ;)) - although we know that more documentation will make this easier. Until then, you always have to source code to read ;)

Cheers, Florian