[SOLVED] What is the meaning of the EchoNest API's getTimbre vector?

What is the meaning of the EchoNest API's getTimbre vector?

The EchoNest Analyzer Documentation states the following regarding timbre:

timbre is the quality of a musical note or sound that distinguishes different types of musical instruments, or voices. It is a complex notion also referred to as sound color, texture, or tone quality, and is derived from the shape of a segment’s spectro-temporal surface, independently of pitch and loudness. The Echo Nest Analyzer’s timbre feature is a vector that includes 12 unbounded values roughly centered around 0. Those values are high level abstractions of the spectral surface, ordered by degree of importance. For completeness however, the first dimension represents the average loudness of the segment; second emphasizes brightness; third is more closely correlated to the flatness of a sound; fourth to sounds with a stronger attack; etc. See an image below representing the 12 basis functions (i.e. template segments). The actual timbre of the segment is best described as a linear combination of these 12 basis functions weighted by the coefficient values: timbre = c1 x b1 + c2 x b2 + ... + c12 x b12, where c1 to c12 represent the 12 coefficients and b1 to b12 the 12 basis functions as displayed below. Timbre vectors are best used in comparison with each other.

My understanding is that the b vector ({b1...b12}) is what is being returned by your API's getTimbre method. But then where are the {c1...c12} coefficients coming from? I don't understand how to acquire a scalar timbre from a vector timbre (primarily because your analysis API is closed source). Can you help me out with this?

Solution

Note that answers on this website come from volunteers. To get official support for the library, you need to contact the publisher directly.

b1 … b12 is not the result of the audio analysis, it is merely descriptive of what the analysis does. They are fixed constants as shown in the diagram:

The vector of scalars c1 … c12 is what the analyzer produces. Of course, the sound cannot be perfectly described by only 12 numbers. Multiplying the scalars by the functions won't reproduce the original music because there's not enough data there; it's only an approximation. Possibly, though, you'll get a similar "mood" from each segment, so it could be interesting to try and listen.