I know that cosine similarity can be used to measure how two images or audios are similar.
But I don't understand how an image can be represented as a N-dimensions vector. For a text document d
, each i-th
dimension represents the term t_i
, and it's scalar component represent it's frequency inside the document. The problem is that I cannot figure out the same "mapping" for an image (or audio) file.
The only solution that crosses my mind is that we have of vector in M-dimensions, where M is the number of pixels in the images (millions of dimensions? That's insane!) and the value is "how much dark the pixel is" with a max value which represents white, but I strongly think that this solution is not the one used. I have no idea how this could be done for an audio file.
Hilbert Curve ... a space filling curve which maps a 2D image onto a 1D line ... each pixel is visited once and only once in a spatial pattern which nicely handles changes to pixel density ... at each pixel the intensity is recorded ... the resulting 1D line is your vector ready for a cross-product with a line generated from another source image using the same technique
use this to compute pixel intensity (Y) from source image pixel RGB values :
Y = 0.2126 * R + 0.7152 * G + 0.0722 * B
So from each pixel in source image we generate its Y value and use this to populate each position in our 1D vector (where pixel order is generated from our Hilbert Curve of the image), repeat this across each pixel in source image
Lets say our image is 16 by 16 so we have 256 pixels represented in our line by 256 equally spaced points ... if we choose to generate audio from our image we could place a sine wave oscillator at each of these 256 points and drive the volume of each oscillator by the point's pixel intensity measurement ( Y ) ... concomitantly we drive each oscillator's frequency by its position in the line ... low to high frequency of the human hearing spectrum (say 200hz to 2khz) across the length of the line ... introduce time by generating audio for a short while ... at each instant of time add together the curve height across all oscillators and divide by 256 (cut audio samples) ... this audio is the sonic mapping of our source image ... this transformation is reversible ... we can just as easily start with audio and generate an image ... with our 1D vector of ( Y ) values as the intermediator
Here is an excellent clip on this idea https://www.youtube.com/watch?v=DuiryHHTrjU
Importantly, this technique is entirely reversible ... if we start with audio we can generate an image and in so doing we gain access to the intermediator vector ... do a Fourier Transform FFT on a short audio clip to transform it from time domain into its frequency domain counterpart ... this results in a set of frequencies each with an amplitude value ... each frequency value is put into a position in our intermediator vector to represent an output pixel ... the output pixel intensity value is driven from the FFT amplitude for that frequency ... then do a Hilbert Curve in reverse to map our 1D vector line into an output 2D image