So, I have finally found some time to brush up on python and its visualization features, with particular focus on sound. Below is my attempt at putting together a waterfall spectrogram using plot_surface that is in part based on chunks of code found on the web.
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal # spectrogram function
from matplotlib import cm # colour map
import pyaudio as pa
import time
import struct
CHUNK = 1024
FORMAT = pa.paInt16
CHANNELS = 1
RATE = 44100 # in Hz
LENGTH = 2 # in seconds
p = pa.PyAudio()
stream = p.open(
format = FORMAT,
channels = CHANNELS,
rate = RATE,
input=True,
output=True,
frames_per_buffer=CHUNK
)
# 3d plot
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.set_xlim(RATE/2,0)
# ---- OPTION QUESTION COMING UP BELOW
# this works better for option 1.
ax.set_zlim(-40,40)
# this works better for option 2. Why?
#ax.set_zlim(0,100)
# ---- END OF OPTION QUESTION
totaldata = bytearray(RATE*LENGTH) # this zeros out the entire array
sf = None
while 1:
data = stream.read(CHUNK)
# ---- 2 OPTIONS COMING UP BELOW (UNCOMMENT ONE OR THE OTHER, AND MATCH IT WITH THE ONE ABOVE)
# OPTION 1 (CHUNK of data only--this works fine)
# prepare just this chunk for display
numpydata = np.frombuffer(data, dtype=np.int16)
## # OPTION 2 (1 second of data--this has huge walls against the edges of each chunk?)
## # pop first CHUNK-sized data
## totaldata = totaldata[((CHUNK*2)-1):]
## # append the new data
## totaldata[((RATE*LENGTH)-(CHUNK*2)):] = data
## # load the data (option 1 ,one CHUNK)
## #numpydata = np.frombuffer(onechunkdata, dtype=np.int16)
## # (option 2, 1 second)
## numpydata = np.frombuffer(totaldata, dtype=np.int16)
# ---- END OF TWO OPTIONS
# analyze data
freq_bins, timestamps, spec = signal.spectrogram(numpydata, RATE)
# redraw (have if statement since the first time sf is not instantiated until below)
if (sf):
sf.remove()
sf = ax.plot_surface(freq_bins[:, None], timestamps[None, :], 10.0*np.log10(spec), cmap='inferno')
plt.show(block=False)
plt.pause(0.01)
Please note there are 2 options annotated in the code, with the first option being enabled by the default. The first option takes only one chunk of audio captured by the microphone and immediately visualizes it. This technically looks like a waterfall, albeit a microscopic one. This one works ok (imaged below):
However, when I enable 2nd option (note the two annotated places in the code where you need to comment the first option out and uncomment the second option, first one is with set_zlim and the second is in the while loop), which tries to load the array and dynamically pop chunk-sized data and then append the new data at the end. In this iteration, I get tons of spikes surrounding each window. So, if you look at the surface directly from the bottom or top, you get to see that inside the valleys, the data still shows correctly, but the edges are so profoundly tall that it is impossible to see the same from other angles. Looking from the side, you can see how each chunk is surrounded by these same spikes. Please see images below showing the issue:
Isometric view:
Side view:
View from the bottom showing how you can still see a capture of my whistling, showing there is a steady frequency buried between the spiky mountains:
I tried changing the nfft, window, nperseg, and perhaps most importantly, assigned different windowing functions (e.g. hamming), and while it does affect the overall look of the spectrogram, none of it solves its spiky appearance. So, here are several questions:
What can I do to make the spectrogram smoothen the edges?
Is this potentially because I am interrupting reading from the microphone with calculations and redraws, which may result in skipped samples that in turn cause these spiky walls?
Why is the z axis having such profoundly different ranges when observing a single audio chunk versus 1-second of data (see first commented OPTION question in the code)?
Why is the microphone chunk actually twice the requested chunk? It has one channel of audio and a requested chunk is 1024 samples. Yet, when one requests the data length, I get 2048 instead.
By extension, why is the graphed length of the y axis (time) half as long as the requested size of the totaldata (RATE, which is 44100 times LENGTH, which is 1s)?
It seems to me this is a windowing and/or skipped samples in the captured stream (that are then seamlessly appended inside the totaldata bytearray) issue, but I have no idea how to address this. Question 1 is the main question, while the remaining questions are related but also potentially independent of the first.
Thank you for your help.
So, I finally figured out the answer to the 1st question. The offending line is:
totaldata = totaldata[((CHUNK*2)-1):]
Early on I thought I would need to subtract 1 from the CHUNK*2 pop of the beginning of the totaldata buffer, given the array index is [0]. It turns out that removing -1
from that one line solves questions 1-3, resulting in a (mostly) correct spectrogram, with the answer to 2 being "no, it is not because of those potential reasons", and the answer to 3 being erroneous data, which causes the aforesaid problems. That still leaves questions 4 and 5, though.
I do now ocassionally see a vertical artifact (as if there is a split second of a broadband noise spanning 0-RATE/2 Hz). I am thinking this may have to do with windowing, overlap, and possibly imperfections in the calculation. Would love to hear what others may have to say about that...