pythonscipydata-analysiscurve-fittinggaussian

How to regress multiple gaussian peaks from a spectrogram using scipy?


I have this code that aims to fit the data in here DriveFilelink

The function read_file is utilized to extract information in a structured manner. However, I'm encountering challenges with the Gaussian fit, evident from the discrepancies observed in the Gaussian fit image. This issue appears to stem from the constraints imposed on certain parameters, such as fixing the mean of all the Gaussians and three out of the four amplitudes. Despite these constraints, they are necessary as they are based on available information.

Does anyone know how I can fix this problem? In order to have a better fit.

The code is the following:

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
from scipy.optimize import curve_fit
from google.colab import drive
import os
# Mount Google Drive
drive.mount('/content/drive')


# Function to read numeric data from a file
def read_numeric_data(file_path):
    with open(file_path, 'r', encoding='latin1') as f:
        data = []
        live_time = None
        real_time = None
        for line in f:
            if 'LIVE_TIME' in line:
                live_time = line.strip()
            elif 'REAL_TIME' in line:
                real_time = line.strip()
            elif all(char.isdigit() or char in ".-+e" for char in line.strip()):
                row = [float(x) for x in line.split()]
                data.extend(row)
    return np.array(data), live_time, real_time



file_path = '/content/drive/MyDrive/ProjetoXRF_Manuel/April2024/20240402_In.mca'
data, _, _ = read_numeric_data(file_path)

a = -0.0188026396003431
b = 0.039549044037714
Data = data





# Función para convolucionar múltiples gaussianas
def multi_gaussian(x, *params):
    
    eps = 1e-10
    y = np.zeros_like(x)
    amplitude_relations = [1,0.5538, 0.1673 , 0.1673*0.5185 ]
    meanvalues= [24.210, 24.002 , 27.276 , 27.238]
    amplitude = params[0]

    for i in range(4):
        sigma = params[i * 3 + 2]  # Get the fitted standard deviation for this Gaussian
        
        y += amplitude*amplitude_relations[i]* np.exp(-(x - meanvalues[i])**2 / (2 * sigma**2 + eps))
    return y


sigma=[]
area=[]

# Función para graficar el espectro de energía convolucionado
def plot_convolved_spectrum(Data, a, b,i, ax=None):

    maxim = np.max(Data)
    workingarray = np.squeeze(Data)

    # Definir puntos máximos
    peaks, _ = find_peaks(workingarray / maxim, height=0.1)
    peak_values = workingarray[peaks] / maxim
    peak_indices = peaks

    # Calcular valores de energía correspondientes a los picos
    energy_spectrum = a + b * peak_indices

    # Preparar datos para convolución
    data = workingarray[:885] / maxim
    data_y = data / data.max()
    data_x = a + b * np.linspace(0, 885, num=len(data_y))

    # Ajustar múltiples gaussianas al espectro de energía
    peak_indices2, _ = find_peaks(data_y, height=0.1)
    peak_amplitudes = [1,0.5538, 0.1673 , 0.1673*0.5185 ]
    peak_means = [24.210, 24.002 , 27.276 , 27.238]
    peak_sigmas = [0.1] * 4

    params_init = list(zip(peak_amplitudes, peak_means, peak_sigmas))
    params_init = np.concatenate(params_init)

    # Ajustar curva
    params, params_cov = curve_fit(multi_gaussian, data_x, data_y, p0=params_init)

    # Obtener una interpolación de alta resolución del ajuste
    x_fine = np.linspace(data_x.min(), data_x.max(), num=20000)

    y_fit = multi_gaussian(x_fine, *params)

    #  Data Graphic
    ax.scatter(data_x, data_y, color='black', marker='o', s=20, label = 'Data' )
    y_data_error =np.sqrt(workingarray[:885])
    ax.plot(data_x, data_y + y_data_error/maxim, color='black',linestyle='--')
    ax.plot(data_x, data_y - y_data_error/maxim, color='black',linestyle='--')
    # Fit Graphic

    ax.plot(x_fine, y_fit, label="Fit", linewidth=1.5)



    # Extraer desviaciones estándar y amplitudes
    sigmas_array = params[2::3]
    # Calcular sigma promedio
    sigma.append(np.mean(sigmas_array))

    # Configuración del gráfico
    ax.set_xlabel('Energy (KeV)')
    ax.set_ylabel('Normalized Data')
    ax.legend()
    ax.set_title('Convolved Energy Spectrum')

    # Imprimir información
    print("Standard deviations:", sigmas_array)






fig, ax = plt.subplots()
plot_convolved_spectrum(Data, a, b,1,ax=ax)
ax.set_xlim(22, 28)
plt.show()

Note I've experimented with setting bounds and refining initial guesses to improve the Gaussian fit. However, it's important to note that working with bounds isn't equivalent to fixing parameters. My aim is to maintain a degree of freedom of just two parameters: the amplitude of the first Gaussian and the standard deviation of all Gaussians. While exploring these adjustments, I'm striving to strike a balance between constraint and flexibility to achieve the desired fit accuracy.

Note This is the model: model

The parameters of the Gaussian function are defined as follows: A represents the amplitude, x_0 denotes the mean value, and sigma represents the standard deviation. In the context of analyzing energy spectra, I possess prior knowledge about the mean values and the relationships governing the amplitudes. Consequently, I aim to constrain certain parameters (such as A and mean) based on this known information. Specifically, I intend to fit only the first amplitude and then utilize the known relationships, such as A2 = 0.1673 * A1 for the second peak. And the to fit the corresponding standar deviation.

Why using a sum of gaussians?.

The apparent singularity of the first peak in the plot might suggest a single Gaussian. However, this is not the case. The Gaussian representing this energy peak actually consists of two Gaussians that are summed together. This complexity arises because the energy emission at that level comprises two distinct values, 24.002 and 24.210. And due to the resolution of the experiment we are not able to distinguish them just by seen the plot.

And for all these reasons is that I am trying to fix some parameters.

If someone has issue with the data here it is:

extracted data.txt


Solution

  • Here is a MCVE showing how to regress a variable number of peaks from the significant part of your signal.

    import itertools
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from scipy import signal, stats, optimize
    

    We load you data and post-process it to get physical series:

    data = pd.read_csv("20240402_In.mca", encoding="latin-1", names=["y"]).loc[12:2058]
    data["y"] = pd.to_numeric(data["y"])
    data["y"] /= data["y"].max()
    data["x"] = -0.0188026396003431 + 0.039549044037714 * data.index
    
    data = data.loc[(20. <= data["x"]) & (data["x"] <= 30.), :]
    

    We create the model which accept a variable number of peaks (driven by find_peaks solution):

    def peak(x, A, s, x0):
        law = stats.norm(loc=x0, scale=s)
        return A * law.pdf(x) / law.pdf(x0)
    
    def model(x, *parameters):
        n = len(parameters) // 3
        y = np.zeros_like(x)
        for i in range(n):
            y += peak(x, *parameters[(i * 3):((i + 1) * 3)])
        return y
    

    We identify peaks:

    indices, bases = signal.find_peaks(data.y, prominence=0.0125)
    #(array([ 67, 118, 195, 210]),
    # {'prominences': array([0.01987281, 0.99920509, 0.18918919, 0.03338633]),
    #  'left_bases': array([  1,   1, 179, 205]),
    #  'right_bases': array([ 76, 160, 219, 219])})
    

    This is the key point, from the identification, we create our educated guess:

    x0s = data.x.values[indices]
    As = bases["prominences"]
    ss = (data.x.values[bases["right_bases"]] - data.x.values[bases["left_bases"]]) / 8.
    p0 = list(itertools.chain(*zip(As, ss, x0s)))
    #[0.01987281399046105,
    # 0.37077228785356864,
    # 22.682348638047493,
    # 0.9992050874403816,
    # 0.7860372502495658,
    # 24.699349883970907,
    # 0.1891891891891892,
    # 0.19774522018856988,
    # 27.744626274874886,
    # 0.033386327503974564,
    # 0.06921082706599968,
    # 28.337861935440593]
    

    Now we regress the model wrt your dataset:

    popt, pcov = optimize.curve_fit(model, data.x, data.y, p0=p0)
    #array([1.03735804e-02, 9.61270732e-01, 2.29030214e+01, 9.92651381e-01,
    #   1.59694755e-01, 2.46561911e+01, 1.85332645e-01, 1.21882422e-01,
    #   2.77807838e+01, 3.67805911e-02, 1.21890416e-01, 2.83583849e+01])
    

    We can now estimate the model:

    yhat = model(data.x, *popt)
    

    Graphically, it leads to:

    enter image description here

    Which is quite satisfactory for the unconstrained fit but certainly does not fulfill the constraint you want to enforce.

    Could you publish your data ready for coding as arrays in your post?