How to create a DataFrame from GC-MS data with retention time, mass values, and intensities in Python with correct alignment?

I am processing gas chromatography - mass spectroscopy (GC-MS) data stored in a NetCDF file in Python, and I’m having trouble aligning the data correctly to create a structured DataFrame. My goal is to build a matrix where:

Rows: Retention times (from scan_acquisition_time). Columns: Rounded mass values (from mass_values). Cells: Intensity values (from intensity_values) corresponding to the retention time and mass.

The problem that I am facing now is that for the life of me, I cannot understand why some of the intensity values are not aligned/positioned in the correct order in the dataframe. I know that they are not positioned correctly because I have a reference of the same dataframe (output from another programming software which has been verified to be correct).

The placement of intensity values in the dataframe is very important for what I am trying to do; and I have tried different ways such us doing a pivot table or by chunking the intensity values according to the size of mass values, but all of the methods has failed so far.

this was how I extracted the raw data:

` import xarray as xr import pandas as pd import numpy as np

 # Extract necessary data
 retention_time = data['scan_acquisition_time'].values.squeeze()
 intensity_values = data['intensity_values'].values.squeeze()
 mass_values = data['mass_values'].values.squeeze()
 scan_index = data['scan_index'].values.squeeze()
 rounded_mass = np.round(mass_values).astype(int)  # Round mass values to integers
 mass_min = np.round(min(mass_values)).astype(int)
 mass_max = np.round(max(mass_values)).astype(int)
 ideal_mass_values = list(range(mass_min, mass_max + 1))

the following is an example of my data structure:

Shapes

#retention times per scan scan_acquisition_time: (4825,)

#Starting indices for scan's data scan_index: (4825,)

#mass to charge ratio mass_values: (2903174,)

#Intensities corresponding to each mass intensity_values: (2903174,)

sample data

scan_acquisition_time = np.array([ 5.903 6.546 7.188 7.83 8.472 9.115 9.757 10.399 11.041 11.684 12.326 12.968 13.61 14.253 4.895 15.537 16.18 16.822 17.464 18.106 18.749 19.391 20.033 20.675 1.318 21.96 22.602 23.244 23.887 24.529 25.171 25.813 26.456 27.098 27.74 28.383 29.025 29.667 30.309 30.952 31.594 32.236 32.878 33.521 4.163 34.805 35.447 36.09 36.732 37.374])

scan_index = np.array([ 0 624 1249 1878 2492 3127 3742 4366 4982 5618 6246 6863 7469 8098 8715 9320 9936 10565 11191 11821 12449 13061 13680 14316 14931 15573 16208 16824 17448 18055 18680 19301 19928 20551 21186 21794 22419 23051 23674 24295 24919 25543 26164 26786 27414 28041 28664 29285 29904 30517])

mass_values = np.array([19.79999924 20.29999924 20.89999962 1.60000038 22.29999924 22.79999924 23.20000076 24.20000076 24.89999962 26. 27.20000076 28.10000038 29. 29.79999924 31.20000076 32.09999847 32.90000153 33.29999924 34.09999847 35. 35.40000153 36. 36.79999924 37.79999924 38.79999924 40.09999847 41.20000076 41.90000153 43. 44. 45.29999924 46.09999847 47.09999847 47.90000153 48.79999924 50. 50.29999924 51. 52.20000076 52.90000153 53.90000153 54.29999924 55. 56.09999847 56.79999924 57.09999847 58.09999847 59.09999847 60.20000076 60.79999924])

intensity_values = np.array([ 506. 574. 465. 411. 412. 562. 590. 598. 541. 1480. 762. 63016. 726. 576. 799. 23904. 98. 246. 471. 216. 291. 220. 222. 674. 887. 2035. 1982. 631. 501. 8382. 712. 469. 520. 743. 290. 585. 568. 1137. 907. 763. 236. 191. 811. 556. 340. 348. 425. 354. 327. 430.])

and then to build the dataframe, I did this:

 if 'point_count' in data.variables:
     point_count = data['point_count'].values.squeeze()
 else:
     raise ValueError("The dataset does not have the 'point_count' variable,    necessary to map scans to intensity data")

 # Repeat retention times for each point in each scan
 retention_time_repeated = np.repeat(retention_time, point_count)

 # Ensure lengths match
 assert len(retention_time_repeated) == len(intensity_values), "Mismatch in  retention_time and intensity_values length"

 # Get unique retention times and define matrix dimensions
 unique_retention_times, inverse_indices = np.unique(retention_time_repeated, return_inverse=True)
 unique_masses = ideal_mass_values

 # Initialize a zero-filled intensity matrix
 intensity_matrix = np.zeros((len(unique_retention_times), len(unique_masses)))

 # Create mass index mapping
 mass_to_index = {mass: j for j, mass in enumerate(unique_masses)}

 # Get indices for the intensity matrix
 mass_indices = np.array([mass_to_index[mass] for mass in rounded_mass])

 # Populate the intensity matrix in a vectorized manner
 intensity_matrix[inverse_indices, mass_indices] += intensity_values

 # Convert the matrix to a DataFrame for easier inspection
 matrix_df = pd.DataFrame(
     intensity_matrix,
     index=unique_retention_times,
     columns=unique_masses
 )

 # Display part of the matrix for verification
 print(matrix_df.head())

My questions:

How can I ensure that intensities are correctly aligned with retention times and rounded mass values?
Should I incorporate scan_index to help with the alignment? If so, how should I approach it?
Are there alternative, better ways to structure or process this data in Python to avoid alignment issues?

Is there anything else I can do? Please help. Thank you :')

For reference, this is what I am trying to get:

Intensity matrix for sample.CDF:

I managed to get it out but like mentioned, some of the values came out differently.

Solution

In your approach retention_time_repeated is created using np.repeat. This means that you are assuming a perfect match between retention times and the intensity values. But this is not the case. The actual mapping depends on scan_index which determines where each scan's data starts and ends. So, you need to calculate rounded_mass using np.round and ensure a consistent rounding method and have an explicit mapping mass_to_index. This is created to align each rounded mass value with a specific column in the intensity matrix. When populating the matrix use mass_to_index to place each intensity value into the correct column, ensuring alignment with the rounded masses.

THis can be done this way (here I created sample data since you didn't provide any):

import numpy as np
import pandas as pd

sample_data = {
    'scan_acquisition_time': np.array([1.0, 1.1, 1.2, 1.3, 1.4]), 
    'scan_index': np.array([0, 4, 9, 14, 18, 21]), 
    'mass_values': np.array([50.1, 51.2, 52.0, 53.3, 50.0, 51.1, 52.2, 54.1, 55.0, 50.2, 
                             51.5, 52.3, 53.8, 54.6, 50.3, 51.0, 52.1, 53.7, 50.5, 52.0, 53.0]),
    'intensity_values': np.array([100, 200, 150, 50, 120, 180, 160, 90, 80, 110, 
                                  210, 140, 70, 60, 130, 190, 170, 100, 125, 175, 155])
}

retention_time = sample_data['scan_acquisition_time']
scan_index = sample_data['scan_index']

rounded_mass = np.round(sample_data['mass_values']).astype(int)
intensity_values = sample_data['intensity_values']

mass_min = np.min(rounded_mass)
mass_max = np.max(rounded_mass)
ideal_mass_values = list(range(mass_min, mass_max + 1))

intensity_matrix = np.zeros((len(retention_time), len(ideal_mass_values)))

mass_to_index = {mass: i for i, mass in enumerate(ideal_mass_values)}

for i in range(len(scan_index) - 1):  
    start_idx = scan_index[i]
    end_idx = scan_index[i + 1]
    current_masses = rounded_mass[start_idx:end_idx]
    current_intensities = intensity_values[start_idx:end_idx]
    
    for m, intensity in zip(current_masses, current_intensities):
        col_idx = mass_to_index.get(m, None)
        if col_idx is not None:  
            intensity_matrix[i, col_idx] += intensity

last_start_idx = scan_index[-1]
current_masses = rounded_mass[last_start_idx:]
current_intensities = intensity_values[last_start_idx:]
for m, intensity in zip(current_masses, current_intensities):
    col_idx = mass_to_index.get(m, None)
    if col_idx is not None:
        intensity_matrix[-1, col_idx] += intensity

matrix_df = pd.DataFrame(
    intensity_matrix,
    index=retention_time,
    columns=ideal_mass_values
)

print(matrix_df.head())

which I believe returns what you would expect with your actual data:

        50     51     52     53     54    55
1.0  100.0  200.0  150.0   50.0    0.0   0.0
1.1  120.0  180.0  160.0    0.0   90.0  80.0
1.2  110.0    0.0  350.0    0.0   70.0  60.0
1.3  130.0  190.0  170.0    0.0  100.0   0.0
1.4  125.0    0.0  175.0  155.0    0.0   0.0

EDIT: SOLUTION WITH YOUR DATA:

import numpy as np
import pandas as pd

scan_acquisition_time = np.array([5.903, 6.546, 7.188, 7.83, 8.472, 9.115, 9.757, 10.399, 11.041, 11.684, 12.326, 12.968, 13.61, 14.253, 4.895, 15.537, 16.18, 16.822, 17.464, 18.106, 18.749, 19.391, 20.033, 20.675, 1.318, 21.96, 22.602, 23.244, 23.887, 24.529, 25.171, 25.813, 26.456, 27.098, 27.74, 28.383, 29.025, 29.667, 30.309, 30.952, 31.594, 32.236, 32.878, 33.521, 4.163, 34.805, 35.447, 36.09, 36.732, 37.374])

scan_index = np.array([0, 624, 1249, 1878, 2492, 3127, 3742, 4366, 4982, 5618, 6246, 6863, 7469, 8098, 8715, 9320, 9936, 10565, 11191, 11821, 12449, 13061, 13680, 14316, 14931, 15573, 16208, 16824, 17448, 18055, 18680, 19301, 19928, 20551, 21186, 21794, 22419, 23051, 23674, 24295, 24919, 25543, 26164, 26786, 27414, 28041, 28664, 29285, 29904, 30517])

mass_values = np.array([19.79999924, 20.29999924, 20.89999962, 1.60000038, 22.29999924, 22.79999924, 23.20000076, 24.20000076, 24.89999962, 26., 27.20000076, 28.10000038, 29., 29.79999924, 31.20000076, 32.09999847, 32.90000153, 33.29999924, 34.09999847, 35., 35.40000153, 36., 36.79999924, 37.79999924, 38.79999924, 40.09999847, 41.20000076, 41.90000153, 43., 44., 45.29999924, 46.09999847, 47.09999847, 47.90000153, 48.79999924, 50., 50.29999924, 51., 52.20000076, 52.90000153, 53.90000153, 54.29999924, 55., 56.09999847, 56.79999924, 57.09999847, 58.09999847, 59.09999847, 60.20000076, 60.79999924])

intensity_values = np.array([506., 574., 465., 411., 412., 562., 590., 598., 541., 1480., 762., 63016., 726., 576., 799., 23904., 98., 246., 471., 216., 291., 220., 222., 674., 887., 2035., 1982., 631., 501., 8382., 712., 469., 520., 743., 290., 585., 568., 1137., 907., 763., 236., 191., 811., 556., 340., 348., 425., 354., 327., 430.])

rounded_mass = np.round(mass_values).astype(int)
mass_min = np.min(rounded_mass)
mass_max = np.max(rounded_mass)
ideal_mass_values = list(range(mass_min, mass_max + 1))

intensity_matrix = np.zeros((len(scan_acquisition_time), len(ideal_mass_values)))
mass_to_index = {mass: i for i, mass in enumerate(ideal_mass_values)}

for i in range(len(scan_index) - 1):
    start_idx = scan_index[i]
    end_idx = scan_index[i + 1]
    current_masses = rounded_mass[start_idx:end_idx]
    current_intensities = intensity_values[start_idx:end_idx]
    
    for m, intensity in zip(current_masses, current_intensities):
        col_idx = mass_to_index.get(m, None)
        if col_idx is not None:
            intensity_matrix[i, col_idx] += intensity

last_start_idx = scan_index[-1]
current_masses = rounded_mass[last_start_idx:]
current_intensities = intensity_values[last_start_idx:]
for m, intensity in zip(current_masses, current_intensities):
    col_idx = mass_to_index.get(m, None)
    if col_idx is not None:
        intensity_matrix[-1, col_idx] += intensity

matrix_df = pd.DataFrame(
    intensity_matrix,
    index=scan_acquisition_time,
    columns=ideal_mass_values
)

which gives

          2    3    4    5    6    7    8    9    10   11  ...     52     53  \
5.903  411.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  907.0  763.0   
6.546    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0    0.0   
7.188    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0    0.0   
7.830    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0    0.0   
8.472    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...    0.0    0.0   

          54     55     56     57     58     59     60     61  
5.903  427.0  811.0  556.0  688.0  425.0  354.0  327.0  430.0  
6.546    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
7.188    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
7.830    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0  
8.472    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0