I have a RegEx problem with a string pattern that i used to extract data from them by breaking it. I am using Python. The string lines are, simplified, something like this (there are some lines in between but these are the problematic ones):
Compound Name,RT,Score\n2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705
['Compound Name', 'RT', 'Score', '2', '4', '6-Cycloheptatrien-1-one', ' 3', '5-bis-trimethylsilyl-', '8.705', '626', 'Area', '386902.453', 'tR (min)', '8.705', '']
Compound Name,RT,Score\nBenzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682
['Compound Name', 'RT', 'Score', 'Benzene', ' 4-ethenyl-1', '2-dimethyl-', '11.682', '600', 'Area', '16289.239', 'tR (min)', '11.682', '']
Compound Name,RT,Score\n2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921
['Compound Name', 'RT', 'Score', '2-(p-Tolyl)propan-2-ol', ' TMS', '11.921', '605', 'Area', '17976.558', 'tR (min)', '11.921', '']
Compound Name,RT,Score\nTris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036
['Compound Name', 'RT', 'Score', 'Tris(tert-butyldimethylsilyloxy)arsane', '12.036', '685', 'Area', '76702.743', 'tR (min)', '12.036', '']
Compound Name,RT,Score\nPhenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646
['Compound Name', 'RT', 'Score', 'Phenyl-pentamethyl-disiloxane', '13.646', '656', 'Area', '84633.281', 'tR (min)', '13.646', '']
Compound Name,RT,Score\n**[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hexyl-, methyl est**,16.642,605,Area,566.512,tR (min),16.642
['Compound Name', 'RT', 'Score', '[1', "1'-Bicyclopropyl]-2-octanoic acid", " 2'-hexyl-", ' methyl est', '16.642', '605', 'Area', '566.512', 'tR (min)', '16.642', '']
Compound Name,RT,Score\n**(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl-pent-3-enyl)-c**,19.853,650,Area,10141.071,tR (min),19.853
The bold area are the chemical compound names which change the most and are the main cause of problems, and straightaway my Python result. These chemical names should be treated as a whole string which may happen in some cases. Everything else between commas should be separated.
The point is first substracting those chemical compounds as characters on their own. I had a first pattern that used to work with simpler compound names:
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)|\n(?=(?:[^"]*"[^"]*")*[^"]*$)'
which put all compound names between single quotes into double quotes to treat them as a single character string. However now it will split if ,
is found and as you can see in the examples the compounds aren't between single quotes.
I believe it is way easier to tackle from behind: get the ,8.705,626,Area,386902.453,tR (min),8.705 portion first and treat everything before as a single string. Here Area
and tR(min)
won't change, however their numbers accompanying will.
Any idea how to proceed to separate both sides of the line, the compound and the rest? Shall i use RegEx ^ & $ special characters?
CODE:
import os
import io # for utf8
import sys
import re
from pprint import pprint
import xlsxwriter # for the data, to write them into xlsx.
import tkinter as tk # for GUI
from tkinter import filedialog # for GUI
from collections import Counter # for another thing
file_path=filedialog.askopenfilename()
file=io.open(file_path,'r',encoding="utf-8")
filecsvname=os.path.basename(file.name)
lines=file.readlines()
file.close()
filename=filecsvname.strip('.csv')
liness=[]
seqnames=[]
for i in range(len(lines)-1):
i
lines[i]
if lines[i].startswith('Compound')==True and lines[i+1].startswith('A')==False:
liness.append(lines[i]+lines[i+1])
elif lines[i].startswith('Area')==True:
liness.append(lines[i]+',Compound Name, ')
elif lines[i].startswith('Sample Name')==True:
liness.append(lines[i])
elif lines[i].startswith('Sequence Name')==True:
seqnames.append(lines[i])
liness.append(lines[i])
# some irrelevant extra code for seqnames.
pattern = r',(?=(?:[^"]*"[^"]*")*[^"]*$)|\n(?=(?:[^"]*"[^"]*")*[^"]*$)'
datalines=[]
for i in range(len(liness)):
lineofdata=liness[i].rstrip('\t')
splittedlines=re.split(pattern,lineofdata)
datalines.append(splittedlines)
Minimal Example Extract of the input *.csv file. Link to file: https://drive.google.com/file/d/1rXanJdnZP-ey0D8dU7XQ6TrHLZK6HLo2/view?usp=sharing
The main idea here is to make it prettier for Excel automatically via the script.
Sequence Name,20240603_S24031_anonymous
Sample Name,Blank
Area,15906292.428,tR (min),3.696
Area,10164232.248,tR (min),3.783
Area,5541647.716,tR (min),4.091
Area,203969.547,tR (min),6.514
Area,542692.442,tR (min),6.579
Area,1039445.806,tR (min),6.788
Area,4011338.408,tR (min),7.249
Area,504687.529,tR (min),8.177
Area,108363.994,tR (min),8.420
Compound Name,RT,Score
2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705
Area,368393.389,tR (min),8.850
Area,91574.405,tR (min),9.447
Area,899294.259,tR (min),9.651
Area,193389.843,tR (min),10.041
Area,56853.148,tR (min),10.812
Compound Name,RT,Score
Benzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682
Compound Name,RT,Score
2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921
Area,78966.216,tR (min),12.031
Compound Name,RT,Score
Tris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036
Area,534.128,tR (min),12.285
Compound Name,RT,Score
Phenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646
Area,11464.379,tR (min),13.696
Compound Name,RT,Score
Cyclopentene, 3,3-dimethyl-4-methylene-1,2-bis(trimethylsil,14.543,645,Area,402308.085,tR (min),14.543
This is a common example of bad exported CSV, if you don't have any mean to replace the column separator then you will indeed post process the files.
Anyway it is sufficiently structured to recover the information without the need for regular expression.
tokens = [
"Compound Name,RT,Score\n2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethylsilyl-,8.705,626,Area,386902.453,tR (min),8.705",
"Compound Name,RT,Score\nBenzene, 4-ethenyl-1,2-dimethyl-,11.682,600,Area,16289.239,tR (min),11.682",
"Compound Name,RT,Score\n2-(p-Tolyl)propan-2-ol, TMS,11.921,605,Area,17976.558,tR (min),11.921",
"Compound Name,RT,Score\nTris(tert-butyldimethylsilyloxy)arsane,12.036,685,Area,76702.743,tR (min),12.036",
"Compound Name,RT,Score\nPhenyl-pentamethyl-disiloxane,13.646,656,Area,84633.281,tR (min),13.646",
"Compound Name,RT,Score\n**[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hexyl-, methyl est**,16.642,605,Area,566.512,tR (min),16.642",
"Compound Name,RT,Score\n**(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl-pent-3-enyl)-c**,19.853,650,Area,10141.071,tR (min),19.853",
]
def process(token):
lines = token.split("\n")
fields = lines[-1].split(",")
return {
"compound": ",".join(fields[:-6]),
"area": fields[-3],
"rt": fields[-1],
}
results = [process(token) for token in tokens]
The result gives three columns for each chromatographic peak.
compound area rt
0 2,4,6-Cycloheptatrien-1-one, 3,5-bis-trimethyl... 386902.453 8.705
1 Benzene, 4-ethenyl-1,2-dimethyl- 16289.239 11.682
2 2-(p-Tolyl)propan-2-ol, TMS 17976.558 11.921
3 Tris(tert-butyldimethylsilyloxy)arsane 76702.743 12.036
4 Phenyl-pentamethyl-disiloxane 84633.281 13.646
5 **[1,1'-Bicyclopropyl]-2-octanoic acid, 2'-hex... 566.512 16.642
6 **(t-Butyl-dimethylsilyl)[2-methyl-2-(4-methyl... 10141.071 19.853
Based on the file sample you provided the following function should import the format:
import pandas as pd
def process(filename):
with open(filename, "r") as file:
raw = file.read()
results = []
for sequence_id, sequence in enumerate(raw.split("\n\n")):
lines = sequence.split("\n")
sequence_name = lines[0].split(",")[-1]
for line in lines:
fields = line.split(",")
data = {}
if "Area," in line:
data |= {"area": float(fields[-3]), "rt": float(fields[-1])}
if ",Area," in line:
data |= {"compound": ",".join(fields[:-6])}
if data:
results.append({
"sequence_id": sequence_id,
"sequence_name": sequence_name,
} | data)
return pd.DataFrame(results)
It returns:
sequence_id sequence_name area rt \
0 0 20240603_S24031_anonymous 1.590629e+07 3.696
1 0 20240603_S24031_anonymous 1.016423e+07 3.783
2 0 20240603_S24031_anonymous 5.541648e+06 4.091
3 0 20240603_S24031_anonymous 2.039695e+05 6.514
4 0 20240603_S24031_anonymous 5.426924e+05 6.579
... ... ... ... ...
1675 37 20240604_S24032_BelénFlórez 2.436482e+07 19.036
1676 37 20240604_S24032_BelénFlórez 5.640226e+06 21.963
1677 37 20240604_S24032_BelénFlórez 5.021575e+05 23.108
1678 37 20240604_S24032_BelénFlórez 1.940947e+06 23.592
1679 37 20240604_S24032_BelénFlórez 9.665506e+05 24.550
compound
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
1675 Hexanoic acid
1676 Octanoic acid
1677 2H-Pyran-2-one, tetrahydro-6-pentyl-
1678 n-Decanoic acid
1679 Hexanediamide, N,N'-di-benzoyloxy-