I have parsed a pdf file
using ai models and got parsed markdown
results which is saved in a variable doc_parsed
. Printing below its sample contents by code print(doc_parsed[2].text[:1000])
# Details
|Name|Mr. XYZ|
|---|---|
|Age/Sex|XX YRS/X|
|Id.|01x40xxxxx|
|Refered By|Self|
|Collection On|xx/Aug/20xx 0x:x0AM|
|Collected By|xxxxxxx|
|Sample Rec. On|xx/Aug/20xx xx:x0 AM|
|Collection Mode|HOME COLLECTION|
|Reporting On|xx/Aug/20xx 0x:xx PM|
|BarCode|xxxxxx|
# Test Results
|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|Electrolyte Profile, Serum| | | |
|SODIUM (Na+)|136.2|136 - 145|mmol/L|
|POTASSIUM (K+)|4.23|3.5 - 5.5|mmol/L|
|CHLORIDE(Cl-)|106.24|98.0 - 107|mmol/L|
|TOTAL CALCIUM (Ca)|9.00|8.6-10.2|mg/dL|
|IONIZED CALCIUM|4.52|4.4 - 5.4|mg/dl|
|NON-IONIZED CALCIUM|4.49|4.4 - 5.4|mg/dl|
|pH.(Method : ISE Direct)|7.39|7.35 - 7.45| |
ISSUE: I have tried several ways to split this into columns of dataframe with delimeter as |
by using pd.read_csv()
& pd.read_table()
but none worked.
import pandas as pd
import io
pd.read_table(doc_parsed[2].text[:1000], sep="|")
ValueError: Invalid file path or buffer object type: <class 'llama_index.core.schema.Document'>
import io
input_text = io.StringIO(print(doc_parsed[2].text[:1000]))
pd.read_csv(input_text,header=None, delimiter="|",
usecols = ["Parameter Name", "Result","Unit","Reference Range"])
EmptyDataError: No columns to parse from file
pd.read_csv(input_text,header=None, delimiter="|")
EmptyDataError: No columns to parse from file
Appreciate any help here.
This issue might be due to the markdown not being perfectly formatted or extra characters that pandas doesn't handle well by default.
Potential Solution You can try the following approach:
input_text = io.StringIO(sample_text)
df = pd.read_csv(input_text, sep="|", skipinitialspace=True)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
Using this method, I was able to successfully parse the markdown into a DataFrame.
I hope this gives you a clean DataFrame with your data nicely organized into columns.