pythonpandasdataframeformattabula

Splitting a merged column by space...but some of the data has spaces between the values


This is a bit of a weird one -

I have scraped some data out of a pdf to create a table using tabula.

The table doesn't come out perfect and it merges some of the columns together. As an example 3 of the columns all get put into 1 column. No problem to split that with a delimiter, I can do something like -

columns_split = 'A B C'
df_split = df[columns_split].str.split(' ', expand=True)

The problem - one of the columns that gets merged has values that may go up to 4 digits. For some reason the values in the pdf table that are 4 digits long have a space in them so instead of '1000' the values are '1 000'. This consequently gets honoured by tabula when it reads the pdf. The merged column that comes in looks a bit like this; where instead of three separate columns A, B and C they are all ploughed in to one.


df['A B C']
Out[24]: 
1      A B C
2    1 000 60,0 300,0
3    1 000 60,0 200,0
4      999 60,0 200,0
5      888 70,0 100,0
6    2 777 80,0 100,0
7       666 80,0 90,0
8      555 50,0 100,0
9      111 50,0 100,0
Name: A B C, dtype: object

The knock on effect is when I run the split command it will split the columns where it finds a space in those values and so I end up with 4 columns with some values being knocked along. My initial thought was to ignore the first first delimiter, but that wouldn't apply to the rows that don't have the issue.


columns_split = 'A B C'
df_split = df[columns_split].str.split(' ', expand=True)    

df_split
Out[26]: 
      0     1        2      3
1     A     B        C   None
2     1   000     60,0  300,0
3     1   000     60,0  200,0
4   999  60,0    200,0   None
5   888  70,0    100,0   None
6     2   777     80,0  100,0
7   666  80,0     90,0   None
8   555  50,0    100,0   None
9   111  50,0    100,0   None

Hopefully someone out there has an inspirational idea how to get around this. Editing the pdf sadly is not an option, the above is for illustrative purposed and alas this is not the only pdf I need to run this on.

Thanks in advance for any help.


Solution

  • If only the first column has those incorrect spaces, use str.rsplit with a max of 2 splits:

    df['A B C'].str.rsplit(n=2, expand=True)
    

    Output:

           0     1      2
    1      A     B      C
    2  1 000  60,0  300,0
    3  1 000  60,0  200,0
    4    999  60,0  200,0
    5    888  70,0  100,0
    6  2 777  80,0  100,0
    7    666  80,0   90,0
    8    555  50,0  100,0
    9    111  50,0  100,0