pythonpandasdatetimeheaderread-csv

How to best use read_csv parameters when headers are on different rows, and then make 1st column datetime index


Ive been having trouble reading and updating a csv from yfinance, due to the data in the first few rows of the downloaded csv:

The downloaded csv (formatted) looks like this:

Price Adj Close Close High Low Open Volume
Ticker ^BVSP ^BVSP ^BVSP ^BVSP ^BVSP ^BVSP
Date
2014-01-02 50341.0 50341.0 51656.0 50246.0 51522.0 3476300
2014-01-03 50981.0 50981.0 50981.0 50269.0 50348.0 7360400
2014-01-06 50974.0 50974.0 51002.0 50451.0 50980.0 3727800
2014-01-07 50430.0 50430.0 51478.0 50429.0 50982.0 3339500

The raw .csv file looks like this:

Price,Adj Close,Close,High,Low,Open,Volume
Ticker,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP
Date,,,,,,,
2014-01-02,50341.0,50341.0,51656.0,50246.0,51522.0,3476300
2014-01-03,50981.0,50981.0,50981.0,50269.0,50348.0,7360400
2014-01-06,50974.0,50974.0,51002.0,50451.0,50980.0,3727800
2014-01-07,50430.0,50430.0,51478.0,50429.0,50982.0,3339500

Once read, I want the df to look like this, where 'Date' is datetime index:

Date Adj Close Close High Low Open Volume
2014-01-02 50341.0 50341.0 51656.0 50246.0 51522.0 3476300
2014-01-03 50981.0 50981.0 50981.0 50269.0 50348.0 7360400
2014-01-06 50974.0 50974.0 51002.0 50451.0 50980.0 3727800
2014-01-07 50430.0 50430.0 51478.0 50429.0 50982.0 3339500

I'm using this code, which works, but it seems clumsy.

idx_df = pd.read_csv(
            f'{data_folder}/INDEX_{idx_code}.csv',
            header=None,
            skiprows=3,  #  data starts on row 4
            names=['Date', 'Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume'],
            index_col='Date'
        )
        idx_df.index = pd.to_datetime(idx_df.index, errors='coerce')  

My questions:

  1. Is there a simpler/more elegant way, perhaps using one line of code, and using the "header" parameter, even though "Date" is at position (2, 0) and the others are at (0, 1:6)?
  2. Is there a way to set the index as datetime within the "read_csv" instruction, avoiding the "idx_df.index =" line?

Thanks


Solution

  • Example

    import pandas as pd
    import io
    
    csv1 = '''Price,Adj Close,Close,High,Low,Open,Volume
    Ticker,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP
    Date,,,,,,,
    2014-01-02,50341.0,50341.0,51656.0,50246.0,51522.0,3476300
    2014-01-03,50981.0,50981.0,50981.0,50269.0,50348.0,7360400
    2014-01-06,50974.0,50974.0,51002.0,50451.0,50980.0,3727800
    2014-01-07,50430.0,50430.0,51478.0,50429.0,50982.0,3339500
    '''
    

    Code

    use skiprows, parse_dates, index_col parameter

    df = pd.read_csv(
        io.StringIO(csv1), # file path
        skiprows=[1, 2], # skip junk rows
        parse_dates=['Price'], # convert Price column to datetime 
        index_col=0 # set Price column as index
    ).rename_axis('Date') # rename Price -> Date
    

    df

                Adj Close    Close     High      Low     Open   Volume
    Date                                                              
    2014-01-02    50341.0  50341.0  51656.0  50246.0  51522.0  3476300
    2014-01-03    50981.0  50981.0  50981.0  50269.0  50348.0  7360400
    2014-01-06    50974.0  50974.0  51002.0  50451.0  50980.0  3727800
    2014-01-07    50430.0  50430.0  51478.0  50429.0  50982.0  3339500