pythonpython-3.xpandasdatetime

Pandas not recognising older dates (pre-1600)


I'm entering data in csv format. The majority of dates are after 1900, but some are earlier than this. The oldest that I've seen so far is 1518.

The 1518 date actually came up with an out of bounds error. I know that python should be able to cope with dates up to around 584 years old, but it didn't in this case. This limitation isn't the issue.

Here is an example of my data:

Index,Dates
00457,01/01/1981
134535,22/12/1977
3015,15/11/1889
00458,01/01/1981
00459,01/01/1981
134774,10/01/1978
00461,01/01/1981
00764,01/01/2000
00462,01/01/1981
00899,23/09/1518
00063,01/01/1981
00464,01/01/1981

After reading the file in using:

DF = pd.read_csv(sourceFile5,parse_dates=['Dates'], dayfirst=True, index_col="cNumber", skipinitialspace = True)

The formatting is fine, but when I try to filter through the results using

newDF.append(DF[ DF["Dates"] > one_month_ago])

(Please be aware that one_month_ago is a variable defined by my script)

None of the entries are recognised (even those from 1900 onwards). I know that the filter command works because I have used these with other .csv files that don't contain such old dates and there has been no issue.

For this reason, I added the extra step:

DF["Dates"] = pd.to_datetime(DF["Dates"], dayfirst = True, format = "%d/%m/%Y", errors = "coerce")

The post-1900 dates return fine, but the earlier dates return as YYYY-MM-DD. Even so, neither are recognised during the filter stage I mention above even after this additional step. The column appears to be returning as a series of strings.

I'm at a loss as to why this is. Can anybody help?


Solution

  • According the documentation, there's limitation (the time span that can be represented using a 64-bit integer is limited to approximately 584 years).

    You can represent Out-of-Bounds Spans using Periods to do computation:

    def conv(x):
        day, month, year = map(int, x.split("/"))
        return pd.Period(year=year, month=month, day=day, freq="D")
    
    
    df = pd.read_csv("your_file.csv")
    df["Dates"] = df["Dates"].apply(conv)
    print(df["Dates"])
    

    Prints:

    0     1981-01-01
    1     1977-12-22
    2     1889-11-15
    3     1981-01-01
    4     1981-01-01
    5     1978-01-10
    6     1981-01-01
    7     2000-01-01
    8     1981-01-01
    9     1518-09-23
    10    1981-01-01
    11    1981-01-01
    Name: Dates, dtype: period[D]
    

    EDIT: After removing the 1518-09-23, you can load the file normally:

    df = pd.read_csv("your_file.csv")
    df["Dates"] = pd.to_datetime(df["Dates"])
    print(df["Dates"])
    

    Prints:

    0    1981-01-01
    1    1977-12-22
    2    1889-11-15
    3    1981-01-01
    4    1981-01-01
    5    1978-10-01
    6    1981-01-01
    7    2000-01-01
    8    1981-01-01
    9    1981-01-01
    10   1981-01-01
    Name: Dates, dtype: datetime64[ns]
    

    Note the datetime64[ns]