pythondecodefitspyfits

Open/edit utf8 fits header in python (pyfits)


I have to deal with some fits files which contain utf8 text in their header. This means basically all functions of the pyfits package do not work. Also .decode does not work as the fits header is a class not a list. Does someone know how to decode the header so I can process the data? The actual content is not so important so something like ignoring the letters is fine. My current code looks like this:

hdulist = fits.open('Jupiter.FIT')
hdu = hdulist[0].header
hdu.decode('ascii', errors='ignore')

And I get: AttributeError: 'Header' object has no attribute 'decode'

Functions like:

print (hdu)

return:

ValueError: FITS header values must contain standard printable ASCII characters; "'Uni G\xf6ttingen, Institut f\xfcr Astrophysik'" contains characters/bytes that do not represent printable characters in ASCII.

I thought about writing something in the entry so I don't need to care about it. However I can' even retrieve which entry contains the bad characters and I would like to have a batch solution as I have some hundred files.


Solution

  • As anatoly techtonik pointed out non-ASCII characters in FITS headers are outright invalid, and make invalid FITS files. That said, it would be nice if astropy.io.fits could at least read the invalid entries. Support for that is currently broken and needs a champion to fix it, but nobody has because it's an infrequent enough problem, and most people encounter it in one or two files, fix those files, and move on. Would love for someone to tackle the problem though.

    In the meantime, since you know exactly what string this file is hiccupping on, I would just open the file in raw binary mode and replace the string. If the FITS file is very large, you could read it a block at a time and do the replacement on those blocks. FITS files (especially headers) are written in 2880 byte blocks, so you know that anywhere that string appears will be aligned to such a block, and you don't have to do any parsing of the header format beyond that. Just make sure that the string you replace it with is no longer than the original string, and that if it's shorter it is right-padded with spaces, because FITS headers are a fixed-width format and anything that changes the length of a header will corrupt the entire file. For this particular case then, I would try something like this:

    bad_str = 'Uni Göttingen, Institut für Astrophysik'.encode('latin1')
    good_str = 'Uni Gottingen, Institut fur Astrophysik'.encode('ascii')
    # In this case I already know the replacement is the same length so I'm no worried about it
    # A more general solution would require fixing the header parser to deal with non-ASCII bytes
    # in some consistent manner; I'm also looking for the full string instead of the individual
    # characters so that I don't corrupt binary data in the non-header blocks
    in_filename = 'Jupiter.FIT'
    out_filename = 'Jupiter-fixed.fits'
    
    with open(in_filename, 'rb') as inf, open(out_filename, 'wb') as outf:
        while True:
            block = inf.read(2880)
            if not block:
                break
            block = block.replace(bad_str, good_str)
            outf.write(block)
    

    This is ugly, and for a very large file might be slow, but it's a start. I can think of better solutions, but that are harder to understand and probably not worth taking the time on if you just have a handful of files to fix.

    Once that's done, please give the originator of the file a stern talking to--they should not be publishing corrupt FITS files.