pythoncsvastropyfits

Extracting only few columns from a FITS file that is freely available online to download using python


I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data? Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem. It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.

Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/

I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.

Also, is there a way to do the same thing with CSV files or a general way to do it with any file?


Solution

  • I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.

    First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.

    Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.

    There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.

    Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.

    As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.

    Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.

    As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.

    For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.

    Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.