pythonpython-3.xcsvopencsv

Pulling out data from CSV files' specific columns in Python


I need a quick help with reading CSV files using Python and storing it in a 'data-type' file to use the data to graph after storing all the data in different files.

I have searched it, but in all cases I found, there was headers in the data. My data does not header part. They are tab separated. And I need to store only specific columns of the data. Ex:

12345601 2345678@abcdef 1 2 365 places

In this case, as an example, I would want to store only "2345678@abcdef" and "365" in the new python file in order to use it in the future to create a graph.

Also, I have more than 1 csv file in a folder and I need to do it in each of them. The sources I found did not talk about it and only referred to:

# open csv file
with open(csv_file, 'rb') as csvfile:

Could anyone refer me to already answered question or help me out with it?


Solution

  • . . . and storing it in a PY file to use the data to graph after storing all the data in different files . . .

    . . . I would want to store only "2345678@abcdef" and "365" in the new python file . . .

    Are you sure that you want to store the data in a python file? Python files are supposed to hold python code and they should be executable by the python interpreter. It would be a better idea to store your data in a data-type file (say, preprocessed_data.csv).

    To get a list of files matching a pattern, you can use python's built-in glob library.

    Here's an example of how you could read multiple csv files in a directory and extract the desired columns from each one:

    import glob
    
    # indices of columns you want to preserve
    desired_columns = [1, 4]
    # change this to the directory that holds your data files
    csv_directory = '/path/to/csv/files/*.csv'
    
    # iterate over files holding data
    extracted_data = []
    for file_name in glob.glob(csv_directory):
        with open(file_name, 'r') as data_file:
            while True:
                line = data_file.readline()
                # stop at the end of the file
                if len(line) == 0:
                    break
    
                # splits the line by whitespace
                tokens = line.split()
                # only grab the columns we care about
                desired_data = [tokens[i] for i in desired_columns]
                extracted_data.append(desired_data)
    

    It would be easy to write the extracted data to a new file. The following example shows how you might save the data to a csv file.

    output_string = ''
    for row in extracted_data:
        output_string += ','.join(row) + '\n'
    
    with open('./preprocessed_data.csv', 'w') as csv_file:
        csv_file.write(output_string)
    

    Edit:

    If you don't want to combine all the csv files, here's a version that can process one at a time:

    def process_file(input_path, output_path, selected_columns):
        extracted_data = []    
        with open(input_path, 'r') as in_file:
            while True:
                line = in_file.readline()
                if len(line) == 0: break
                tokens = line.split()
                extracted_data.append([tokens[i] for i in selected_columns])
        
        output_string = ''
        for row in extracted_data:
            output_string += ','.join(row) + '\n'
        
        with open(output_path, 'w') as out_file:
            out_file.write(output_string)
    
    # whenever you need to process a file:
    process_file(
        '/path/to/input.csv', 
        '/path/to/processed/output.csv',
        [1, 4])
    
    # if you want to process every file in a directory:
    target_directory = '/path/to/my/files/*.csv'
    for file in glob.glob(target_directory):
        process_file(file, file + '.out', [1, 4])
    

    Edit 2:

    The following example will process every file in a directory and write the results to a similarly-named output file in another directory:

    import os
    import glob
    
    input_directory = '/path/to/my/files/*.csv'
    output_directory = '/path/to/output'
    for file in glob.glob(input_directory):
        file_name = os.path.basename(file) + '.out'
        out_file = os.path.join(output_directory, file_name)
        process_file(file, out_file, [1, 4])
    

    If you want to add headers to the output, then process_file could be modified like this:

    def process_file(input_path, output_path, selected_columns, column_headers=[]):
        extracted_data = []    
        with open(input_path, 'r') as in_file:
            while True:
                line = in_file.readline()
                if len(line) == 0: break
                tokens = line.split()
                extracted_data.append([tokens[i] for i in selected_columns])
        
        output_string = ','.join(column_headers) + '\n'
        for row in extracted_data:
            output_string += ','.join(row) + '\n'
        
        with open(output_path, 'w') as out_file:
            out_file.write(output_string)