javacsvapache-commons-csv

How to use Commons CSV remove duplicate in csv file using Java?


I have a csv file. It contains several duplicate columns. I am trying to remove these duplicates using Java. I found Apache Common csv library, some people use it to remove duplicate rows. How can I use it to remove or skip duplicate columns?

For example: my csv header is:

ID Name Email Email

So far my code is:

Reader reader = Files.newBufferedReader(Paths.get("user.csv"));
 
            // read csv file
            Iterable<CSVRecord> records = CSVFormat.DEFAULT.withFirstRecordAsHeader()
                    .withIgnoreHeaderCase()
                    .withTrim()
                    .parse(reader);
        
            for (CSVRecord record : records) {
                System.out.println("Record #: " + record.getRecordNumber());
                System.out.println("ID: " + record.get("ID"));
                System.out.println("Name: " + record.get("Name"));
                System.out.println("Email: " + record.get("Email"));
                
            }
        
            // close the reader
            reader.close();



Solution

  • Your code is close to what you need - you just need to use CSVPrinter to write out your data to a new file.

    import java.io.IOException;
    import java.io.Reader;
    import java.io.Writer;
    import java.nio.charset.StandardCharsets;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.nio.file.StandardOpenOption;
    import org.apache.commons.csv.CSVFormat;
    import org.apache.commons.csv.CSVPrinter;
    import org.apache.commons.csv.CSVRecord;
    
    public class App {
    
        public static void main(String[] args) throws IOException {
    
            try (final Reader reader = Files.newBufferedReader(Paths.get("source.csv"),
                    StandardCharsets.UTF_8)) {
    
                final Writer writer = Files.newBufferedWriter(Paths.get("target.csv"),
                        StandardCharsets.UTF_8,
                        StandardOpenOption.CREATE); // overwrites existing output file
    
                try (final CSVPrinter printer = CSVFormat.DEFAULT
                        .withHeader("ID", "Name", "Email")
                        .print(writer)) {
                    
                    // read each input file record:
                    Iterable<CSVRecord> records = CSVFormat.DEFAULT
                            .withFirstRecordAsHeader()
                            .withIgnoreHeaderCase()
                            .withTrim()
                            .parse(reader);
                    
                    // write each output file record
                    for (CSVRecord record : records) {
                        printer.print(record.get("ID"));
                        printer.print(record.get("Name"));
                        printer.print(record.get("Email"));
                        printer.println();
                    }
                }
            }
        }
    }
    

    This transforms the following source file:

    ID,Name,Email,Email
    1,Albert,foo@bar.com,foo@bar.com
    2,Brian,baz@bat.com,baz@bat.com
    

    To this target file:

    ID,Name,Email
    1,Albert,foo@bar.com
    2,Brian,baz@bat.com
    

    Some points to note:

    1. I was wrong in my comment. You do not need to use column indexes - you can use headings (as I do above) in your specific case.

    2. Whenever reading and writing a file, it is recommended to provide the character encoding. In my case, I use UTF-8. (This assumes the original file was created as a URF-8 file, of course - or is compatible with that encoding.)

    3. When opening the reader and the writer I use "try-with-resources" statements. These mean I do not have to explicitly close the file resources - Java takes care of that for me.