I am trying to compare two CSV files that have the same data but columns in different orders. When the column orders match, the following code works: How can I tweak my following code to make it work when column orders don't match between the CSV files?
Set<String> source = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(sourceFile)));
Set<String> target = new HashSet<>(org.apache.commons.io.FileUtils.readLines(new File(targetFile)));
return source.containsAll(target) && target.containsAll(source)
For example, the above test pass when the source file and target file are in this way:
source file:
a,b,c
1,2,3
4,5,6
target file:
a,b,c
1,2,3
4,5,6
However, the source file is same, but if the target file is in the following way, it doesn't work.
target file:
a,c,b
1,3,2
4,6,5
A Set
relies on properly functioning .equals
method for comparison, whether detecting duplicates, or comparing it's elements to those in another Collection
. When I saw this question, my first thought was to create a new class
for Objects to put into your Set
Objects, replacing the String
Objects. But, at the time, it was easier and faster to produce the code in my previous answer.
Here is another solution, which is closer to my first thought. To start, I created a Pair
class, which overrides .hashCode ()
and .equals (Object other)
.
package comparecsv1;
import java.util.Objects;
public class Pair <T, U> {
private final T t;
private final U u;
Pair (T aT, U aU) {
this.t = aT;
this.u = aU;
}
@Override
public int hashCode() {
int hash = 3;
hash = 59 * hash + Objects.hashCode(this.t);
hash = 59 * hash + Objects.hashCode(this.u);
return hash;
}
@Override
public boolean equals(Object obj) {
if (this == obj) { return true; }
if (obj == null) { return false; }
if (getClass() != obj.getClass()) { return false; }
final Pair<?, ?> other = (Pair<?, ?>) obj;
if (!Objects.equals(this.t, other.t)) {
return false;
}
return Objects.equals(this.u, other.u);
} // end equals
} // end class pair
The .equals (Object obj)
and the .hashCode ()
methods were auto-generated by the IDE. As you know, .hashCode()
should always be overridden when .equals
is overridden. Also, some Collection
Objects, such as HashMap
and HashSet
rely on proper .hashCode()
methods.
After creating class Pair<T,U>
, I created class CompareCSV1
. The idea here is to use a Set<Set<Pair<String, String>>>
where you have Set<String>
in your code.
A Pair<String, String>
pairs a value from a column with the header for the column in which it appears.
A Set<Pair<String, String>>
represents one row.
A Set<Set<Pair<String, String>>>
represents all the rows.
package comparecsv1;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public final class CompareCSV1 {
private final Set<Set<Pair<String, String>>> theSet;
private final String [] columnHeader;
private CompareCSV1 (String columnHeadings, String headerSplitRegex) {
columnHeader = columnHeadings.split (headerSplitRegex);
theSet = new HashSet<> ();
}
private Set<Pair<String, String>> createLine
(String columnSource, String columnSplitRegex) {
String [] column = columnSource.split (columnSplitRegex);
Set<Pair<String, String>> lineSet = new HashSet<> ();
int i = 0;
for (String columnValue: column) {
lineSet.add (new Pair (columnValue, columnHeader [i++]));
}
return lineSet;
}
public Set<Set<Pair<String, String>>> getSet () { return theSet; }
public String [] getColumnHeaders () {
return Arrays.copyOf (columnHeader, columnHeader.length);
}
public static CompareCSV1 createFromData (List<String> theData
, String headerSplitRegex, String columnSplitRegex) {
CompareCSV1 result =
new CompareCSV1 (theData.get(0), headerSplitRegex);
for (int i = 1; i < theData.size(); ++i) {
result.theSet.add(result.createLine(theData.get(i), columnSplitRegex));
}
return result;
}
public static void main(String[] args) {
String [] sourceData = {"a,b,c,d,e", "6,7,8,9,10", "1,2,3,4,5"
,"11,12,13,14,15", "16,17,18,19,20"};
String [] targetData = {"c,b,e,d,a", "3,2,5,4,1", "8,7,10,9,6"
,"13,12,15,14,11", "18,17,20,19,16"};
List<String> source = Arrays.asList(sourceData);
List<String> target = Arrays.asList (targetData);
CompareCSV1 sourceCSV = createFromData (source, ",", ",");
CompareCSV1 targetCSV = createFromData (target, ",", ",");
System.out.println ("Source contains target? "
+ sourceCSV.getSet().containsAll (targetCSV.getSet())
+ ". Target contains source? "
+ targetCSV.getSet().containsAll (sourceCSV.getSet())
+ ". Are equal? " + targetCSV.getSet().equals (sourceCSV.getSet()));
} // end main
} // end class CompareCSV1
This code has some things in common with the code in my first answer:
String []
Objects, with calls to Arrays.asList
method as substitutes for your data sources.I hard coded ","
as the String split expression in main
. But, the new methods allow the String split expression to be passed. It allows a separate String split expressions for the column header line and the data lines.