This question is not equal to How to print only the unique lines in BASH? because that ones suggests to remove all copies of the duplicated lines, while this one is about eliminating their duplicates only, i..e, change 1, 2, 3, 3
into 1, 2, 3
instead of just 1, 2
.
This question is really hard to write because I cannot see anything to give meaning to it. But the example is clearly straight. If I have a file like this:
1
2
2
3
4
After to parse the file erasing the duplicated lines, becoming it like this:
1
3
4
I know python or some of it, this is a python script I wrote to perform it. Create a file called clean_duplicates.py
and run it as:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
Although, while searching for duplicates lines remove seems to be easier to use tools as grep, sort, sed, uniq:
You may use uniq
with -u
/--unique
option. As per the uniq
man page:
-u
/--unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
For example:
cat /tmp/uniques.txt | uniq -u
OR, as mentioned in UUOC: Useless use of cat, better way will be to do it like:
uniq -u /tmp/uniques.txt
Both of these commands will return me value:
1
3
4
where /tmp/uniques.txt holds the number as mentioned in the question, i.e.
1
2
2
3
4
Note: uniq
requires the content of file to be sorted. As mentioned in doc:
By default,
uniq
prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
In case file is not sorted, you need to sort
the content first
and then use uniq
over the sorted content:
sort /tmp/uniques.txt | uniq -u