2 ways to remove duplicate lines from Linux files
There are many ways to remove duplicate lines from a text file on Linux, but here are two that involve the awk and uniq commands and that offer slightly different results.
Remove duplicate lines with awk
The first command we’ll examine in this post is a very unusual awk command that systematically removes every line in the file that is encountered more than once. It leaves the first instance of the line intact, but “remembers” it and removes any duplicates encountered afterwards.
Here’s an example. Initially, the file looks like this:
Once upon a time, there was a lovely princess with a foul temper. Whenever she went for a walk, she left her castle smiling, but if she ran into anyone frowning or arguing with someone else, she stopped and made an angry face. Continue reading If the princess ran into a friend who didn't want to chat with her, she stopped and made an angry face. <== will be removed Continue reading <== will be removed
The awk command that does this work looks like this:
$ awk '!x[$0]++' grouchy_princess Once upon a time, there was a lovely princess with a foul temper. Whenever she went for a walk, she left her castle smiling, but if she ran into anyone frowning or arguing with someone else, she stopped and made an angry face. Continue reading If the princess ran into a friend who didn't want to chat with her,
Note that each of the duplicated lines is now displayed only once and in its initial position.
In fact, if you simply want to see any duplicated lines, you only need to change the command in a minor way. Just remove the exclamation point (signifying “not”) and you will see only the duplicated lines:
$ awk 'x[$0]++' grouchy_princess she stopped and made an angry face. Continue reading
The only problem with the awk ‘!x[$0]++’ command is that it’s not all that easy to remember. On the other hand, it’s also not that hard to turn the command into a simple script. Mine looks like this:
$ cat rmdups #!/bin/bash
awk '!x[$0]++' $1
The awk command removes duplicate lines from whatever file is provided as an argument. If you want to save the output to a file instead of displaying it, make it look like this:
#!/bin/bash
awk '!x[$0]++' $1 > $1-new
You can run the script shown using a command like “rmdups addresses”. If you use the second version, a file with “-new” added to the original file name will contain the output.
Remove duplicate lines with uniq
If you don’t need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.
$ sort grouchy_princess | uniq but if she ran into anyone frowning or arguing with someone else, Continue reading If the princess ran into a friend who didn't want to chat with her, Once upon a time, there was a lovely princess with a foul temper. she stopped and made an angry face. Whenever she went for a walk, she left her castle smiling,
In addition, if sorting the contents of your file contents is helpful, this approach may be ideal. While this technique doesn’t work all that well with fairy tales, it works just fine for lists of meeting attendees, grocery shopping lists etc.
This combined use of sort and uniq surrounding the file name means a command like it can’t be turned into an alias, but it could be turned into a simple script like this:
#!/bin/bash if [ $# == 1 ]; then if [ -f $1 ]; then sort $1 | uniq fi fi
The script verifies that an argument was provided and that it’s an existing file before it sorts it and sends the output to the uniq command.
Wrap-Up
Commands like those shown can be very helpful in cleaning up or verifying the content of text files, particularly lists in which you don’t want any line to show up multiple times. Turning the commands into a script makes it convenient to call on them whenever they might be helpful.
Copyright © 2022 IDG Communications, Inc.