2 ways to remove duplicate lines from Linux files


There are many ways to remove duplicate lines from a text file on Linux, but here are two that involve the awk and uniq commands and that offer slightly different results.

Remove duplicate lines with awk

The first command we’ll examine in this post is a very unusual awk command that systematically removes every line in the file that is encountered more than once. It leaves the first instance of the line intact, but “remembers” it and removes any duplicates encountered afterwards.

Here’s an example. Initially, the file looks like this:

Once upon a time, there was a lovely princess with a foul temper.
Whenever she went for a walk, she left her castle smiling,
but if she ran into anyone frowning or arguing with someone else,
she stopped and made an angry face.
Continue reading
If the princess ran into a friend who didn't want to chat with her,
she stopped and made an angry face.		<== will be removed
Continue reading				<== will be removed

The awk command that does this work looks like this:

$ awk '!x[$0]++' grouchy_princess
Once upon a time, there was a lovely princess with a foul temper.
Whenever she went for a walk, she left her castle smiling,
but if she ran into anyone frowning or arguing with someone else,
she stopped and made an angry face.
Continue reading
If the princess ran into a friend who didn't want to chat with her,

Note that each of the duplicated lines is now displayed only once and in its initial position.

In fact, if you simply want to see any duplicated lines, you only need to change the command in a minor way. Just remove the exclamation point (signifying “not”) and you will see only the duplicated lines:

$ awk 'x[$0]++' grouchy_princess
she stopped and made an angry face.
Continue reading

The only problem with the awk ‘!x[$0]++’ command is that it’s not all that easy to remember. On the other hand, it’s also not that hard to turn the command into a simple script. Mine looks like this:

$ cat rmdups
#!/bin/bash
awk '!x[$0]++' $1

The awk command removes duplicate lines from whatever file is provided as an argument. If you want to save the output to a file instead of displaying it, make it look like this:

#!/bin/bash
awk '!x[$0]++' $1 > $1-new

You can run the script shown using a command like “rmdups addresses”. If you use the second version, a file with “-new” added to the original file name will contain the output.

Remove duplicate lines with uniq

If you don’t need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.

$ sort grouchy_princess | uniq
but if she ran into anyone frowning or arguing with someone else,
Continue reading
If the princess ran into a friend who didn't want to chat with her,
Once upon a time, there was a lovely princess with a foul temper.
she stopped and made an angry face.
Whenever she went for a walk, she left her castle smiling,

In addition, if sorting the contents of your file contents is helpful, this approach may be ideal. While this technique doesn’t work all that well with fairy tales, it works just fine for lists of meeting attendees, grocery shopping lists etc.

This combined use of sort and uniq surrounding the file name means a command like it can’t be turned into an alias, but it could be turned into a simple script like this:

#!/bin/bash

if [ $# == 1 ]; then
  if [ -f $1 ]; then
    sort $1 | uniq
  fi
fi

The script verifies that an argument was provided and that it’s an existing file before it sorts it and sends the output to the uniq command.

Wrap-Up

Commands like those shown can be very helpful in cleaning up or verifying the content of text files, particularly lists in which you don’t want any line to show up multiple times. Turning the commands into a script makes it convenient to call on them whenever they might be helpful.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2022 IDG Communications, Inc.



Source link