Searching through compressed files on Linux


There are quite a few ways to search through compressed text files on Linux systems without having to uncompress them first. Depending on the format of the files, you can choose to view entire files, extract specific text, navigate through file contents searching for content of interest, and sometimes even edit content. I

First, to show you how this works, I compressed the words file on one of my Linux systems (/usr/share/dict/words) using these commands:

$ cp /usr/share/dict/words .
$ 7z a words.7z words
$ bzip2 -k words
$ gzip -k words
$ xz -k words
$ zip words.zip words
 

The -k options used with the bzip2, gzip, and xz commands kept these commands from removing the original file, which they would by default. The resultant files then looked like this:

$ ls -l
total 9164
-rw-r--r--. 1 shs shs 4953598 Oct 27 16:11 words
-rw-r--r--. 1 shs shs 1230545 Oct 27 16:14 words.7z
-rw-r--r--. 1 shs shs 1712421 Oct 27 16:11 words.bz2
-rw-r--r--. 1 shs shs 1476067 Oct 27 16:11 words.gz
-rw-r--r--. 1 shs shs 1230236 Oct 27 16:11 words.xz
-rw-r--r--. 1 shs shs 1476203 Oct 28 12:42 words.zip

Viewing compressed-file content

To view the entire content of a compressed file while leaving the compressed file intact, you can use any of these commands:

  • for 7z:  7z x -so words.7z
  • for bz2:  bzcat words.bz2
  • for gz:  zcat words.gz
  • for xz:  xzcat words.xz
  • for zip:  zcat words.zip

For example:

$ bzcat words.bz2 | head -5        $ 7z x -so words.7z | head -5
1080                               1080
10-point                           10-point
10th                               10th
11-point                           11-point
12-point                           12-point

You can also pipe the output to commands like more or grep, or simply watch it scroll rapidly down your screen.

$ 7z x -so words.7z | grep overclever
overclever
overcleverly
overcleverness

Browsing with less

You can browse some types of compressed files (bz2, gz and xz) using the less command.

$ less words.bz2        $ less words.gz         $ less words.xz
1080                    1080                    1080
10-point                10-point                10-point
10th                    10th                    10th
11-point                11-point                11-point
12-point                12-point                12-point
...                     ...                     ...

Searching for text in 7z files

The 7z command allows you to view files included in the archive, but searching their contents requires an extraction (-x) option. However, a command like that below leaves the compressed file intact but also extracts the contents in the process. The -so option tells the command to write data to standard out.

$ 7z x -so words.7z | grep clever | column
clever          cleverest       cleverly        overcleverly    uncleverness
cleverality     clever-handed   cleverness      overcleverness
clever-clever   cleverish       clevernesses    unclever
cleverer        cleverishly     overclever      uncleverly

There doesn’t seem to be a grep-like command for 7z files, but commands like this work very well.

Searching for text in other types of compressed files

To search for specific text in compressed files, you can use commands like these:

$ bzgrep overclever words.bz2
$ zgrep overclever words.gz
$ xzgrep overclever words.xz
$ zipgrep overclever words.zip

For any of these commands, you should see these words that they pull from the compressed word files:

overclever
overcleverly
overcleverness

Editing compressed files

Using vi or vim, you can actually edit some compressed files (bz2, gz and xz files) to add, change, or remove content. The files will remain compressed on your disk, but you’ll be able to notice the size changes.

$ xzcat words.xz | tail -3
Zz
zZt
ZZZ
$ vi words.xz
$ xzcat words.xz | tail -3
zZt
ZZZ
I added this line!

Wrap-Up

Given all the ways that you can browse and select content from compressed files, it might be a good time to exercise your “overcleverness” and see how helpful the methods described in this post might be.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2021 IDG Communications, Inc.



Source link