Splitting files on Linux by context


The csplit command is unusual in that allows you to split text files into pieces based on their content. The command allows you to specify a contextual string and use it as a delimiter for identifying the chunks to be saved as separate files.

As an example, if you wanted to separate diary entries into a series of files each with a single entry, you might do something like this.

$ csplit -z diary '/^Dear/' '{*}'
153
123
136

In this example, “diary” is the name of the file to be split. The command is looking for lines that begin with the word “Dear” as in “Dear Diary” to determine where each chunk begins. The -z option tells csplit to not bother saving files that would be empty.

You can list the files that were just created by using a command like the following that limits the output of the ls command to the most recent files. The three numbers shown display the length of each of the three separate files that were created.

$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        136 Jan  1 15:02 xx02
-rw-r--r--.  1 shs  shs        123 Jan  1 15:02 xx01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:02 xx00

You could also use the full phrase for the separator line:

$ csplit -z diary '/^Dear Diary,/' '{*}'

In either case, the xx00 file will look like this:

$ cat xx00
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

The xx00, xx01, xx02, etc. file naming is the default. Split an additional file and these output files would be overwritten by the newer files unless you use the -f or –prefix option to replace “xx” with something more meaningful as in the example below in which the word “diary” is used to name the files.

$ csplit -zf diary diary '/^Dear/' '{*}'
153
123
136
$ ls -ltr | tail -3
-rw-r--r--.  1 shs  shs        123 Jan  1 15:11 diary01
-rw-r--r--.  1 shs  shs        153 Jan  1 15:11 diary00
-rw-r--r--.  1 shs  shs        136 Jan  1 15:11 diary02

If the file you want to split is separated by dates, you might try a command like this that looks for a portion of the date field:

$ csplit -zf diary diary '/, 202/' '{*}'
166
136
149
$ cat diary00
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

If you want to add a file extension to your output files, you can specify it as in the command shown below that uses “.txt” as the file extension. The 02d specifies that two digits are to be used. This is the default, but if you want 4 digits, just change the 2 to a 4.

$ $ csplit -z -b "%02d.txt" diary '/, 20/' '{*}'
10
166
136
149
$ ls -ltr | tail -4
-rw-r--r--.  1 shs  shs        149 Jan  1 15:53 xx03.txt
-rw-r--r--.  1 shs  shs        136 Jan  1 15:53 xx02.txt
-rw-r--r--.  1 shs  shs        166 Jan  1 15:53 xx01.txt
-rw-r--r--.  1 shs  shs         10 Jan  1 15:53 xx00.txt
$ cat xx01.txt
Dec 11, 2021
Dear Diary,

Today was a difficult day. I dragged a dozen bags of trash to the transfer
station and came home to find a dozen more waiting on my porch.

Wrap-Up

The csplit command can make splitting files into pieces based on meaningful breaks fairly easy and includes enough options to help you get exactly the result you want.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2022 IDG Communications, Inc.



Source link