Splitting files on Linux by context
The csplit command is unusual in that allows you to split text files into pieces based on their content. The command allows you to specify a contextual string and use it as a delimiter for identifying the chunks to be saved as separate files.
As an example, if you wanted to separate diary entries into a series of files each with a single entry, you might do something like this.
$ csplit -z diary '/^Dear/' '{*}' 153 123 136
In this example, “diary” is the name of the file to be split. The command is looking for lines that begin with the word “Dear” as in “Dear Diary” to determine where each chunk begins. The -z option tells csplit to not bother saving files that would be empty.
You can list the files that were just created by using a command like the following that limits the output of the ls command to the most recent files. The three numbers shown display the length of each of the three separate files that were created.
$ ls -ltr | tail -3 -rw-r--r--. 1 shs shs 136 Jan 1 15:02 xx02 -rw-r--r--. 1 shs shs 123 Jan 1 15:02 xx01 -rw-r--r--. 1 shs shs 153 Jan 1 15:02 xx00
You could also use the full phrase for the separator line:
$ csplit -z diary '/^Dear Diary,/' '{*}'
In either case, the xx00 file will look like this:
$ cat xx00 Dear Diary, Today was a difficult day. I dragged a dozen bags of trash to the transfer station and came home to find a dozen more waiting on my porch.
The xx00, xx01, xx02, etc. file naming is the default. Split an additional file and these output files would be overwritten by the newer files unless you use the -f or –prefix option to replace “xx” with something more meaningful as in the example below in which the word “diary” is used to name the files.
$ csplit -zf diary diary '/^Dear/' '{*}' 153 123 136 $ ls -ltr | tail -3 -rw-r--r--. 1 shs shs 123 Jan 1 15:11 diary01 -rw-r--r--. 1 shs shs 153 Jan 1 15:11 diary00 -rw-r--r--. 1 shs shs 136 Jan 1 15:11 diary02
If the file you want to split is separated by dates, you might try a command like this that looks for a portion of the date field:
$ csplit -zf diary diary '/, 202/' '{*}' 166 136 149 $ cat diary00 Dec 11, 2021 Dear Diary, Today was a difficult day. I dragged a dozen bags of trash to the transfer station and came home to find a dozen more waiting on my porch.
If you want to add a file extension to your output files, you can specify it as in the command shown below that uses “.txt” as the file extension. The 02d specifies that two digits are to be used. This is the default, but if you want 4 digits, just change the 2 to a 4.
$ $ csplit -z -b "%02d.txt" diary '/, 20/' '{*}' 10 166 136 149 $ ls -ltr | tail -4 -rw-r--r--. 1 shs shs 149 Jan 1 15:53 xx03.txt -rw-r--r--. 1 shs shs 136 Jan 1 15:53 xx02.txt -rw-r--r--. 1 shs shs 166 Jan 1 15:53 xx01.txt -rw-r--r--. 1 shs shs 10 Jan 1 15:53 xx00.txt $ cat xx01.txt Dec 11, 2021 Dear Diary, Today was a difficult day. I dragged a dozen bags of trash to the transfer station and came home to find a dozen more waiting on my porch.
Wrap-Up
The csplit command can make splitting files into pieces based on meaningful breaks fairly easy and includes enough options to help you get exactly the result you want.
Copyright © 2022 IDG Communications, Inc.