- How to run a local LLM as a browser-based AI with this free extension
- Wi-Fi 7 in 2025: Will this be the year?
- 81% of firms back a Zero Trust approach to cyber defense
- These useful One UI 7 features are coming to the Galaxy S25 series - and older models too
- This premium projector has a dazzling display that could replace my 4K TV
Extracting substrings on Linux
There are many ways to extract substrings from lines of text using Linux and doing so can be extremely useful when preparing scripts that may be used to process large amounts of data. This post describes ways you can take advantage of the commands that make extracting substrings easy.
Using bash parameter expansion
When using bash parameter expansion, you can specify the starting and ending positions for the text that you want to extract. For example, you can create a variable by assigning it a value and then use syntax like that shown below to select a portion of it.
$ string="Happy days are here again" $ echo ${string:1:10} appy days $ echo ${string:0:9} Happy days
Note that the example above makes it clear that this technique starts position numbering at 0. So, in the next example, the 7 represents the eighth character in the string and the -2 means to drop the last 2 characters. As a result, the substring in the first example below has a single character and the second has all but the last two.
$ string="1234567890" $ echo ${string:7:-2} 8 $ echo ${string:0:-2} 12345678
In this next example, we first create a variable using “set –” and then use echo to display the eighth and ninth characters. In other words, it starts with the eighth character (7) and then displays two characters.
$ set -- 01234567890abcdef $ echo ${1:7:2} 78
NOTE: You could display the string created with the set command by simply using the command “echo $1”. This is what is referenced by the “1” in the example above.
$ set -- 01234567890abcdef $ echo $1 01234567890abcdef
Using cut
The cut command can be used in several ways to yank substrings from text. The -c option allows you to select the character positions to be displayed. For cut, character numbering starts at 1.
$ echo "12345" | cut -c 1-3 123
In this next example, we select the last two words by character position. If you select more characters than are available, it doesn’t affect the output.
$ echo "Have some fun" | cut -c 6-13 some fun $ cut -c 6-13 <<< "Have some fun" some fun $ echo "Have some fun" | cut -c 6-20 some fun
In addition, you can pipe text to the cut command or use the cut command to work with text in a file. Just be sure that the positions work for every line.
$ cat myfile $ cut -c 6-15 myfile Have some fun some fun Grab your lunch your lunch Take nice nap nice nap
The cut command can also work with delimiters and this often makes it a lot easier to use with files in which the words or fields don’t line up precisely. To work with a file of mailing addresses, for example, you could do this to pull out the third field in the comma-separated addresses:
$ cat addresses $ cut -d, -f3 addresses 6803 Gravel Road,Hurlock,MD MD 121 Blueberry Drive,Outback,VA VA 1427 N 12th Street,Reading,PA PA 2001 Turtle Road,Baker,WV WV 264 Dakota Street,Groton,CT CT 111 Mindless Circle,Celery,TX TX 1089 Plymouth Drive,Rahway,NJ NJ 949 Endless Lane,Hoboken,NJ NJ 2001 Turtle Road,Outback,VA VA
You can select multiple fields by specifying a range (e.g., “2-3”) or a sequence (e.g., “2,3”) as shown below.
$ cut -d, -f2-3 addresses $ cut -d, -f2,3 addresses Hurlock,MD Hurlock,MD Outback,VA Outback,VA Reading,PA Reading,PA Baker,WV Baker,WV Groton,CT Groton,CT Celery,TX Celery,TX Rahway,NJ Rahway,NJ Hoboken,NJ Hoboken,NJ Outback,VA Outback,VA
Using awk
The awk command can also be used to extract substrings. Here’s an example of pulling text from a supplied phrase:
$ awk '{print substr($0,6,8)}' <<< "Wash your car" your car
The $0 represents the complete phrase.
To work with a file with delimited fields, use the -F (field delimiter) option. In this case, the delimiter is a comma. Use -F’:’ if the file is colon-delimited.
$ awk -F',' '{print $3}' addresses | sort | uniq CT MD NJ PA TX VA WV
If your fields are separated with both a comma and a space, that is no problem for awk. Just specify that in the command like this:
$ awk -F', ' '{print $3}' addresses | sort | uniq CT MD NJ PA TX VA WV
In fact, if you want the awk command to work regardless of whether fields are separated with just commas or both commas and blanks, you can do this:
$ awk -F', ?' '{print $3}' addresses | sort | uniq CT MD NJ PA TX VA WV
Using awk, you can also display two fields by using syntax like this:
$ awk -F',' '{print $2,$3}' addresses | sort | uniq Baker WV Celery TX Groton CT Hoboken NJ Hurlock MD Outback VA Rahway NJ Reading PA
Using expr
To use the expr command, type “expr substr” followed by your string, the start position and the string length.
$ expr substr "Have some fun" 6 8 some fun
$ str="Have some fun" $ expr substr "$str" 6 8 some fun
Wrap-Up
There are lots of ways to extract substrings on Linux, but each of the commands you might use has its own quirks and its own advantages.
Copyright © 2022 IDG Communications, Inc.