From the course: Red Hat Certified System Administrator (EX200) Cert Prep: 1 Deploy, Configure, and Manage (2021)

Use grep and regular expressions to analyze text

From the course: Red Hat Certified System Administrator (EX200) Cert Prep: 1 Deploy, Configure, and Manage (2021)

Start my 1-month free trial

Use grep and regular expressions to analyze text

- The most efficient way of searching for data in a file is to use grep. Grep shows lines in the file that match the provided search criteria. The syntax is grep, followed by options and then the search criteria, and lastly the name of the file to search through. Grep has many many options, but the ones I like to use most are dash i for case insensitive searches, dash v for inverted searches, which shows the opposite of the search criteria, dash c for the number of lines that matched, dash o to show only characters that matched, not the entire line, dash r to recursively grep, this searches through all files in the directory, and dash capital E to use extended regular expressions. This is the same as using egrep. There are many many more options, view the Grep man page for more information. This example of using options shows a case insensitive search for the word root in /etc/passwd. You can also pipe the output of any command into Grep. In this example, I'm using the find command to search for files in the entire operating system with names ending with .txt. This output is then piped into Grep, and it searches for the word apache. This is a very common use for Grep. To make our search criteria more strict, we can employ anchors. The caret anchor forces Grep to search from the beginning of the line. The dollar sign anchor is to the end. A search criteria of caret dollar sign might not seem that useful. It will anchor to the beginning and the end, but will not match anything in the middle. This very simple criteria finds blank lines. You might wonder why we would want to do that, but if you combine it with a dash v invert option, it will only show non blank lines. It's very clever. We have criteria for matching characters. To match one character of any type, we use the dot. To match more characters, we use a modifier. The asterisk is a common modifier. It says to match zero or more of the previous character. For instance, we use a dot which matches any one character, and follow it with an asterisk modifier which matches zero or more of any previous characters. The result is that we match any number of any characters. This is equivalent to the asterisk by itself in file globs. Character sets match exactly one character just like the dot, but we can be specific about which character to match and include a list of possible characters, such as abc. The one character that matches can either be a, b, or c. We can also contain ranges of characters such as zero through nine or a through z. Just remember that the set will only match one character, but that one character can be any of the items in the list. To match everything except what is in the character set, make sure the first character is an exclamation mark. This will negate the match. A character class is a reliable posix compliant way of matching certain types of characters, such as numbers, upper and lower case letters, or both at the same time, as well as the same including numbers. They can also match spaces alone, which include tabs and new lines. They can match printable characters not including spaces, printable characters including spaces, punctuation, and even non-printable control characters, or hexadecimal numbers. These are very difficult to match using standard sets. You can place character classes inside of sets, so user left square bracket zero dash nine right square bracket is the same as user left square bracket left square bracket colon digit colon right square bracket right square bracket, and either will match user zero, user one, user two, and so on, up to user nine. You can also add more than one character class inside of a set. In this example, it will match user zero through nine, or user with a trailing space but no number. Negating a character class is exactly the same as negating items in a set. I've included it because it looks odd. Just place the exclamation mark right before the character class, and it will negate the character class. In this example, it will find user followed by anything but a digit. There are two types of regular expressions, basic and extended. Extended regular expressions do everything that basic regular expressions do, and I'd recommend using them whenever possible, as the syntax is actually simpler than basic regular expressions. Not every tool supports them, but you can use them with sed -r, egrep, awk, and bash's built in regular expression operator. In brief, one dot matches one character of any type just like before. An asterisk modifier matches zero or more of the previous character, a question mark modifier matches zero or one of the previous character, and a plus modifier matches one or more of the previous character. If you want to specify exactly how many matches, we can place a digit inside of curly braces. For instance, for two of the previous character, we can place a two inside of curly braces. You can do this with basic regular expressions with GNU tools, but you need to escape the curly braces with backslashes. To specify a range such as two, three, four of the previous character, we can use two comma four in the curly braces. We can also group matches if we want to match on more than one character. For instance, if we wanted to match more than one occurrence of the characters ab together, we could follow this with a modifier to specify the number of occurrences. This would match two occurrences of ab, such as abab. Extended regular expressions support alternation by creating a group and placing two items in it, separated by a pipe, we can match cat or dog. This is really the one big advantage of extended over basic regular expressions. Let's look at an example. The caret anchors to the beginning of the line and will only show lines that start with http. The dot matches any one character and the asterisk after it modifies it to any number of characters, so dot asterisk will match any number of any characters. The letters tcp are literal, the dot asterisk again matches any number of any characters, and lastly, the word service, followed by the dollar sign, anchors to the end. This will only match lines that end with the word service. Using grep dash capital E, or egrep, we can use alternation, we can match both tcp and udp by creating a group with both patterns separated by a pipe. This will match either tcp or udp in this line.

Contents