Chat Zalo Chat Messenger Phone Number Đăng nhập
A beginner's guide to regular expressions with grep

A beginner’s guide to regular expressions with grep

A regular expression (also called regex or regexp) is a rule that a computer can use to match characters or groups of characters within a larger body of text. For example, using regular expressions, you can find all instances of the word cat in a document, or all instances of a word that begins with c and ends with t.

Using regular expressions in the real world can be much more complex and powerful than that. For example, suppose you need to write code to verify that the entire contents of the body of an HTTP POST request are free from script injection attacks. Malicious code can appear in many ways, but you know that injected script code will always appear between HTML tags <script></script> You can apply the regular expression <script>.*</script>, which matches any block of code text bracketed by <script> tags, to the body of the HTTP request as part of the script injection code lookup.

This example is just one of many uses of regular expressions. In this series, you will learn more about how the syntax of this and other regular expressions works.

As just demonstrated, a regex can be a powerful tool for finding text according to a particular pattern in a variety of situations. Once mastered, regular expressions provide developers with the ability to locate text patterns in source code and design-time documentation. You can also apply regular expressions to text that is subject to algorithmic processing at run time, such as content in HTTP requests or event messages.

Regular expressions are supported by many programming languages, as well as classic command-line applications such as awk, sed, and grep, which were developed for Unix many decades ago and are now offered on GNU/Linux.

This article examines the basics of using regular expressions under grep. The article shows how you can use a regular expression to declare a pattern that you want to match and describes the essential building blocks of regular expressions, with many examples. This article does not assume any prior knowledge of regular expressions, but you should understand how to do it with the Linux operating system on the command line.

What are regular expressions and what is grep?

As we have noted, a regular expression is a rule used to match characters in text. These rules are declarative, meaning they are immutable: once declared, they do not change. But a single rule can be applied to any variety of situations.

Regular expressions are written in a special language. Although this language has been standardized, dialects vary from one regular expression engine to another. For example, JavaScript has a regex dialect, just like C++, Java, and Python.

This article uses the regular expression dialect that accompanies the Linux grep command, with an extension to support more powerful features. grep is a binary executable that filters the contents of a file or the output of other commands (stdout). Regular expressions are fundamental to grep: The re in the middle of the name means “regular expression”.

This article

uses grep because you do not need to set up a particular coding environment or write any code to work with the regular expression examples shown in this article. All you need to do is copy and paste an example into the command line of a Linux terminal and you will see the results immediately. The grep command can be used in any shell.

Because this article focuses on regular expressions as a language and not on file manipulation, the examples use examples of text piped to grep instead of input files.

How

to use grep against the contents of a file

To print lines in a file that match a

regular expression, use the following syntax: $ grep -options <regular_expression> /paths/to/files

In this command syntax

:

  • -options, if specified, controls the behavior of the command
  • . <regular_expression> indicates the regular expression to run on the files. /

  • paths/to/files indicate one or more files on which the regular will run.

The options used in this article are

:

  • -P: Apply regular expressions to the style of the Perl programming language. This option, which is specific to GNU/Linux, is used in the article to unlock powerful features that grep does not recognize by default. There is nothing Perl-specific about the regular expressions used in this article; The same features can be found in many programming languages.
  • -i: Matches in a case-insensitive way.
  • -o: Prints only the characters that match the regular expression. By default, the entire line containing the corresponding string is printed.

How to pipe content

to a regular expression

As mentioned earlier, you can also use a regular expression to filter stdout output. The following example uses the pipe symbol (|) to feed the output of an echo command to grep.

$ echo “I like to use regular expressions.” | grep -Po ‘r.*ar’

The command produces the following output:

regular

Why does grep return regular characters to match the regular expression specified here? We will explore the reasons in the following sections of this article.

Regular characters, metacharacters, and patterns: The building blocks of regular expressions You’ll use three basic building blocks when working with regular expressions

: regular characters, metacharacters, and patterns

. Regular characters and metacharacters are used to create a regular expression, and that regular expression represents a matching pattern that the regex engine applies to some content.

You can think of a metacharacter as a placeholder symbol. For example, the . Metacharacter (a dot or dot) represents “any character.” The d metacharacter represents any individual number, from 0 to 9.

The * metacharacter is an abbreviation that represents the statement “find a character that appears zero or more times as defined by the previous character.” (You’ll see how to work with the * metacharacter in the next few sections.)

Regular expressions support many metacharacters, each worthy of one or two pages of description. For now, the important thing to understand is that a metacharacter is a reserved symbol used by the regex engine to describe a character generically. In addition, certain metacharacters are an abbreviation for a search instruction.

You can combine regular characters with metacharacters to declare rules that define search patterns. For example, consider the following short regular expression:

.t

This matches a pattern that consists of two characters. The first character can be any character, as declared by the . (period) metacharacter, but the second character must be t. Therefore, applying the regular expression .t to the string I like cats but not rats matches the strings highlighted in bold here: I like

c at s

but not rats

You can do a lot using only the basic metacharacters to create regular expressions with grep. The following sections provide several useful examples.

Running Basic Regular Expressions

The following subsections show several examples of regular expressions. The examples are presented as two commands to enter into a Linux terminal. The first command creates a variable named teststr that contains an example string. The second executes the echo command against teststr and pipes the result of the echo command to grep. The grep command then filters the input based on the associated regular expression.

How to declare an exact pattern match using normal characters

The following example shows how to find a string based on the regular character pattern, Fido. The search statement is case-sensitive:

$teststr=”Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Po ‘Fido’

The result is:

Fido

How to declare an exact pattern match that is not case sensitive

The following example shows how to find a string according to a regular character pattern, fido. The search declaration is not case-sensitive, as indicated by the -i option in the grep command. Therefore, the regex engine will encounter occurrences such as FIDO, as well as fido or fiDo.

$ teststr=”Jeff and the mascot Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Poi ‘fido’

The result is:

Fido

How to declare a logical pattern match

The following example uses the metacharacter symbol | to search according to such and such a condition, that is, a condition that can be satisfied with any of the regular expressions on either side of |. In this case, the regular expression coincides with the occurrences of the regular character f or g:

$ teststr=”Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Po ‘f|g’

The grep command identifies each occurrence that satisfies the rule declared in the regular expression. Conceptually, the regular expression says: Returns any character that is an f or a g. We are leaving the search case-sensitive, as is the default. Therefore, the identified characters are highlighted in bold here:

Jeff and the mascot Lucky. Gregg and dog Fido. Chris has 1 bird named Tweety.

Because each character is identified and returned one by one, the output sent to the terminal window is:

f f g g g How to find a

character at the

beginning of a line

The following example uses the ^ metacharacter to find the beginning of a line of text. Conceptually, the ^ metacharacter coincides with the beginning of a line.

The example runs the regular expression ^J. This regular expression looks for a match that satisfies two conditions. The first condition is to find the beginning of the line; the next thing is to find the regular character J in that position.

$ teststr=”Jeff and the mascot Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Po ‘^J’

The regular expression matches the bold highlighted character as shown here:

Jeff and the mascot Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The result returned to the terminal is:

J How to find a

character at the

end of a line The

following example uses the $ metacharacter to find the end of a line of text

.

The example runs the regular expression .$. The regular expression declares a matching rule that has two conditions. First, the regular expression seeks an appearance of the regular character. (period). The regular expression then looks to see if the end of the line is next. Therefore, if the . The character comes at the end of the line, it is considered a coincidence.

The regular expression includes a backslash () as the “escape” metacharacter before the period. The escape metacharacter is necessary to override the normal meaning of the dot as a metacharacter. Remember that the . (dot) metacharacter means any character. With the escape character, the point is treated like a normal character, so it matches itself:

$teststr=”Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Po ‘.$’

The regular expression matches the end period of the text, highlighted in bold as shown here:

Jeff and the mascot Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.

The result is just the endpoint:

.

Suppose you used a point with no escape in the regular expression:

$ echo $teststr | grep -Po ‘.$’

You would get the same result as using the escape point, but a different logic is running. That logic is: match any character that is the last character before the end of the string. Therefore, the regular expression would always match any line. Using the escape character to identify a character as a regular character is a subtle distinction in this case, but an important one.

How to find multiple

characters at the end of a line

The following example finds the string assigned to the teststr variable to match the characters ty. when they appear at the end of a line. $ teststr

=”Jeff and the pet Lucky. Gregg and the dog Fido. Chris has 1 bird named Tweety.” $ echo $teststr | grep -Po ‘ty.$’

The result is:

ty.

Again, notice that the user of the escape metacharacter () must declare the . (period) character as regular character.

How to Find Occurrences of a Character Using Metacharacters to Match Numbers

The following example uses the d metacharacter to create a regular expression that matches any number in a given piece of text.

$ teststr=”There are 9 cats and 2 dogs in a box.” $ echo $teststr | grep -Po ‘d’

Because each number is matched and returned one by one, the output sent to the terminal is

: 9 2

How to Find a String Using

Metacharacters for a Number and Space

The following example uses the d and s metacharacters together with regular characters to create a regular expression that matches the text according to the following logic: Match any number that is followed by a space and then the regular cat characters.

The d metacharacter matches a

number and the s metacharacter matches a whitespace character (a space, tab, or other rare characters):

$teststr=”There are 9 cats and 2 dogs in a box.” $ echo $teststr | grep -Po ‘dscats’

The result is:

9 cats How to

combine metacharacters to create a complex regular expression

The following example uses the d metacharacter to match a number, s to match a space, and . (period) to match any character. Regular expressions use the * metacharacter to say, Match zero or more successive occurrences of the previous character.

The logic expressed in the regular expression is as follows: Find a text string that begins with a number followed by a space character and the regular characters cats. Then continue, matching any character until you reach another number followed by a space character and the regular characters dogs:

$teststr=”There are 9 cats and 2 dogs in a box.” $ echo $teststr | grep -Po ‘dscats.*dsdogs’

The result is:

9 cats and 2 dogs

How to loop a line of text to a stopping point

The following example uses the . (period) metacharacter and * together with the regular characters cats to create a regular expression with the following logic: Match any character zero or more times until you reach the characters cats:

$ teststr=”There are 9 cats and 2 dogs in a box.” $ echo $teststr | grep -Po ‘.*cats’ The

result is:

There are 9 cats

The interesting thing about this regular expression is that starting from the beginning of the line is implicit. The ^ metacharacter could be used to indicate the beginning of a line, but because the regular expression matches any character until it reaches the cats, it is not necessary to explicitly declare the beginning of the line using ^. The regular expression starts processing from the beginning of the line by default.

Regular expressions

discover patterns in

text

Regular expressions provide an efficient yet concise way to perform complex text filtering. You can use them in programming languages such as JavaScript, Python, Perl, and C++, and directly in a Linux terminal to process files and text using the grep command, as shown in this article.

Becoming familiar with regular expressions takes time. Mastering the complexities of working with metacharacters alone can be daunting. Fortunately, the learning curve is developmental. You don’t have to master all regular expressions to work with them usefully as a beginner. You can start with the basics, and as you learn more, you can do more. Just being able to match patterns using the basic examples shown in this article can provide an immediate benefit.

More information

The next article in this series explains the features of regular expressions that are even more powerful. Read it here: Regex how-to: Quantifiers, pattern collections and word limits

Last updated: April 19, 2023

Contact US