Tuesday, December 11, 2018

Regular Expressions in Python

Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. It allows us to specify a pattern of text to search for. Regular expressions are huge time-savers, not just for software users but also for programmers. The applications for regular expressions (sometimes shortened to regexp, regex, or re) are wide-spread, but they are fairly complex, so when contemplating using a regex for a certain task, think about alternatives, and come to regexes as a last resort.


Regular expressions use two types of characters:


a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wild card.

b) Literals (like a,b,1,2…)


The most common uses of regular expressions are:

  •     Search a string (search and match)
  •     Finding a string (findall)
  •     Break string into a sub strings (split)
  •     Replace part of a string (sub)


Python's re module  provides full support for regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression. The most commonly used methods in this module are:
  •     re.match()
  •     re.search()
  •     re.findall()
  •     re.split()
  •     re.sub()
  •     re.compile()

Let's understand these functions with the help of examples:


1. The match()


This function attempts to match RE pattern to string with optional flags. It finds match if it occurs at start of the string. The syntax for this function is: re.match(pattern, string, flags=0) where -


pattern is the regular expression to be matched, string is the string, which would be searched to match the pattern at the beginning of string and flags are modifiers that we can specify using bitwise OR (|).


The re.match function returns a match object on success, None on failure.


See the code below:


import re

output = re.match( r'covri','covri communications and management solutions')

print(output)


When we run this program we get the following output:


<_sre.SRE_Match object; span=(0, 5), match='covri'>


------------------
(program exited with code: 0)

Press any key to continue . . .


The “r” at the start of the pattern string is used to denote a raw string. Now try this code:


import re

output = re.match( r'management','covri communications and management solutions')

print(output)

When we run this program we get the following output:


 None
------------------
(program exited with code: 0)

Press any key to continue . . .


The reason is that the match() finds match if it occurs at start of the string and our pattern 'management' isn't at the start.

We can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:
  • group()  Return the string matched by the RE
  • start()  Return the starting position of the match
  • end()  Return the ending position of the match
  • span()  Return a tuple containing the (start, end) positions of the match
Let's incorporate these methods in our program and see the output:

import re

output = re.match( r'covri','covri communications and management solutions')

print(output.group())
print(output.start())
print(output.end())
print(output.span())


The output of this program is:


covri
0
5
(0, 5)

------------------
(program exited with code: 0)

Press any key to continue . . .

The group() returns the substring that was matched by the RE. start() and end() return the starting and ending index of the match. span() returns both start and end indexes in a single tuple. Since the match() method only checks if the RE matches at the start of a string, start() will always be zero.


2. re.search(pattern, string)

The search() method of patterns scans through the string, looking for any location where this RE matches. The syntax for this function is:

re.search(pattern, string, flags=0)

where pattern is the regular expression to be matched, string is the string, which would be searched to match the pattern at the beginning of string and flags are modifiers that we can specify using bitwise OR (|).

The re.search function returns a match object on success, none on failure. Now try this code:

import re

output = re.search( r'management','covri communications and management solutions')

print(output)


The output of this program is:


<_sre.SRE_Match object; span=(25, 35), match='management'>


------------------
(program exited with code: 0)

Press any key to continue . . .



We can query the search object for information about the matching string. Search object instances also have several methods and attributes; the most important ones are:


  • group()  Return the string matched by the RE
  • start()  Return the starting position of the search
  • end()  Return the ending position of the search
  • span()  Return a tuple containing the (start, end) positions of the search


Let's incorporate these methods in our program and see the output:

import re

output = re.search( r'covri','covri communications and management solutions')

print(output.group())
print(output.start())
print(output.end())
print(output.span())


The output of this program is:


management
25
35
(25, 35)

------------------
(program exited with code: 0)

Press any key to continue . . .


 3. re.findall (pattern, string)

It helps to get a list of all matching patterns. It has no constraints of searching from start or end. Thus findall() finds all substrings where the re matches, and returns them as a list. See the example below:

import re

output = re.findall( r'covri','covri communications and management solutions aka covri solutions')

print(output)

The output of this program is:

['covri', 'covri']

------------------
(program exited with code: 0)

Press any key to continue . . .


4. re.split(pattern, string, [maxsplit=0])

This methods helps to split string by the occurrences of given pattern. See the program below:

import re

output = re.split( r'i','covrisolutions')

print(output)


The output of this program is:

['covr', 'solut', 'ons']

------------------

(program exited with code: 0)

Press any key to continue . . .

In this program we are splitting the string 'covrisolutions' by 'i'. As seen in the output the string is divided into three parts and stored in a list.

In case we want to specify the number of splits for the string we can use the maxsplit argument in the split() as shown below:

import re

output = re.split( r'i','covrisolutions',maxsplit=1)

print(output)

The output of this program is:

['covr', 'solutions']

------------------
(program exited with code: 0)

Press any key to continue . . .

The string covrisolutions was split from the first occurrence of 'i' and the second occurrence of 'i' was ignored.


5. re.sub()


The sub() search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged. The syntax is:

re.sub(pattern, repl, string)

where pattern is the string to be searched, repl is the string which will replace the string specified in pattern, and string is the input string which should be search for the patter.

See the code below:


import re


output = re.sub( r'India','World','covrisolutions is a leading solutions provider in India')


print(output)


The output of this program will be:

covrisolutions is a leading solutions provider in World

------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see that in the output the word India has been replaced by World.


6. re.compile()


We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it. The syntax is:

re.compile(pattern) where pattern is the string which will be converted to a match object. See the example below:

import re

pattern = re.compile('covri')

output = pattern.findall('covri solutions is based in India and covri is a leading solutions provider in India')


print(output)


The output of the program is shown below:

['covri', 'covri']

------------------

(program exited with code: 0)

Press any key to continue . . .


The compile method converted the string into the pattern object which was then used with findall() to print all occurrences of the string stored in the object.


Operators used with Regular expressions

Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

.       Matches with any single character except newline ‘\n’.

?       match 0 or 1 occurrence of the pattern to its left

+       1 or more occurrences of the pattern to its left

*       0 or more occurrences of the pattern to its left

\w       Matches with a alphanumeric character whereas \W (upper case W) matches non                                  alphanumeric character.

\d       Matches with digits [0-9] and /D (upper case D) matches with non-digits.

\s       Matches with a single white space character (space, newline, return, tab, form) and \S                          (uppercase S) matches any non-white space character.

\b       boundary between word and non-word and /B is opposite of /b

[..]       Matches any single character in a square bracket and [^..] matches any single character not                  in square bracket

\       It is used for special meaning characters like \. to match a period or \+ for plus sign.

^ and $  ^ and $ match the start or end of the string respectively

{n,m}    Matches at least n and at most m occurrences of preceding expression if we write it as                         {,m} then it will return at least any minimum occurrence to max m preceding expression.

a| b      Matches either a or b

( )      Groups regular expressions and returns matched text

\t, \n, \r Matches tab, newline, return


Let's try to use some of these operators in our programs. Suppose I want to print each character of  my input string 'covri solutions is based in India and covri is a leading solutions provider in India' . See the code below:

import re

output = re.findall(r'\w','covri solutions is based in India and covri is a leading solutions provider in India')

print(output)


The output of the program is shown below:

['c', 'o', 'v', 'r', 'i', 's', 'o', 'l', 'u', 't', 'i', 'o', 'n', 's', 'i', 's',

 'b', 'a', 's', 'e', 'd', 'i', 'n', 'I', 'n', 'd', 'i', 'a', 'a', 'n', 'd', 'c',

 'o', 'v', 'r', 'i', 'i', 's', 'a', 'l', 'e', 'a', 'd', 'i', 'n', 'g', 's', 'o',

 'l', 'u', 't', 'i', 'o', 'n', 's', 'p', 'r', 'o', 'v', 'i', 'd', 'e', 'r', 'i',

 'n', 'I', 'n', 'd', 'i', 'a']

------------------

(program exited with code: 0)

Press any key to continue . . .


If I want to print each word of  my input string 'covri solutions is based in India and covri is a leading solutions provider in India' . See the code below:

import re

output = re.findall(r'\w+','covri solutions is based in India and covri is a leading solutions provider in India')

print(output)

The output of the program is shown below:

['covri', 'solutions', 'is', 'based', 'in', 'India', 'and', 'covri', 'is', 'a',

'leading', 'solutions', 'provider', 'in', 'India']

------------------

(program exited with code: 0)

Press any key to continue . . .


Make some programs using these operators in order to understand their function. Here we end today's discussion, so till we meet next keep practicing and learning Python as Python is easy to learn!




































































































































































Share:

0 comments:

Post a Comment