Wednesday, May 1, 2019

Pandas - 27 (String Manipulation)

Most operations related to String Manipulation can easily be made by using built-in functions provided by Python. For more complex cases of matching and manipulation, it is necessary to use regular expressions. Let's see some of the built in functions and regular expressions-

1. Built-in Methods for String Manipulation

a. The split() function

It is used if we want to separate a composite strings in to various parts and then assign them to the correct variables. The split() function separate parts of the text, taking as a reference point a separator. See the following program:

import pandas as pd
import numpy as np

text_to_be_separated = '101 Baquer complex , Hyderabad'

print(text_to_be_separated.split(','))#produces string with a space character at the end

separated_text = [s.strip() for s in text_to_be_separated.split(',')]

print('\nThe formatted text \n')
print(separated_text)


In the above program the split() function allows us to separate parts of the text, taking  a comma as a reference point or a separator. The output of the program is shown below:

['101 Baquer complex ', ' Hyderabad']

The formatted text

['101 Baquer complex', 'Hyderabad']
------------------
(program exited with code: 0)

Press any key to continue . . .


In the first output we have a string with a space character at the end. To overcome this common problem, we then use the split() function along with the strip() function, which trims the whitespace (including newlines) and returns an array of strings as the result.

We can also store the result into variables provided the number of elements is small and always the
same. See the following program:

import pandas as pd
import numpy as np

text_to_be_separated = '101 Baquer complex , Hyderabad'

print(text_to_be_separated.split(','))#produces string with a space character at the end

address, city = [s.strip() for s in text_to_be_separated.split(',')]

print('\nAddress \n')
print(address)

print('\nCity \n')
print(city)

print('\nComplete address \n')
print((address+','+city))


The output of the program is shown below:

['101 Baquer complex ', ' Hyderabad']

Address

101 Baquer complex

City

Hyderabad

Complete address

101 Baquer complex,Hyderabad
------------------
(program exited with code: 0)

Press any key to continue . . .


b. The join() function

String concatenation can be useful only if there are only two or three strings to be concatenated. For larger number of strings the join() function should be used which we assign to the separator character, with which we want to join the various strings. See the following program:

import pandas as pd
import numpy as np

strings_to_be_joined = ['101', 'Baquer complex','Chapel','Road' ,'Abids', 'Hyderabad','500001']

print(','.join(strings_to_be_joined))


The output of the program is shown below:

101,Baquer complex,Chapel,Road,Abids,Hyderabad,500001
------------------
(program exited with code: 0)

Press any key to continue . . .


c. index() and find() functions

These functions can be used on the string to search for pieces of text in them, i.e., substrings. Python also provides the keyword that represents the best way of detecting substrings. See the following program:

import pandas as pd
import numpy as np

text = '101 Baquer complex , Hyderabad'

print('Hyderabad' in text)#Using in keyword
print(text.index('Hyderabad'))#Using index()
print(text.find('Hyderabad'))#Using find()
print(text.find('Delhi'))#Using find()
print(text.index('Delhi'))#Using index()



The output of the program is shown below:

True
21
21
-1
Traceback (most recent call last):
  File "chap6.py", line 10, in <module>
    print(text.index('Delhi'))#Using index()
ValueError: substring not found
------------------
(program exited with code: 1)

Press any key to continue . . . 


As seen from the output, Python's in keyword returns a Boolean value whereas the index() and find() functions returns the number of the corresponding characters in the text where we have the substring. The difference in the behavior of these two functions can be seen, however, when the substring is not found. When we searched for Delhi, the index() function returns an error message, and find() returns -1 if the substring is not found.

d. The count() function

Using the count() function we can know how many times a character or combination of characters (substring) occurs within the text. See the following program:

import pandas as pd
import numpy as np

text = '101 Baquer complex , Hyderabad'

print(text.count('e'))


The output of the program is shown below:

3
------------------
(program exited with code: 1)

Press any key to continue . . . 


e.The replace() function 
 
Using the replace() function we can replace or eliminate a substring (or a single character) in our string. In both cases you will use the replace() function,where if we are prompted to replace a substring with a blank character, the operation will be equivalent to the elimination of the substring from the text. See the following program:

import pandas as pd
import numpy as np

text = '101 Baquer complex , Hyderabad'

print(text.replace('Baquer','Covri'))
print(text.replace('1',''))


The output of the program is shown below:

101 Covri complex , Hyderabad
0 Baquer complex , Hyderabad
------------------
(program exited with code: 0)

Press any key to continue . . .


2. Regular Expressions

Regular expressions provide a very flexible way to search and match string patterns within text. A single expression, generically called regex, is a string formed according to the regular expression language. There is a built-in Python module called re, which is responsible for the operation of the regex and thus must be imported in the programs. It provides functions for:

• Pattern matching
• Substitution
• Splitting

Let's use the split() function provided by the re module that performs the same operations, only it can accept a regex pattern as the criteria of separation, which makes it considerably more flexible. We'll use \s+, which is  the regex for expressing a sequence of one or more whitespace characters. See the following program:

import pandas as pd
import numpy as np
import re

text = '101 Baquer complex , Hyderabad'

print(re.split('\s+', text))
print(text.split(','))


In our program we have used the split() function used before as well as the one provided by the re module. The output of the program is shown below:

['101', 'Baquer', 'complex', ',', 'Hyderabad']
['101 Baquer complex ', ' Hyderabad']
------------------
(program exited with code: 0)

Press any key to continue . . .


When we call the re.split() function, the regular expression is first compiled, then subsequently calls the split() function on the text argument. We can also compile the regex function with the re.compile() function, thus obtaining a reusable object regex and so gaining in terms of CPU cycles.
This is especially true in the operations of iterative search of a substring in a set or an array of strings. Thus if we make a regex object with the compile() function, we can apply split() directly to it as shown in the following program:

import pandas as pd
import numpy as np
import re

text = '101 Baquer complex , Hyderabad'

regex = re.compile('\s+')

print(regex.split(text))


The output of the program is shown below:

['101', 'Baquer', 'complex', ',', 'Hyderabad']

------------------
(program exited with code: 0)

Press any key to continue . . .


To match a regex pattern to any other business substrings in the text, we can use the findall() function. It returns a list of all the substrings in the text that meet the requirements of the regex. See the following program:

import pandas as pd
import numpy as np
import re

text = 'COVRI Solutions, 101 Baquer complex ,Chapel Road, Hyderabad'

print(re.findall('C\w+',text))
print(re.findall('c\w+',text))
print(re.findall('[C,c]\w+',text))

 
The output of the program is shown below:

['COVRI', 'Chapel']
['complex']
['COVRI', 'complex', ',Chapel']
------------------
(program exited with code: 0)

Press any key to continue . . .

Like findall() function, there are two other functions—match() and search(). While findall() returns all matches within a list, the search() function returns only the first match. Furthermore, the object returned by this function is a particular object. This object does not contain the value of the substring that responds to the regex pattern, but returns its start and end positions within the string.

The match() function performs matching only at the beginning of the string; if there is no match to the first character, it goes no farther in research within the string. If no match found then it will not return any objects. If match() has a response, it returns an object identical to what you saw for the
search() function. See the following program:

import pandas as pd
import numpy as np
import re

text = 'COVRI Solutions, 101 Baquer complex ,Chapel Road, Hyderabad'

print(re.findall('[C,c]\w+',text))
print(re.search('[C,c]\w+',text))

search = re.search('[C,c]\w+',text)
print(search.start())
print(search.end())
print(text[search.start():search.end()])

print(re.match('[A,a]\w+',text))

print(re.match('C\w+',text))

match = re.match('C\w+',text)

print(text[match.start():match.end()])


The output of the program is shown below:

['COVRI', 'complex', ',Chapel']
<re.Match object; span=(0, 5), match='COVRI'>
0
5
COVRI
None
<re.Match object; span=(0, 5), match='COVRI'>
COVRI
------------------
(program exited with code: 0)

Press any key to continue . . .


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment