Tuesday, April 9, 2019

Pandas - 11 (Using RegExp to Parse TXT Files)

Sometimes it is possible that the files on which the data is to be parsed do not show separators well defined as a comma or a semicolon. In these cases, the regular expressions are helpful. We can specify a regexp within the read_table() function using the sep option.

Let's use regexp in a program which will parse a TXT file having values that are separated by spaces or tabs in an unpredictable order. Our TXT file mydata4.txt is shown below in which we have the values separated by tabs or spaces in a random order:

red    blue    yellow    green
1         5           2          3
2         7           8          5
3         3           6          7

The following program shows how to use the regexp :

import pandas as pd
import numpy as np


frame1 = pd.read_table('mydata4.txt',sep='\s+', engine='python')

print('\nThe dataframe\n')
print(frame1)


The output of the program is shown below:

The dataframe

     red  blue  yellow  green
0    1     5       2          3
1    2     7       8          5
2    3     3       6          7
------------------
(program exited with code: 0)

Press any key to continue . . .


The output shows the result is a perfect dataframe in which the values are perfectly ordered. We have used the wildcard /s which stands for the space or tab character (if you want to indicate a tab, you use /t). Some of the common wildcards are:

.             Single character, except newline
\d           Digit
\D          Non-digit character
\s           Whitespace character
\S           Non-whitespace character
\n           New line character
\t           Tab character
\uxxxx  Unicode character specified by the hexadecimal number xxxx

Our next program extracts the numeric part from a TXT file mydata5.txt, in which there is a sequence of characters with numerical values and the literal characters are completely fused. The mydata5.txt is shown below:

000END123AAA122
001END124BBB321
002END125CCC333

See the following program :

import pandas as pd
import numpy as np


frame1 = pd.read_table('mydata5.txt',sep='\D+',header=None, engine='python')

print('\nThe dataframe\n')
print(frame1)


The output of the program is shown below: 

The dataframe

    0    1    2
0  0  123  122
1  1  124  321
2  2  125  333
------------------
(program exited with code: 0)

Press any key to continue . . .


As you may have noticed we set the header option to None, this is done whenever the column headings are not present in the TXT file.

Our next program shows parsing which exclude lines headers or unnecessary comments contained in a file mydata6.txt using the skiprows option.

########### LOG FILE ############
This file has been generated by automatic system
white,red,blue,green,animal
12-Feb-2015: Counting of animals inside the house
1,5,2,3,cat
2,7,8,5,dog
13-Feb-2015: Counting of animals outside the house
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse

See the following program :

import pandas as pd
import numpy as np


frame1 = pd.read_table('mydata6.txt',sep=',',skiprows=[0,1,3,6])

print('\nThe dataframe\n')
print(frame1)



The output of the program is shown below:

The dataframe

       white  red  blue  green animal
0          1    5        2      3      cat
1          2    7        8      5      dog
2         3    3         6      7      horse
3         2    2         8      3      duck
4         4    4         2      1      mouse
------------------
(program exited with code: 0)

Press any key to continue . . .



Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment