Thursday, April 25, 2019

Pandas - 22 (Concatenating)

Concatenation is another type of data combination and NumPy provides a concatenate() function to do this kind of operation with arrays. See the following program :

import pandas as pd
import numpy as np

array1 = np.arange(9).reshape((3,3))
print('Array 1\n')
print(array1)

array2 = np.arange(9).reshape((3,3))+6
print('\nArray 2\n')
print(array2)

print('\nConcatenated array axis=1\n')
print(np.concatenate([array1,array2],axis=1))

print('\nConcatenated array axis=0\n')
print(np.concatenate([array1,array2],axis=0))



The output of the program is shown below:

Array 1

[[0 1 2]
 [3 4 5]
 [6 7 8]]

Array 2

[[ 6  7  8]
 [ 9 10 11]
 [12 13 14]]

Concatenated array axis=1

[[ 0  1  2  6  7  8]
 [ 3  4  5  9 10 11]
 [ 6  7  8 12 13 14]]

Concatenated array axis=0

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
------------------
(program exited with code: 0)

Press any key to continue . . .


The pandas library and its data structures like series and dataframe, having labeled axes allows you to further generalize the concatenation of arrays. The concat() function is provided by pandas for this kind of operation. See the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])
print('\nSeries 2\n')
print(ser2)

print('\nConcatenated series axis=1\n')
print(pd.concat([ser1,ser2], axis=1))

print('\nConcatenated series axis=0\n')
print(pd.concat([ser1,ser2]))


By default, the concat() function works on axis = 0, having as a returned object a series. If you set the axis = 1, then the result will be a dataframe. The output of the program is shown below:

Series 1

1    0.936029
2    0.194529
3    0.448288
4    0.952875
dtype: float64

Series 2

5    0.392544
6    0.978594
7    0.453258
8    0.661619
dtype: float64

Concatenated series axis=1

          0         1
1  0.936029       NaN
2  0.194529       NaN
3  0.448288       NaN
4  0.952875       NaN
5       NaN  0.392544
6       NaN  0.978594
7       NaN  0.453258
8       NaN  0.661619

Concatenated series axis=0

1    0.936029
2    0.194529
3    0.448288
4    0.952875
5    0.392544
6    0.978594
7    0.453258
8    0.661619
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


When we concatenate the series with axis=1, in the output the concatenated parts are not identifiable in the result. Let's say we want to create a hierarchical index on the axis of concatenation. To do this, we have to use the keys option as shown in the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])
print('\nSeries 2\n')
print(ser2)

print('\nConcatenated series using the keys option\n')
print(pd.concat([ser1,ser2], keys=[1,2]))

print('\nConcatenated series using the keys option along axis=1\n')
print(pd.concat([ser1,ser2], axis=1, keys=[1,2]))


The output of the program is shown below:

Series 1

1    0.034474
2    0.984395
3    0.912107
4    0.543064
dtype: float64

Series 2

5    0.864616
6    0.231658
7    0.875177
8    0.400951
dtype: float64

Concatenated series using the keys option

1  1    0.034474
   2    0.984395
   3    0.912107
   4    0.543064
2  5    0.864616
   6    0.231658
   7    0.875177
   8    0.400951
dtype: float64

Concatenated series using the keys option along axis=1

          1         2
1  0.034474       NaN
2  0.984395       NaN
3  0.912107       NaN
4  0.543064       NaN
5       NaN  0.864616
6       NaN  0.231658
7       NaN  0.875177
8       NaN  0.400951
------------------
(program exited with code: 0)

Press any key to continue . . .


As you may have noticed in the case of combinations between series along the axis = 1 the keys become the column headers of the dataframe.

Just like series, the concatenation applied to the dataframe. The following program shows the concatenation applied to the dataframe:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[1,2,3], columns=['A','B','C'])
print('Frame 1\n')
print(frame1)

frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[4,5,6], columns=['A','B','C'])
print('\nFrame 2\n')
print(frame2)

print('\nConcatenated frames\n')
print(pd.concat([frame1, frame2]))

print('\nConcatenated frames along axis=1\n')
print(pd.concat([frame1, frame2], axis=1))



The output of the program is shown below:

Frame 1

          A         B         C
1  0.216094  0.206833  0.565031
2  0.278919  0.311937  0.410026
3  0.262882  0.487224  0.489479

Frame 2

          A         B         C
4  0.660482  0.491644  0.411970
5  0.511529  0.394583  0.475184
6  0.638702  0.849363  0.190679

Concatenated frames

          A         B         C
1  0.216094  0.206833  0.565031
2  0.278919  0.311937  0.410026
3  0.262882  0.487224  0.489479
4  0.660482  0.491644  0.411970
5  0.511529  0.394583  0.475184
6  0.638702  0.849363  0.190679

Concatenated frames along axis=1

          A         B         C         A         B         C
1  0.216094  0.206833  0.565031       NaN       NaN       NaN
2  0.278919  0.311937  0.410026       NaN       NaN       NaN
3  0.262882  0.487224  0.489479       NaN       NaN       NaN
4       NaN       NaN       NaN  0.660482  0.491644  0.411970
5       NaN       NaN       NaN  0.511529  0.394583  0.475184
6       NaN       NaN       NaN  0.638702  0.849363  0.190679
------------------
(program exited with code: 0)

Press any key to continue . . .


Let's consider a scenario in which we want the two datasets to have indexes that overlap in their entirety or at least partially. This combination of data cannot be obtained either with merging or with concatenation. One applicable function to series is combine_first(), which performs this kind of
operation along with data alignment. See the following program :

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(5),index=[1,2,3,4,5])
print('Series 1\n')
print(ser1)

ser2 = pd.Series(np.random.rand(4),index=[2,4,5,6])
print('\nSeries 2\n')
print(ser2)

print('\nCombined series with ser2 as an arument\n')
print(ser1.combine_first(ser2))

print('\nCombined series with ser1 as an arument\n')
print(ser2.combine_first(ser1))

print('\nCombined series with partial overlap\n')
print(ser1[:3].combine_first(ser2[:3]))



The output of the program is shown below:

Series 1

1    0.546086
2    0.855131
3    0.975251
4    0.159282
5    0.778717
dtype: float64

Series 2

2    0.420990
4    0.883285
5    0.483201
6    0.848290
dtype: float64

Combined series with ser2 as an arument

1    0.546086
2    0.855131
3    0.975251
4    0.159282
5    0.778717
6    0.848290
dtype: float64

Combined series with ser1 as an arument

1    0.546086
2    0.420990
3    0.975251
4    0.883285
5    0.483201
6    0.848290
dtype: float64

Combined series with partial overlap

1    0.546086
2    0.855131
3    0.975251
4    0.883285
5    0.483201
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . . 


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment