Saturday, April 23, 2022

Reading the Data using Spark

 Lets continue from where we left in the previous post.

df_spark = spark_session.read.csv('sample_data/california_housing_train.csv')

In the above code, we can see that spark uses the Spark session variable to call the read.csv() function to read the data when it is in CSV format now if you remember when we need to read the CSV file in padas we used to call read_csv().

According to me when we are learning something new which has stuff related to previous learning then it is good to compare both the stuff so in this article, we will also compare the pandas’ data processing with spark’s data processing.

df_spark

Output:

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string]

Now if we look at the output so it shows that it has returned the DataFrame in which we can see the dictionary like setup where (c0, c1, c2…..c_n) is the number of columns and corresponding to each column index we can see the type of that column i.e. String.

PySpark’s show() function

df_spark.show()

Output:


From the above output, we can compare the PySpark’s show() function with the pandas head() function.

  • In the head() function, we can see the top 5 records (unless we don’t specify it in the arguments) whereas in the show() function it returns the top 20 records which are mentioned too at the last.
  • Another difference that we can notice is the appearance of the tabular data that we can see using both the function one can compare it as in the start of the article I’ve used head() function.
  • One more major difference that one can point out which is also a drawback of PySpark’s show() function i.e. when we are looking at the column names it is showing (c0, c1, c2…..c_n), and the exact column names are shown as the first tuple of records but we can fix this issue as well.

So let’s fix it!

df_spark_col  = spark_session.read.option('header', 'true').csv('sample_data/california_housing_train.csv')

df_spark_col

Output:

DataFrame[longitude: string, latitude: string, housing_median_age: string, total_rooms: string, total_bedrooms: string, population: string, households: string, median_income: string, median_house_value: string]

Okay! so now just after looking at the output, we can say that we have fixed that problem (we will still confirm that later) as in the output instead of (c0,c1,c2….c_n) in the place of column name now we can see the actual name of the columns.

Let’s confirm it by looking at the complete data with records.

df_spark = spark_session.read.option('header', 'true').csv('sample_data/california_housing_train.csv').show()

df_spark

Output

Now we can see that in the place of the columns we can now see the actual name of the column instead that indexing formatting.


Share:

0 comments:

Post a Comment