Friday, April 22, 2022

PySpark using Python

Apache Spark is a sort of engine which helps in operating and executing the data analysis, data engineering, and machine learning tasks both in the cloud as well as on a local machine, and for that, it can either use a single machine or the clusters i.e distributed system.


We already have some relevant tools available in the market which can perform the data engineering tasks so in this section we will discuss why we should choose Apache Spark over its other alternatives.


Features of Spark


  1. Streaming data: When we say streaming the data it is in the form of batch streaming and in this key feature Apache Spark will be able to stream our data in real-time by using our preferred programming language.
  1. Increasing Data science scalability: Apache Spark is one of the widely used engines for scalable computing and to perform Data science task which requires high computational power Apache Spark should be the first choice.
  1. Handling Big Data projects: As previously mentioned that it has high computational power so for that reason it can handle Big data projects in cloud computing as well using the distributed systems/clusters when working with the cloud, not on the local machines.
Installing PySpark Library using “pip”

pip install pyspark

Output:



After successfully installing PySpark


Importing PySpark Library

import pyspark

Reading the Dataset

Just before reading the dataset let me tell you that we will be working on the California House Price dataset and as I have worked on the Collab for handling PySpark operations so I got this dataset from the sample section.

import pandas as pd
data = pd.read_csv('sample_data/california_housing_test.csv')
data.head()

Output:


Now as we have imported the dataset and also have a look at it so now let’s start working with PySpark. But before getting real work with PySpark we have to start the Spark’s Session and for that, we need to follow some steps which are mentioned below.

  1. Importing the Spark Session from the Pyspark’s SQL object
  2. After importing the Spark session we will build the Spark Session using the builder function of the SparkSession object.
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName('PySpark_article').getOrCreate()

Now as we can see that with the help of builder the function we have first called the appName class to name our session (here I have given *”PySpark_article”* as the session name) and at the last, for creating the session we have called getOrCreate() function and store it in the variable named spark_session.

spark_session

Output:


Now when we see what our spark session will hold it will return the above output which has the following components:

  1. About the spark session: In memory
  2. Spark context:
  • Version: It will return the current version of the spark which we are using – v3.2.1
  • Master: Interesting thing to notice here is when we will be working in the cloud then we might have different clusters as well like first, there will be a master and then a tree-like structure (cluster_1, cluster_2… cluster_n) but here as we are working on a local system and not the distributed one so it is returning local.
  • AppName: And finally the name of the app (spark session) which we gave while declaring it.
Next we'll see how to read data using Spark.
Share:

0 comments:

Post a Comment