Big data is becoming more and more prevalent as the number ofcomputers and sensors are proliferating and creating more and more data at anever-increasing rate.In this post, we talk about using the cloud to access these large databases using Python and Pandas and BigQuery which is a Google API.
BigQuery is based on a RESTful web service (think of contacting web pages with URL addresses that ask specific questions in a standard format and then getting a response back just like a browser gets a webpage) and a number of libraries for Python and other languages hide the complexity of the queries going back and forth.
Abstraction in software systems is key to making big systems work and reasonable to program. For example, although a web browser uses HTML to display web pages, there are layers and layers of software under that, doing things like transmitting IP packets or manipulating bits. These lower layers are different if you are using a wired network or a WiFi network. The cool thing about abstraction here is up at the webpage level, we don’t care. We just use it.
BigQuery is a serverless model. This means that BigQuery has about the highest level of abstraction in the cloud community, removing the user’s responsibility for worrying about spinning up VMs (bringing a new virtual machines online in the cloud), RAM, numbers of CPUs, and so on. You can scale from one to thousands of CPUs in a matter of seconds, paying only for the resources you actually use.
BigQuery has a large number of public big-data datasets, such as those from Medicare and NOAA (National Oceanic and Atmospheric Agency). We make use of these datasets in our examples below.
One of the most interesting features of BigQuery is the ability to stream data into BigQuery on the order of millions of rows (data samples) per second, data you can start to analyze almost immediately.
We will be using BigQuery with the Python library pandas. The Python library google.cloud provides a Python library that maps the BigQuery data into our friendly pandas DataFrames.
Signing up on Google for BigQuery
Go to cloud.google.com and sign up for your free trial. Although Google requires a credit card to prove you are not a robot, they will not charge you even when your trial is over without you manually switching over to a paid account. If you exceed $300 during your trial (which you shouldn’t), Google will notify you but will not charge you.
The $300 limit to the trial should be plenty enough to allow you to do a bunch of queries and learning on the BigQuery cloud platform.
Setting up your project and authentication
To access the Google cloud you will need to set up a project and then receive your authentication credentials from Google to be able to use their systems. The following steps will show you how to do this:
1. Go to https://console.developers.google.com/ and sign in using your account name and password generated earlier.
2. Next, click the My First Project button up in the upper-left corner of the screen. It shows you a screen like the one in Figure below:
3. Click the New Project button on the pop-up screen.
4. Fill out your project name as MedicareProject and click Create.
5. Next, select your project, MedicareProject, from the upper-left menu button. Make sure you don’t leave this on the default “My Project” selection. Make sure you change it to MedicareProject — otherwise you will be setting up the APIs and authentication for the wrong project. This is an easy mistake to make.
6. After you have selected MedicareProject, click on the “+” button near the top to enable the BigQuery API.
7. When the API selection screen comes up, search for BigQuery and select the BigQuery API. Then click Enable.
8. Now to get our authentication credentials. In the left-hand menu, choose Credentials. A screen like the one in Figure below comes up.
9. Select the BigQuery API and then click the No, I’m Not Using Them option in the Are You Planning to Use This API with the App Engine or Compute Engine? section.
10. Click the What Credentials Do I Need? button to get to our last screen, as shown in Figure below.
11. Type MedicareProject into the Service Account Name textbox and then select Project➪ Owner in the Role menu.
12. Leave the JSON radio button selected and click Continue. A message appears saying that the service account and key has been created. A file called something similar to “MedicareProject-1223xxxxx413.json” is downloaded to your computer.
13. Copy that downloaded file into the directory that you will be building your Python program file in.
In the next post we shall make a program that reads one of the public data Medicare datasets and grabs some data for analysis.
BigQuery is based on a RESTful web service (think of contacting web pages with URL addresses that ask specific questions in a standard format and then getting a response back just like a browser gets a webpage) and a number of libraries for Python and other languages hide the complexity of the queries going back and forth.
Abstraction in software systems is key to making big systems work and reasonable to program. For example, although a web browser uses HTML to display web pages, there are layers and layers of software under that, doing things like transmitting IP packets or manipulating bits. These lower layers are different if you are using a wired network or a WiFi network. The cool thing about abstraction here is up at the webpage level, we don’t care. We just use it.
BigQuery is a serverless model. This means that BigQuery has about the highest level of abstraction in the cloud community, removing the user’s responsibility for worrying about spinning up VMs (bringing a new virtual machines online in the cloud), RAM, numbers of CPUs, and so on. You can scale from one to thousands of CPUs in a matter of seconds, paying only for the resources you actually use.
BigQuery has a large number of public big-data datasets, such as those from Medicare and NOAA (National Oceanic and Atmospheric Agency). We make use of these datasets in our examples below.
One of the most interesting features of BigQuery is the ability to stream data into BigQuery on the order of millions of rows (data samples) per second, data you can start to analyze almost immediately.
We will be using BigQuery with the Python library pandas. The Python library google.cloud provides a Python library that maps the BigQuery data into our friendly pandas DataFrames.
Signing up on Google for BigQuery
Go to cloud.google.com and sign up for your free trial. Although Google requires a credit card to prove you are not a robot, they will not charge you even when your trial is over without you manually switching over to a paid account. If you exceed $300 during your trial (which you shouldn’t), Google will notify you but will not charge you.
The $300 limit to the trial should be plenty enough to allow you to do a bunch of queries and learning on the BigQuery cloud platform.
Setting up your project and authentication
To access the Google cloud you will need to set up a project and then receive your authentication credentials from Google to be able to use their systems. The following steps will show you how to do this:
1. Go to https://console.developers.google.com/ and sign in using your account name and password generated earlier.
2. Next, click the My First Project button up in the upper-left corner of the screen. It shows you a screen like the one in Figure below:
3. Click the New Project button on the pop-up screen.
4. Fill out your project name as MedicareProject and click Create.
5. Next, select your project, MedicareProject, from the upper-left menu button. Make sure you don’t leave this on the default “My Project” selection. Make sure you change it to MedicareProject — otherwise you will be setting up the APIs and authentication for the wrong project. This is an easy mistake to make.
6. After you have selected MedicareProject, click on the “+” button near the top to enable the BigQuery API.
7. When the API selection screen comes up, search for BigQuery and select the BigQuery API. Then click Enable.
8. Now to get our authentication credentials. In the left-hand menu, choose Credentials. A screen like the one in Figure below comes up.
9. Select the BigQuery API and then click the No, I’m Not Using Them option in the Are You Planning to Use This API with the App Engine or Compute Engine? section.
10. Click the What Credentials Do I Need? button to get to our last screen, as shown in Figure below.
11. Type MedicareProject into the Service Account Name textbox and then select Project➪ Owner in the Role menu.
12. Leave the JSON radio button selected and click Continue. A message appears saying that the service account and key has been created. A file called something similar to “MedicareProject-1223xxxxx413.json” is downloaded to your computer.
13. Copy that downloaded file into the directory that you will be building your Python program file in.
In the next post we shall make a program that reads one of the public data Medicare datasets and grabs some data for analysis.
0 comments:
Post a Comment