Monday, July 8, 2019

First big-data program

Our program reads one of the public data Medicare datasets and grabs some data for analysis. I'll be using the inpatient_charges_2015dataset. We use a SQL query to select the information from the dataset that we want to look at and eventually analyze. The figure below shows all the columns in the inpatient_charges_2015 dataset:


The code for our program is shown below:

import pandas as pd
from google.cloud import bigquery


# set up the query

QUERY = """
        SELECT provider_city, provider_state, drg_definition,
        average_total_payments, average_medicare_payments
        FROM `bigquery-public-data.cms_medicare.inpatient_charges_2015`
        WHERE provider_city = "GREAT FALLS" AND provider_state = "MT"
        ORDER BY provider_city ASC
        LIMIT 1000
        """

client = bigquery.Client.from_service_account_json(
            'MedicareProject2-1223283ef413.json')


query_job = client.query(QUERY)
df = query_job.to_dataframe()

print ("Records Returned: ", df.shape )
print ()
print ("First 3 Records")
print (df.head(3))


First, we import our libraries. Note the google.cloud library and the bigquery import:

import pandas as pd
from google.cloud import bigquery


Next we set up the SQL query used to fetch the data we are looking for into a pandas DataFrame for us to analyze:

# set up the query

QUERY = """
SELECT provider_city, provider_state, drg_definition,
average_total_payments, average_medicare_payments
FROM `bigquery-public-data.cms_medicare.inpatient_charges_2015`
WHERE provider_city = "GREAT FALLS" AND provider_state = "MT"
ORDER BY provider_city ASC
LIMIT 1000
"""


We SELECT the columns that we want which are given in figure above FROM the database bigquery-public-data.cms_medicare.inpatient_charges_2015 only WHERE the provider_city is GREAT FALLS and theprovider_state is MT. Finally we tell the system to order the results by ascending alphanumeric order by the provider_city. Which, since we only selected one city, is somewhat redundant.

Remember to replace the json filename below with your authentication file (which you copied into the program directory earlier).. This one won’t work for you .

client = bigquery.Client.from_service_account_json(
'MedicareProject2-122xxxxxef413.json')


Make sure you have google.cloud library installed, if you don’t have the google.cloud library installed, type this into your terminal window :

pip3 install google-cloud-bigquery

Now we'll fire the query off to the BigQuery cloud:

query_job = client.query(QUERY)

Then we translate the results to our good friend the Pandas DataFrame:

df = query_job.to_dataframe()

Finally we fetch and print a few results to see what we got back:

print ("Records Returned: ", df.shape )
print ()
print ("First 3 Records")
print (df.head(3))


When we run this program we get the following output:

Records Returned: (112, 5)
First 3 Records
provider_city provider_state
drg_ definition average_total_payments average_medicare_payments
0 GREAT FALLS MT 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL
INFA... 11997.11 11080.32
1 GREAT FALLS MT 039 - EXTRACRANIAL PROCEDURES W/O
CC/MCC 7082.85 5954.81
2 GREAT FALLS MT 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL
INFA... 7140.80 6145.38
Visualizing your Data



We found 112 records from Great Falls. You can go back and change the query in your program to select your own city and state.

If you get an authentication error, then go back and make sure you put the correct authentication file into your directory. And if necessary, repeat the whole generate-an-authentication-file routine again, paying special attention to the project name selection.

Try to run this program successfully and in the next post we'll set up another query. Till we meet again keep practicing!

Share:

0 comments:

Post a Comment