Thursday, September 26, 2019

Downloading data sets from online sources

We can find an incredible variety of data online, much of which hasn’t been examined thoroughly. The ability to analyze this data allows us to discover patterns and connections that no one else has found. We can access and visualize data stored in two common data formats, CSV and JSON. In this post, we’ll download weather data set from online sources and create working visualizations of that data using Python’s csv module which can process weather data stored in the CSV (comma-separated values) format and analyze high and low temperatures over time in two different locations. Then using Matplotlib we'll generate a chart based on our downloaded data to display variations in temperature in two dissimilar environments: Sitka, Alaska, and Death Valley, California.

A simple way to store data in a text file is to write the data as a series of values separated by commas, which is called comma-separated values. The resulting files are called CSV files. For example, here’s a chunk of weather data in CSV format:

"INW00025333","DELHI AIRPORT, ND IN","2018-01-01","0.45",,"48","38"

This is an excerpt of some weather data from January 1, 2018 in Delhi, New Delhi. It includes the day’s high and low temperatures, as well as a number of other measurements from that day. CSV files can be tricky for humans to read, but they’re easy for programs to process and extract values from,
which speeds up the data analysis process.

We’ll begin with a small set of CSV-formatted weather data recorded in Sitka (Alaska,US), which can be downloaded from https://ncdc.noaa.gov/cdo-web/. Make a folder called data inside the folder where you’re saving your programs. Copy the file sitka_weather_07-2018_simple.csv into this new folder.

Let's start with parsing the CSV File Headers as shown in the following program:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    print(header_row)

After importing the csv module, we assign the name of the file we’re working with to filename. We then open the file and assign the resulting file object to f . Next, we call csv.reader() and pass it the file object as an argument to create a reader object associated with that file. We assign the reader object to reader.

The csv module contains a next() function, which returns the next line in the file when passed the reader object. In the preceding listing, we call next() only once so we get the first line of the file, which contains the file headers. We store the data that’s returned in header_row. As you can see in the output below, header_row contains meaningful, weather-related headers that tell us what information each line of data holds:

['STATION', 'NAME', 'DATE', 'PRCP', 'TAVG', 'TMAX', 'TMIN']
------------------
(program exited with code: 0)

Press any key to continue . . .

The reader object processes the first line of comma-separated values in the file and stores each as an item in a list.

The header STATION represents the code for the weather station that recorded this data. The position of this header tells us that the first value in each line will be the weather station code. The NAME header indicates that the second value in each line is the name of the weather station that made the recording. The rest of the headers specify what kinds of information were recorded in each reading. The data we’re most interested in for now are the date, the high temperature (TMAX), and the low temperature (TMIN). This is a simple data set that contains only precipitation and temperature-related data. When you download your own weather data, you can choose to include a number of other measurements relating to wind speed, direction, and more detailed precipitation data.

To make it easier to understand the file header data, we print each header and its position in the list:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
  
    for index, column_header in enumerate(header_row):
        print(index, column_header)


The enumerate() function returns both the index of each item and the value of each item as we loop through a list. The output of the program is shown below which is the index of each header:

0 STATION
1 NAME
2 DATE
3 PRCP
4 TAVG
5 TMAX
6 TMIN
------------------
(program exited with code: 0)

Press any key to continue . . .


In the output above, we see that the dates and their high temperatures are stored in columns 2 and 5. To explore this data, we’ll process each row of data in sitka_weather_07-2018_simple.csv and extract the values with the indexes 2 and 5.

As we know which columns of data we need, let’s read in some of that data. First, we’ll read in the high temperature for each day:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    

# Get high temperatures from this file.
    highs = []
   
    for row in reader:
        high = int(row[5])
        highs.append(high)
       
print(highs)


We make an empty list called highs and then loop through the remaining rows in the file. The reader object continues from where it left off in the CSV file and automatically returns each line following its current position. Because we’ve already read the header row, the loop will begin at the second line where the actual data begins. On each pass through the loop, we pull the data from index 5, which corresponds to the header TMAX, and assign it to the variable high. We use the int() function to convert the data, which is stored as a string, to a numerical format so we can use it. We then append this value to highs.

The following listing is the output of the program which shows the data now stored in highs:

[62, 58, 70, 70, 67, 59, 58, 62, 66, 59, 56, 63, 65, 58, 56, 59, 64, 60, 60, 61, 65, 65, 63, 59, 64, 65, 68, 66, 64, 67, 65]
------------------
(program exited with code: 0)

Press any key to continue . . .


Now we the value of high temperature for each date so let's create a visualization of this data. To visualize the temperature data we have, we’ll first create a simple plot of the daily highs using Matplotlib, as shown in the following program:

# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(highs, c='red')

# Format plot.
plt.title("Daily high temperatures, July 2018", fontsize=24)
plt.xlabel('', fontsize=16)
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()


We pass the list of highs to plot() and pass c='red' to plot the points in red color. We then specify a
few other formatting details, such as the title, font size, and labels. As we have yet to add the dates,
we won’t label the x-axis, but plt.xlabel() does modify the font size to make the default labels more readable. The complete program and the output plot is shown below:

import csv
import matplotlib.pyplot as plt
filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    # Get high temperatures from this file.
    highs = []
     
    for row in reader:
        high = int(row[5])
        highs.append(high)
              
# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(highs, c='red')

# Format plot.
plt.title("Daily high temperatures, July 2018", fontsize=24)
plt.xlabel('', fontsize=16)
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()




Here is the output plot:



Here I am ending today's post. In the next post we'll see how to add the dates to our graph. 




Share:

0 comments:

Post a Comment