Monday, September 30, 2019

Mapping Global Data Sets: JSON Format

on September 30, 2019 with 4 comments

In the previous post we used datasets stored in the CSV format. In this post we'll work with data stored in JSON format. First we'll download a data set representing all the earthquakes that have occurred in the world during the previous month. Then we’ll make a map showing the location of these earthquakes and how significant each one was. Because the data is stored in the JSON format, we’ll work with it using the json module. Using Plotly’s beginner-friendly mapping tool for location-based data, we’ll create visualizations that clearly show the global distribution of earthquakes.

Download the file eq_1_day_m1.json from at https://earthquake.usgs.gov/earthquakes/feed/. and save it to the folder where you’re storing the project files. This file includes data for all earthquakes with a
magnitude M1 or greater that took place in the last 24 hours.

When we open eq_1_day_m1.json, we’ll see that it’s very dense and hard to read:

{"type":"FeatureCollection","metadata":{"generated":1550361461000,...
{"type":"Feature","properties":{"mag":1.2,"place":"11km NNE of Nor...
{"type":"Feature","properties":{"mag":4.3,"place":"69km NNW of Ayn...
{"type":"Feature","properties":{"mag":3.6,"place":"126km SSE of Co...
{"type":"Feature","properties":{"mag":2.1,"place":"21km NNW of Teh...
{"type":"Feature","properties":{"mag":4,"place":"57km SSW of Kakto...
--snip--

This file is formatted more for machines than it is for humans. But we can see that the file contains some dictionaries, as well as information that we’re interested in, such as earthquake magnitudes and
locations. The json module provides a variety of tools for exploring and working with JSON data. Some of these tools will help us reformat the file so we can look at the raw data more easily before we begin to work with it programmatically.

In the following program we'll load the data and display it in a format that’s easier to read. This is a long data file, so instead of printing it, we’ll rewrite the data to a new file. Then we can open that file and scroll back and forth easily through the data. See the code below:

import json

# Explore the structure of the data.
filename = 'data/eq_data_1_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)

readable_file = 'data/readable_eq_data.json'
with open(readable_file, 'w') as f:
    json.dump(all_eq_data, f, indent=4)

We first import the json module to load the data properly from the file, and then store the entire set of data in all_eq_data. The json.load() function converts the data into a dictionary. Next we create a file to write this same data into a more readable format. The json.dump() function takes a JSON data object and a file object, and writes the data to that file.

The indent=4 argument tells dump() to format the data using indentation that matches the data’s
structure. When you look in your data directory and open the file readable_eq_data .json, here’s the what you’ll see:

{
    "type": "FeatureCollection",
    "metadata": {
        "generated": 1550361461000,
        "url": "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/1.0_day.geojson",
        "title": "USGS Magnitude 1.0+ Earthquakes, Past Day",
        "status": 200,
        "api": "1.7.0",
        "count": 158
    },
"features": [
        {
            "type": "Feature",
---snip---
     "bbox": [
        -176.7088,
        -30.7399,
        -1.16,
        164.5151,
        69.5346,
        249.4
    ]
}

The first part of the file includes a section with the key "metadata". This tells us when the data file was generated and where we can find the data online. It also gives us a human-readable title and the number of earthquakes included in this file. In this 24-hour period, 158 earthquakes were recorded. This geoJSON file has a structure that’s helpful for location-based data. The information is stored in a list associated with the key "features".

Because this file contains earthquake data, the data is in list form where every item in the list corresponds to a single earthquake. This structure might look confusing, but it’s quite powerful. It allows geologists to store as much information as they need to in a dictionary about each earthquake,
and then stuff all those dictionaries into one big list.

Here is the dictionary representing a single earthquake:

"features": [
        {
            "type": "Feature",
            "properties": {
                "mag": 0.96,
                "place": "8km NE of Aguanga, CA",
                "time": 1550360775470,
                "updated": 1550360993593,
                "tz": -480,
                "url": "https://earthquake.usgs.gov/earthquakes/eventpage/ci37532978",
                "detail": "https://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/ci37532978.geojson",
                "felt": null,
                "cdi": null,
                "mmi": null,
                "alert": null,
                "status": "automatic",
                "tsunami": 0,
                "sig": 14,
                "net": "ci",
                "code": "37532978",
                "ids": ",ci37532978,",
                "sources": ",ci,",
                "types": ",geoserve,nearby-cities,origin,phase-data,",
                "nst": 32,
                "dmin": 0.02648,
                "rms": 0.15,
                "gap": 37,
                "magType": "ml",
                "type": "earthquake",
                "title": "M 1.0 - 8km NE of Aguanga, CA"
            },
            "geometry": {
                "type": "Point",
                "coordinates": [
                    -116.7941667,
                    33.4863333,
                    3.22
                ]
            },
            "id": "ci37532978"
        },

The key "properties" contains a lot of information about each earthquake. We’re mainly interested in the magnitude of each quake, which is associated with the key "mag". We’re also interested in the title of each earthquake, which provides a nice summary of its magnitude and location.

The key "geometry" helps us understand where the earthquake occurred. We’ll need this information to map each event. We can find the longitude and the latitude for each earthquake in a list associated
with the key "coordinates".

This file contains way more nesting than we’d use in the code we write, so if it looks confusing, but Python will handle most of the complexity. We’ll only be working with one or two nesting levels at a time. We’ll start by pulling out a dictionary for each earthquake that was recorded in the 24-hour time period.

The following program makes a list that contains all the information about every earthquake that occurred:

import json

# Explore the structure of the data.
filename = 'data/eq_data_1_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)

all_eq_dicts = all_eq_data['features']

print(len(all_eq_dicts))

We take the data associated with the key 'features' and store it in all_eq_dicts. We know this file contains records about 158 earthquakes, and the output shown below verifies that we’ve captured all of the earthquakes in the file:

158
------------------
(program exited with code: 0)

Press any key to continue . . .

Using the list containing data about each earthquake, we can loop through that list and extract any information we want.The next program pulls the magnitudes from each earthquake. Add the code shown below to the previous program:

mags = []
for eq_dict in all_eq_dicts:
    mag = eq_dict['properties']['mag']
    mags.append(mag)

print(mags[:10])

We make an empty list to store the magnitudes, and then loop through the dictionary all_eq_dicts. Inside this loop, each earthquake is represented by the dictionary eq_dict. Each earthquake’s magnitude is stored in the 'properties' section of this dictionary under the key 'mag'. We store each magnitude in the variable mag, and then append it to the list mags.

We print the first 10 magnitudes, so we can see whether we’re getting the correct data. The output is shown below:

[0.96, 1.2, 4.3, 3.6, 2.1, 4, 1.06, 2.3, 4.9, 1.8]
------------------
(program exited with code: 0)

Press any key to continue . . .

Now we’ll pull the location data for each earthquake, and then we can make a map of the earthquakes. The location data is stored under the key "geometry". Inside the geometry dictionary is a "coordinates" key, and the first two values in this list are the longitude and latitude. The following program pulls the location data:

mags, lons, lats = [], [], []
for eq_dict in all_eq_dicts:
    mag = eq_dict['properties']['mag']
    lon = eq_dict['geometry']['coordinates'][0]
    lat = eq_dict['geometry']['coordinates'][1]
    mags.append(mag)
    lons.append(lon)
    lats.append(lat)

print(mags[:10])
print(lons[:5])
print(lats[:5])

We make empty lists for the longitudes and latitudes. The code eq_dict ['geometry'] accesses the dictionary representing the geometry element of the earthquake. The second key, 'coordinates', pulls the list of values associated with 'coordinates'. Finally, the 0 index asks for the first value in the list of coordinates, which corresponds to an earthquake’s longitude.

We print the first five longitudes and latitudes, the following output shows that we’re pulling the correct data:

[0.96, 1.2, 4.3, 3.6, 2.1, 4, 1.06, 2.3, 4.9, 1.8]
[-116.7941667, -148.9865, -74.2343, -161.6801, -118.5316667]
[33.4863333, 64.6673, -12.1025, 54.2232, 35.3098333]
------------------
(program exited with code: 0)

Press any key to continue . . .

Here I am ending this post. In the next post we'll use this data and move on to mapping each earthquake.

Handling issues arising due to missing/corrupt data

on September 29, 2019 with No comments

The program we made in the previous post can use data for any location. Sometimes weather stations collect different data than others, and some occasionally malfunction and fail to collect some of the data they’re supposed to. Missing data can result in exceptions that crash our programs unless we handle them properly.

As an example, let’s see what happens when we attempt to generate a temperature plot for Death Valley, California for which the the data file is death_valley_2018_simple.csv. This file should also be copied to the data folder in our program directory.

As we did before, we'll first check for the headers that are included in this data file, the following program does this:

import csv
filename = 'data/death_valley_2018_simple.csv'
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    for index, column_header in enumerate(header_row):
        print(index, column_header)

The output of the program is shown below:

0 STATION
1 NAME
2 DATE
3 PRCP
4 TMAX
5 TMIN
6 TOBS
------------------
(program exited with code: 0)

Press any key to continue . . .

The date is in the same position at index 2. But the high and low temperatures are at indexes 4 and 5, so we’d need to change the indexes in our code to reflect these new positions. Instead of including an average temperature reading for the day, this station includes TOBS, a reading for a specific observation time.

I removed one of the temperature readings from this file to show what happens when some data is missing from a file. Let's make a program to generate a graph for Death Valley using the indexes we just noted, and see what happens:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/death_valley_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[4])
        low = int(row[5])
        dates.append(current_date)
        highs.append(high)
        lows.append(low)


# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red',alpha=0.5)
ax.plot(dates,lows, c='green',alpha=0.5)
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.3)

# Format plot.
title = "Daily high and low temperatures - 2018\nDeath Valley, CA"
plt.title(title, fontsize=20)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

When we run this program we get the following output:

Traceback (most recent call last):
File "pasing_CSV.py", line 15, in <module>
    high = int(row[4])
ValueError: invalid literal for int() with base 10: ''
------------------
(program exited with code: 1)

Press any key to continue . . .

The traceback tells us that Python can’t process the high temperature for one of the dates because it can’t turn an empty string ('') into an integer. Rather than look through the data and finding out which reading is missing, we’ll just handle cases of missing data directly.

We’ll run error-checking code when the values are being read from the CSV file to handle exceptions that might arise. The following program shows how that works:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/death_valley_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        try:
            high = int(row[4])
            low = int(row[5])
        except ValueError:
            print(f"Missing data for {current_date}")
        else:
            dates.append(current_date)
            highs.append(high)
            lows.append(low)


# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red',alpha=0.5)
ax.plot(dates,lows, c='green',alpha=0.5)
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.3)

# Format plot.
title = "Daily high and low temperatures - 2018\nDeath Valley, CA"
plt.title(title, fontsize=20)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

Each time we examine a row, we try to extract the date and the high and low temperature. If any data is missing, Python will raise a ValueError and we handle it by printing an error message that includes the date of the missing data. After printing the error, the loop will continue processing the next row. If all data for a date is retrieved without error, the else block will run and the data will be appended to the appropriate lists.

When we run the program now, we’ll see that only one date had missing data:

Missing data for 2018-02-18 00:00:00

Because the error is handled appropriately, our code is able to generate a plot, which skips over the missing data. Figure below shows the resulting plot:

Here we used a try-except-else block to handle missing data. Sometimes you’ll use continue to skip over some data or use remove() or del to eliminate some data after it’s been extracted. Use any approach that works, as long as the result is a meaningful, accurate visualization.

Comparing this graph to the Sitka graph, we can see that Death Valley is warmer overall than southeast Alaska, as we expect. Also, the range of temperatures each day is greater in the desert. The height of the shaded region makes this clear.

Here our discussion comes to end. In the next post we'll start with mapping data sets using JSON format.

Adding dates to our graph using The datetime Module

on September 27, 2019 with No comments

In the previous post we drew a simple temperature data plot . Let’s add dates to our graph to make it more useful. The first date from the weather data file is in the second row of the file:

"USW00025333","SITKA AIRPORT, AK US","2018-07-01","0.25",,"62","50"

The data will be read in as a string, so we need a way to convert the string "2018-07-01" to an object representing this date. We can construct an object representing July 1, 2018 using the strptime() method from the datetime module. Let’s see how strptime() works with the help of the following program:

from datetime import datetime
first_date = datetime.strptime('2019-09-26', '%Y-%m-%d')
print(first_date)

We first import the datetime class from the datetime module. Then we call the method strptime() using the string containing the date we want to work with as its first argument. The second argument tells Python how the date is formatted. In this example, Python interprets '%Y-' to mean the part
of the string before the first dash is a four-digit year; '%m-' means the part of the string before the second dash is a number representing the month; and '%d' means the last part of the string is the day of the month, from 1 to 31.

When we run the program we get the following:

2019-09-26 00:00:00
------------------
(program exited with code: 0)

Press any key to continue . . .

The strptime() method can take a variety of arguments to determine how to interpret the date. Some of these arguments are:

Now let's improve our temperature data plot by extracting dates for the daily highs and passing those highs and dates to plot(). See the following program:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs = [], []

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        dates.append(current_date)
        highs.append(high)

# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red')

# Format plot.
plt.title("Daily high temperatures, July 2018", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

We create two empty lists to store the dates and high temperatures from the file. We then convert the data containing the date information (row[2]) to a datetime object and append it to dates. We pass the dates and the high temperature values to plot(). The call to fig.autofmt_xdate() draws the date labels diagonally to prevent them from overlapping. The Figure shown blow shows the improved graph:

If we add more data we can get a more complete picture of the weather in Sitka. The file sitka_weather_2018_simple.csv, contains a full year’s worth of weather data for Sitka. So we'll use this file now. Just copy this file into the same folder where you saved the previous data source file. The following code generates a graph for the entire year’s weather:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/sitka_weather_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs = [], []

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        dates.append(current_date)
        highs.append(high)

# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red')

# Format plot.
plt.title("Daily high temperatures - 2018", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

We've only changed the data source file name and the plot title. The output is as shown below:

So far our plots were made using the high temperature values. In order to make our informative graph even more useful we can include the low temperatures. To do so, we need to extract the low temperatures from the data file and then add them to our graph as shown in the following program:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/sitka_weather_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        low = int(row[6])
        dates.append(current_date)
        highs.append(high)
        lows.append(low)


# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red')
ax.plot(dates,lows, c='green')

# Format plot.
plt.title("Daily high and low temperatures - 2018", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

In our previous program we add an empty list lows to hold low temperatures, and then extract and store the low temperature for each date from the seventh position in each row (row[6]). Next we add a call to plot() for the low temperatures and color these values green. Finally we update the title of the plot and add low temperature values. The output chart is shown below:

Our next step will be to add a finishing touch to the graph by using shading to show the range between each day’s high and low temperatures. To do so, we’ll use the fill_between() method, which takes a series of x-values and two series of y-values, and fills the space between the two y-value series as shown in the program below:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/sitka_weather_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]

    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        low = int(row[6])
        dates.append(current_date)
        highs.append(high)
        lows.append(low)


# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red',alpha=0.5)
ax.plot(dates,lows, c='green',alpha=0.5)
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.3)

# Format plot.
plt.title("Daily high and low temperatures - 2018", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

The alpha argument controls a color’s transparency. An alpha value of 0 is completely transparent, and 1 (the default) is completely opaque. By setting alpha to 0.5, we make the red and green plot lines appear lighter. Next we pass fill_between() the list dates for the x-values and then the two y-value series highs and lows. The facecolor argument determines the color of the shaded region; we give it a low alpha value of 0.3 so the filled region connects the two data series without distracting from the information they represent. Following figure shows the plot with the shaded region between the highs and lows:

The shading helps make the range between the two data sets immediately apparent.

Here I am ending this post. In the next post we shall consider the condition when missing data can result in exceptions that crash our programs.

Downloading data sets from online sources

on September 26, 2019 with No comments

We can find an incredible variety of data online, much of which hasn’t been examined thoroughly. The ability to analyze this data allows us to discover patterns and connections that no one else has found. We can access and visualize data stored in two common data formats, CSV and JSON. In this post, we’ll download weather data set from online sources and create working visualizations of that data using Python’s csv module which can process weather data stored in the CSV (comma-separated values) format and analyze high and low temperatures over time in two different locations. Then using Matplotlib we'll generate a chart based on our downloaded data to display variations in temperature in two dissimilar environments: Sitka, Alaska, and Death Valley, California.

A simple way to store data in a text file is to write the data as a series of values separated by commas, which is called comma-separated values. The resulting files are called CSV files. For example, here’s a chunk of weather data in CSV format:

"INW00025333","DELHI AIRPORT, ND IN","2018-01-01","0.45",,"48","38"

This is an excerpt of some weather data from January 1, 2018 in Delhi, New Delhi. It includes the day’s high and low temperatures, as well as a number of other measurements from that day. CSV files can be tricky for humans to read, but they’re easy for programs to process and extract values from,
which speeds up the data analysis process.

We’ll begin with a small set of CSV-formatted weather data recorded in Sitka (Alaska,US), which can be downloaded from https://ncdc.noaa.gov/cdo-web/. Make a folder called data inside the folder where you’re saving your programs. Copy the file sitka_weather_07-2018_simple.csv into this new folder.

Let's start with parsing the CSV File Headers as shown in the following program:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    print(header_row)

After importing the csv module, we assign the name of the file we’re working with to filename. We then open the file and assign the resulting file object to f . Next, we call csv.reader() and pass it the file object as an argument to create a reader object associated with that file. We assign the reader object to reader.

The csv module contains a next() function, which returns the next line in the file when passed the reader object. In the preceding listing, we call next() only once so we get the first line of the file, which contains the file headers. We store the data that’s returned in header_row. As you can see in the output below, header_row contains meaningful, weather-related headers that tell us what information each line of data holds:

['STATION', 'NAME', 'DATE', 'PRCP', 'TAVG', 'TMAX', 'TMIN']
------------------
(program exited with code: 0)

Press any key to continue . . .

The reader object processes the first line of comma-separated values in the file and stores each as an item in a list.

The header STATION represents the code for the weather station that recorded this data. The position of this header tells us that the first value in each line will be the weather station code. The NAME header indicates that the second value in each line is the name of the weather station that made the recording. The rest of the headers specify what kinds of information were recorded in each reading. The data we’re most interested in for now are the date, the high temperature (TMAX), and the low temperature (TMIN). This is a simple data set that contains only precipitation and temperature-related data. When you download your own weather data, you can choose to include a number of other measurements relating to wind speed, direction, and more detailed precipitation data.

To make it easier to understand the file header data, we print each header and its position in the list:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    for index, column_header in enumerate(header_row):
        print(index, column_header)

The enumerate() function returns both the index of each item and the value of each item as we loop through a list. The output of the program is shown below which is the index of each header:

0 STATION
1 NAME
2 DATE
3 PRCP
4 TAVG
5 TMAX
6 TMIN
------------------
(program exited with code: 0)

Press any key to continue . . .

In the output above, we see that the dates and their high temperatures are stored in columns 2 and 5. To explore this data, we’ll process each row of data in sitka_weather_07-2018_simple.csv and extract the values with the indexes 2 and 5.

As we know which columns of data we need, let’s read in some of that data. First, we’ll read in the high temperature for each day:

import csv

filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

# Get high temperatures from this file.
    highs = []

    for row in reader:
        high = int(row[5])
        highs.append(high)

print(highs)

We make an empty list called highs and then loop through the remaining rows in the file. The reader object continues from where it left off in the CSV file and automatically returns each line following its current position. Because we’ve already read the header row, the loop will begin at the second line where the actual data begins. On each pass through the loop, we pull the data from index 5, which corresponds to the header TMAX, and assign it to the variable high. We use the int() function to convert the data, which is stored as a string, to a numerical format so we can use it. We then append this value to highs.

The following listing is the output of the program which shows the data now stored in highs:

[62, 58, 70, 70, 67, 59, 58, 62, 66, 59, 56, 63, 65, 58, 56, 59, 64, 60, 60, 61, 65, 65, 63, 59, 64, 65, 68, 66, 64, 67, 65]
------------------
(program exited with code: 0)

Press any key to continue . . .

Now we the value of high temperature for each date so let's create a visualization of this data. To visualize the temperature data we have, we’ll first create a simple plot of the daily highs using Matplotlib, as shown in the following program:

# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(highs, c='red')

# Format plot.
plt.title("Daily high temperatures, July 2018", fontsize=24)
plt.xlabel('', fontsize=16)
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

We pass the list of highs to plot() and pass c='red' to plot the points in red color. We then specify a
few other formatting details, such as the title, font size, and labels. As we have yet to add the dates,
we won’t label the x-axis, but plt.xlabel() does modify the font size to make the default labels more readable. The complete program and the output plot is shown below:

import csv
import matplotlib.pyplot as plt
filename = 'data/sitka_weather_07-2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    # Get high temperatures from this file.
    highs = []

    for row in reader:
        high = int(row[5])
        highs.append(high)

# Plot the high temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(highs, c='red')

# Format plot.
plt.title("Daily high temperatures, July 2018", fontsize=24)
plt.xlabel('', fontsize=16)
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

Here is the output plot:

Here I am ending today's post. In the next post we'll see how to add the dates to our graph.

Rolling Two Dice

on September 25, 2019 with No comments

In the previous post we saw rolling a single dice D6. Now we'll roll two dice together. Rolling two dice results in larger numbers and a different distribution of results. The following code will create two D6 dice to simulate the way we roll a pair of dice:

from die import Die
from plotly.graph_objs import Bar, Layout
from plotly import offline

# Create two D6 dice.
die_1 = Die()
die_2 = Die()

# Make some rolls, and store results in a list.
results = []

for roll_num in range(1000):
    result = die_1.roll()+ die_2.roll()
    results.append(result)

# Analyze the results.
frequencies = []
max_result = die_1.num_sides + die_2.num_sides
for value in range(2, max_result+1):
    frequency = results.count(value)
    frequencies.append(frequency)

# Visualize the results.

x_values = list(range(2, max_result+1))
data = [Bar(x=x_values, y=frequencies)]
x_axis_config = {'title': 'Result','dtick': 1}
y_axis_config = {'title': 'Frequency of Result'}

my_layout = Layout(title='Results of rolling two D6 dice 1000 times',xaxis=x_axis_config, yaxis=y_axis_config)
offline.plot({'data': data, 'layout': my_layout}, filename='d6_d6.html')

In our program each time we roll the pair, we’ll add the two numbers (one from each die) and store the sum in results. After creating two instances of Die, we roll the dice and calculate the sum of the two dice for each roll. The largest possible result (12) is the sum of the largest number on both dice, which we store in max_result. The smallest possible result (2) is the sum of the smallest number on both dice. When we analyze the results, we count the number of results for each value between 2 and max_result.

When creating the chart, we include the dtick key in the x_axis_config dictionary. This setting controls the spacing between tick marks on the x-axis. Now that we have more bars on the histogram, Plotly’s default settings will only label some of the bars. The 'dtick': 1 setting tells Plotly to label every tick mark.

When we run this program, the following output is shown on the browser:

The above graph shows the approximate results you’re likely to get when you roll a pair of D6 dice. As you can see, you’re least likely to roll a 2 or a 12 and most likely to roll a 7. This happens because there are six ways to roll a 7, namely: 1 and 6, 2 and 5, 3 and 4, 4 and 3, 5 and 2, or 6 and 1.

What if we roll a pair of dice of different sizes? let's take a D6 and D10 and see what happens when
we roll them 50,000 times. The code is shown below:

from plotly.graph_objs import Bar, Layout
from plotly import offline

from die import Die

# Create a D6 and a D10.
die_1 = Die()
die_2 = Die(10)

# Make some rolls, and store results in a list.
results = []
for roll_num in range(50_000):
    result = die_1.roll() + die_2.roll()
    results.append(result)

# Analyze the results.
frequencies = []
max_result = die_1.num_sides + die_2.num_sides
for value in range(2, max_result+1):
    frequency = results.count(value)
    frequencies.append(frequency)

# Visualize the results.
x_values = list(range(2, max_result+1))
data = [Bar(x=x_values, y=frequencies)]

x_axis_config = {'title': 'Result', 'dtick': 1}
y_axis_config = {'title': 'Frequency of Result'}
my_layout = Layout(title='Results of rolling a D6 and a D10 50000 times',
        xaxis=x_axis_config, yaxis=y_axis_config)
offline.plot({'data': data, 'layout': my_layout}, filename='d6_d10.html')

To make a D10, we pass the argument 10 when creating the second Die instance and change the first loop to simulate 50,000 rolls instead of 1000. The output of this program is shown below:

Instead of one most likely result, there are five. This happens because there’s still only one way to roll the smallest value (1 and 1) and the largest value (6 and 10), but the smaller die limits the number of ways you can generate the middle numbers: there are six ways to roll a 7, 8, 9, 10, and 11. Therefore, these are the most common results, and you’re equally likely to roll any one of these numbers.

As as exercise try to make a program for rolling three Dice and plot the results. See you soon with another topic which will be related to downloading data sets from online sources.

Python is easy to learn

Monday, September 30, 2019

Mapping Global Data Sets: JSON Format

Sunday, September 29, 2019

Handling issues arising due to missing/corrupt data

Friday, September 27, 2019

Adding dates to our graph using The datetime Module

Thursday, September 26, 2019

Downloading data sets from online sources

Wednesday, September 25, 2019

Rolling Two Dice