Sunday, September 29, 2019

Handling issues arising due to missing/corrupt data

The program we made in the previous post can use data for any location. Sometimes weather stations collect different data than others, and some occasionally malfunction and fail to collect some of the data they’re supposed to. Missing data can result in exceptions that crash our programs unless we handle them properly.

As an example, let’s see what happens when we attempt to generate a temperature plot for Death Valley, California for which the the data file is death_valley_2018_simple.csv. This file should also be copied to the data folder in our program directory.

As we did before, we'll first check for the headers that are included in this data file, the following program does this:

import csv
filename = 'data/death_valley_2018_simple.csv'
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
   
    for index, column_header in enumerate(header_row):
        print(index, column_header)


The output of the program is shown below:

0 STATION
1 NAME
2 DATE
3 PRCP
4 TMAX
5 TMIN
6 TOBS
------------------
(program exited with code: 0)

Press any key to continue . . .

The date is in the same position at index 2. But the high and low temperatures are at indexes 4 and 5, so we’d need to change the indexes in our code to reflect these new positions. Instead of including an average temperature reading for the day, this station includes TOBS, a reading for a specific observation time.

I removed one of the temperature readings from this file to show what happens when some data is missing from a file. Let's make a program to generate a graph for Death Valley using the indexes we just noted, and see what happens:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/death_valley_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
       
    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]
       
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[4])
        low = int(row[5])
        dates.append(current_date)
        highs.append(high)
        lows.append(low)
       
       
# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red',alpha=0.5)
ax.plot(dates,lows, c='green',alpha=0.5)
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.3)

# Format plot.
title = "Daily high and low temperatures - 2018\nDeath Valley, CA"
plt.title(title, fontsize=20)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()


When we run this program we get the following output:

Traceback (most recent call last):
  File "pasing_CSV.py", line 15, in <module>
    high = int(row[4])
ValueError: invalid literal for int() with base 10: ''
------------------
(program exited with code: 1)

Press any key to continue . . . 


The traceback tells us that Python can’t process the high temperature for one of the dates because it can’t turn an empty string ('') into an integer. Rather than look through the data and finding out which reading is missing, we’ll just handle cases of missing data directly.

We’ll run error-checking code when the values are being read from the CSV file to handle exceptions that might arise. The following program shows how that works:

import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = 'data/death_valley_2018_simple.csv'

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
       
    # Get dates and high temperatures from this file.
    dates, highs, lows = [], [],[]
       
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        try:
            high = int(row[4])
            low = int(row[5])
        except ValueError:
            print(f"Missing data for {current_date}")
        else:
            dates.append(current_date)
            highs.append(high)
            lows.append(low)
       
       
# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates,highs, c='red',alpha=0.5)
ax.plot(dates,lows, c='green',alpha=0.5)
plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.3)

# Format plot.
title = "Daily high and low temperatures - 2018\nDeath Valley, CA"
plt.title(title, fontsize=20)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()

Each time we examine a row, we try to extract the date and the high and low temperature. If any data is missing, Python will raise a ValueError and we handle it by printing an error message that includes the date of the missing data. After printing the error, the loop will continue processing the next row. If all data for a date is retrieved without error, the else block will run and the data will be appended to the appropriate lists.

When we run the program now, we’ll see that only one date had missing data:

Missing data for 2018-02-18 00:00:00

Because the error is handled appropriately, our code is able to generate a plot, which skips over the missing data. Figure below shows the resulting plot:


Here we used a try-except-else block to handle missing data. Sometimes you’ll use continue to skip over some data or use remove() or del to eliminate some data after it’s been extracted. Use any approach that works, as long as the result is a meaningful, accurate visualization.

Comparing this graph to the Sitka graph, we can see that Death Valley is warmer overall than southeast Alaska, as we expect. Also, the range of temperatures each day is greater in the desert. The height of the shaded region makes this clear.

Here our discussion comes to end. In the next post we'll start with mapping data sets using JSON format.







Share:

0 comments:

Post a Comment