Climate Data Visualization

sqlite
python
visualizations
Author

Jun Ryu

Published

January 10, 2023

In this post, we will be working with the data from National Oceanic and Atmospheric Administration (NOAA) to create interesting, interactive geographical visualizations.

1. Creating a Database


First, we will connect to a database and create three separate tables (named temperatures, countries, and stations).

  • The temperatures table will contain station ID, year of measurement, month of measurement, and average temperature.
  • The countries table will contain country names and their corresponding country codes.
  • The stations table will contain station ID, latitude, longitude, elevation, and the name of the station.
from plotly import express as px
import numpy as np
import pandas as pd
import sqlite3

conn = sqlite3.connect("temps.db") # connect to database
cursor = conn.cursor() 
df1 = pd.read_csv("temps_stacked.csv")
#df1.to_sql("temperatures", conn, index=False)
df2 = pd.read_csv("countries.csv")
#df2.to_sql("countries", conn, index=False)
df3 = pd.read_csv("station-metadata.csv")
#df3.to_sql("stations", conn, index=False)

conn.close() # close connection
df1.head() # temperatures
ID Year Month Temp
0 ACW00011604 1961 1 -0.89
1 ACW00011604 1961 2 2.36
2 ACW00011604 1961 3 4.72
3 ACW00011604 1961 4 7.73
4 ACW00011604 1961 5 11.28
df2.head() # countries
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
df3.head() # stations
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

2. Query Function


Now, we will write a query function query_climate_database(), which filters and returns a new dataframe based on the user’s input. The user will specify the country, starting year, ending year, and the month. Then, the user will receive a new dataframe matching all the inputted information as well as the station name, latitude, longitude, and the average temperature at the station during the specific year and month.

def query_climate_database(country, year_begin, year_end, month):
    """
    Query function to filter data
    Args:
    country: the country of interest
    year_begin: earliest year for which should be returned
    year_end: latest year for which should be returned
    month: a specific month of the year
    Return:
    a dataframe containing all matches
    """
    
    conn = sqlite3.connect("temps.db")
    # use f-string
    # left-join stations and countries using country codes
    # left-join stations and temperatures using station IDs
    cmd = f"SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Name Country, T.Year, T.Month, T.Temp \
    FROM stations S \
    LEFT JOIN countries C on SUBSTRING(S.id, 1, 2) = C.'FIPS 10-4' \
    LEFT JOIN temperatures t on  S.id = T.id \
    WHERE C.Name = '{country}' AND T.Year BETWEEN {year_begin} AND {year_end} AND T.Month = {month}"
    df = pd.read_sql_query(cmd, conn)
    
    conn.close()
    return df

An example with the function is shown below with the inputs "India", 1980, 2020, and 1.

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

3. Geographic Scatter Plots


We will use the above function to write a new function temperature_coefficient_plot(), which will address the question:

How does the average yearly change in temperature vary within a given country?

The new function will accept the same four arguments as query_climate_database(), but will have an added explicit argument min_obs, which will be used to filter out stations that have less years worth of data than this specified number. Also, we will add **kwargs for the user to pass onto px.scatter_mapbox(), which is what we will use to create the final visualization.

In order to compute the estimates of yearly increase in temperature, we will use a linear regression model for each station and grab all the corresponding first coefficients.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    Create a plot containing estimates of yearly temperature increase
    Args:
    country: the country of interest
    year_begin: earliest year for which should be returned
    year_end: latest year for which should be returned
    month: a specific month of the year
    min_obs: minimum required number of years of data for any given station
    **kwargs: additional keyword arguments
    Return:
    a scatter_mapbox plot
    """
    
    # pull the correct data using the query function
    df = query_climate_database(country = country, year_begin = year_begin, 
                                year_end = year_end, month = month)
    
    # filter out stations based on min_obs argument
    df = df.groupby('NAME').filter(lambda x : x['Year'].count() >= min_obs)
    
    # fit the linear model for each station
    lin_model = px.get_trendline_results(px.scatter(df, x="Year", y="Temp", facet_col="NAME", 
                                                 trendline="ols", facet_col_wrap = 12, facet_row_spacing = 0.001))
    result = lin_model["px_fit_results"]
    coef = [list(result.iloc[i].params[[1]]) for i in range(len(result))]
    coef = list(np.round(np.concatenate(coef), 3)) # round the results
    diff = dict(zip(df["NAME"].unique(),coef))
    df["Estimated Yearly Increase"] = df["NAME"].map(diff) # map the results onto a new column
    
    months = {
        1: "January", 2: "February", 3: "March", 4: "April",
        5: "May", 6: "June", 7: "July", 8: "August",
        9: "September", 10: "October", 11: "November", 12: "December"
    }
    
    # create plot
    fig = px.scatter_mapbox(df, lon = "LONGITUDE", lat = "LATITUDE", color = "Estimated Yearly Increase",
                           color_continuous_midpoint = 0, hover_name = "NAME", height = 600,
                           hover_data = ["LATITUDE", "LONGITUDE", "Estimated Yearly Increase"],
                           title = "Estimates of Yearly Increase in Temperature in " + months.get(month) 
                            + "<br>for stations in " + str(country) + ", years " + str(year_begin) + " - " + str(year_end),
                           **kwargs
                           )
    fig.layout.coloraxis.colorbar.title = 'Estimated Yearly <br> Increase (°C)' 
    fig.update_layout(margin=dict(l=50, r=50, t=100, b=50)) 
    return fig

Let’s demonstrate the above function! We will again use the inputs "India", 1980, 2020, and 1 from Part 2, and add on 10 for the argument min_obs. We will also add extra arguments to pass onto px.scatter_mapbox(). If everything goes right, the resulting visualization should contain all stations in India with at least 10 years worth of data between the years 1980 and 2020 for the month of January.

The station data will then be plotted in the form of a scatter plot on a world map. Each data point will contain information about the corresponding station’s estimate of yearly temperature increase, which can be explicitly found by hovering over the data point or estimated by looking at its color.

color_map = px.colors.diverging.RdGy_r # produce a color map 
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale = color_map)
fig.show()

Cool! The above gives us a neat visualization created with Plotly Express that answers our central question. Let’s try passing on different arguments.

color_map = px.colors.diverging.RdGy_r  
fig = temperature_coefficient_plot("France", 1993, 2012, 5, 5,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale = color_map)
fig.show()