In this post, we will be working with the data from National Oceanic and Atmospheric Administration (NOAA) to create interesting, interactive geographical visualizations.

1. Creating a Database

First, we will connect to a database and create three separate tables (named temperatures, countries, and stations).

The temperatures table will contain station ID, year of measurement, month of measurement, and average temperature.
The countries table will contain country names and their corresponding country codes.
The stations table will contain station ID, latitude, longitude, elevation, and the name of the station.

from plotly import express as px
import numpy as np
import pandas as pd
import sqlite3

conn = sqlite3.connect("temps.db") # connect to database
cursor = conn.cursor() 
df1 = pd.read_csv("temps_stacked.csv")
#df1.to_sql("temperatures", conn, index=False)
df2 = pd.read_csv("countries.csv")
#df2.to_sql("countries", conn, index=False)
df3 = pd.read_csv("station-metadata.csv")
#df3.to_sql("stations", conn, index=False)

conn.close() # close connection

df1.head() # temperatures

	ID	Year	Month	Temp
0	ACW00011604	1961	1	-0.89
1	ACW00011604	1961	2	2.36
2	ACW00011604	1961	3	4.72
3	ACW00011604	1961	4	7.73
4	ACW00011604	1961	5	11.28

df2.head() # countries

	FIPS 10-4	ISO 3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

df3.head() # stations

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

2. Query Function

Now, we will write a query function query_climate_database(), which filters and returns a new dataframe based on the user’s input. The user will specify the country, starting year, ending year, and the month. Then, the user will receive a new dataframe matching all the inputted information as well as the station name, latitude, longitude, and the average temperature at the station during the specific year and month.

def query_climate_database(country, year_begin, year_end, month):
    """
    Query function to filter data
    Args:
    country: the country of interest
    year_begin: earliest year for which should be returned
    year_end: latest year for which should be returned
    month: a specific month of the year
    Return:
    a dataframe containing all matches
    """
    
    conn = sqlite3.connect("temps.db")
    # use f-string
    # left-join stations and countries using country codes
    # left-join stations and temperatures using station IDs
    cmd = f"SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Name Country, T.Year, T.Month, T.Temp \
    FROM stations S \
    LEFT JOIN countries C on SUBSTRING(S.id, 1, 2) = C.'FIPS 10-4' \
    LEFT JOIN temperatures t on  S.id = T.id \
    WHERE C.Name = '{country}' AND T.Year BETWEEN {year_begin} AND {year_end} AND T.Month = {month}"
    df = pd.read_sql_query(cmd, conn)
    
    conn.close()
    return df

An example with the function is shown below with the inputs "India", 1980, 2020, and 1.

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48
1	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57
2	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19
3	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51
4	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81
...	...	...	...	...	...	...	...
3147	DARJEELING	27.050	88.270	India	1983	1	5.10
3148	DARJEELING	27.050	88.270	India	1986	1	6.90
3149	DARJEELING	27.050	88.270	India	1994	1	8.10
3150	DARJEELING	27.050	88.270	India	1995	1	5.60
3151	DARJEELING	27.050	88.270	India	1997	1	5.70

3152 rows × 7 columns

3. Geographic Scatter Plots

We will use the above function to write a new function temperature_coefficient_plot(), which will address the question:

How does the average yearly change in temperature vary within a given country?

The new function will accept the same four arguments as query_climate_database(), but will have an added explicit argument min_obs, which will be used to filter out stations that have less years worth of data than this specified number. Also, we will add **kwargs for the user to pass onto px.scatter_mapbox(), which is what we will use to create the final visualization.

In order to compute the estimates of yearly increase in temperature, we will use a linear regression model for each station and grab all the corresponding first coefficients.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    Create a plot containing estimates of yearly temperature increase
    Args:
    country: the country of interest
    year_begin: earliest year for which should be returned
    year_end: latest year for which should be returned
    month: a specific month of the year
    min_obs: minimum required number of years of data for any given station
    **kwargs: additional keyword arguments
    Return:
    a scatter_mapbox plot
    """
    
    # pull the correct data using the query function
    df = query_climate_database(country = country, year_begin = year_begin, 
                                year_end = year_end, month = month)
    
    # filter out stations based on min_obs argument
    df = df.groupby('NAME').filter(lambda x : x['Year'].count() >= min_obs)
    
    # fit the linear model for each station
    lin_model = px.get_trendline_results(px.scatter(df, x="Year", y="Temp", facet_col="NAME", 
                                                 trendline="ols", facet_col_wrap = 12, facet_row_spacing = 0.001))
    result = lin_model["px_fit_results"]
    coef = [list(result.iloc[i].params[[1]]) for i in range(len(result))]
    coef = list(np.round(np.concatenate(coef), 3)) # round the results
    diff = dict(zip(df["NAME"].unique(),coef))
    df["Estimated Yearly Increase"] = df["NAME"].map(diff) # map the results onto a new column
    
    months = {
        1: "January", 2: "February", 3: "March", 4: "April",
        5: "May", 6: "June", 7: "July", 8: "August",
        9: "September", 10: "October", 11: "November", 12: "December"
    }
    
    # create plot
    fig = px.scatter_mapbox(df, lon = "LONGITUDE", lat = "LATITUDE", color = "Estimated Yearly Increase",
                           color_continuous_midpoint = 0, hover_name = "NAME", height = 600,
                           hover_data = ["LATITUDE", "LONGITUDE", "Estimated Yearly Increase"],
                           title = "Estimates of Yearly Increase in Temperature in " + months.get(month) 
                            + "<br>for stations in " + str(country) + ", years " + str(year_begin) + " - " + str(year_end),
                           **kwargs
                           )
    fig.layout.coloraxis.colorbar.title = 'Estimated Yearly <br> Increase (°C)' 
    fig.update_layout(margin=dict(l=50, r=50, t=100, b=50)) 
    return fig

Let’s demonstrate the above function! We will again use the inputs "India", 1980, 2020, and 1 from Part 2, and add on 10 for the argument min_obs. We will also add extra arguments to pass onto px.scatter_mapbox(). If everything goes right, the resulting visualization should contain all stations in India with at least 10 years worth of data between the years 1980 and 2020 for the month of January.

The station data will then be plotted in the form of a scatter plot on a world map. Each data point will contain information about the corresponding station’s estimate of yearly temperature increase, which can be explicitly found by hovering over the data point or estimated by looking at its color.

color_map = px.colors.diverging.RdGy_r # produce a color map 
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale = color_map)
fig.show()

Cool! The above gives us a neat visualization created with Plotly Express that answers our central question. Let’s try passing on different arguments.

color_map = px.colors.diverging.RdGy_r  
fig = temperature_coefficient_plot("France", 1993, 2012, 5, 5,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale = color_map)
fig.show()