Data science I

pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language (pandas-website).

Let’s say we want to investigate the weather data recorded at the Frankfurt International Airport from the year 2015. The data is available in the tabular CSV format:

Read data

First, we want to read the data:

import pandas as pd
tabelle = pd.read_csv("../data/frankfurt_weather.csv", parse_dates=["time"], index_col="time", sep=",", na_values="")

The pandas function pd.read_csv(...) opens the csv file and reads its contents.

The parse_dates argument tells Python to interpret the values in the time column as dates
The index_col argument tells it to use this column as index.
The sep argument tells it which character it should use for separating columns
The na_values argument tells pandas, which values should be interpreted as NaN (missing) values. In our case, missing values are just blank cells ("")

pandas has many more readers for a variety of file types, e.g. Excel, SQL, HDF, HTML, etc. The full list of supported file types and the respective readers can be found here.

# HIDECELL
pd.set_option('display.max_rows', 5)

DataFrame

All data is stored in the tabelle object. This object is a so-called DataFrame, the basic pandas object. You can think of it like a spreadsheet.

To be more precise, it is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

The DataFrame object offers many functions that allow you to work with the data conveniently.

For a first glimps at the data, you can call the head() function which prints out the first 5 rows of the table:

tabelle.head()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-01-01 00:20:00	2800	1.0	1.0	0.0	0.0	1036.0	200.0	OVC
2015-01-01 00:50:00	1500	1.0	1.0	0.0	0.0	1036.0	100.0	OVC
2015-01-01 01:20:00	1000	1.0	1.0	0.0	0.0	1036.0	100.0	OVC
2015-01-01 01:50:00	700	1.0	1.0	0.0	0.0	1036.0	NaN	NaN
2015-01-01 02:20:00	600	1.0	1.0	0.0	0.0	1036.0	NaN	NaN

To get an overview of the value distributions of the variables within tabelle, you can call the describe() function. It generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values:

tabelle.describe()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height
count	17520.000000	17507.000000	17507.00000	15615.000000	17519.000000	17507.000000	10210.000000
mean	9421.269977	11.679785	6.00674	169.353186	7.120212	1018.367967	2142.938296
...	...	...	...	...	...	...	...
75%	10000.000000	17.000000	10.00000	230.000000	10.000000	1024.000000	3100.000000
max	10000.000000	39.000000	20.00000	360.000000	40.000000	1041.000000	8000.000000

8 rows × 7 columns

The dtypes field of tabelle shows you all column names and the data type of the respective column:

tabelle.dtypes

visibility           int64
air_temperature    float64
                    ...   
cloud_height       float64
cloud_cover         object
Length: 8, dtype: object

If you want to know, how many columns and lines the table consists of, use the shape attribute:

tabelle.shape

(17520, 8)

Let’s say, you want to sort this table by visibility values in descending order.

You can sort a pandas DataFrame by the values of a specific column, using the sort_values() function:

tabelle.sort_values(by="visibility", ascending=False).head()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-07-02 12:20:00	10000	34.0	15.0	90.0	6.0	1019.0	NaN	NaN
2015-08-13 23:20:00	10000	24.0	17.0	240.0	21.0	1010.0	NaN	NaN
2015-08-13 16:50:00	10000	34.0	15.0	20.0	14.0	1009.0	NaN	NaN
2015-08-13 17:20:00	10000	34.0	16.0	10.0	12.0	1009.0	NaN	NaN
2015-08-13 17:50:00	10000	33.0	16.0	30.0	11.0	1009.0	NaN	NaN

Selection by label

Let’s assume, we are not interested in the complete year 2015 but only in August. We can select the respective rows with the loc attribute.

When the index of the DataFrame is a range of dates (like in this case), you can select a row ( = measurement date) by specifying a date (as string):

tabelle.loc["2015-08"].head() # select all measurements of August 2015.

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-08-01 00:20:00	10000	16.0	5.0	30.0	11.0	1015.0	NaN	NaN
2015-08-01 00:50:00	10000	16.0	5.0	20.0	11.0	1014.0	NaN	NaN
2015-08-01 01:20:00	10000	16.0	5.0	20.0	10.0	1014.0	NaN	NaN
2015-08-01 01:50:00	10000	16.0	5.0	30.0	10.0	1014.0	NaN	NaN
2015-08-01 02:20:00	10000	16.0	5.0	20.0	10.0	1014.0	NaN	NaN

If you want to select a specific date range, you can do so by providing a start and end date (notice that the specific dates do not have to match the date values in the “time” column):

tabelle.loc["2015-08-01 15:55":"2015-08-01 19:20"] # selects all rows between 2015-08-01 15:55 and 2015-08-01 19:20

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-08-01 16:20:00	10000	27.0	6.0	190.0	9.0	1013.0	NaN	NaN
2015-08-01 16:50:00	10000	27.0	7.0	180.0	9.0	1013.0	NaN	NaN
...	...	...	...	...	...	...	...	...
2015-08-01 18:50:00	10000	22.0	10.0	210.0	4.0	1014.0	NaN	NaN
2015-08-01 19:20:00	10000	21.0	10.0	190.0	5.0	1015.0	NaN	NaN

7 rows × 8 columns

What if we are only interested in the measurements between 12:00 and 15:00 for each day? Unfortunately, for this task we cannot use the loc attribute. However, the between_time() function helps us:

tabelle.between_time("12:00", "15:00")

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-01-01 12:20:00	1600	3.0	2.0	220.0	3.0	1036.0	300.0	FEW
2015-01-01 12:50:00	2000	3.0	2.0	NaN	2.0	1036.0	500.0	BKN
...	...	...	...	...	...	...	...	...
2015-12-31 14:20:00	3800	5.0	4.0	NaN	1.0	1024.0	400.0	BKN
2015-12-31 14:50:00	3500	5.0	4.0	NaN	2.0	1024.0	300.0	BKN

2190 rows × 8 columns

Now we know, how to select rows in time series (=measurements). But what if we are only interested in one specific variable, e.g. cloud height?

You can select single columns by:

tabelle.loc[:, "cloud_height"]

time
2015-01-01 00:20:00    200.0
2015-01-01 00:50:00    100.0
                       ...  
2015-12-31 23:20:00      NaN
2015-12-31 23:50:00    100.0
Name: cloud_height, Length: 17520, dtype: float64

Note that selecting a single column returns a Series object and not just the column as a list for example. In principle, a Series is a one-dimensional DataFrame and it provides more or less the same functionality.

For example, you can use the count() method to count the values within a column/Series:

tabelle.loc[:, "cloud_height"].count()

As you can see, the returned value of 10210 is smaller than the total table row count of 17520 because NaN values are excluded automatically.

There are many more methods provided with the Series class but we will not go into further detail here.

If you are interested to get a simple list of the values of the column use the .to_list() method but most of the time it is better to use the .values attribute which returns the values as an array (don’t worry about what an array is right now. We will cover that in the next chapter):

tabelle.loc[:, "cloud_height"].values

array([200., 100., 100., ...,  nan,  nan, 100.])

You can select multiple columns by:

tabelle.loc[:, ["visibility", "air_temperature"]]

	visibility	air_temperature
time
2015-01-01 00:20:00	2800	1.0
2015-01-01 00:50:00	1500	1.0
...	...	...
2015-12-31 23:20:00	500	5.0
2015-12-31 23:50:00	500	5.0

17520 rows × 2 columns

You can also select columns and rows together within one call:

tabelle.loc["2015-11-01 15:55":"2015-11-01 19:20", ["visibility", "air_temperature"]]

	visibility	air_temperature
time
2015-11-01 16:20:00	250	5.0
2015-11-01 16:50:00	200	6.0
...	...	...
2015-11-01 18:50:00	100	6.0
2015-11-01 19:20:00	100	5.0

7 rows × 2 columns

Selection by position (index)

Just as with the label selection, you can select values via their index using iloc:

tabelle.iloc[3:5, 0:2]

	visibility	air_temperature
time
2015-01-01 01:50:00	700	1.0
2015-01-01 02:20:00	600	1.0

Task

Download the following file to your local machine

https://github.com/python-kurs/sommersemester_2019/blob/master/content/data/frankfurt_weather.csv

Import pandas, open the file and set the time column as index
Clip the table to February 01, 2015 and the variables “visibility”, “air_temperature”, and “wind_speed”.

Solution

import pandas as pd
tabelle = pd.read_csv("path/to/frankfurt_weather.csv", parse_dates=["time"], index_col="time", sep=",")

tabelle.loc["2015-02-01", ["visibility", "air_temperature", "wind_speed"]]

Boolean indexing

What if you don’t want to clip the table to specific times but to other conditions? Let’s assume you want to select all rows, where visibility was measured to be below 1000m.

You can do this easily by specifying the conditions within the [ ] brackets:

tabelle[tabelle.visibility < 1000] # Selects all rows where visibility < 1000

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-01-01 01:50:00	700	1.0	1.0	0.0	0.0	1036.0	NaN	NaN
2015-01-01 02:20:00	600	1.0	1.0	0.0	0.0	1036.0	NaN	NaN
...	...	...	...	...	...	...	...	...
2015-12-31 23:20:00	500	5.0	5.0	NaN	2.0	1026.0	NaN	NaN
2015-12-31 23:50:00	500	5.0	5.0	240.0	3.0	1026.0	100.0	BKN

299 rows × 8 columns

You can also use multiple columns to select data after specific criteria. Let’s say we are only interested in rows with visibility values below 1000m and temperature values above 5 °C:

tabelle[(tabelle.visibility < 1000) & (tabelle.air_temperature>5)] # selects all rows where vis < 1000 and temp > 5

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover
time
2015-08-29 02:20:00	500	15.0	15.0	60.0	4.0	1022.0	100.0	BKN
2015-10-05 01:50:00	400	9.0	9.0	10.0	4.0	1016.0	100.0	SCT
...	...	...	...	...	...	...	...	...
2015-12-24 05:20:00	300	6.0	6.0	NaN	1.0	1026.0	100.0	FEW
2015-12-24 06:20:00	700	6.0	6.0	NaN	1.0	1025.0	600.0	SCT

32 rows × 8 columns

Operations

Besides its many selection options, the DataFrame class provides many functions for calculating statistical measures, e.g. the average of each column:

tabelle.mean()

visibility         9421.269977
air_temperature      11.679785
                      ...     
air_pressure       1018.367967
cloud_height       2142.938296
Length: 7, dtype: float64

You can do the same for single columns (Series):

tabelle.visibility.mean()

9421.26997716895

And of course, you can combine all of the above:

tabelle[tabelle.visibility < 1000].wind_speed.median() # Calculate the median wind speed for times when the visibility was below 1000m

3.0

If you want to apply specific operations to your data that are not covered by the standard functions of pandas, you can do so by using the apply() method. This works for single Series or along an axis of a DataFrame object:

print(tabelle.air_temperature)

# First we define the function
def do_sth(x):
    return (x + 5) ** 2

# then we apply it to the temperature Series:
tabelle.air_temperature.apply(do_sth)

time
2015-01-01 00:20:00    1.0
2015-01-01 00:50:00    1.0
                      ... 
2015-12-31 23:20:00    5.0
2015-12-31 23:50:00    5.0
Name: air_temperature, Length: 17520, dtype: float64

time
2015-01-01 00:20:00     36.0
2015-01-01 00:50:00     36.0
                       ...  
2015-12-31 23:20:00    100.0
2015-12-31 23:50:00    100.0
Name: air_temperature, Length: 17520, dtype: float64

But what if we want to do an operation on multiple columns at the same time and save the result in a new column?

Let’s say we want to calculate the difference between air temperature and dewpoint values and save the results to a new column named “temp_dewpt_diff”.

We can achieve this easily via:

tabelle["temp_dewpt_diff"] = tabelle.air_temperature - tabelle.dewpoint
tabelle.head()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover	temp_dewpt_diff
time
2015-01-01 00:20:00	2800	1.0	1.0	0.0	0.0	1036.0	200.0	OVC	0.0
2015-01-01 00:50:00	1500	1.0	1.0	0.0	0.0	1036.0	100.0	OVC	0.0
2015-01-01 01:20:00	1000	1.0	1.0	0.0	0.0	1036.0	100.0	OVC	0.0
2015-01-01 01:50:00	700	1.0	1.0	0.0	0.0	1036.0	NaN	NaN	0.0
2015-01-01 02:20:00	600	1.0	1.0	0.0	0.0	1036.0	NaN	NaN	0.0

Treating missing (NaN) values

All of the methods shown above exclude missing values in their calculations by default. If you do not want to exclude NaNs, you can do that with the skipna or dropna keywords (in most methods).

For a complete overview of how pandas treats missing data and where you should be aware of implicit behavior, have a look at the pandas docs.

Task

Calculate what percentage of the wind came from north-east (between 0° and 90°) and how much came from south-west (between 180° and 270°).
Calculate the average wind speed for times when the horizontal visibility (vis) was below/above 1000m.
Which windspeed average is higher? Do you have an idea, why?

Solution

NE_count = tabelle[(tabelle.wind_direction>=0.) & (tabelle.wind_direction<=90.0)].wind_direction.count()
SW_count = tabelle[(tabelle.wind_direction>=180.) & (tabelle.wind_direction<=270.0)].wind_direction.count()
total_count = len(tabelle)
NE_perc = NE_count/total_count*100
SW_perc = SW_count/total_count*100
print("North-East: {0:2.2f}%\nSouth-West: {1:2.2f}%".format(NE_perc,SW_perc))

ws_av_below1000 = tabelle[tabelle.visibility<1000.].wind_speed.mean()
ws_av_above1000 = tabelle[tabelle.visibility>1000.].wind_speed.mean()
print("Average windspeed below 1000m: {0:2.2f}m/s\nAverage windspeed above 1000m: {1:2.2f}m/s".format(ws_av_below1000,ws_av_above1000))

Average wind speed is lower during low visibility situations because low visibility is mainly caused by fog. And in Frankfurt, fog is more likely to form during stable and calm conditions.

Resampling

Another often needed functionality when working with time series data is provided by the resample() method. You can use it to resample your data to different temporal resolutions.

The resample() function returns a Resampler object which in turn provides many functions for the specific resampling method.

For example, we can average the Frankfurt Airport data on a 3 hour interval:

tabelle.resample("3H").mean()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	temp_dewpt_diff
time
2015-01-01 00:00:00	1200.000000	1.0	1.0	0.000000	0.000000	1036.0	133.333333	0.0
2015-01-01 03:00:00	916.666667	1.0	1.0	0.000000	0.000000	1036.0	200.000000	0.0
...	...	...	...	...	...	...	...	...
2015-12-31 18:00:00	500.000000	5.0	5.0	252.500000	2.333333	1025.0	100.000000	0.0
2015-12-31 21:00:00	533.333333	5.0	5.0	206.666667	2.000000	1025.5	100.000000	0.0

2920 rows × 8 columns

But we are not limited to calculating the mean over these intervals. We can also calculate the sum or any other function we like:

tabelle.resample("3H").sum()

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	temp_dewpt_diff
time
2015-01-01 00:00:00	7200	6.0	6.0	0.0	0.0	6216.0	400.0	0.0
2015-01-01 03:00:00	5500	6.0	6.0	0.0	0.0	6216.0	800.0	0.0
...	...	...	...	...	...	...	...	...
2015-12-31 18:00:00	3000	30.0	30.0	1010.0	14.0	6150.0	200.0	0.0
2015-12-31 21:00:00	3200	30.0	30.0	620.0	12.0	6153.0	300.0	0.0

2920 rows × 8 columns

.sum() and .mean() are convenient to calculate the same statistic for all columns of the data frame. Let’s say we want to calculate the mean of the temperature but the maximum of the cloud height. Instead of selecting each column into a new variable and applying the corresponding function to each we can use the .agg() method of the data frame.

tabelle.loc[:, ["air_temperature", "cloud_height"]].resample("3H").agg({"air_temperature": "mean", "cloud_height": "max"}).head()

	air_temperature	cloud_height
time
2015-01-01 00:00:00	1.000000	200.0
2015-01-01 03:00:00	1.000000	200.0
2015-01-01 06:00:00	1.000000	2700.0
2015-01-01 09:00:00	2.166667	300.0
2015-01-01 12:00:00	3.000000	900.0

Info

to compute more than one statistic for all columns just give .agg() a list of the statistics to calculate
this can also be done with the dictionary method to compute multiple statistics for different columns
- e.g.: .agg({"air_temperature": ["mean", "min", "max"], "cloud_height": ["min", "max"]}

We can also resample to a higher temporal resolution than the original data set and provide an interpolation technique for time steps between known measurement points:

tabelle.resample("10Min").interpolate("linear")

	visibility	air_temperature	dewpoint	wind_direction	wind_speed	air_pressure	cloud_height	cloud_cover	temp_dewpt_diff
time
2015-01-01 00:20:00	2800.000000	1.0	1.0	0.000000	0.000000	1036.0	200.000000	OVC	0.0
2015-01-01 00:30:00	2366.666667	1.0	1.0	0.000000	0.000000	1036.0	166.666667	NaN	0.0
...	...	...	...	...	...	...	...	...	...
2015-12-31 23:40:00	500.000000	5.0	5.0	253.333333	2.666667	1026.0	100.000000	NaN	0.0
2015-12-31 23:50:00	500.000000	5.0	5.0	240.000000	3.000000	1026.0	100.000000	BKN	0.0

52558 rows × 9 columns

Write data

If you want to save your data, you can use one of the many writers of pandas. A full list of available writers is given here.

x = tabelle[(tabelle.visibility < 1000) & (tabelle.air_temperature < 5)]
x.to_html("../data/output.html")

Task

Resample the table to 1 hour intervals (averages)
Select all rows from the resampled table where the cloud height was below 1000m and the air pressure was above 1000hPa.
Write the result to the file “my_data.html”
Open the file in your browser. Which column is missing and why?

Solution

tab_res = tabelle.resample("1H").mean()

tab_res_1000 = tab_res[(tab_res.cloud_height<1000) & (tab_res.air_pressure>1000)]

```
tab_res_1000.to_html("my_data.html")
```
The cloumn “cloud_cover” is missing. It was automatically dropped during resampling as the computer doesn’t know how to “average” the listed string values.

So far, you have learned how to read, manipulate and write data in tabular shape. But what if you want to get a visual representation of your data?

Although pandas has the ability to plot the data in graphs as well, we use specific plotting libraries for this purpose. There are many plotting libraries for virtually any data visualization need. Some of these libraries can be used regardless of the specific application, but many of them are intensely focused on accomplishing a particular task. A good overview is given on the FusionCharts Blog.

Here we will focus on the most basic (and oldest) plotting library called “matplotlib”.

matplotlib

is a very powerfull plotting library. This brings some complexity with it, which sometimes can be a little confusing.

So before we get to the actual plotting we have a look at how a plot is structured in general and the vocabulary associated with its elements.

For more details on matplotlib basics have a look at the Usage Guide.

The tutorial and examples pages are also worth looking at.

Source: https://www.machinelearningplus.com/wp-content/uploads/2019/01/99_matplotlib_structure-1.png

Source:https://matplotlib.org/_images/anatomy.png

Line plots

Let’s start with a simple example. We want to take a look at the time series of the temperature data in a simple line plot:

import matplotlib.pyplot as plt

plt.plot(tabelle.air_temperature)
plt.show()

png

This gives us a nice little line plot with the temperature time series depicted as blue line. However, the plot is a little small, the axes are not labeled and there is no legend. Let’s fix that…

plt.rcParams["font.size"] = 18

plt.figure(figsize=(20,5))
plt.plot(tabelle.air_temperature,label="Air temperature at Frankfurt Int. Airport in 2015")
plt.xlim(("2015-01-01","2015-12-31"))
plt.xticks(["2015-{:02d}-15".format(x) for x in range(1,13,1)],["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"])
plt.legend()
plt.ylabel("Temperature (°C)")
plt.show()

png

This looks better. We can see that we do not have to worry about the details of the axes and axis of the plot since pyplot takes care of it for us. Each time we do a function of pyplot it just reuses the axes from the first plot call.

However, let’s say we are not interested in the daily temperature fluctuations.

Additionally while manually constructing the xticks labels works fine, it is a little annoying. Especially since the index of the dataframe (which is used for the x-axis) already contains all information we need and just the formatting is wrong. Fortunately, matplotlib has some convenience functions available in its date submodule we can use if we switch to the “manual-axes” approach.

So let’s resample the data to daily means, format the dates and plot it again…

import matplotlib.dates as mdates # we need this for proper date formatting

temp_resampled = tabelle.air_temperature.resample("1d").mean()

# We asign the figure to a variable to be able to manipulate it later
fig = plt.figure(figsize=(20,5))

# Add a new axes to the plot manually. 
# The argument to the add_subplot function is used for positioning of the plot
# First two numbers are number of rows/columns in the figure
# Third number is the position of the plot to create
ax = fig.add_subplot(111)

# Plot the data
ax.plot(temp_resampled,label="Air temperature at Frankfurt Int. Airport in 2015")

# Plot the legend
ax.legend()

# Set the Y-Axis label
ax.set_ylabel("Temperature (°C)")

# Set the limits (range) of the X-Axis
ax.set_xlim(("2015-01-01","2015-12-31"))

# instead of creating the month labels manually (as we did before) we use matplotlib's built-in date locators and formatters
# set up the locator to find the 15th of every month (even though not every month is the same length this is close enough)
days = mdates.DayLocator(bymonthday=15)
# set the date format to the month shortname
monthFmt = mdates.DateFormatter("%b")

# apply locator and formatter to the ticks
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(monthFmt)

plt.show()

png

You can also put multiple lines into one plot:

tabelle_resampled = tabelle.loc[:, ["air_temperature", "dewpoint"]].resample("1d").mean()

fig = plt.figure(figsize=(20,5))

ax = fig.add_subplot(111)

ax.plot(tabelle_resampled.loc[:, "air_temperature"],label="Air temperature")

# Plot a second line in the same figure
ax.plot(tabelle_resampled.loc[:, "dewpoint"],label="Dewpoint")

ax.legend()
ax.set_ylabel("Temperature (°C)")
ax.set_xlim(("2015-01-01","2015-12-31"))

ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(monthFmt)

plt.show()

png

And if we don’t like the colors or the line types, we can choose different ones by providing the c and linestyle arguments to the plotting function:

fig = plt.figure(figsize=(20,5))
plt.rcParams["font.size"] = 18

ax = fig.add_subplot(111)

ax.plot(tabelle_resampled.loc[:, "air_temperature"], label="Air temperature", c="k", linestyle="-.")
ax.plot(tabelle_resampled.loc[:, "dewpoint"], label="Dewpoint", c="r", linestyle=":")

ax.legend()
ax.set_ylabel("Temperature (°C)")
ax.set_xlim(("2015-01-01","2015-12-31"))

ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(monthFmt)

plt.show()

png

What if we want to plot 2 variables in the same plot that have different units, e.g. temperature and air pressure values?

For this, we can use the twinx() function. It creates a second subplot that shares the Y-Axis with the first subplot:

tabelle_resampled = tabelle.loc[:, ["air_temperature", "air_pressure"]].resample("1d").mean()

fig = plt.figure(figsize=(20,5))

ax2 = fig.add_subplot(111)
ax1 = ax2.twinx()

ax1.plot(tabelle_resampled.loc[:, "air_temperature"], c="r", label="Temperature")
ax2.plot(tabelle_resampled.loc[:, "air_pressure"], c="b", label="Air pressure")

ax1.set_ylabel("Temperature (°C)")
ax2.set_ylabel("Air pressure (m)")

ax1.legend()
ax2.legend()

ax1.set_xlim(("2015-01-01","2015-12-31"))

ax1.xaxis.set_major_locator(days)
ax1.xaxis.set_major_formatter(monthFmt)
plt.show()

png

Normally, subplots are, however, used to plot multiple figures next to each other in one plot. For example, we can create 2 subplots “ax1” and “ax2”, that depict temperature and wind speeds next to each other:

tabelle_resampled = tabelle.loc[:, ["air_temperature", "wind_speed"]].resample("1d").mean()

fig = plt.figure(figsize=(20,5))
plt.rcParams["font.size"] = 18

ax1 = fig.add_subplot(1,2,1) # add a subplot at index 1 into a plot that has 1 row and 2 columns
ax2 = fig.add_subplot(1,2,2) # add a subplot at index 2 into a plot that has 1 row and 2 columns

ax1.plot(tabelle_resampled.loc[:, "air_temperature"], c="r", label="Temperature")
ax2.plot(tabelle_resampled.loc[:, "wind_speed"], c="b", label="Wind speed")

ax1.legend()
ax2.legend()

ax1.set_xlim(("2015-01-01","2015-12-31"))
ax2.set_xlim(("2015-01-01","2015-12-31"))

ax1.xaxis.set_major_locator(days)
ax1.xaxis.set_major_formatter(monthFmt)

ax2.xaxis.set_major_locator(days)
ax2.xaxis.set_major_formatter(monthFmt)

plt.show()

png

If you want to save your figure, you can simply call the savefig() function instead of show():

fig = plt.figure(figsize=(20,5))

plt.plot(tabelle.air_temperature.resample("1d").mean(), c="r", label="Temperature")

plt.savefig("images/S01E04/beispiel_output.png")

png

Histograms

To get an overview of the distribution of temperature values, we can also plot a histogram of the data:

plt.hist(tabelle.air_temperature.dropna())
plt.show()

png

Again, this plot is nice and simple but we can do better…

temperatur = tabelle.air_temperature.dropna()

plt.figure(figsize=(20,7))

plt.hist(temperatur,bins=sorted(temperatur.unique()-0.5), color="#1aa8a8")
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.show()

png

We can also plot multiple histograms within one chart, e.g. the temperature values separated into January and August:

temperatur_jan = tabelle.air_temperature["2015-01"].dropna()
temperatur_aug = tabelle.air_temperature["2015-08"].dropna()

plt.figure(figsize=(20,7))
plt.hist(temperatur_jan,bins=sorted(tabelle.air_temperature.dropna().unique()-0.5), color="#00b1ff", alpha=0.5, label="January")
plt.hist(temperatur_aug,bins=sorted(tabelle.air_temperature.dropna().unique()-0.5), color="#ff0000", alpha=0.5, label="August")
plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.legend()
plt.show()

png

And we can draw multiple plot types within one plot. Let’s plot the winter and summer temperature counts as lines and the total temperature counts as bars:

temperatur_jan = tabelle.air_temperature["2015-01"].dropna()
temperatur_aug = tabelle.air_temperature["2015-08"].dropna()

n_w, bins_w, _ = plt.hist(tabelle.air_temperature["2015-01":"2015-03"].dropna(), bins=sorted(tabelle.air_temperature.dropna().unique()-0.5))
n_s, bins_s, _ = plt.hist(tabelle.air_temperature["2015-07":"2015-09"].dropna(), bins=sorted(tabelle.air_temperature.dropna().unique()-0.5))
plt.close()

plt.figure(figsize=(20,7))

plt.hist(tabelle.air_temperature.dropna(), bins=sorted(tabelle.air_temperature.dropna().unique()-0.5), color="#78C179")
plt.plot(bins_w[:-1]+0.5, n_w, label="Winter", c="b")
plt.plot(bins_s[:-1]+0.5, n_s, label="Summer", c="r")

plt.xlabel("Temperature (°C)")
plt.ylabel("Count")
plt.legend()
plt.show()

png

Scatterplots

Another often used plot is the scatterplot.

Let’s use it to plot temperature vs. air pressure values:

plt.figure(figsize=(9,9))
plt.rcParams["font.size"] = 14

plt.scatter(tabelle.air_temperature, tabelle.air_pressure)
plt.show()

png

We can also plot multiple scatterplots on top of each other to depict more information within one plot.

If we do this with temperature/pressure measurements separately for January and August, we can see that air pressures below 1000 hPa were only prevalent during January but not during August and that the air pressure value range was much larger during January than during August:

plt.figure(figsize=(9,9))

plt.scatter(tabelle.air_temperature, tabelle.air_pressure, marker=".", c="grey", alpha=0.1, label="")
plt.scatter(tabelle.air_temperature["2015-01"], tabelle.air_pressure["2015-01"], marker=".", c="#35A3FF", alpha=1, label="January")
plt.scatter(tabelle.air_temperature["2015-08"], tabelle.air_pressure["2015-08"], marker=".", c="#FF824C", alpha=1, label="August")

plt.xlabel("Temperature (°C)")
plt.ylabel("Pressure (hPa)")

plt.legend()
plt.show()

png

There are many more options that you can tweak in order to get a nice looking ready to print plot. And of course, there are many more plot types (e.g. bar plots, box plots, 3d plots, …) We won’t cover all of them here but you can have a look at the matplotlib documentation.

Also, there are many more plotting libraries out there which cover all kinds of useful plotting routines, e.g. for plotting wind data…

Windroses

Using the windrose extension, you can easily plot the wind data in a typical wind rose chart:

from windrose import WindroseAxes

ax = WindroseAxes.from_ax()
ax.bar(tabelle.wind_direction, tabelle.wind_speed, normed=True, opening=0.8, edgecolor="white")
ax.set_legend()
plt.show()

png

Task

Create a simple line plot of the wind speed values of February 2015.
Create a simple histogram of the same data.

Solution

plt.plot(tabelle.loc["2015-02"].wind_speed)

plt.hist(tabelle.loc["2015-02"].wind_speed)

Info

Actually the plotting functions in pandas use matplotlib in the background too. It just makes smart default assumptions to hide many of the cumbersome matplotlib tasks.
There are other high level plotting libraries like:
- Holoviews
- hvplot (which is based on holoviews)
They wrap matplotlib and many of the libraries you can see on the FusionCharts Blog. They make it easy and fast to explore data interactively because you don’t need all the “trouble” of plot configuration like with matplotlib.

So if you are still in the knowledge discovery and exploration phase using one of the high level libraries instead of matplotlib can save you a lot of time, but if you need to configure every detail of your plot there is often no way around the latter.

Exercise 4

Complete the fourth assignment and push your results until tuesday 14:00 next week
Assignment link: https://classroom.github.com/a/j4cFzXy8