Chapter 2 Plotting
The function ggplot_dt
takes a data table containing data and will plot the data as a map. The data table needs to contain columns named lon
and lat
,
and the gridpoints should be regularly spaced (e.g. half-degrees):
data("chirps_monthly")
ggplot_dt(chirps_monthly)
## Warning in modify_dt_map_plotting(dt = dt, data_col = data_col): The data table contains multiple levels per lon/lat. The following selection is plotted:
## month = 11
## year = 1991
The function returns a warning that the data contains multiple levels per coordinate. The reason is that chirps_monthly
contains precipitation-observations for many years and two months, so the plotting function has to select a month and year for plotting. By default, it selects the coordinates that first appear in the data table. Additionally, it prints out a warning telling you which subselection it shows. If we want to plot a different year and month, we can use the data.table subsetting syntax (see last chapter):
ggplot_dt(chirps_monthly[year == 2020 & month == 11])
Let us look at the data for plotting:
print(chirps_monthly)
## Index: <month__year>
## lon lat prec month year terc_cat
## <num> <num> <num> <int> <int> <int>
## 1: 22.0 -12.0 199.68 11 1991 1
## 2: 22.0 -11.5 198.87 11 1991 1
## 3: 22.0 -11.0 157.86 11 1991 0
## 4: 22.0 -10.5 144.15 11 1991 -1
## 5: 22.0 -10.0 153.96 11 1991 -1
## ---
## 209036: 51.5 21.0 7.20 12 2020 -1
## 209037: 51.5 21.5 6.54 12 2020 -1
## 209038: 51.5 22.0 6.15 12 2020 -1
## 209039: 51.5 22.5 4.86 12 2020 -1
## 209040: 51.5 23.0 4.17 12 2020 -1
This data has four dimension variables (lon, lat, year, month), and two value variables (prec
and terc_cat
). By default, ggplot_dt
plots the first column that contains a value variable. You can see which columns are considered dimension variables by runing dimvars(chirps_monthly)
. A full list of column names considered dimension variables is returned if you just type dimvars()
. Everything that is not considered dimension variable is considered value variable.
If we want to change the column that is plotted we can specify the name of the plotting column as second argument:
ggplot_dt(chirps_monthly,'terc_cat')
## Warning in modify_dt_map_plotting(dt = dt, data_col = data_col): The data table contains multiple levels per lon/lat. The following selection is plotted:
## month = 11
## year = 1991
## Check out the function tercile_cat() for plotting tercile-categories.
Note that the function selects different default-colors for the terc_cat
-column.
Of course you can also control the color scales manually, more on this later.
Certain column names are recognized by the package, e.g. all names listed in tc_cols()
are recognized as column names for tercile categories.
For these, the color scale of the plot is adjusted to match the color of verification maps. Moreover, the function kindly points out that there is a separate function for tercile categories,
namely tercile_plot()
, which we will discuss in Plots for the Greater Horn of Africa.
The plotting window is automatically adjusted to the data. If you have data covering the entire earth, the entire earth will be plotted. If you want to focus the plot to a specific region, you can do this by subsetting the data table:
# a region containing Tanzania
ggplot_dt(chirps_monthly[lon %between% c(28,43) & lat %between% c(-12,1)])
## Warning in modify_dt_map_plotting(dt = dt, data_col = data_col): The data table contains multiple levels per lon/lat. The following selection is plotted:
## month = 11
## year = 1991
The function has further optional arguments (also recall that you can access the function documentation summarizing all of this by typing ?ggplot_dt
):
ggplot_dt(chirps_monthly,
mn = 'November 2021 precipitation', # add a title to the plot
rr = c(0,100), # fix the limits of the color scale
name = 'mm') # change the legend label
## Warning in modify_dt_map_plotting(dt = dt, data_col = data_col): The data table contains multiple levels per lon/lat. The following selection is plotted:
## month = 11
## year = 1991
In this example we set the lower limit of the color scale to 0 and the upper limit to 100. By default the data is truncated at the ends of the color scale, so every pixel with precipitation of above 100 mm is now shown in the same dark blue. Setting the range of the color scale is useful for making several plots comparable or to force symmetry around 0, e.g. when plotting correlations or anomalies:
= copy(chirps_monthly) # this is necessary before modifying the example data.
dt := mean(prec), by = .(lon,lat,month)] # add a climatology column to dt
dt[,clim := prec - clim] # add an anomaly column to dt
dt[,ano print(dt)
## Index: <month__year>
## lon lat prec month year terc_cat clim ano
## <num> <num> <num> <int> <int> <int> <num> <num>
## 1: 22.0 -12.0 199.68 11 1991 1 157.920 41.760
## 2: 22.0 -11.5 198.87 11 1991 1 162.781 36.089
## 3: 22.0 -11.0 157.86 11 1991 0 165.031 -7.171
## 4: 22.0 -10.5 144.15 11 1991 -1 169.285 -25.135
## 5: 22.0 -10.0 153.96 11 1991 -1 183.343 -29.383
## ---
## 209036: 51.5 21.0 7.20 12 2020 -1 7.636 -0.436
## 209037: 51.5 21.5 6.54 12 2020 -1 7.258 -0.718
## 209038: 51.5 22.0 6.15 12 2020 -1 6.459 -0.309
## 209039: 51.5 22.5 4.86 12 2020 -1 5.503 -0.643
## 209040: 51.5 23.0 4.17 12 2020 -1 4.890 -0.720
ggplot_dt(dt[month == 11 & year == 2020], 'ano', rr = c(-90,90))
2.1 Controlling the colorscale
By default, ggplot_dt
selects reasonable color scales if it recognizes the name of the column it plots. For example, above we saw that it selected reasonable colors for columns named prec
, ano
(positive values in blue, zero in white, negative values in red). We saw above that different colors were selected for the column terc_cat
(which are more in line with verification maps). We can also customize the selected colors with the arguments low
,mid
, and high
. An overview over available color names can be found here.
ggplot_dt(dt[month == 11 & year == 2020], 'ano',
mn = 'November 2020 rainfall anomaly',
rr = c(-90,90),
low = 'salmon', mid = 'gold', high = 'darkgreen',
name = 'mm')
Note that we set the range argument to c(-90,90)
. Fixing the range mostly makes sense when the range is known (e.g. for correlation plots), or when you want to compare several plots (e.g. for comparing mean square error of different NWP models, all plots should have the same range). If we leave the range argument rr
free, the range is determined from the data.
Additionally, we can use the midpoint
argument to determine the midpoint of the color scale (the value that centers the middle color, white by default).
Finally, the function allows to discretize the color scale. To this end the argument discrete_cs
is set to TRUE
. We can control the breaks of the discrete colorscale
by one of the arguments binwidth
, n.breaks
, or breaks
(the latter takes a vector containing all breakpoints).
Using breaks
or binwidth
is recommended: The argument n.breaks
(which is passed to the function ggplot2::scale_fill_steps2
)
tries to find ‘nice’ breaks and does not work reliably. To revisit the anomaly plot from above:
ggplot_dt(dt[month == 11 & year == 2020], 'ano',
mn = 'November 2021 rainfall anomaly',
discrete_cs = TRUE, binwidth = 60,
low = 'red', mid = 'white', high = 'darkgreen',
midpoint = 0,
name = 'mm')
For saving a created plot, we can either use the function ggplot2::ggsave()
or any of R
s graphical devices, e.g.
pdf(file = '<path to file and filename>.pdf', width = ...,height = ...)
print(pp)
dev.off()
This creates a .pdf file, but you can print .png and some other file formats similarly, see ?Devices
for an overview.
2.2 Plotting values for selected countries
Above, we have already seen an option how to restrict a plot to a particular country: by manually subsetting the data to a rectangle of longitudes and latitudes containing that specific country. This is of course quite tedious, and to make our lives easier we can use the restrict_to_country
-function that takes a data table and a country name, and subsets the data table to only contain gridpoints in the specified country. Country names should be capitalized and in English: e.g. Burundi, Eritrea, Ethiopia, Kenya, Rwanda, Somalia, South Sudan, Sudan, Tanzania, Uganda.
= restrict_to_country(dt[month == 11 & year == 2020],'Kenya')
dt_new ggplot_dt(dt_new, 'ano',
mn = 'November 2020 rainfall anomaly',
discrete_cs = TRUE, midpoint = 0, binwidth = 60)
As we can see, the function restricts the data to all gridcells for which the centerpoint lies within the specified country. This is useful, for example, for calculating mean scores for the specified country. However, it is not optimal for plotting, since all grid cells past the border are censored, even though the original data table contained values there. To this end, the restrict_to_country
function has a rectangle
-argument that you can set to TRUE
for plotting:
= restrict_to_country(dt[month == 11 & year == 2020],'Kenya', rectangle = TRUE)
dt_new ggplot_dt(dt_new, 'ano',
mn = 'November 2020 rainfall anomaly',
discrete_cs = TRUE, midpoint = 0, binwidth = 60)
Instead of a single country name, you can also pass multiple country names in a vector to the function. Moreover, when you use rectangle = TRUE
, you can specify a tolerance tol
in order to widen the plotting window:
= restrict_to_country(dt[month == 11 & year == 2020],
dt_new c('Kenya','Tanzania'),
rectangle = TRUE,tol = 2)
ggplot_dt(dt_new, 'ano',
mn = 'November 2020 rainfall anomaly',
discrete_cs = TRUE, midpoint = 0, binwidth = 50)
The tol = 2
argument means that the function will include a buffer zone of 2 degrees lon/lat outside the specified countries (i.e. 4 gridpoints to each side). Note that the buffer to the south of Tanzania is smaller, because the original data table dt
does not contain any data further south.
2.3 Customized plots
The function ggplot_dt
is, as its name suggests, based on the package ggplot2
. This is a widely-used package and there are many books and tutorials available for getting familiar with the syntax, e.g. https://ggplot2-book.org/. Besides printing out a plot, the function also returns this plot as a gg-object, which can be further modified. In ggplot2
, plots are composed out of multiple layers, allowing for successive adding of layers. This can help us to generate highly customized plots. As an example, let’s revisit the anomaly plot from above and add the location of Nairobi an Addis Abbaba to it:
library(ggplot2)
# get locations as data table:
= data.table(name = c('Addis Abbaba','Nairobi'),
loc lon = c(38.77,36.84),lat = c(9,-1.28))
print(loc)
## name lon lat
## <char> <num> <num>
## 1: Addis Abbaba 38.77 9.00
## 2: Nairobi 36.84 -1.28
= ggplot_dt(dt[month == 11 & year == 2020], 'ano',
pp mn = 'November 2020 rainfall anomaly',
low = 'red', mid = 'white', high = 'darkgreen',
midpoint = 0,
name = 'mm') +
geom_point(data = loc,mapping = aes(x = lon,y = lat)) +
geom_text(data = loc,mapping = aes(x = lon,y = lat,label = name),vjust = 1.5)
print(pp)
Here, we added two layers to the original plot, the first one being the geom_point
-layer that creates the two points at the locations of the cities, and the second being the geom_text
-layer that labels the points by the city names.
A frequently required operation is the changing of the font sizes of title and labels. The easiest way to do this is the command
theme_set(theme_bw(base_size = 16)) # defaults to 12
print(pp)
We can also use ggplots adding-layer-syntax to overwrite existing layers, for example if we want a fully customized colorscale:
library(viridis) # the viridis package contains some nice color scales
= pp + scale_fill_viridis(breaks = seq(-150,150,by = 20),
pp_new guide = guide_colorbar(title = 'my personal color scale',
title.position = 'top',
barwidth = 20,
direction = 'horizontal')) +
xlab('lon') + ylab('lat') + # label axis
theme(panel.background = element_rect(fill = 'salmon'), # change background color (used for missing values) to something whackey
axis.ticks = element_line(), # add ticks...
axis.text = element_text(), # ... and labels for the axis, i.e. some lons and lats.
legend.position = 'bottom')
print(pp_new)
For comparing multiple plots (potentially all of them with the same legend), the function ggarrange
from the package ggpubr
is useful:
library(ggpubr)
# compare 2019 October anomaly to 2020 anomaly:
= c(-150,150) # force color scale to be identical for the plots
rr
= ggplot_dt(dt[month == 11 & year == 2019], 'ano',
pp1 rr = rr,
mn = 'October 2019 rainfall anomaly',
low = 'red', mid = 'white', high = 'darkgreen',
guide = guide_colorbar(barwidth = 20,barheight = 1,direction = 'horizontal'),
midpoint = 0,
name = 'mm') +
geom_point(data = loc,mapping = aes(x = lon,y = lat)) +
geom_text(data = loc,mapping = aes(x = lon,y = lat,label = name),vjust = 1.5)
= ggplot_dt(dt[month == 11 & year == 2020], 'ano',
pp2 rr = rr,
mn = 'October 2020 rainfall anomaly',
low = 'red', mid = 'white', high = 'darkgreen',
midpoint = 0,
name = 'mm') +
geom_point(data = loc,mapping = aes(x = lon,y = lat)) +
geom_text(data = loc,mapping = aes(x = lon,y = lat,label = name),vjust = 1.5)
ggarrange(pp1,pp2,ncol = 2,common.legend = TRUE,legend = 'bottom')
Similar functionality is provided by the patchwork package which we use in the next section.
2.4 Plots for the Greater Horn of Africa
This package was developed in collaboration with ICPAC. Three times a year, ICPAC hosts the Greater Horn of Africa Climate Outlook Forum (GHACOF),
discussing weather forecasts for the coming season with various stakeholders. For aiding this dissemination, SeaVal contains separate plotting function for the Greater Horn of Africa (GHA) region.
These functions use a higher-resolution map, which is part of the SeaVal package itself. The function gha_plot()
works exactly as ggplot_dt
, except that it uses this higher-resolution map:
gha_plot(chirps_monthly[year == 2020 & month == 12])
In this plot it looks a bit stupid that the data outside of the GHA-borders is shown. As discussed [Plotting values for selected countries][above], one way to solve this would be to use the restrict_to_country()
-function,
by passing it the names of all countries in the GHA region. Luckily, we do not have to do this manually, but can use the shorthand restrict_to_GHA()
:
= restrict_to_GHA(chirps_monthly[year == 2020 & month == 12])
temp gha_plot(temp)
Internally, the gha_plot()
-function calls ggplot_dt()
, just changing the map. Therefore, everything discussed in the previous sections works exactly the same for gha_plot()
. The main advantage of ggplot_dt()
is that it uses a global map, so you can use it for visualizing data anywhere.
As already noted above, the chirps_monthly
data has a terc_cat
column, containing tercile categories for any given year. We can produce a separate plot for these using the tercile_plot()
-function:
tercile_plot(temp)
Also for this function you have a lot of options for finetuning your plots, e.g. if you prefer different colors for the three categories. See ?tercile_plot
for details.
Finally, a special plot is available for showing tercile forecasts as a map. Such forecasts take the form of three probabilities - for the three categories ‘below normal’, ‘normal’ and above normal.
SeaVal
expects you to store these probabilities in columns named below
, normal
, and above
, respectively. It also expects you to use probabilities between 0 and 1 (not percent).
We discuss how to derive such forecasts using SeaVal
in Section Getting your data into the right format. For the exposition here we can rely on the example dataset ecmwf_monthly
which contains tercile probabilities predicted by ECMWFs SEAS5 system:
= ecmwf_monthly[year == 2020 & month == 12 & member == 1]
tercile_forecasts print(tercile_forecasts)
## lon lat year month prec member below normal above
## <num> <num> <int> <int> <num> <int> <num> <num> <num>
## 1: 22.0 13.0 2020 12 -0.06 1 0.48 0.36 0.16
## 2: 22.5 12.5 2020 12 -0.03 1 0.48 0.36 0.16
## 3: 22.5 13.0 2020 12 -0.06 1 0.44 0.36 0.20
## 4: 22.5 13.5 2020 12 -0.03 1 0.32 0.36 0.32
## 5: 22.5 14.0 2020 12 0.00 1 0.36 0.36 0.28
## ---
## 2064: 50.5 11.0 2020 12 13.26 1 0.36 0.36 0.28
## 2065: 50.5 11.5 2020 12 20.55 1 0.36 0.24 0.40
## 2066: 51.0 10.5 2020 12 28.47 1 0.32 0.40 0.28
## 2067: 51.0 11.0 2020 12 17.94 1 0.36 0.40 0.24
## 2068: 51.0 11.5 2020 12 25.23 1 0.36 0.28 0.36
We see that the data table contains columns below
, normal
and above
, so one way of visualizing the prediction is by plotting three maps for these three values.
Here, we use the patchwork package to combine several plots into a single plot:
library(patchwork)
= gha_plot(tercile_forecasts, 'below')
p1 = gha_plot(tercile_forecasts, 'normal')
p2 = gha_plot(tercile_forecasts, 'above')
p3
= p1 + p2 + p3 # this syntax is enabled by patchwork
pp plot(pp)
Note how gha_plot()
automatically selects different color scales for the three plots, because it recognizes the column names above
, normal
, and below
. In particular, all scales are centered around 0.333
(=white),
which is the climatological forecast.
Looking at three plots can be somewhat overwhelming. Therefore, at GHACOFs these are usually compressed into a single plot, which shows for each gridpoint only the most probable category (and its probability).
These plots can be generated using the function tfc_gha_plot()
(for tercile-forecast-GHA-plot):
tfc_gha_plot(tercile_forecasts)
Again, the default behavior of this function (e.g. using discrete color scales) can be flexibly modified, see ?tfc_gha_plot
for details. There is also a version of this function using the global map of ggplot_dt
, namely tfc_plot
.