Individual assignment for the visual analytics project.
Exploratory Data Analysis (EDA) is an approach that uses data visualisation to analyse data sets and to summarise their characteristics. It enables users to draw useful information and insights of the data sets with the use of graphical and non-graphical methods to reveal the distribution and statistics of selected variables. Many Data Science experts and literature state that EDA is a critical first step in data analysis as it helps to detect mistakes, check assumptions, determine relationships among the independent variables, and assess the direction and magnitude of relationships between the dependent and independent variablea (Seltman, 2018).
This assignment explores the appropriate tidyverse methods, as well as the ExPanDaR package for EDA.
Instead of the conventional way of installing the required packages right away, the code chunk below checks whether the required packages have already been installed and subsequently install those that are not before loading the packages.
packages = c('tidyverse', 'ExPanDaR', 'kableExtra', 'ggstatsplot', 'corrplot', 'DT')
for(p in packages){ 'exclude FOR loop for Shiny>'
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The data sets for this Visual Analytics Project were sourced from Eurostat Climate Change Database. Through the preliminary literature review conducted, the data sets used were narrowed down, wrangled, and subsequently merged into a single panel or cross-sectional time-series data where behavior of entities, in this case countries, are observed and recorded across time.
The list of literature reviewed and details of the data preparation could be found here.
GHG <- read_csv("data/GHGproj_data.csv")
-- Column specification ----------------------------------------------
cols(
Country = col_character(),
Year = col_double(),
GHG_emissions = col_double(),
CO2_emissions = col_double(),
CH4_emissions = col_double(),
NO2_emissions = col_double(),
GDP = col_double(),
Final_EI = col_double(),
Fuel_mix = col_double(),
Carbon_intensity = col_double(),
Renewables_share = col_double(),
Envt_taxes = col_double(),
Liquid_biofuels = col_double(),
Solar_thermal = col_double(),
Heat_pumps = col_double()
)
str(GHG)
spec_tbl_df[,15] [324 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Country : chr [1:324] "European Union - 27 countries (from 2020)" "European Union - 27 countries (from 2020)" "European Union - 27 countries (from 2020)" "European Union - 27 countries (from 2020)" ...
$ Year : num [1:324] 2010 2011 2012 2013 2014 ...
$ GHG_emissions : num [1:324] 4188244 4075283 3996645 3912623 3776464 ...
$ CO2_emissions : num [1:324] 3443914 3338307 3262594 3181485 3046141 ...
$ CH4_emissions : num [1:324] 427118 419634 416827 411161 406015 ...
$ NO2_emissions : num [1:324] 219323 216187 213789 214272 216858 ...
$ GDP : num [1:324] 10977298 11321811 11388518 11517119 11781640 ...
$ Final_EI : num [1:324] 92.2 85.9 84.3 83.6 77.4 ...
$ Fuel_mix : num [1:324] 2.8 2.88 2.78 2.67 2.62 2.57 2.54 2.53 2.4 NA ...
$ Carbon_intensity: num [1:324] 91.9 91.5 90.9 89.5 88.3 88.7 88 86.7 85.2 NA ...
$ Renewables_share: num [1:324] 14.4 14.6 16 16.7 17.5 ...
$ Envt_taxes : num [1:324] 259023 271901 278204 284121 291030 ...
$ Liquid_biofuels : num [1:324] 29689 28711 28422 29751 29152 ...
$ Solar_thermal : num [1:324] 34488 37949 40587 43246 45425 ...
$ Heat_pumps : num [1:324] 31200 43941 48603 53298 79990 ...
- attr(*, "spec")=
.. cols(
.. Country = col_character(),
.. Year = col_double(),
.. GHG_emissions = col_double(),
.. CO2_emissions = col_double(),
.. CH4_emissions = col_double(),
.. NO2_emissions = col_double(),
.. GDP = col_double(),
.. Final_EI = col_double(),
.. Fuel_mix = col_double(),
.. Carbon_intensity = col_double(),
.. Renewables_share = col_double(),
.. Envt_taxes = col_double(),
.. Liquid_biofuels = col_double(),
.. Solar_thermal = col_double(),
.. Heat_pumps = col_double()
.. )
From the inspection, it could be observed that there are several countries that have very long name. To find out the unique values in Country:
unique(GHG$Country)
[1] "European Union - 27 countries (from 2020)"
[2] "European Union - 28 countries (2013-2020) and Iceland under the Kyoto Protocol"
[3] "European Union - 28 countries (2013-2020)"
[4] "Belgium"
[5] "Bulgaria"
[6] "Czechia"
[7] "Denmark"
[8] "Germany (until 1990 former territory of the FRG)"
[9] "Estonia"
[10] "Ireland"
[11] "Greece"
[12] "Spain"
[13] "France"
[14] "Croatia"
[15] "Italy"
[16] "Cyprus"
[17] "Latvia"
[18] "Lithuania"
[19] "Luxembourg"
[20] "Hungary"
[21] "Malta"
[22] "Netherlands"
[23] "Austria"
[24] "Poland"
[25] "Portugal"
[26] "Romania"
[27] "Slovenia"
[28] "Slovakia"
[29] "Finland"
[30] "Sweden"
[31] "Iceland"
[32] "Liechtenstein"
[33] "Norway"
[34] "Switzerland"
[35] "United Kingdom"
[36] "Turkey"
From the above, it is noticed that the variable Country consist of groups of countries that are not needed in the analysis. Also, Germany has a rather long name.
G <- GHG %>%
filter(!Country %in% c("European Union - 27 countries (from 2020)",
"European Union - 28 countries (2013-2020)",
"European Union - 28 countries (2013-2020) and Iceland under the Kyoto Protocol")) %>%
mutate(Country=recode(Country, "Germany (until 1990 former territory of the FRG)" = "Germany"))
In the propose Shiny application, the data table will be made available to users who not only want to have any overview of the data structure, but also the actual data set.
datatable(
data = G,
extensions = 'FixedHeader',
options =
list(
scrollX = TRUE,
scrollY = 500,
scrollCollapse = TRUE,
paging = FALSE
)
)
t <- prepare_descriptive_table(G)
t$kable_ret %>%
kable_styling("condensed", full_width = F, position = "center")
N | Mean | Std. dev. | Min. | 25 % | Median | 75 % | Max. | |
---|---|---|---|---|---|---|---|---|
Year | 297 | 2,014.000 | 2.586 | 2,010.000 | 2,012.000 | 2,014.000 | 2,016.000 | 2,018.00 |
GHG_emissions | 297 | 152,056.937 | 207,515.414 | 181.080 | 20,923.490 | 59,569.730 | 135,117.650 | 942,338.00 |
CO2_emissions | 297 | 123,775.451 | 175,144.754 | 143.750 | 17,684.980 | 46,231.650 | 114,561.030 | 832,669.74 |
CH4_emissions | 297 | 16,075.657 | 19,472.298 | 17.990 | 3,332.560 | 7,238.570 | 18,438.370 | 63,767.39 |
NO2_emissions | 297 | 8,369.881 | 10,847.787 | 8.970 | 1,877.410 | 4,463.950 | 7,852.580 | 41,871.46 |
GDP | 294 | 487,470.776 | 725,371.404 | 4,812.400 | 43,223.100 | 188,265.850 | 469,861.475 | 3,356,410.00 |
Final_EI | 279 | 91.874 | 36.440 | 42.970 | 73.575 | 85.580 | 96.825 | 274.92 |
Fuel_mix | 279 | 2.757 | 3.966 | 0.000 | 0.825 | 1.400 | 3.095 | 20.17 |
Carbon_intensity | 279 | 88.397 | 11.960 | 44.400 | 83.600 | 88.500 | 94.850 | 124.90 |
Renewables_share | 270 | 22.261 | 17.012 | 0.979 | 10.600 | 16.558 | 28.668 | 76.69 |
Envt_taxes | 297 | 11,718.864 | 16,783.704 | 33.860 | 1,125.280 | 4,977.240 | 10,712.560 | 63,868.08 |
Liquid_biofuels | 278 | 996.403 | 1,618.882 | 0.000 | 25.000 | 295.500 | 1,144.306 | 8,907.00 |
Solar_thermal | 279 | 2,076.456 | 4,416.113 | 0.000 | 71.924 | 401.498 | 1,458.500 | 20,200.00 |
Heat_pumps | 279 | 4,273.957 | 13,332.143 | 0.000 | 0.000 | 0.000 | 1,371.537 | 126,842.03 |
To find out the missing values in the data:
missing.values <- G %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing))
missing.values %>% kable()
key | num.missing |
---|---|
Renewables_share | 27 |
Liquid_biofuels | 19 |
Carbon_intensity | 18 |
Final_EI | 18 |
Fuel_mix | 18 |
Heat_pumps | 18 |
Solar_thermal | 18 |
GDP | 3 |
Visualising the Missing Values:
prepare_missing_values_graph(G, ts_id = "Country")
This section explore different visualisation of EDA and application of interactivity.
E <- select(G, -c(Country,Year))
F <- E %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap( ~key, ncol=4, scales="free") +
geom_histogram()
F
H <- select(G, GHG_emissions)
ggplot(H, aes(GHG_emissions))+
geom_histogram()
Since the independent variables are of continuous data type, to review the trend overtime, the ggplot function of ggplot2 package is used along with geom_line to plot the line graph.
t <- ggplot(G, aes(x=Year, y=GHG_emissions)) +
geom_line()+
facet_wrap(~ Country)
t
Having all countries included in a single plot.
t <- ggplot(G, aes(x=Year, y=GHG_emissions, color=Country)) +
geom_line()
t
From the graph above, even with colours assigned to each country, it is hard for users to differentiate the countries. Also, at value below 250000, there are many countries overlapping. The way to improve this visualisation is to transform the y-axis to log10 scale.
u <- t +
scale_y_log10()
u
While the overlapping of lines has slightly improved with the log10 scale, users will still have difficulty matching the color to legend. Since the legend is not useful, it could be removed using the theme() function of ggplot2, which enables the customisation of non-data components of plots.
v <- u +
geom_text(data=subset(G, Year==2018), aes(label=Country, color=Country)) +
theme(legend.position = "none")
v
In the plot above, there are overlapping between the countries texts, and also the line. To fix this, the hjust parameter of the aesthetics setting of geom_text() could be adjusted. check_overlap could also be turned on. The coord_cartesian() allows clipping to be turned off so that text could extend beyond the plot and not be truncated.
v <- u +
geom_text(data=subset(G, Year==2018), aes(label=Country, color=Country, hjust = -0.1), check_overlap=TRUE) +
coord_cartesian(clip = 'off') +
theme(legend.position = "none")
v
The plot above though better, if far from perfect. Due to the turning off of text over-lapping, many of the labels have been removed. To improve on the aesthetics and users’ experience:
Finally, a grouped time trend graph will be shown to the users.
In this session, the ggcorrmat() function of the ggstatsplot package, multiple functions of corrplot package, and the prepare_correlation_graph() function of ExPanDaR package are explored to find the most appropriate visualisation.
From the visualisation of the missing values, the situation of the missing values in the dataset is not too bad, and it is very much limited to only Switzerland and Liechtenstein. However, coorplote supports only data without missing values. Hence, to create a data set to be used for the corrplot correlation plots, the missing values are removed.
M <- select(G, -c(Country, Year))
N <- M[complete.cases(M), ]
O <- cor(N)
corrplot(O, method = "ellipse")
O <- cor(N)
corrplot(O, method = "number")
O <- cor(N)
corrplot.mixed(O)
ggcorrmat(data = M)
$df_corr
GHG_emissions CO2_emissions CH4_emissions
GHG_emissions 1.00000000 0.99784541 0.930140816
CO2_emissions 0.99413874 1.00000000 0.905829005
CH4_emissions 0.97144524 0.95078334 1.000000000
NO2_emissions 0.94710402 0.92372630 0.959508233
GDP 0.86271907 0.87141830 0.832277410
Final_EI -0.37694381 -0.34604142 -0.467943876
Fuel_mix 0.18760737 0.17048453 0.228648231
Carbon_intensity 0.02357443 0.02642678 0.001257642
Renewables_share -0.25004005 -0.23094422 -0.261702456
Envt_taxes 0.87809209 0.88611949 0.852467040
Liquid_biofuels 0.77081143 0.76697476 0.706991817
Solar_thermal 0.74724643 0.74321289 0.727001177
Heat_pumps 0.42831149 0.44521129 0.380857932
NO2_emissions GDP Final_EI Fuel_mix
GHG_emissions 0.91822300 0.90916053 -0.2662011 0.273841014
CO2_emissions 0.89605922 0.90638984 -0.2541530 0.258173601
CH4_emissions 0.93835851 0.82096507 -0.3192199 0.394317966
NO2_emissions 1.00000000 0.82762517 -0.2778211 0.351788621
GDP 0.84440412 1.00000000 -0.2653626 -0.044843685
Final_EI -0.41022136 -0.43546422 1.0000000 -0.017495209
Fuel_mix 0.23057069 0.03065245 0.1367399 1.000000000
Carbon_intensity -0.02506221 -0.10495727 0.0654970 0.219588269
Renewables_share -0.22918867 -0.13486515 0.2779467 -0.019695971
Envt_taxes 0.86451982 0.98096907 -0.4442150 0.002396859
Liquid_biofuels 0.69792002 0.72783664 -0.2058724 -0.089409346
Solar_thermal 0.67147366 0.71421697 -0.4921314 -0.065272694
Heat_pumps 0.34748022 0.54374122 -0.2023139 0.048261441
Carbon_intensity Renewables_share Envt_taxes
GHG_emissions 0.10939055 -0.29550085 0.89278054
CO2_emissions 0.11426435 -0.28842393 0.88690010
CH4_emissions 0.09580844 -0.32566188 0.83248701
NO2_emissions 0.07032943 -0.27349823 0.80918834
GDP 0.02835471 -0.21459569 0.96482276
Final_EI -0.42439312 0.57321622 -0.28184944
Fuel_mix 0.13743471 -0.11295737 -0.01662560
Carbon_intensity 1.00000000 -0.46972038 0.03546597
Renewables_share -0.30397325 1.00000000 -0.22539529
Envt_taxes -0.10770795 -0.13058739 1.00000000
Liquid_biofuels -0.13226131 -0.19134578 0.71586859
Solar_thermal -0.04256105 -0.28566499 0.74522454
Heat_pumps -0.18348267 0.02471386 0.48184550
Liquid_biofuels Solar_thermal Heat_pumps
GHG_emissions 0.80888395 0.71707654 0.37944853
CO2_emissions 0.81528203 0.72483913 0.35691727
CH4_emissions 0.67251692 0.60414885 0.43118429
NO2_emissions 0.71922124 0.71680647 0.42243572
GDP 0.81285907 0.54850567 0.53355859
Final_EI -0.19482993 -0.21674724 -0.16845508
Fuel_mix -0.04718531 0.33365143 -0.12384495
Carbon_intensity 0.02811054 0.09999862 -0.08765757
Renewables_share -0.23486629 -0.13953268 -0.04609222
Envt_taxes 0.72717110 0.52967393 0.55750720
Liquid_biofuels 1.00000000 0.51491939 0.34638084
Solar_thermal 0.65844575 1.00000000 0.09847440
Heat_pumps 0.44471160 0.36127767 1.00000000
$df_prob
GHG_emissions CO2_emissions CH4_emissions
GHG_emissions 0.000000e+00 0.000000e+00 2.214261e-130
CO2_emissions 4.589804e-287 0.000000e+00 4.725595e-112
CH4_emissions 2.357611e-186 3.808932e-152 0.000000e+00
NO2_emissions 1.201835e-147 5.784735e-125 2.305553e-164
GDP 1.841234e-88 2.542659e-92 8.610783e-77
Final_EI 7.533660e-11 2.875435e-09 1.377506e-16
Fuel_mix 1.646245e-03 4.292795e-03 1.165846e-04
Carbon_intensity 6.950160e-01 6.602926e-01 9.833155e-01
Renewables_share 3.242710e-05 1.286207e-04 1.322064e-05
Envt_taxes 1.930319e-96 1.554219e-100 4.362829e-85
Liquid_biofuels 5.591005e-56 4.117373e-55 2.029889e-43
Solar_thermal 4.418228e-51 2.889804e-50 3.896660e-47
Heat_pumps 7.102909e-14 5.467934e-15 4.617453e-11
NO2_emissions GDP Final_EI
GHG_emissions 1.107023e-120 4.178441e-113 6.541791e-06
CO2_emissions 4.710304e-106 2.724632e-111 1.732054e-05
CH4_emissions 3.952887e-138 4.861730e-73 4.998240e-08
NO2_emissions 0.000000e+00 3.243922e-75 2.443410e-06
GDP 3.876546e-81 0.000000e+00 7.011285e-06
Final_EI 9.476106e-13 2.441471e-14 0.000000e+00
Fuel_mix 1.017025e-04 6.101790e-01 2.234259e-02
Carbon_intensity 6.768192e-01 8.010140e-02 2.755920e-01
Renewables_share 1.451789e-04 2.669751e-02 3.524004e-06
Envt_taxes 3.887780e-90 6.644798e-210 6.385613e-15
Liquid_biofuels 6.623125e-42 3.976896e-47 5.519562e-04
Solar_thermal 6.246854e-38 8.019620e-45 2.005984e-18
Heat_pumps 2.447790e-09 7.237810e-23 6.754436e-04
Fuel_mix Carbon_intensity Renewables_share
GHG_emissions 3.441234e-06 6.808287e-02 7.657793e-07
CO2_emissions 1.258195e-05 5.661516e-02 1.434689e-06
CH4_emissions 8.154235e-12 1.103087e-01 4.341741e-08
NO2_emissions 1.504001e-09 2.416352e-01 5.104848e-06
GDP 4.556373e-01 6.372227e-01 3.833793e-04
Final_EI 7.710988e-01 1.261549e-13 5.538967e-25
Fuel_mix 0.000000e+00 2.166373e-02 6.382450e-02
Carbon_intensity 2.185072e-04 0.000000e+00 3.185907e-16
Renewables_share 7.473258e-01 3.531281e-07 0.000000e+00
Envt_taxes 9.682081e-01 7.245808e-02 3.195221e-02
Liquid_biofuels 1.370110e-01 2.745437e-02 1.616725e-03
Solar_thermal 2.772432e-01 4.789210e-01 1.824181e-06
Heat_pumps 4.219898e-01 2.089752e-03 6.860157e-01
Envt_taxes Liquid_biofuels Solar_thermal
GHG_emissions 3.574286e-104 1.300381e-65 2.499206e-45
CO2_emissions 5.984541e-101 1.912263e-67 9.797314e-47
CH4_emissions 1.234026e-77 5.925337e-38 3.811122e-29
NO2_emissions 4.225961e-70 1.488791e-45 2.791854e-45
GDP 1.835520e-171 9.634015e-67 2.569499e-23
Final_EI 1.718263e-06 1.093975e-03 2.647071e-04
Fuel_mix 7.821836e-01 4.332584e-01 1.112717e-08
Carbon_intensity 5.552413e-01 6.407311e-01 9.551676e-02
Renewables_share 1.880067e-04 1.007874e-04 2.182794e-02
Envt_taxes 0.000000e+00 5.287208e-47 1.401543e-21
Liquid_biofuels 5.878084e-45 0.000000e+00 3.171027e-20
Solar_thermal 1.137672e-50 6.262780e-36 0.000000e+00
Heat_pumps 1.262439e-17 6.610859e-15 5.011830e-10
Heat_pumps
GHG_emissions 5.511675e-11
CO2_emissions 8.341815e-10
CH4_emissions 4.639549e-14
NO2_emissions 1.676178e-13
GDP 6.268999e-22
Final_EI 4.782774e-03
Fuel_mix 3.870624e-02
Carbon_intensity 1.441776e-01
Renewables_share 4.506933e-01
Envt_taxes 3.469327e-24
Liquid_biofuels 2.956242e-09
Solar_thermal 1.007032e-01
Heat_pumps 0.000000e+00
$df_n
GHG_emissions CO2_emissions CH4_emissions
GHG_emissions 297 297 297
CO2_emissions 297 297 297
CH4_emissions 297 297 297
NO2_emissions 297 297 297
GDP 294 294 294
Final_EI 279 279 279
Fuel_mix 279 279 279
Carbon_intensity 279 279 279
Renewables_share 270 270 270
Envt_taxes 297 297 297
Liquid_biofuels 278 278 278
Solar_thermal 279 279 279
Heat_pumps 279 279 279
NO2_emissions GDP Final_EI Fuel_mix Carbon_intensity
GHG_emissions 297 294 279 279 279
CO2_emissions 297 294 279 279 279
CH4_emissions 297 294 279 279 279
NO2_emissions 297 294 279 279 279
GDP 294 294 279 279 279
Final_EI 279 279 279 279 279
Fuel_mix 279 279 279 279 279
Carbon_intensity 279 279 279 279 279
Renewables_share 270 270 270 270 270
Envt_taxes 297 294 279 279 279
Liquid_biofuels 278 278 278 278 278
Solar_thermal 279 279 279 279 279
Heat_pumps 279 279 279 279 279
Renewables_share Envt_taxes Liquid_biofuels
GHG_emissions 270 297 278
CO2_emissions 270 297 278
CH4_emissions 270 297 278
NO2_emissions 270 297 278
GDP 270 294 278
Final_EI 270 279 278
Fuel_mix 270 279 278
Carbon_intensity 270 279 278
Renewables_share 270 270 269
Envt_taxes 270 297 278
Liquid_biofuels 269 278 278
Solar_thermal 270 279 278
Heat_pumps 270 279 278
Solar_thermal Heat_pumps
GHG_emissions 279 279
CO2_emissions 279 279
CH4_emissions 279 279
NO2_emissions 279 279
GDP 279 279
Final_EI 279 279
Fuel_mix 279 279
Carbon_intensity 279 279
Renewables_share 270 270
Envt_taxes 279 279
Liquid_biofuels 278 278
Solar_thermal 279 279
Heat_pumps 279 279
Comparing the correlation plots above, it is clear that ggcormat() of ggstateplot package and prepare_correlation_graph() function of ExPanDaR package are more robust as they are able to generate correlation plot of dataset with missing values, which is suitable for the panel data project as the dataset has a small number of missing values. A downside of the plot from prepare_correlation_graph() is that it returns the dataframe below the plot that cannot be disabled. Therefore, the ggcormat() plot is preferred as it has the correlation coefficients embedded within the visualisation, and those variables pairs that are statistically insignificant at ρ < 0.05 are also crossed out
Using the prepare_scatter_plot() function of the ExPanDaR package, bivariate analysis between variables could be shown so as to allow users to determine the empirical relationship between variables.
prepare_scatter_plot(G, x="GHG_emissions", y="Final_EI", color="Country", loess = 1)
Each point below represents a unique tab within the EDA module of the Shiny application
Distill is a publication format for scientific and technical writing, native to the web.
Learn more about using Distill at https://rstudio.github.io/distill.