DataViz in R | 05. Bar Chart - Playing with Facet
Published on May 31, 2023
ggplot

Let's play with facet and theming

DataViz in R | 05. Bar Chart - Playing with Facet

Target result

http://www.datavisualisation-r.com/pdf/barcharts_multiple_all_grouped.pdf

Long time no see.Today’s target seems weird but I would like to challenge myself to become proficient in ggplot.

#Load extra font first to avoid geom_text font error
library(extrafont)
#Load neccessary lib
library(ggplot2)
library(dplyr)
#Setting width and height. This graph has very long height

options(repr.plot.width=10, repr.plot.height=12)
theme_set(theme_minimal(base_family = "Lato Light"))

The data originate from a survey on the international comparison of school performance study by the Programme for International Student Assessment (PISA) from 2009. This survey, conducted on behest of the Organisation for Economic Co- operation and Development (OECD), recorded the abilities of 15-year-old students in the areas of reading competency, mathematical competency, and scientific competency in the OECD states and in 33 OECD partner countries. Since 2000, the survey has been conducted every 3 years.

Searching through the official website of OECD does not return any downloadable dataset. Fortunately, someone made it as a package since 11 years ago: https://github.com/jbryer/pisa. We take the pisa.student.rda dataset and start the work.

Note: We can dive into PISA 2018 data at here: https://www.oecd.org/pisa/data/2018database/

We tried to use readRDS function but it returns unknown input format. Thus we use the traditional load function. However, unlike readRDS which convert directly to a dataframe, load() does not return the object.

It produces a side effect where the objects saved in the file are loaded into the environment. Assigning the load to a variable and we have the data loaded pisa.student and pisa.catalog.student.

#Load data
load("./myData/pisa-master/data/pisa.student.rda")
#Calling our dataframe
head(pisa.student)
A data.frame: 6 × 305
CNTCOUNTRYOECDSUBNATIOSCHOOLIDStIDStdST01Q01ST02Q01ST03Q02ST03Q03PV4READPV5READPV1SCIEPV2SCIEPV3SCIEPV4SCIEPV5SCIEW_FSTUWTTESTLANGCNTFAC
fctfctfctfctfctfctfctfctfctfctdbldbldbldbldbldbldbldblfctdbl
1AlbaniaAlbaniaNon-OECDAlbania11NANA3NA291.86242.16359.7280.44323.33383.01364.365.4037ALBANIAN0.0293
2AlbaniaAlbaniaNon-OECDAlbania12NANA1NA216.02320.24277.08227.66261.23291.07341.425.4037ALBANIAN0.0293
3AlbaniaAlbaniaNon-OECDAlbania13NANA6NA399.6418.04420.68461.71414.16442.13421.625.4037ALBANIAN0.0293
4AlbaniaAlbaniaNon-OECDAlbania14NANA9NA446.51460.81454.81448.28420.31439.89445.495.7039ALBANIAN0.0293
5AlbaniaAlbaniaNon-OECDAlbania15NANA7NA372.95395.99428.7411.92394.2413.78415.655.7039ALBANIAN0.0293
6AlbaniaAlbaniaNon-OECDAlbania16NANA9NA506.63505.83549.37585.73585.73535.38509.275.4037ALBANIAN0.0293
dim(pisa.student)

475460 · 305

unique(pisa.student$COUNTRY)
Output
Albania · Argentina · Australia · Austria · Azerbaijan · Belgium · Bulgaria · Brazil · Canada · Switzerland · Chile · Colombia · Czech · Republic · Germany · Denmark · Spain · Estonia · Finland · France · United · Kingdom · Greece · Hong Kong-China · Croatia · Hungary · Indonesia · Ireland · Iceland · Israel · Italy · Jordan · Japan · Kazakhstan · Kyrgyzstan · Korea · Liechtenstein · Lithuania · Luxembourg · Latvia · Macao-China · Mexico · Montenegro · Netherlands · Norway · New Zealand · Panama · Peru · Poland · Portugal · Dubai (UAE) · Qatar · Shanghai-China · Romania · Russian · Federation · Singapore · Serbia · Slovak · Republic · Slovenia · Sweden · Chinese · Taipei · Thailand · Trinidad and Tobago · Tunisia · Turkey · Uruguay · United States
Levels:
‘Albania ’· ‘Azerbaijan ’· ‘Argentina ’· ‘Australia ’· ‘Austria ’· ‘Belgium ’· ‘Brazil ’· ‘Bulgaria ’· ‘Canada ’· ‘Chile ’· ‘Shanghai-China ’· ‘Chinese Taipei ’· ‘Colombia ’· ‘Croatia ’· ‘Czech Republic ’· ‘Denmark ’· ‘Estonia ’· ‘Finland ’· ‘France ’· ‘Germany ’· ‘Greece ’· ‘Hong Kong-China ’· ‘Hungary ’· ‘Iceland ’· ‘Indonesia ’· ‘Ireland ’· ‘Israel ’· ‘Italy ’· ‘Japan ’· ‘Kazakhstan ’· ‘Jordan ’· ‘Korea ’· ‘Kyrgyzstan ’· ‘Latvia ’· ‘Liechtenstein ’· ‘Lithuania ’· ‘Luxembourg ’· ‘Macao-China ’· ‘Mexico ’· ‘Montenegro ’· ‘Netherlands ’· ‘New Zealand ’· ‘Norway ’· ‘Panama ’· ‘Peru ’· ‘Poland ’· ‘Portugal ’· ‘Qatar ’· ‘Romania ’· ‘Russian Federation ’· ‘Serbia ’· ‘Singapore ’· ‘Slovak Republic ’· ‘Slovenia ’· ‘Spain ’· ‘Sweden ’· ‘Switzerland ’· ‘Thailand ’· ‘Trinidad and Tobago ’· ‘Dubai (UAE) ’· ‘Tunisia ’· ‘Turkey ’· ‘United Kingdom ’· ‘United States ’· ‘Uruguay’

Because our graph only represents the data for the USA, Canada and Mexico (USMCA), we filtered it and compared the total number of respondents in the book (66,690). In addition, there are more than 300 questions but our graph extracted only the items of question group 28, i.e variables whose name starts with “ST24Q” (or 11 questions based on the targeted result image).

pisa <- pisa.student %>% 
    dplyr::filter(COUNTRY %in% c('United States', 'Canada', 'Mexico')) %>%
    select(2,5, starts_with("ST24Q"))
dim(pisa)
head(pisa)

66690 · 13

A data.frame: 6 × 13
COUNTRYSCHOOLIDST24Q01ST24Q02ST24Q03ST24Q04ST24Q05ST24Q06ST24Q07ST24Q08ST24Q09ST24Q10ST24Q11
fctfctfctfctfctfctfctfctfctfctfctfctfct
1Canada1DisagreeStrongly agreeStrongly agreeStrongly disagreeStrongly agreeStrongly disagreeAgreeDisagreeStrongly disagreeAgreeAgree
2Canada1AgreeStrongly disagreeStrongly disagreeStrongly agreeStrongly disagreeAgreeStrongly disagreeAgreeStrongly agreeStrongly disagreeStrongly disagree
3Canada1Strongly agreeStrongly disagreeStrongly disagreeAgreeStrongly disagreeStrongly agreeStrongly disagreeAgreeDisagreeDisagreeStrongly disagree
4Canada1DisagreeDisagreeAgreeStrongly disagreeDisagreeDisagreeAgreeStrongly disagreeStrongly disagreeAgreeAgree
5Canada1Strongly disagreeDisagreeStrongly disagreeDisagreeStrongly disagreeDisagreeDisagreeAgreeAgreeAgreeStrongly disagree
6Canada1AgreeStrongly disagreeStrongly disagreeAgreeStrongly disagreeAgreeStrongly disagreeAgreeStrongly agreeStrongly disagreeStrongly disagree
load("./myData/pisa-master/data/pisa.catalog.student.rda")
#Get the desc of questions
#Check type of `pisa.catalog.student`
dim(pisa.catalog.student)
is.vector(pisa.catalog.student)

NULL TRUE

#Using grep as regex searching with ^
#unname return the value of vector only
pisa.catalog.student[grep("^ST24Q", names(pisa.catalog.student))]
Output
ST24Q01:
‘Read Attitude - Only if I have to’
ST24Q02:
‘Read Attitude - Favourite hobbies’
ST24Q03:
‘Read Attitude - Talk about books’
ST24Q04:
‘Read Attitude - Hard to finish’
ST24Q05:
‘Read Attitude - Happy as present’
ST24Q06:
‘Read Attitude - Waste of time’
ST24Q07:
‘Read Attitude - Enjoy library’
ST24Q08:
‘Read Attitude - Need information’
ST24Q09:
‘Read Attitude - Cannot sit still’
ST24Q10:
‘Read Attitude - Express opinions’
ST24Q11:
‘Read Attitude - Exchange
#But the question description for our plot need to be written in a full insightful sentence
#So the above code is for the purpose of checking only

question_desc <- c("I read only if I have to.",
                        "Reading is one of my favorite hobbies.",
                        "I like talking about books with other people.",
                        "I find it hard to finish books.",
                        "I feel happy if I receive a book as a present.",
                        "For me, reading is a waste of time.",
                        "I enjoy going to a bookstore or a library.",
                        "I read only to get information that I need.",
                        "I cannot sit still and read for more than a few minutes.",
                        "I like to express my opinions about books I have read.",
                        "I like to exchange books with my friends.")
names(question_desc) <- paste0("ST24Q", sprintf("%02d", 1:11))

question_desc
Output
ST24Q01
‘I read only if I have to.’
ST24Q02
‘Reading is one of my favorite hobbies.’
ST24Q03
‘I like talking about books with other people.’
ST24Q04
‘I find it hard to finish books.’
ST24Q05
‘I feel happy if I receive a book as a present.’
ST24Q06
‘For me, reading is a waste of time.’
ST24Q07
‘I enjoy going to a bookstore or a library.’
ST24Q08
‘I read only to get information that I need.’
ST24Q09
‘I cannot sit still and read for more than a few minutes.’
ST24Q10
‘I like to express my opinions about books I have read.’
ST24Q11
‘I like to exchange books with my friends.’

Now we need to make our data readable by ggplot, or switch it into a longer form using pivot_longer()

library(tidyr)
pisa_longer <- pisa %>%
    pivot_longer(cols=c(-1,-2), names_to = "Question", values_to = "Answer") %>%
    #We dropped SCHOOLID here
    group_by(COUNTRY, Question, Answer, .add=T) %>%
    #the default .add=FALSE in group_by() will override existing groups.
    summarize(Count = n(), .groups = 'drop') %>%
    #We dropped NA Answer here
    filter(Answer != 'NA')

head(pisa_longer, 10)
A tibble: 10 × 4
COUNTRYQuestionAnswerCount
fctchrfctint
CanadaST24Q01Strongly disagree5807
CanadaST24Q01Disagree7938
CanadaST24Q01Agree5623
CanadaST24Q01Strongly agree3229
CanadaST24Q02Strongly disagree6052
CanadaST24Q02Disagree7953
CanadaST24Q02Agree5568
CanadaST24Q02Strongly agree3028
CanadaST24Q03Strongly disagree5697
CanadaST24Q03Disagree7154
#Create percentage column
pisa_617 <- pisa_longer %>%
    group_by(COUNTRY, Question, .add=T) %>%
    mutate(Perc = Count / sum(Count))

head(pisa_617, 10)
A grouped_df: 10 × 5
COUNTRYQuestionAnswerCountPerc
fctchrfctintdbl
CanadaST24Q01Strongly disagree58070.256981
CanadaST24Q01Disagree79380.3512856
CanadaST24Q01Agree56230.2488383
CanadaST24Q01Strongly agree32290.1428951
CanadaST24Q02Strongly disagree60520.2677758
CanadaST24Q02Disagree79530.3518871
CanadaST24Q02Agree55680.2463608
CanadaST24Q02Strongly agree30280.1339764
CanadaST24Q03Strongly disagree56970.2522917
CanadaST24Q03Disagree71540.316815
#The main idea is using the facet element

pisa_617 %>%
ggplot(aes(x=Perc, y=COUNTRY)) + 
    geom_col(mapping=aes(fill=Answer), position = position_stack(reverse = TRUE)) +
    facet_grid(Question ~ .)

png

#Create custom color vector based on origin graph(using eye-dropper)
color_pisa <- c("Strongly agree" = "#8b8878", 
               "Agree" = "#fff8dc", 
               "Disagree" = "#ffb6c1", 
               "Strongly disagree" = "#8b475d")
pisa_617 %>%
ggplot(aes(x=Perc, y=COUNTRY)) + 
    geom_col(mapping=aes(fill=Answer), position = position_stack(reverse = TRUE)) +
    scale_fill_manual(values=color_pisa) +
    scale_y_discrete(labels=function(x) paste0(x, "-")) +
    facet_grid(Question ~ .)

png

We try to edit the y - label by creating a new column.

pisa_test <- pisa_617 %>%
    group_by(COUNTRY, Question, .add=T) %>%
    mutate(neg_perc = sprintf("%.0f%%",sum(Count[Answer %in% c("Strongly disagree", "Disagree")]) / sum(Count) * 100)
          )
head(pisa_test,10)
A grouped_df: 10 × 6
COUNTRYQuestionAnswerCountPercneg_perc
fctchrfctintdblchr
CanadaST24Q01Strongly disagree58070.25698161%
CanadaST24Q01Disagree79380.351285661%
CanadaST24Q01Agree56230.248838361%
CanadaST24Q01Strongly agree32290.142895161%
CanadaST24Q02Strongly disagree60520.267775862%
CanadaST24Q02Disagree79530.351887162%
CanadaST24Q02Agree55680.246360862%
CanadaST24Q02Strongly agree30280.133976462%
CanadaST24Q03Strongly disagree56970.252291757%
CanadaST24Q03Disagree71540.31681557%
pisa_test %>%
ggplot(aes(x=Perc, y=COUNTRY)) + 
    geom_col(mapping=aes(fill=Answer), position = position_stack(reverse = TRUE)) +
    scale_fill_manual(values=color_pisa) +
    geom_text(data=pisa_test, mapping=aes(x=-0.05, y=COUNTRY, label=neg_perc), size=3, hjust=1) +
    facet_grid(Question ~ .)

png

The problem arises, because the neg_perc was drawn 4 times for each country in each question. It makes the text seems bolder and aliasing. The proposed solution I found from ChatGPT is to create another column contains the desired labels (eg. Canada - 61%), but it was plotted 4 times repeatedly again.

In general, I think those are not a elegant solutions and waste of memory. So I will try to use the setting scale_y_discrete with custom function as below.

my_function <- function(breaks) {
    result <- pisa_test %>% 
    group_by(COUNTRY, Question) %>%
    summarise(Percentage_Disagree = round(sum(Count[Answer %in% c("Strongly disagree", "Disagree")]) / sum(Count) * 100,0),
              .groups = 'drop')
    labels <- paste0(breaks, " - ", result$Percentage_Disagree,"%")
  return(labels)
}
my_function(c('Canada', 'Mexico'))
'Canada - 61%''Mexico - 62%''Canada - 57%''Mexico - 72%''Canada - 50%''Mexico - 76%''Canada - 48%''Mexico - 61%''Canada - 76%''Mexico - 47%''Canada - 60%''Mexico - 59%''Canada - 52%''Mexico - 53%''Canada - 61%''Mexico - 43%''Canada - 88%''Mexico - 54%''Canada - 44%''Mexico - 77%''Canada - 37%''Mexico - 52%''Canada - 50%''Mexico - 70%''Canada - 59%''Mexico - 69%''Canada - 62%''Mexico - 74%''Canada - 47%''Mexico - 53%''Canada - 71%''Mexico - 49%''Canada - 66%'

However, the problem with breaks parameter in scale_y_discrete(labels = my_function) is:

  • For each facet Question, breaks will return a vector naming the label y - COUNTRY. And that’s all.
  • So for all Question, we have the same breaks vector, which is c('United States', 'Mexico', 'Canada')
  • Regardless of what we try in the custom my_function, our function will never know what exactly the Question are, given the useless of Mr. breaks
  • As a result, we can not concatenate the specific percentage of each facet as in the targeted result.

So, I think we should move to solution using geom_text with axis.text.y = element_blank() in theme to turn of the labels of y-axis.

pisa_label <- pisa_617 %>%
    group_by(COUNTRY, Question, .add=T) %>%
    summarize(neg_perc = sum(Count[Answer %in% c("Strongly disagree", "Disagree")]) / sum(Count) * 100,
             .groups = "drop") %>%
    mutate(right_labels = sprintf("%.0f%%", 100 - neg_perc),
           left_labels = paste0(COUNTRY, " - ", sprintf("%.0f%%", neg_perc)))
head(pisa_label)
A tibble: 6 × 5
COUNTRYQuestionneg_percright_labelsleft_labels
fctchrdblchrchr
CanadaST24Q0160.8266639%Canada - 61%
CanadaST24Q0261.9662838%Canada - 62%
CanadaST24Q0356.9106843%Canada - 57%
CanadaST24Q0471.7719428%Canada - 72%
CanadaST24Q0550.1396150%Canada - 50%
CanadaST24Q0675.7233424%Canada - 76%
p1 <-
pisa_617 %>%
ggplot(aes(x=Perc, y=COUNTRY)) + 
    #geom_col creates bar and customize bar color
    geom_col(mapping=aes(fill=Answer), color = "black", linewidth = 0.2, width = 0.8, position = position_stack(reverse = TRUE)) +
    scale_fill_manual(values=color_pisa) +
    #break into facet, but use facet_wrap instead of facet_grid to modify position of strip text
    facet_wrap(Question ~ ., nrow = 11, strip.position = "top",
               labeller = as_labeller(question_desc)) +
    
    #custom label, on the left and the right
    geom_text(data=pisa_label, mapping=aes(x=-0.01, y=COUNTRY, label=left_labels), size=3, hjust=1, family = "Lato Light") +
    geom_text(data=pisa_label, mapping=aes(x=1.01, y=COUNTRY, label=right_labels), size=3, hjust=0, family = "Lato Light") +
       
    #Add custom axis by geom_segment
    geom_segment(data=data.frame(x = 0, xend = 1, y = 0, yend = 0, Question = "ST24Q11"),
               aes(x=x,y=y,yend=yend,xend=xend), inherit.aes=FALSE)+

    #To move the panel area to the right, we increase the xlim while limiting the display label
    coord_cartesian(xlim=c(-0.1,1), clip="off")
                       
p1

png

#Formating
p1  +
    #Title
    labs(x=NULL, y=NULL,
         title="Reading attitude",
         subtitle = "How much do you disagree or agree with these statements about reading?",
         caption= "Source: PISA 2009 Assessment Framework - Key Competencies in Reading, Mathematics, and Science
                    \n © OECD 2009, Data: bryer.org") +
    #Labelling x-axis 
    scale_x_continuous(breaks = seq(0, 1, 0.25), labels= function(x) round(abs(x)*100,0)) +
    #Add padding between bar and x-axis
    scale_y_discrete(expand = expansion(add = .8)) +
    #Theme element
    theme(axis.text.y = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          #strip text aligned with the bar. 
          #Don't know the relationship btw l=30 and l=50
          strip.text.x = element_text(hjust = 0, size = 3 *14/5,
                                      margin = margin(l=90)),
          legend.position = "bottom",
          legend.title = element_blank(),
          plot.caption = element_text(face="italic"),
          plot.title.position = "plot",
          plot.title = element_text(size = 20, family="Lato Black"),
          #Add some margin to final svg output
          plot.margin = margin(t=10, l=10),
          #Show the x-axis line and ticks
          axis.ticks.x = element_line(color = "black"),
          legend.spacing.x = unit(50, "pt"),
          #Increase the distance between legend keys
          legend.text = element_text(margin = margin(l = -40)),
         )

png

Note: I found some useful tips in below links:

#As we set the option of IRKernel at 10 x 12 inches, we set the same para for ggsave
ggsave("6.1.7 Bar char Grouping all responses.svg", last_plot(), device=svg, width = 10, height = 12, units="in")

SVG image result (Open in new tab to zoom in) Barchart Facet Starter

TODO:

  • Change geom_text to annotate for global font: The Answer is NO.