DataViz in R | 03. Load EVS Dataset
Published on Apr 29, 2023
ggplot

Data wrangling with EVS dataset from gesis

DataViz in R | 03. Load EVS Dataset

Data source

In previous posts, I complained that the dataset ZA4753 - European Values Study 2008: Germany could not be directly downloaded.

Therefore I made some research and go from this site (appreared at top when I search “EVS 2008 dataset”) to the below site

https://search.gesis.org/research_data/ZA4800

Fortunately, this dataset is now downloadable, simply by registering an account.

Downloads:
ZA4800_v5-0-0.dta.zip Stata (Dataset) 10.03 MB
ZA4800_v5-0-0.sav.zip SPSS (Dataset) 12.91 MB

Availability: A - Data and documents are released for academic research and teaching.

Select the SPSS dataset and we can continue to dive in.

#Using haven library to read the .sav file

library(haven)
read_sav("./myData/EVS_2008/ZA4800_v5-0-0.sav")
Error: Failed to parse C:/Users/minhd/OneDrive/Learning/Dataviz-R/myData/EVS_2008/ZA4800_v5-0-0.sav: Unable to convert string to the requested encoding (invalid byte sequence).
Traceback:


1. read_sav("./myData/EVS_2008/ZA4800_v5-0-0.sav")

2. df_parse_sav_file(spec, encoding, user_na, cols_skip, n_max, 
 .     skip, name_repair = .name_repair)

Error when try to handling SPSS data with unicode character

I found several posts on Internet telling about the error when using encoding="utf-8" (default) in read_sav.

Error: Failed to parse "./myData/EVS_2008/ZA4800_v5-0-0.sav"
Unable to convert string to the requested encoding (invalid byte sequence)

It may come from the nature of how SPSS handling the data file. A quick fix is changing encoding to latin1. However, the problem still arises when the negative values become NA.

Therefore, I changed to dta format and use read_dta instead.

za4800 <- read_dta("./myData/EVS_2008/ZA4800_v5-0-0.dta")
dim(za4800)

#We need to extract data for Germany and for certain questions only
#Read the source for description of each variable (there are 132 variables)
  1. 66280
  2. 477
unique(za4800$country)
<labelled<double>[46]>: country code
 [1]   8  40  51  56  70 100 112 191 196 197 203 208 233 246 250 268 276 300 348
[20] 352 372 380 428 440 442 470 498 499 528 578 616 620 642 643 688 703 705 724
[39] 752 756 792 804 807 826 909 915

Labels:
 value              label
     8            Albania
    31         Azerbaijan
    40            Austria
    51            Armenia
    56            Belgium
    70 Bosnia Herzegovina
   100           Bulgaria
   112            Belarus
   191            Croatia
   196             Cyprus
   197    Northern Cyprus
   203     Czech Republic
   208            Denmark
   233            Estonia
   246            Finland
   250             France
   268            Georgia
   276            Germany
   300             Greece
   ...
   915             Kosovo
#We found the country code of Germany was '276'

evs_ger <- dplyr::filter(za4800, country==276)
dim(evs_ger)
  1. 2075
  2. 477
#It seems matching to the number of observation in the book. Great!
#Now we select only 8 questions for learning purpose

evs_ger <- dplyr::select(evs_ger, v106,v159,v160,v161,v162,v163,v164,v165)
head(evs_ger)
A tibble: 6 × 8
v106v159v160v161v162v163v164v165
<dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl>
24143343
23222244
-31333111
-31423112
-33112323
22222221
#write the RDS file for learning purpose
saveRDS(evs_ger, file="./myData/EVS_2008/germany.Rda")
#Test load my data
my_df <- readRDS("./myData/EVS_2008/germany.Rda")
head(my_df)
dim(my_df)
A tibble: 6 × 8
v106v159v160v161v162v163v164v165
<dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl><dbl+lbl>
24143343
23222244
-31333111
-31423112
-33112323
22222221
  1. 2075
  2. 8
#

attributes(my_df[[5]])$labels
other missing
-5
question not asked
-4
not applicable
-3
no answer
-2
don't know
-1
agree strongly
1
agree
2
disagree
3
disagree strongly
4