Data wrangling with EVS dataset from gesis
Data source
In previous posts, I complained that the dataset ZA4753 - European Values Study 2008: Germany
could not be directly downloaded.
Therefore I made some research and go from this site (appreared at top when I search “EVS 2008 dataset”) to the below site
https://search.gesis.org/research_data/ZA4800
Fortunately, this dataset is now downloadable, simply by registering an account.
Downloads:
ZA4800_v5-0-0.dta.zip Stata (Dataset) 10.03 MB
ZA4800_v5-0-0.sav.zip SPSS (Dataset) 12.91 MB
Availability: A - Data and documents are released for academic research and teaching.
Select the SPSS dataset and we can continue to dive in.
#Using haven library to read the .sav file
library(haven)
read_sav("./myData/EVS_2008/ZA4800_v5-0-0.sav")
Error: Failed to parse C:/Users/minhd/OneDrive/Learning/Dataviz-R/myData/EVS_2008/ZA4800_v5-0-0.sav: Unable to convert string to the requested encoding (invalid byte sequence).
Traceback:
1. read_sav("./myData/EVS_2008/ZA4800_v5-0-0.sav")
2. df_parse_sav_file(spec, encoding, user_na, cols_skip, n_max,
. skip, name_repair = .name_repair)
Error when try to handling SPSS data with unicode character
I found several posts on Internet telling about the error when using encoding="utf-8"
(default) in read_sav.
Error: Failed to parse "./myData/EVS_2008/ZA4800_v5-0-0.sav"
Unable to convert string to the requested encoding (invalid byte sequence)
It may come from the nature of how SPSS handling the data file. A quick fix is changing encoding
to latin1
. However, the problem still arises when the negative values become NA
.
Therefore, I changed to dta
format and use read_dta
instead.
za4800 <- read_dta("./myData/EVS_2008/ZA4800_v5-0-0.dta")
dim(za4800)
#We need to extract data for Germany and for certain questions only
#Read the source for description of each variable (there are 132 variables)
- 66280
- 477
unique(za4800$country)
<labelled<double>[46]>: country code
[1] 8 40 51 56 70 100 112 191 196 197 203 208 233 246 250 268 276 300 348
[20] 352 372 380 428 440 442 470 498 499 528 578 616 620 642 643 688 703 705 724
[39] 752 756 792 804 807 826 909 915
Labels:
value label
8 Albania
31 Azerbaijan
40 Austria
51 Armenia
56 Belgium
70 Bosnia Herzegovina
100 Bulgaria
112 Belarus
191 Croatia
196 Cyprus
197 Northern Cyprus
203 Czech Republic
208 Denmark
233 Estonia
246 Finland
250 France
268 Georgia
276 Germany
300 Greece
...
915 Kosovo
#We found the country code of Germany was '276'
evs_ger <- dplyr::filter(za4800, country==276)
dim(evs_ger)
- 2075
- 477
#It seems matching to the number of observation in the book. Great!
#Now we select only 8 questions for learning purpose
evs_ger <- dplyr::select(evs_ger, v106,v159,v160,v161,v162,v163,v164,v165)
head(evs_ger)
v106 | v159 | v160 | v161 | v162 | v163 | v164 | v165 |
---|---|---|---|---|---|---|---|
<dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> |
2 | 4 | 1 | 4 | 3 | 3 | 4 | 3 |
2 | 3 | 2 | 2 | 2 | 2 | 4 | 4 |
-3 | 1 | 3 | 3 | 3 | 1 | 1 | 1 |
-3 | 1 | 4 | 2 | 3 | 1 | 1 | 2 |
-3 | 3 | 1 | 1 | 2 | 3 | 2 | 3 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |
#write the RDS file for learning purpose
saveRDS(evs_ger, file="./myData/EVS_2008/germany.Rda")
#Test load my data
my_df <- readRDS("./myData/EVS_2008/germany.Rda")
head(my_df)
dim(my_df)
v106 | v159 | v160 | v161 | v162 | v163 | v164 | v165 |
---|---|---|---|---|---|---|---|
<dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> | <dbl+lbl> |
2 | 4 | 1 | 4 | 3 | 3 | 4 | 3 |
2 | 3 | 2 | 2 | 2 | 2 | 4 | 4 |
-3 | 1 | 3 | 3 | 3 | 1 | 1 | 1 |
-3 | 1 | 4 | 2 | 3 | 1 | 1 | 2 |
-3 | 3 | 1 | 1 | 2 | 3 | 2 | 3 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |
- 2075
- 8
#
attributes(my_df[[5]])$labels
- other missing
- -5
- question not asked
- -4
- not applicable
- -3
- no answer
- -2
- don't know
- -1
- agree strongly
- 1
- agree
- 2
- disagree
- 3
- disagree strongly
- 4