path | size | modified |
---|---|---|
Printed on 17 May 2024 | ||
data_raw/postSA_YSJ_2023_3+February+2024_21.40.sav | 7.29M | 2024-02-03 21:41:12 |
data_raw/preSA_YSJ_2023_12+April+2024_12.08.sav | 23.44M | 2024-04-12 12:08:56 |
data_raw/Study Abroad Expectations_September 11, 2023_17.06.sav | 43.24M | 2023-09-12 00:09:39 |
data_raw/Study+Abroad+Expectations+–+External_September+11,+2023_17.09.sav | 23.82M | 2023-09-12 00:09:19 |
study_design/MAXOUT-SA_Codeplan.xlsx | 55.82K | 2024-04-18 16:43:18 |
study_design/MAXOUT-SA_Interviewees.xlsx | 12.83K | 2024-04-25 13:21:27 |
study_design/postSA_survey_labels.csv | 11.85K | 2024-05-16 06:23:26 |
study_design/preSA_survey_labels.csv | 20.87K | 2024-05-16 06:23:25 |
Data files
The raw data files are downloaded from Qualtrics™ into a folder called data_raw
with their default Qualtrics™ names, which includes the survey name plus the date and time of the download. The data is exported from Qualtrics™ as SPSS .sav
data files with the extra long labels option.
The Qualtrics™ questionnaire was based on a design codeplan saved in an Excel .xlsx
file, which is stored in a folder named study_design
. The same folder also contains a spreadsheet with details about the survey participants who also participated in the follow-up qualitative interview phase of the data collection.
The data_raw
and study_design
folders contain the following files:
The code below sets up functional links to these files in R:
#### File paths ----------------------------------------------------------------------------------
<- list.files("data_raw", pattern = "\\.sav")) # List `.sav` files
(datafiles <- list.files("study_design", pattern = "\\.xlsx")) # List `.xlsx` files
(designfiles
## 2020 pre-SA
<- file.path("data_raw", grep("Study Abroad Expectations", datafiles, value = TRUE)) # 2020 YSJ student data
preSA20_ysj_path <- file.path("data_raw", grep("External", datafiles, value = TRUE)) # 2020 Non-YSJ student data
preSA20_ext_path
## 2023 pre-SA
<- file.path("data_raw", grep("preSA_YSJ_2023", datafiles, value = TRUE)) # 2023 YSJ pre-SA data
preSA23_ysj_path
## 2023 post-SA
<- file.path("data_raw", grep("postSA_YSJ_2023", datafiles, value = TRUE)) # 2023 YSJ post-SA data
postSA23_ysj_path
## Design
<- file.path("study_design", grep("Codeplan", designfiles, value = TRUE)) # Excel survey codebook
codeplan_path <- file.path("study_design", grep("Interviewees", designfiles, value = TRUE)) # Excel list of interviewees interviewees_path
Pre-SA datasets
Questionnaire/variable differences
The 2020 pilot data collection consists of a YSJ and an external dataset. The difference between the two questionnaires was a single item that asked YSJ respondents whether they would also be interested in participating in a qualitative interview study. Qualitative data was not collected from external respondents:
In YSJ but not in External data:
[1] "ysj_interview"
[1] "Accept to participate in an interview"
The position of the variable in the dataset is:
[1] 303
The 2023 pre-SA survey had several differences compared to the 2020 questionnaire:
2020 | 2023 |
---|---|
In which academic year do you expect to go on a Study Abroad year? [
|
In which year do you expect to go on a Study Abroad year/semester? [
|
Who do you expect to socialize with most while on Study Abroad? [
|
Who do you expect to socialize with most while on Study Abroad? [
|
Block of 16 questions on “imagined self” was not asked | |
The email question asked for “university email address” specifically |
The sayr
and expect_socialise
variables from 2023 were given the _23
suffix to their variable names (in the Codeplan document).
Data import
The code below imports into R the pre-SA data (.sav
files), the variable information (names, labels) from the codeplan document (survey_design/SA_codeplan.xlsx
), and information about which respondents also participated in qualitative follow-up interviews (data_qualitative/MAXOUT-SA-Interviewees.xlsx
):
#### Import from raw ----------------------------------------------------------------------------------------
## Pre-SA Codeplan
<- read_excel(codeplan_path, sheet = "preSAvars")
codeplan_pre
## Interviewees
<- read_excel(interviewees_path) |> data_select(c("Random_ID", "interviewed_preSA", "interviewed_postSA"))
interviewees
## 2020 pre-SA YSJ
<- read_spss(preSA20_ysj_path) # Import from spss
preSA20_ysj names(preSA20_ysj) <- codeplan_pre$varname_pre20 # Assign variable names
::set_label(preSA20_ysj) <- codeplan_pre$varlabel_pre20 # Assign variable labels
sjlabelled
## 2020 pre-SA External
<- read_spss(preSA20_ext_path)
preSA20_ext names(preSA20_ext) <- codeplan_pre$varname_pre20[-303] # Assign variable names removing YSJ-specific var
::set_label(preSA20_ext) <- codeplan_pre$varlabel_pre20[-303] # Assign variable labels removing YSJ-specific var
sjlabelled
## 2023 pre-SA YSJ
<- read_spss(preSA23_ysj_path)
preSA23_ysj names(preSA23_ysj) <- na.omit(codeplan_pre$varname_pre23) # Assign variable names
::set_label(preSA23_ysj) <- na.omit(codeplan_pre$varlabel_pre23) # Assign variable labels sjlabelled
Data management variables
Before merging the datasets, we create an additional cohort
column which records the academic year of the pre-SA data collection. The survey for the YSJ study had been kept open for several months, spanning the second semester of the 2019/2020 academic year and the first semester of the 2020/2021 AY, and therefore the preSA20_ysj
dataset contains responses from two student cohorts (2019/2020 and 2020/2021). Data for the preSA20_ext
dataset should only contain responses from the 2019/2020 student cohort due to the outbreak of the Covid-19 pandemic, which interfered with the data collection since international travel - and Study Abroad years - were put on hold. However, there is one response that was submitted in March 2021. This response will be removed from the dataset:
#### Create `cohort` column ------------------------------------------------------------------------------------
<- preSA20_ysj |>
preSA20_ysj mutate(cohort = case_when(StartDate < as.POSIXct("2020-09-01") ~ "19/20",
>= as.POSIXct("2020-09-01") ~ "20/21"))
StartDate
<- preSA20_ext |>
preSA20_ext mutate(cohort = "19/20") |>
::filter(StartDate < as.POSIXct("2020-09-01")) # remove response dating "2021-03-20 16:11:51"
dplyr
<- preSA23_ysj |>
preSA23_ysj mutate(cohort = "23/24")
Merging the pre-SA datasets
Merging the three datasets should therefore have ncol(preSA20_ysj) + 3
= 312 variables.
The code below merges the datasets and checks its dimensions:
#### Merge all pre-SA datasets from 2020 and 2023 ---------------------------------------------------------------
<- sjmisc::add_rows(preSA20_ysj, # `this`sjmisc::add_rows` keeps `label` attribute but not other non-relevant attributes
preSA ## `dplyr::bind_rows` removes the variable label attributes
preSA20_ext, ## `datawizard::data_merge` keeps all SPSS-specific attributes (`display_width`, `format.spss`)
preSA23_ysj)
### Check dimensions of merged dataframe
dim(preSA)
[1] 243 312
Number of columns as expected: TRUE
Replacing piped text
The Qualtrics™ questionnaire included piped text for Japanese and Korean language students, and these appear with non-human-readable characters in the variable and value labels, so we replace these characters with the phrase “JP/KO”. For example, see the value label of the A1_comjpko
variable before and after the replacement:
Speaks with JP/KO friends in JP/KO (A1_comjpko) | |||||
---|---|---|---|---|---|
Value | Label | N | Raw % | Valid % | Cum. % |
1 | ${lm://Field/1} | 38 | 15.64 | 100 | 100 |
Speaks with JP/KO friends in JP/KO (A1_comjpko) | |||||
---|---|---|---|---|---|
Value | Label | N | Raw % | Valid % | Cum. % |
1 | JP/KO | 38 | 15.64 | 100 | 100 |
The code below makes the replacements across all the value labels in the dataset:
#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------
## Get all value labels as list
<- sjlabelled::get_labels(preSA)
labs
## Change all the values labels in all the variables in list
<- lapply(labs, function(x) str_replace_all(x,
labs '\\$[^\\}]*\\}',
"JP/KO"))
## Apply changed labels to dataset; keep labels as attribute (don't do `as_label(as.numeric)` beforehand)
<- sjlabelled::set_labels(preSA, labels = labs, force.labels = TRUE) preSA
Converting categorical variables
We convert the values of all labelled factor (categorical) variables to their labels, so that later we can manipulate the values directly as text.
#### Convert labelled factor variables ------------------------------
## This keeps the unused labels as well
<- preSA |>
preSA mutate(across(where(is.factor), sjlabelled::as_numeric),
across(everything(), sjlabelled::as_label))
## This keeps only the labels of categories that had valid responses
# preSA_alt <- preSA |>
# mutate(across(where(is.factor), labels_to_levels))
Combining Japanese and Korean versions of variables
The survey questions were broken down by language studied (Japanese/Korean), and we have duplicate variables coding the same question (prefixed with “A1_” for Japanese and “A2_” for Korean). With the code below we combine these variables:
#### Unify variables split by language ---------------------------------------------------------
<- preSA |>
korean ::filter(language == "Korean") |>
dplyrselect(!starts_with("A1")) |>
rename_with(stringr::str_replace,
pattern = "A2_", replacement = "",
matches("A2_"))
<- preSA |>
japanese ::filter(language == "Japanese") |>
dplyrselect(!starts_with("A2")) |>
rename_with(stringr::str_replace,
pattern = "A1_", replacement = "",
matches("A1_"))
<- preSA |>
missing ::filter(is.na(language)) |> # 13 missing answers to language
dplyr::remove_empty_columns() # remove all empty columns
datawizard
<- sjmisc::add_rows(japanese, korean, missing) preSA
Removing incomplete responses
There were 13 responses with missing data on language. Since the language studied was a core compulsory-answer item, these 13 cases were also unfinished responses. Of the 243 responses in the pre-SA dataset 183 have been finished and submitted. We keep only finished cases:
#### Keep only completed and submitted responses ---------------------------
<- preSA |>
preSA ::filter(Finished == "True") dplyr
Removing duplicates
E-mail addresses were requested primarily for the purposes of contacting students who opted in for participation in a follow-up qualitative interview and/or future (post-SA) rounds of data collection, as well as for contacting the winner of the randomly selected participation prize. Respondent e-mail addresses and IP Addresses are also helpful for identifying any data reliability issues, such as duplicate responses (n.b. the IPAddress collected by Qualtrics™ is “external”, so those connecting to the same network will share an IP; for this reason, selecting on IP Address is not useful here).
We find four email addresses with duplicate responses:
No. of duplicate emails | ID_first | ID_second |
---|---|---|
2 | 9419 | 8436 |
2 | 9514 | 8175 |
2 | 95339 | 70145 |
2 | 84953 | 44422 |
We will keep the earlier responses. The reason for this choice is that the information from the later responses could be contaminated by having previously completed the survey already (practice effects). Incidentally, the earlier responses also have fewer missing answers (albeit marginally, a difference of one in two cases):
#### Will delete the later responses (incidentally, these also have fewer NAs) -----------------
`%not_in%` = Negate(`%in%`)
<- preSA |>
preSA data_filter(Random_ID %not_in% c("8436", "8175", "70145", "44422")) # keeps original "rownames"; `rownames(preSA) <- NULL` to renumber
This leaves us with 179 responses/cases/rows.
We can also check whether any Random_ID numbers have been allocated multiple times (unfortunately, Qualtrics™ doesn’t have a system to fine-tune the randomisation of numbers…). We find that the Random_ID
number 3591
has been allocated twice:
Random_ID | cohort | uni | language |
---|---|---|---|
3591 | 20/21 | York St John University | Japanese |
3591 | 19/20 | Cardiff University | Japanese |
One was allocated to an external participant, so we replace it by adding a suffix consisting of two 0
s to it:
#### Fix identical Random_IDs -------------------------------------------------------------
$Random_ID[preSA$uni == "Cardiff University" & preSA$Random_ID == "3591" ] <- "359100" preSA
Variable selection
Select out sensitive data
We can select out variables that contain more sensitive information to store separately from the main analysis dataset:
### Select out meta- and safeguarded variables ---------------------------------------------
<- preSA |>
preSA_sensitive select(Status:Progress, Finished:UserLanguage, postcode, followup, ysj_interview, email, Random_ID)
Select out textual data
We can also select out variables that contain text entered in open-ended survey questions. These were identified with the _txt
suffix in the Codeplan. We create and select the most useful variables to keep in the textual dataset:
### Get names of textual variables -----------------------------------------------------------------------------------------------
<- str_subset(names(preSA), pattern = "_txt")
text_variables
### Select and create variables to keep in the textual dataset -------------------------------------------------------------------
<- preSA |>
preSA_textual # create text variable concatenating all school types attended
data_unite(select = contains("school_"), new_column = "schools_combined", remove_na = TRUE, append = TRUE, separator = ", ") |>
# data_unite() doesn't want to exclude NAs... bug in the code... have to remove manually...
data_modify(.at = "schools_combined", .modify = function(x) {text_remove(x, ", NA")}) |>
data_modify(.at = "schools_combined", .modify = function(x) {text_remove(x, "NA, ")}) |>
# select variables to keep
data_select(c(Random_ID, uni, cohort, language, sayr, sayr_23,
gender, age, intstudnt, bornuk, pargrad, schools_combined, text_variables))
Recode textual data
It is more useful to keep a numeric version of the textual variables, which records the number of words in the answers provided, rather than the answers themselves. To distinguish between these and the original variables, we add the suffix _nwords
to their names:
### Function to count all "word" characters, first converting empty strings to NA
<- function(x) {
wordcounts <- get_label(x) # save var labels
label |> convert_to_na(na = "") |> # convert to NA to avoid 0 values
x str_count('\\w+') |> # count all "words"
set_label(label) # reassign the saved labels
}
#### Recode textual variables to wordcount numeric variables; add suffix to var name -----
<- preSA |>
preSA data_modify(.at = text_variables, .modify = wordcounts) |>
data_addsuffix(pattern = "_nwords", select = text_variables)
Remove, add, relocate and label variables
Final variable preparation tasks. We exclude the sensitive data from the dataset, as well as the PIS variable and topic header variables; we add a variable counting the number of missing answers for each respondent; we add two additional variables recording whether the respondent has also participated in a qualitative in-depth interview at the pre-SA and post-SA stage; we reorder and label variables:
### Remove, add and relocate variables ---------------------------------------------------------
# Add variable on interviewees
<- data_merge(preSA, interviewees, id = "Random_ID")
preSA <- data_merge(preSA_textual, interviewees, id = "Random_ID")
preSA_textual
<- preSA |>
preSA # remove variables
select(!c(StartDate, EndDate, Status, IPAddress, Progress, RecordedDate:UserLanguage,
postcode, email, followup, ysj_interview,
Finished, pis, contains("Topics"))) |>
# add of count missing answers
rowwise() |>
mutate(missing_answers = sum(is.na(across(everything())))) |>
ungroup() |>
# relocate
relocate(c(Random_ID, Duration, missing_answers, interviewed_preSA, interviewed_postSA, uni, cohort)) |>
relocate(sayr_23, .after = sayr) |>
relocate(expect_socialise_23, .after = expect_socialise)
# attr(preSA2023$missing_answers, "label") <- "Number of unanswered items"
<- preSA |> var_labels(missing_answers = "Number of unanswered items",
preSA interviewed_preSA = "Interviewed before Study Abroad",
interviewed_postSA = "Re-interviewed after Study Abroad",
cohort = "Student cohort")
Analysis dataset check
The final analysis dataset contains 179 responses and 172 variables. 143 responses are from York St John University students. There are 2 variables containing only NA
values: proglength_txt_nwords, sib4occ_study_txt_nwords. The minimum number of missing answers across the dataset is 26 and the maximum is 62, with a median of 40:
Post-SA dataset
The aim of the post-SA data collection was to provide a sense of how opinions have changed following the Study Abroad year. Many survey questions have been repeated (almost) as they were first asked in the pre-SA data collection, and a more limited number of new items were also introduced. Variables that have an equivalence with the pre-SA data have received the same name stub as in the pre-SA dataset, followed by the _post suffix. New variables without an equivalent in the pre-SA dataset have received the post_ prefix. A number of variables that were programmatically included in the Qualtrics™ survey have the same name as in the pre-SA dataset and will be excluded or modified, keeping only the shared variables needed for matching and merging the responses from the same individuals.
Data import
The code below imports the post-SA data (.sav
files) and the variable information (names, labels) from the Codeplan spreadsheet (survey_design/SA_codeplan.xlsx
) into R:
#### Import from raw -----------------------------------------------------------------------------------
## Post-SA codeplan
<- read_excel(codeplan_path, sheet = "postSAvars")
codeplan_post
## 2023 post-SA YSJ
<- read_spss(postSA23_ysj_path)
postSA23_ysj names(postSA23_ysj) <- na.omit(codeplan_post$varname_post23) # Assign variable names
::set_label(postSA23_ysj) <- na.omit(codeplan_post$varlabel_post23) # Assign variable labels sjlabelled
The raw postSA23_ysj
dataset has 34 answers and 117 variables.
Data cleansing
We fix Qualtrics™ shortcodes in labels and convert categorical variable types. We keep the 32 completed responses only. We also remove several variables that were reused/pre-filled automatically from the pre-SA survey, keeping only the pre-filled Random_ID
variable for merging. One of the pre-filled Random_ID
s corresponds to one of the duplicates that were deleted from the pre-SA dataset. We replace this ID with that of the case kept in the analysis for the purpose of merging. We remove sensitive data.
#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------------
<- sjlabelled::get_labels(postSA23_ysj)
labs <- lapply(labs, function(x) str_replace_all(x,
labs '\\$[^\\}]*\\}',
"JP/KO"))
<- postSA23_ysj |>
postSA23_ysj ::set_labels(labels = labs, force.labels = TRUE) sjlabelled
#### Extract sensitive data ------------------------------------------------------------------------------------
<- postSA23_ysj |>
postSA_sensitive ::select(c(post_qual_1, post_qual_2, # PIS and eligibility vars
dplyr# Sensitive
post_email_uni, post_aftergrad_followup, post_email_personal,
Random_ID))
#### Select cases and variables --------------------------------------------------------------------------------
<- postSA23_ysj |>
postSA ## Convert labelled factor variables
mutate(across(where(is.factor), sjlabelled::as_numeric),
across(everything(), sjlabelled::as_label)) |>
## Select valid responses
::filter(Finished_post == "True" &
dplyr== "I agree to take part" &
post_qual_1 == "I have done a full year of study abroad") |>
post_qual_2 ## Select useful variables
::select(!c(StartDate_post:Progress_post, Finished_post:UserLanguage_post, # Qualtrics variables
dplyr:SAcountry, # Prefilled from Pre-SA survey
uni_post, pre_course, course# PIS and eligibility vars
post_qual_1, post_qual_2, |> # Sensitive
post_email_uni, post_aftergrad_followup, post_email_personal)) ## add of count missing answers
rowwise() |>
mutate(missing_answers_post = sum(is.na(across(everything())))) |>
ungroup() |>
## move to the top
data_relocate(c(Random_ID, missing_answers_post)) |>
## add label to new variable
var_labels(missing_answers_post = "Number of unanswered items")
#### Replace a `Random_ID`
$Random_ID[postSA$Random_ID == "8436" ] <- "9419" postSA
Select out and recode textual data
In the postSA dataset variables that contain textual data entered in open-ended survey questions are identified in two ways: for variables unique to the post-SA survey - which carry the post_ prefix - they are suffixed with _txt, whereas for variables repeated across the two surveys the whole suffix is _txt_post. There are 11 such variables.
### Get names of textual variables -----------------------------------------------------------------------------------------------
<- str_subset(names(postSA), pattern = "_txt")
text_variables_post
### Select and create variables to keep in the textual dataset -------------------------------------------------------------------
<- postSA |>
postSA_textual data_select(c(Random_ID, post_course_same, post_SAuni_satisfaction,
text_variables_post))
It is more useful to keep a numeric version of the textual variables, which records the number of words in the answers provided, rather than the answers themselves. To distinguish between these and the original variables, we add the suffix _nwords
to their names:
#### Recode textual variables to wordcount numeric variables; add suffix to var name -----
#### use the same `wordcounts()` function defined above
<- postSA |>
postSA data_modify(.at = text_variables_post, .modify = wordcounts) |>
data_addsuffix(pattern = "_nwords", select = text_variables_post)
Merging pre-SA and post-SA data
We merge thepostSA23_ysj
dataset with the responses from the same individuals from the preSA
dataset:
<- data_merge(preSA, postSA, join = "inner", by = "Random_ID") postSA
We also merge the textual datasets:
<- data_merge(preSA_textual, postSA_textual, join = , by = "Random_ID") SA_textual
Export datasets
We export the analysis datasets with the names preSA2023
and postSA2023
to SPSS .sav
format to a new folder called data_in
. We export the sensitive data to another folder called data_lock
. We also export the textual data to an Excel sheet and include variable names, labels and their concatenated version as additional rows to make them more informative as column headers in Excel:
## Export main datasets ---------------------------------------------------------------------------------
::dir_create("data_in")
fs::write_spss(preSA, "data_in/preSA2023.sav")
sjlabelled::write_spss(postSA, "data_in/postSA2023.sav")
sjlabelled
::dir_create("data_lock")
fs::write_spss(preSA_sensitive, "data_lock/preSA2023_sensitive.sav")
sjlabelled::write_spss(postSA_sensitive, "data_lock/postSA2023_sensitive.sav")
sjlabelled
## Export qualitative dataset ----------------------------------------------------------------------
# For this we use a function I wrote that modifies the behaviour of datawizard::data_write() to allow variable labels to be saved as
# the first row in the exported text file. This is achieved with an additional optional setting `labels_to_row = TRUE`.
# If `labels_to_row` is not specified, the function does the same as datawizard::data_write()
# Import the function from GitHub Gist
::source_gist("https://gist.github.com/CGMoreh/a706954fb56cf8cc4a1ddc53ac1a4737", filename = "my_data_write.R")
devtools
data_write(SA_textual, "data_in/SA_textual.xlsx", labels_to_row = TRUE)