# Just in case we get errors asking that the package repositories be explicitly set when installing new packages:
options(repos = list(CRAN = "http://cran.rstudio.com/"))
# Install and load required packages
# We can use the {pacman} package to easily install and load several packages:
# ({pacman} itself may need installing if not yet installed)
::p_load(
pacman
tidyverse, sjlabelled, easystats,
ggformula, ggeffects, marginaleffects,
modelsummary, gtsummary, sjPlot,# for general hierarchical modelling
survey, sandwich, lmtest, lme4, # for time-series hierarchical data structures panelr, plm, pglm)
Week 7 Worksheet Notes
Temporalities: Time series, panel and longitudinal data analysis
Aims
By the end of the session you will learn how to:
- fit random-intercepts and random coefficients models to hierarchical data that contains a temporal dimension
- use several
R
packages for modelling time-series data
R packages
Example 1: basic tabulations of hierarchical and time-series structures
Let’s import this simulated toy dataset:
<- data_read("https://cgmoreh.github.io/HSS8005-24/data/test.dta") test
Reading data...
Variables where all values have associated labels are now converted into
factors. If this is not intended, use `convert_factors = FALSE`.
1 out of 9 variables were fully labelled and converted into factors.
we can peek at the data to get a feel for the variables:
::data_peek(test) |> print_md() datawizard
Variable | Type | Values |
---|---|---|
teacher | numeric | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … |
student | numeric | 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, … |
age | numeric | 14, 15, 16, 17, 18, 14, 15, 16, 17, 18, 14, … |
cage | numeric | -2, -1, 0, 1, 2, -2, -1, 0, 1, 2, -2, -1, 0, … |
sex | factor | Male, Male, Male, Male, Male, Male, … |
score | numeric | 64, 68, 69, 71, 75, 58, 60, 71, 69, 72, 54, … |
pass | numeric | 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, … |
experience | numeric | 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, … |
zero | numeric | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
We can also check the first few and last few observations in a dataframe using the head()
and tail()
functions included in base-R. The {gt}
package also has a nice function (gt_preview()
) that can do the same more flexibly and output it as a single table:
::gt_preview(test, top_n = 3, bottom_n = 3, incl_rownums = TRUE) gt
teacher | student | age | cage | sex | score | pass | experience | zero | |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 14 | -2 | Male | 64 | 0 | 5 | 0 |
2 | 1 | 1 | 15 | -1 | Male | 68 | 0 | 5 | 0 |
3 | 1 | 1 | 16 | 0 | Male | 69 | 0 | 5 | 0 |
4..372 | |||||||||
373 | 3 | 75 | 16 | 0 | Female | 81 | 1 | 8 | 0 |
374 | 3 | 75 | 17 | 1 | Female | 85 | 1 | 8 | 0 |
375 | 3 | 75 | 18 | 2 | Female | 86 | 1 | 8 | 0 |
This allows us to see more clearly how teachers
, students
and ages
relate to each other and how they are nested.
To check more directly the number of students and teachers in the dataset, we could use the n_unique()
function from {dplyr}
in the {tidyverse}
:
n_unique(test$student)
[1] 75
n_unique(test$teacher)
[1] 3
We can also check the distribution of the variables; we can break the sex
variable (a categorical variable) into its two constituent categories and treat it as numeric for this purpose:
|> datawizard::to_numeric() |>
test ::describe_distribution() datawizard
Variable | Mean | SD | IQR | Range | Skewness | Kurtosis | n | n_Missing
-----------------------------------------------------------------------------------------
teacher | 2.00 | 0.82 | 2 | [1.00, 3.00] | 0.00 | -1.50 | 375 | 0
student | 38.00 | 21.68 | 38 | [1.00, 75.00] | 0.00 | -1.20 | 375 | 0
age | 16.00 | 1.42 | 2 | [14.00, 18.00] | 0.00 | -1.30 | 375 | 0
cage | 0.00 | 1.42 | 2 | [-2.00, 2.00] | 0.00 | -1.30 | 375 | 0
sex.Female | 0.51 | 0.50 | 1 | [0.00, 1.00] | -0.03 | -2.01 | 375 | 0
sex. Male | 0.49 | 0.50 | 1 | [0.00, 1.00] | 0.03 | -2.01 | 375 | 0
score | 68.71 | 7.20 | 10 | [51.00, 87.00] | 0.14 | -0.28 | 375 | 0
pass | 0.39 | 0.49 | 1 | [0.00, 1.00] | 0.46 | -1.80 | 375 | 0
experience | 6.33 | 1.25 | 3 | [5.00, 8.00] | 0.38 | -1.50 | 375 | 0
zero | 0.00 | 0.00 | 0 | [0.00, 0.00] | | | 375 | 0
Finally, it’s worth looking at the timeline of how test scores change across time (i.e. ages) for
ggplot(test, aes(x= age, score)) +
theme_bw() +
geom_line(aes(group= student), color= "blue")
Example 3: Longer panel data estimation with {plm}
Panel data gathers information about several individuals (cross-sectional units) over several periods. The panel is balanced if all units are observed in all periods; if some units are missing in some periods, the panel is unbalanced. A wide panel has the cross-sectional dimension (N) much larger than the longitudinal dimension (T); when the opposite is true, we have a long panel. Normally, the same units are observed in all periods; when this is not the case and each period samples mostly other units, the result is not a proper panel data, but pooled cross-sections model.
In the previous examples our panels were fairly short. In this exercise, we use longer panel data to also demonstrate some additional functionalities in the {plm}
package. The dataset is more typical of econometric data and originates from Hill et al.’s book “Principles of Econometrics”.
We can load the dataset from the module website and inspect its contents:
<- data_read("https://cgmoreh.github.io/HSS8005-24/data/nls_panel.rda") nls
Reading data...
Variables where all values have associated labels are now converted into
factors. If this is not intended, use `convert_factors = FALSE`.
0 out of 18 variables were fully labelled and converted into factors.
we can peek at the data to get a feel for the variables:
::data_peek(nls) |> print_md() datawizard
Variable | Type | Values |
---|---|---|
id | numeric | 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, … |
year | numeric | 82, 83, 85, 87, 88, 82, 83, 85, 87, 88, 82, … |
lwage | numeric | 1.808289, 1.863417, 1.789367, 1.84653, … |
hours | numeric | 38, 38, 38, 40, 40, 48, 43, 35, 42, 42, 48, … |
age | numeric | 30, 31, 33, 35, 37, 36, 37, 39, 41, 43, 35, … |
educ | numeric | 12, 12, 12, 12, 12, 17, 17, 17, 17, 17, 12, … |
collgrad | numeric | 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, … |
msp | numeric | 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … |
nev_mar | numeric | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
not_smsa | numeric | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
c_city | numeric | 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
south | numeric | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
black | numeric | 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
union | numeric | 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, … |
exper | numeric | 7.666667, 8.583333, 10.17949, 12.17949, … |
exper2 | numeric | 58.77777, 73.67361, 103.622, 148.3399, … |
tenure | numeric | 7.666667, 8.583333, 1.833333, 3.75, 5.25, … |
tenure2 | numeric | 58.77777, 73.67361, 3.361111, 14.0625, … |
We can also check the first few and last few observations in a dataframe using the head()
and tail()
functions included in base-R. The {gt}
package also has a nice function (gt_preview()
) that can do the same more flexibly and output it as a single table:
::gt_preview(nls, top_n = 3, bottom_n = 3, incl_rownums = TRUE) gt
id | year | lwage | hours | age | educ | collgrad | msp | nev_mar | not_smsa | c_city | south | black | union | exper | exper2 | tenure | tenure2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 82 | 1.808289 | 38 | 30 | 12 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 7.666667 | 58.77777 | 7.666667 | 58.777770 |
2 | 1 | 83 | 1.863417 | 38 | 31 | 12 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 8.583333 | 73.67361 | 8.583333 | 73.673610 |
3 | 1 | 85 | 1.789367 | 38 | 33 | 12 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 10.179490 | 103.62200 | 1.833333 | 3.361111 |
4..3577 | ||||||||||||||||||
3578 | 716 | 85 | 1.427116 | 40 | 39 | 12 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 15.435900 | 238.26690 | 6.083333 | 37.006950 |
3579 | 716 | 87 | 1.494368 | 30 | 41 | 12 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 17.435900 | 304.01050 | 8.166667 | 66.694450 |
3580 | 716 | 88 | 1.341422 | 40 | 42 | 12 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 18.858970 | 355.66090 | 9.583333 | 91.840270 |
To check more directly the number of ids
and years
in the dataset, we could use the n_unique()
function from {dplyr}
in the {tidyverse}
:
n_unique(nls$id)
[1] 716
n_unique(nls$year)
[1] 5
We can also check the distribution of the variables:
::describe_distribution(nls) datawizard
Variable | Mean | SD | IQR | Range | Skewness | Kurtosis | n | n_Missing
---------------------------------------------------------------------------------------------
id | 358.50 | 206.72 | 358.50 | [1.00, 716.00] | 0.00 | -1.20 | 3580 | 0
year | 85.00 | 2.28 | 4.00 | [82.00, 88.00] | 0.00 | -1.57 | 3580 | 0
lwage | 1.92 | 0.46 | 0.64 | [0.14, 4.25] | 0.26 | 0.49 | 3580 | 0
hours | 38.03 | 7.97 | 2.00 | [1.00, 80.00] | -1.20 | 3.85 | 3580 | 0
age | 35.88 | 3.89 | 6.00 | [28.00, 46.00] | 0.11 | -0.62 | 3580 | 0
educ | 13.02 | 2.44 | 3.00 | [4.00, 18.00] | 0.24 | 0.15 | 3580 | 0
collgrad | 0.23 | 0.42 | 0.00 | [0.00, 1.00] | 1.28 | -0.36 | 3580 | 0
msp | 0.66 | 0.47 | 1.00 | [0.00, 1.00] | -0.67 | -1.55 | 3580 | 0
nev_mar | 0.14 | 0.35 | 0.00 | [0.00, 1.00] | 2.07 | 2.30 | 3580 | 0
not_smsa | 0.30 | 0.46 | 1.00 | [0.00, 1.00] | 0.88 | -1.23 | 3580 | 0
c_city | 0.30 | 0.46 | 1.00 | [0.00, 1.00] | 0.86 | -1.25 | 3580 | 0
south | 0.42 | 0.49 | 1.00 | [0.00, 1.00] | 0.31 | -1.91 | 3580 | 0
black | 0.28 | 0.45 | 1.00 | [0.00, 1.00] | 0.97 | -1.06 | 3580 | 0
union | 0.26 | 0.44 | 1.00 | [0.00, 1.00] | 1.07 | -0.86 | 3580 | 0
exper | 12.03 | 3.86 | 5.35 | [1.06, 27.19] | 0.02 | -0.13 | 3580 | 0
exper2 | 159.60 | 95.46 | 128.86 | [1.12, 739.42] | 0.92 | 1.32 | 3580 | 0
tenure | 6.95 | 5.17 | 7.83 | [0.00, 24.75] | 0.64 | -0.41 | 3580 | 0
tenure2 | 75.01 | 93.66 | 101.83 | [0.00, 612.56] | 1.74 | 2.91 | 3580 | 0
Panel datsets can be organized in mainly two forms: the long form has a column for each variable and a row for each individual-period; the wide form has a column for each variable-period and a row for each individual. Most panel data methods require the long form, but many data sources provide one wide-form table for each variable; assembling the data from different sources into a long form data frame is often not a trivial matter.
The next code sequence creates a panel structure for the dataset using the function pdata.frame()
of the {plm}
package and displays a small part of this datase:
<- pdata.frame(nls, index=c("id", "year")) nlspd
The function pdim()
extracts the dimensions of the panel data:
pdim(nlspd)
Balanced Panel: n = 716, T = 5, N = 3580
In the following, we will fit a series of models estimating wage
as a function of education
, experience
(and a second-order polynomial for experience
) and race
(“black” vs. “white”), with the aim of demonstrating functions in the plm()
package. We could also fit these models using the hierarchical modelling functions form previous examples and exercises.
Pooled model
A pooled model does not allow for intercept or slope differences among individuals. Such a model can be estimated with plm()
using the specification pooling
:
<- plm(lwage ~ educ + exper + I(exper^2) + black,
wage.pooled model="pooling",
data = nlspd)
summary(wage.pooled)
Pooling Model
Call:
plm(formula = lwage ~ educ + exper + I(exper^2) + black, data = nlspd,
model = "pooling")
Balanced Panel: n = 716, T = 5, N = 3580
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-1.7043039 -0.2469447 -0.0062387 0.2421563 2.6619636
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.42706205 0.05691619 7.5033 7.807e-14 ***
educ 0.07451334 0.00274756 27.1198 < 2.2e-16 ***
exper 0.06272218 0.00797946 7.8605 5.026e-15 ***
I(exper^2) -0.00122307 0.00032263 -3.7910 0.0001525 ***
black -0.13616209 0.01485167 -9.1681 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 772.56
Residual Sum of Squares: 550.24
R-Squared: 0.28777
Adj. R-Squared: 0.28697
F-statistic: 361.111 on 4 and 3575 DF, p-value: < 2.22e-16
model_parameters(wage.pooled)
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | t(3575) | p
------------------------------------------------------------------------
(Intercept) | 0.43 | 0.06 | [ 0.32, 0.54] | 7.50 | < .001
educ | 0.07 | 2.75e-03 | [ 0.07, 0.08] | 27.12 | < .001
exper | 0.06 | 7.98e-03 | [ 0.05, 0.08] | 7.86 | < .001
exper^2 | -1.22e-03 | 3.23e-04 | [ 0.00, 0.00] | -3.79 | < .001
black | -0.14 | 0.01 | [-0.17, -0.11] | -9.17 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
Fixed effects model
The fixed effects model takes into account individual differences, translated into different intercepts of the regression line for different individuals. Variables that change little or not at all over time, such as some individual characteristics should not be included in a fixed effects model because they produce collinearity with the fixed effects.
To estimate a fixed effects with {plm}
, we can use the option model=“within”, and dummy variables for individuals are created automatically in the background and included in the model:
<- plm(lwage ~ educ + exper + I(exper^2) + black,
wage.within model="within",
data = nlspd)
summary(wage.within)
Oneway (individual) effect Within Model
Call:
plm(formula = lwage ~ educ + exper + I(exper^2) + black, data = nlspd,
model = "within")
Balanced Panel: n = 716, T = 5, N = 3580
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-1.1798141 -0.0693879 0.0052198 0.0789143 1.9969132
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
exper 0.05521048 0.00580433 9.5120 < 2.2e-16 ***
I(exper^2) -0.00105655 0.00022838 -4.6263 3.888e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 126.95
Residual Sum of Squares: 110.37
R-Squared: 0.13057
Adj. R-Squared: -0.087246
F-statistic: 214.901 on 2 and 2862 DF, p-value: < 2.22e-16
model_parameters(wage.within)
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | t(2862) | p
----------------------------------------------------------------------
exper | 0.06 | 5.80e-03 | [ 0.04, 0.07] | 9.51 | < .001
exper^2 | -1.06e-03 | 2.28e-04 | [ 0.00, 0.00] | -4.63 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
As we can see in the results above, the “race” variable was excluded from the model because it does not vary over time (it stays constant “within” an individual). This is one major limitation of so-called fixed effects (“within”) models.
The within method is equivalent to including the individual “dummies” in the model.
Random effects
The random effects model elaborates on the fixed effects model by recognizing that, since the individuals in the panel are randomly selected, their characteristics, measured by the intercept \(\beta_{1i}\) should also be random. Random effects estimators are reliable under the assumption that individual characteristics (heterogeneity) are exogenous, that is, they are independent with respect to the regressors in the random effects equation. We have already fit random effects models using the {lme4}
packages previously.
In {plm}
, the same function we used for fixed effects can be used for random effects, but setting the argument model= to ‘random’ and selecting the random.method as one out of four possibilities: “swar” (default), “amemiya”, “walhus”, or “nerlove”:
<- plm(lwage ~ educ + exper + I(exper^2) + black,
wage.random model = "random",
random.method = "swar",
data = nlspd)
summary(wage.random)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = lwage ~ educ + exper + I(exper^2) + black, data = nlspd,
model = "random", random.method = "swar")
Balanced Panel: n = 716, T = 5, N = 3580
Effects:
var std.dev share
idiosyncratic 0.03857 0.19638 0.25
individual 0.11592 0.34047 0.75
theta: 0.7502
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-1.2666036 -0.0928955 0.0057325 0.0984554 2.0978278
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 0.47585529 0.08084927 5.8857 3.964e-09 ***
educ 0.07488688 0.00547943 13.6669 < 2.2e-16 ***
exper 0.05638773 0.00560490 10.0604 < 2.2e-16 ***
I(exper^2) -0.00108166 0.00022148 -4.8838 1.041e-06 ***
black -0.13627562 0.02973398 -4.5832 4.580e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 167.23
Residual Sum of Squares: 137.85
R-Squared: 0.17566
Adj. R-Squared: 0.17474
Chisq: 761.809 on 4 DF, p-value: < 2.22e-16
model_parameters(wage.random)
# Fixed Effects
Parameter | Coefficient | SE | 95% CI | z | df | p
-----------------------------------------------------------------------------
(Intercept) | 0.48 | 0.08 | [ 0.32, 0.63] | 5.89 | 3575 | < .001
educ | 0.07 | 5.48e-03 | [ 0.06, 0.09] | 13.67 | 3575 | < .001
exper | 0.06 | 5.60e-03 | [ 0.05, 0.07] | 10.06 | 3575 | < .001
exper^2 | -1.08e-03 | 2.21e-04 | [ 0.00, 0.00] | -4.88 | 3575 | < .001
black | -0.14 | 0.03 | [-0.19, -0.08] | -4.58 | 3575 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald z-distribution approximation.
Tabulating all the models:
We can now collate the models into a summary tabe to compare them more closely:
::modelsummary(list(wage.pooled, wage.within, wage.random)) modelsummary
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 0.427 | 0.476 | |
(0.057) | (0.081) | ||
educ | 0.075 | 0.075 | |
(0.003) | (0.005) | ||
exper | 0.063 | 0.055 | 0.056 |
(0.008) | (0.006) | (0.006) | |
I(exper^2) | −0.001 | −0.001 | −0.001 |
(0.000) | (0.000) | (0.000) | |
black | −0.136 | −0.136 | |
(0.015) | (0.030) | ||
Num.Obs. | 3580 | 3580 | 3580 |
R2 | 0.288 | 0.131 | 0.176 |
R2 Adj. | 0.287 | −0.087 | 0.175 |
AIC | 3467.1 | −2290.1 | −1488.2 |
BIC | 3504.2 | −2271.6 | −1451.1 |
RMSE | 0.39 | 0.18 | 0.20 |
We can compare these results from those obtaied using other hierarchical modelling functions, such as the lmer()
function we used in earlier examples and exercises.