Title: | Miscellaneous Functions for Panel Data, Quantiles, and Printing Results |
---|---|
Description: | These are miscellaneous functions for working with panel data, quantiles, and printing results. For panel data, the package includes functions for making a panel data balanced (that is, dropping missing individuals that have missing observations in any time period), converting id numbers to row numbers, and to treat repeated cross sections as panel data under the assumption of rank invariance. For quantiles, there are functions to make distribution functions from a set of data points (this is particularly useful when a distribution function is created in several steps), to combine distribution functions based on some external weights, and to invert distribution functions. Finally, there are several other miscellaneous functions for obtaining weighted means, weighted distribution functions, and weighted quantiles; to generate summary statistics and their differences for two groups; and to add or drop covariates from formulas. |
Authors: | Brantly Callaway [aut, cre] |
Maintainer: | Brantly Callaway <[email protected]> |
License: | GPL-2 |
Version: | 1.4.7 |
Built: | 2024-11-11 06:31:43 UTC |
Source: | https://github.com/bcallaway11/bmisc |
addCovFromFormla
adds some covariates to a formula;
covs should be a list of variable names
addCovToFormla(covs, formla)
addCovToFormla(covs, formla)
covs |
should be a list of variable names |
formla |
which formula to add covariates to |
formula
formla <- y ~ x addCovToFormla(list("w", "z"), formla) formla <- ~x addCovToFormla("z", formla)
formla <- y ~ x addCovToFormla(list("w", "z"), formla) formla <- ~x addCovToFormla("z", formla)
make draws of all observations with the same id in a panel data context. This is useful for bootstrapping with panel data.
blockBootSample(data, idname)
blockBootSample(data, idname)
data |
data.frame from which you want to bootstrap |
idname |
column in data which contains an individual identifier |
data.frame bootstrapped from the original dataset; this data.frame will contain new ids
data("LaborSupply", package = "plm") bbs <- blockBootSample(LaborSupply, "id") nrow(bbs) head(bbs$id)
data("LaborSupply", package = "plm") bbs <- blockBootSample(LaborSupply, "id") nrow(bbs) head(bbs$id)
A function to check if treatment is staggered in a panel data set.
check_staggered(df, idname, treatname)
check_staggered(df, idname, treatname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
treatname |
name of column with the treatment indicator |
a logical indicating whether treatment is staggered
The check function used for optimizing to get quantiles
checkfun(a, tau)
checkfun(a, tau)
a |
vector to compute quantiles for |
tau |
between 0 and 1, ex. .5 implies get the median |
numeric value
x <- rnorm(100) x[which.min(checkfun(x, 0.5))] ## should be around 0
x <- rnorm(100) x[which.min(checkfun(x, 0.5))] ## should be around 0
Combines two distribution functions with given weights by pstrat
combineDfs(y.seq, dflist, pstrat = NULL, ...)
combineDfs(y.seq, dflist, pstrat = NULL, ...)
y.seq |
sequence of possible y values |
dflist |
list of distribution functions to combine |
pstrat |
a vector of weights to put on each distribution function; if weights are not provided then equal weight is given to each distribution function |
... |
additional arguments that can be past to BMisc::makeDist |
ecdf
x <- rnorm(100) y <- rnorm(100, 1, 1) Fx <- ecdf(x) Fy <- ecdf(y) both <- combineDfs(seq(-2, 3, 0.1), list(Fx, Fy)) plot(Fx, col = "green") plot(Fy, col = "blue", add = TRUE) plot(both, add = TRUE)
x <- rnorm(100) y <- rnorm(100, 1, 1) Fx <- ecdf(x) Fy <- ecdf(y) both <- combineDfs(seq(-2, 3, 0.1), list(Fx, Fy)) plot(Fx, col = "green") plot(Fy, col = "blue", add = TRUE) plot(both, add = TRUE)
compareBinary
takes in a variable e.g. union
and runs bivariate regression of x on treatment (for summary statistics)
compareBinary( x, on, dta, w = rep(1, nrow(dta)), report = c("diff", "levels", "both") )
compareBinary( x, on, dta, w = rep(1, nrow(dta)), report = c("diff", "levels", "both") )
x |
variables to run regression on |
on |
binary variable |
dta |
the data to use |
w |
weights |
report |
which type of report to make; diff is the difference between the two variables by group |
matrix of results
Turn repeated cross sections data into panel data by imposing rank invariance; does not require that the inputs have the same length
cs2panel(cs1, cs2, yname)
cs2panel(cs1, cs2, yname)
cs1 |
data frame, the first cross section |
cs2 |
data frame, the second cross section |
yname |
the name of the variable to calculate difference for (should be the same in each dataset) |
the change in outcomes over time
A function to check for multicollinearity and drop collinear terms from a matrix
drop_collinear(matrix)
drop_collinear(matrix)
matrix |
a matrix for which the function will remove collinear columns |
a matrix with collinear columns removed
dropCovFromFormla
adds drops some covariates from a
formula; covs should be a list of variable names
dropCovFromFormla(covs, formla)
dropCovFromFormla(covs, formla)
covs |
should be a list of variable names |
formla |
which formula to drop covariates from |
formula
formla <- y ~ x + w + z dropCovFromFormla(list("w", "z"), formla) dropCovFromFormla("z", formla)
formla <- y ~ x + w + z dropCovFromFormla(list("w", "z"), formla) dropCovFromFormla("z", formla)
This is a function that takes in two matrices of dimension nxB and nxk and returns a Bxk matrix that comes from element-wise multiplication of every column in the first matrix times the entire second matrix and the averaging over the n-dimension. It is equivalent (but faster than) the following R code: 'sapply(1:biters, function(b) sqrt(n)*colMeans(Umat[,b]*inf.func))' . This function is particularly useful for fast computations using the multiplier bootstrap.
element_wise_mult(U, inf_func)
element_wise_mult(U, inf_func)
U |
nxB matrix (e.g., these could be a matrix of Rademachar weights for B bootstrap iterations using the multiplier bootstrap |
inf_func |
nxk matrix of (e.g., these could be a matrix containing the influence function for different parameter estimates) |
a Bxk matrix
A function that calculates the first difference in a panel data setting. If the data.frame that is passed in has nxT rows, the resulting vector will also have nxT elements with one element for each unit set to be NA.
get_first_difference(df, idname, yname, tname)
get_first_difference(df, idname, yname, tname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
A function to calculate a unit's group in a panel data setting with a binary treatment and staggered treatment adoption and where there is a column in the data indicating whether or not a unit is treated
get_group(df, idname, tname, treatname)
get_group(df, idname, tname, treatname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
tname |
name of column that holds the time period |
treatname |
name of column with the treatment indicator |
A function that calculates lagged outcomes in a panel data setting. If the data.frame that is passed in has nxT rows, the resulting vector will also have nxT elements with one element for each unit set to be NA
get_lagYi(df, idname, yname, tname, nlags = 1)
get_lagYi(df, idname, yname, tname, nlags = 1)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
nlags |
The number of periods to lag. The default is 1, which computes the lag from the previous period. |
A function to calculate unit-specific principal components, given panel data
get_principal_components( xformula, data, idname, tname, n_components = NULL, ret_wide = FALSE, ret_id = FALSE )
get_principal_components( xformula, data, idname, tname, n_components = NULL, ret_wide = FALSE, ret_id = FALSE )
xformula |
a formula specifying the variables to use in the principal component analysis |
data |
a data.frame containing the panel data |
idname |
the name of the column containing the unit id |
tname |
the name of the column containing the time period |
n_components |
the number of principal components to retain, the default is NULL which will result in all principal components being retained |
ret_wide |
whether to return the data in wide format (where the number of rows is equal to n = length(unique(data[[idname]])) or long format (where the number of rows is equal to nT = nrow(data)). The default is FALSE, so that long data is returned by default. |
ret_id |
whether to return the id column in the output data.frame. The default is FALSE. |
a data.frame containing the original data with the principal components appended
A function to calculate outcomes for units in the first time period that is available in a panel data setting (this function can also be used to recover covariates, etc. in the first period).
get_Yi1(df, idname, yname, tname, gname)
get_Yi1(df, idname, yname, tname, gname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
gname |
name of column containing the unit's group |
A function to calculate the average outcome across all time periods separately for each unit in a panel data setting (this function can also be used to recover covariates, etc.).
get_Yibar(df, idname, yname)
get_Yibar(df, idname, yname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
A function to calculate average outcomes for units in their pre-treatment periods (this function can also be used to recover pre-treatment averages of covariates, etc.). For units that do not participate in the treatment (and therefore have group==0), the function calculates their overall average outcome.
get_Yibar_pre(df, idname, yname, tname, gname)
get_Yibar_pre(df, idname, yname, tname, gname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
gname |
name of column containing the unit's group |
A function to calculate outcomes for units in the period right before they become treated (this function can also be used to recover covariates, etc. in the period right before a unit becomes treated). For units that do not participate in the treatment (and therefore have group==0), they are assigned their outcome in the last period.
get_YiGmin1(df, idname, yname, tname, gname)
get_YiGmin1(df, idname, yname, tname, gname)
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
gname |
name of column containing the unit's group |
A function to calculate outcomes for units in a particular time period 'tp' in a panel data setting (this function can also be used to recover covariates, etc. in the first period).
get_Yit(df, tp, idname, yname, tname)
get_Yit(df, tp, idname, yname, tname)
df |
the data.frame used in the function |
tp |
The time period for which to get the outcome |
idname |
name of column that holds the unit id |
yname |
name of column containing the outcome (or other variable) for which to calculate its outcome in the immediate pre-treatment period |
tname |
name of column that holds the time period |
a vector of outcomes in period t, the vector will have the length nT (i.e., this is returned for each element in the panel, not for a particular period)
a function to take a list and get a particular part out of each element in the list
getListElement(listolists, whichone = 1)
getListElement(listolists, whichone = 1)
listolists |
a list |
whichone |
which item to get out of each list (can be numeric or name) |
list of all the elements 'whichone' from each list
len <- 100 # number elements in list lis <- lapply(1:len, function(l) list(x = (-l), y = l^2)) # create list getListElement(lis, "x")[1] # should be equal to -1 getListElement(lis, 1)[1] # should be equal to -1
len <- 100 # number elements in list lis <- lapply(1:len, function(l) list(x = (-l), y = l^2)) # create list getListElement(lis, "x")[1] # should be equal to -1 getListElement(lis, 1)[1] # should be equal to -1
Get a distribution function from a vector of values after applying some weights
getWeightedDf(y, y.seq = NULL, weights = NULL, norm = TRUE)
getWeightedDf(y, y.seq = NULL, weights = NULL, norm = TRUE)
y |
a vector to compute the mean for |
y.seq |
an optional vector of values to compute the distribution function for; the default is to use all unique values of y |
weights |
the vector of weights, can be NULL, then will just return mean |
norm |
normalize the weights so that they have mean of 1, default is to normalize |
ecdf
Get the mean applying some weights
getWeightedMean(y, weights = NULL, norm = TRUE)
getWeightedMean(y, weights = NULL, norm = TRUE)
y |
a vector to compute the mean for |
weights |
the vector of weights, can be NULL, then will just return mean |
norm |
normalize the weights so that they have mean of 1, default is to normalize |
the weighted mean
Finds multiple quantiles by repeatedly calling getWeightedQuantile
getWeightedQuantiles(tau, cvec, weights = NULL, norm = TRUE)
getWeightedQuantiles(tau, cvec, weights = NULL, norm = TRUE)
tau |
a vector of values between 0 and 1 |
cvec |
a vector to compute quantiles for |
weights |
the weights, weighted.checkfun normalizes the weights to sum to 1. |
norm |
normalize the weights so that they have mean of 1, default is to normalize |
vector of quantiles
ids2rownum takes a vector of ids and converts it to the right row number in the dataset; ids should be unique in the dataset that is, don't pass the function panel data with multiple same ids
ids2rownum(ids, data, idname)
ids2rownum(ids, data, idname)
ids |
vector of ids |
data |
data frame |
idname |
unique id |
vector of row numbers
ids <- seq(1, 1000, length.out = 100) ids <- ids[order(runif(100))] df <- data.frame(id = ids) ids2rownum(df$id, df, "id")
ids <- seq(1, 1000, length.out = 100) ids <- ids[order(runif(100))] df <- data.frame(id = ids) ids2rownum(df$id, df, "id")
take an ecdf object and invert it to get a step-quantile function
invertEcdf(df)
invertEcdf(df)
df |
an ecdf object |
stepfun object that contains the quantiles of the df
Take a formula and return a vector of the variables on the left hand side, it will return NULL for a one sided formula
lhs.vars(formla)
lhs.vars(formla)
formla |
a formula |
vector of variable names
ff <- yvar ~ x1 + x2 lhs.vars(ff)
ff <- yvar ~ x1 + x2 lhs.vars(ff)
This function drops observations from data.frame that are not part of balanced panel data set.
makeBalancedPanel(data, idname, tname, return_data.table = FALSE)
makeBalancedPanel(data, idname, tname, return_data.table = FALSE)
data |
data.frame used in function |
idname |
unique id |
tname |
time period name |
return_data.table |
if TRUE, makeBalancedPanel will return a data.table rather than a data.frame. Default is FALSE. |
data.frame that is a balanced panel
id <- rep(seq(1, 100), each = 2) # individual ids for setting up a two period panel t <- rep(seq(1, 2), 100) # time periods y <- rnorm(200) # outcomes dta <- data.frame(id = id, t = t, y = y) # make into data frame dta <- dta[-7, ] # drop the 7th row from the dataset (which creates an unbalanced panel) dta <- makeBalancedPanel(dta, idname = "id", tname = "t")
id <- rep(seq(1, 100), each = 2) # individual ids for setting up a two period panel t <- rep(seq(1, 2), 100) # time periods y <- rnorm(200) # outcomes dta <- data.frame(id = id, t = t, y = y) # make into data frame dta <- dta[-7, ] # drop the 7th row from the dataset (which creates an unbalanced panel) dta <- makeBalancedPanel(dta, idname = "id", tname = "t")
turn vectors of a values and their distribution function values into an ecdf. Vectors should be the same length and both increasing.
makeDist( x, Fx, sorted = FALSE, rearrange = FALSE, force01 = FALSE, method = "constant" )
makeDist( x, Fx, sorted = FALSE, rearrange = FALSE, force01 = FALSE, method = "constant" )
x |
vector of values |
Fx |
vector of the distribution function values |
sorted |
boolean indicating whether or not x is already sorted; computation is somewhat faster if already sorted |
rearrange |
boolean indicating whether or not should monotize distribution function |
force01 |
boolean indicating whether or not to force the values of the distribution function (i.e. Fx) to be between 0 and 1 |
method |
which method to pass to |
ecdf
y <- rnorm(100) y <- y[order(y)] u <- runif(100) u <- u[order(u)] F <- makeDist(y, u)
y <- rnorm(100) y <- y[order(y)] u <- runif(100) u <- u[order(u)] F <- makeDist(y, u)
A function that takes in an influence function (an nxk matrix) and the number of bootstrap iterations and returns a Bxk matrix of bootstrap results. This function uses Rademechar weights.
multiplier_bootstrap(inf_func, biters)
multiplier_bootstrap(inf_func, biters)
inf_func |
nxk matrix of (e.g., these could be a matrix containing the influence function for different parameter estimates) |
biters |
the number of bootstrap iterations |
a Bxk matrix
This function multiplies a matrix by a vector and returns a numeric vector.
mv_mult(A, v)
mv_mult(A, v)
A |
an nxk matrix. |
v |
a vector (can be stored as numeric or as a kx1 matrix) |
A numeric vector resulting from the multiplication of the matrix by the vector.
A <- matrix(1:9, nrow = 3, ncol = 3) v <- c(2, 4, 6) mv_mult(A, v)
A <- matrix(1:9, nrow = 3, ncol = 3) v <- c(2, 4, 6) mv_mult(A, v)
A helper function to switch from original time periods to "new" time periods (which are just time periods going from 1 to total number of available periods). This allows for periods not being exactly spaced apart by 1.
orig2t(orig, original_time.periods)
orig2t(orig, original_time.periods)
orig |
a vector of original time periods to convert to new time periods. |
original_time.periods |
vector containing all original time periods. |
new time period converted from original time period
panel2cs takes a 2 period dataset and turns it into a cross sectional dataset. The data includes the change in time varying variables between the time periods. The default functionality is to keep all the variables from period 1 and add all the variables listed by name in timevars from period 2 to those.
panel2cs(data, timevars, idname, tname)
panel2cs(data, timevars, idname, tname)
data |
data.frame used in function |
timevars |
vector of names of variables to keep |
idname |
unique id |
tname |
time period name |
data.frame
panel2cs2 takes a 2 period dataset and turns it into a cross sectional dataset; i.e., long to wide. This function considers a particular case where there is some outcome whose value can change over time. It returns the dataset from the first period with the outcome in the second period and the change in outcomes over time appended to it
panel2cs2(data, yname, idname, tname, balance_panel = TRUE)
panel2cs2(data, yname, idname, tname, balance_panel = TRUE)
data |
data.frame used in function |
yname |
name of outcome variable that can change over time |
idname |
unique id |
tname |
time period name |
balance_panel |
whether to ensure that panel is balanced. Default is TRUE, but code runs somewhat faster if this is set to be FALSE. |
data from first period with .y0 (outcome in first period), .y1 (outcome in second period), and .dy (change in outcomes over time) appended to it
Take a formula and return the right hand side of the formula
rhs(formla)
rhs(formla)
formla |
a formula |
a one sided formula
ff <- yvar ~ x1 + x2 rhs(ff)
ff <- yvar ~ x1 + x2 rhs(ff)
Take a formula and return a vector of the variables on the right hand side
rhs.vars(formla)
rhs.vars(formla)
formla |
a formula |
vector of variable names
ff <- yvar ~ x1 + x2 rhs.vars(ff) ff <- y ~ x1 + I(x1^2) rhs.vars(ff)
ff <- yvar ~ x1 + x2 rhs.vars(ff) ff <- y ~ x1 + I(x1^2) rhs.vars(ff)
Source all the files in a folder
source_all(fldr)
source_all(fldr)
fldr |
path to a folder |
returns a subsample of a panel data set; in particular drops
all observations that are not in keepids
. If it is not set,
randomly keeps nkeep
observations.
subsample(dta, idname, tname, keepids = NULL, nkeep = NULL)
subsample(dta, idname, tname, keepids = NULL, nkeep = NULL)
dta |
a data.frame which is a balanced panel |
idname |
the name of the id variable |
tname |
the name of the time variable |
keepids |
which ids to keep |
nkeep |
how many ids to keep (only used if |
a data.frame that contains a subsample of dta
data("LaborSupply", package = "plm") nrow(LaborSupply) unique(LaborSupply$year) ss <- subsample(LaborSupply, "id", "year", nkeep = 100) nrow(ss)
data("LaborSupply", package = "plm") nrow(LaborSupply) unique(LaborSupply$year) ss <- subsample(LaborSupply, "id", "year", nkeep = 100) nrow(ss)
A helper function to switch from "new" t values to original t values. This allows for periods not being exactly spaced apart by 1.
t2orig(t, original_time.periods)
t2orig(t, original_time.periods)
t |
a vector of time periods to convert back to original time periods. |
original_time.periods |
vector containing all original time periods. |
original time period converted from new time period
This function takes a time-invariant variable and repeats it for each period in a panel data set.
time_invariant_to_panel(x, df, idname, balanced_panel = TRUE)
time_invariant_to_panel(x, df, idname, balanced_panel = TRUE)
x |
a vector of length equal to the number of unique ids in df. |
df |
the data.frame used in the function |
idname |
name of column that holds the unit id |
balanced_panel |
a logical indicating whether the panel is balanced. If TRUE, the function will optimize the repetition process. Default is TRUE. |
a vector of length equal to the number of rows in df.
take a name for a y variable and a vector of names for x variables and turn them into a formula
toformula(yname, xnames)
toformula(yname, xnames)
yname |
the name of the y variable |
xnames |
vector of names for x variables |
a formula
toformula("yvar", c("x1", "x2")) ## should return yvar ~ 1 toformula("yvar", rhs.vars(~1))
toformula("yvar", c("x1", "x2")) ## should return yvar ~ 1 toformula("yvar", rhs.vars(~1))
A function to replace NA's with FALSE in vector of logicals
TorF(cond, use_isTRUE = FALSE)
TorF(cond, use_isTRUE = FALSE)
cond |
a vector of conditions to check |
use_isTRUE |
whether or not to use a vectorized version of isTRUE. This is generally slower but covers more cases. |
logical vector
Weights the check function
weighted.checkfun(q, cvec, tau, weights)
weighted.checkfun(q, cvec, tau, weights)
q |
the value to check |
cvec |
vector of data to compute quantiles for |
tau |
between 0 and 1, ex. .5 implies get the median |
weights |
the weights, weighted.checkfun normalizes the weights to sum to 1. |
numeric