Lalonde example

Throughout the user's guide, we'll use the Lalonde dataset, in particular the "Dehejia-Wahha Sample" on that web page. It is originally from the paper Robert Lalonde, "Evaluating the Econometric Evaluations of Training Programs," American Economic Review, Vol. 76, pp. 604-620. 1986.

The output in this example was generated by converting an IJulia notebook to Markdown for the user's guide. This doesn't look great, but you can find a better looking version rendered by nbviewer.

Getting the data set

To download the example data set, run

include(joinpath(Pkg.dir("TargetedLearning"), "examples", "fetchdata.jl"))
fetchdata(ds="lalonde_dw")

This only needs to be run once.

Loading the data set

Using readcsv

The simplest way to read the data set is with readcsv. The CSV file we have has a header, so we'll pass header=true, and we'll get back a tuple containing a matrix with the numeric data in it (readcsv can figure out that it's all Float64s automatically) and a matrix with one row containing column names.

dsfname = joinpath(Pkg.dir("TargetedLearning"), "examples", "data", "lalonde_dw.csv")

dat, colnames = readcsv(dsfname, header=true)

#lets inspect colnames
colnames

1x10 Array{String,2}:
 "treatment"  "age"  "education"  …  "nodegree"  "RE74"  "RE75"  "RE78"

#convert colnames to a vector instead of a matrix with one row
colnames = reshape(colnames, size(colnames, 2))

10-element Array{String,1}:
 "treatment"
 "age"      
 "education"
 "black"    
 "hispanic" 
 "married"  
 "nodegree" 
 "RE74"     
 "RE75"     
 "RE78"

treatment is obviously the treatment variable. The outcome variable is RE78 (earnings in 1978), and RE74 and RE75 are earnings values prior to treatment. The others are potential baseline confounders. Check the link above for more information about the dataset. We want to slice the matrix dat up to extract the treatment and outcome variable. Julia uses square brackets for indexing. The first dimension of a matrix is rows and the second is columns like R. If you want everything in one dimension, you put : (where as in R you can leave that dimension empty). You can index with booleans, integers, vectors of integers, ranges and some other things. Here are some examples:

#we know column 1 is treatment, so get it directly, and ask for all rows with :
treatment = dat[:, 1]

445-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

#suppose instead we want to find the position in colnames that has the value
#"treatment", but we don't know it's the first one. There are a couple of ways
#to do that.
#(.== and operators starting with `.` in general indicate that we want to do
#   an element wise operation)

#(the ; at the end of the line suppresses output)
treatment_take2 = dat[:, colnames .== "treatment", 1];

#the last column is the outcome so we can use the keyword `end`
outcome = dat[:, end]

#we can also use `end` in arithmetic, e.g.
outcome_prebaseline = dat[:, end-1]

445-element Array{Float64,1}:
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     0.0 
     ⋮   
  9160.69
  9210.45
  9311.94
  9319.44
 10033.9 
 10598.7 
 10857.2 
 12357.2 
 13371.3 
 16341.2 
 16946.6 
 23032.0

#baseline covariates are everything between the first and last column, so we could use a range:
baseline_covars = dat[:, 2:end-1]

445x8 Array{Float64,2}:
 37.0  11.0  1.0  0.0  1.0  1.0      0.0        0.0 
 22.0   9.0  0.0  1.0  0.0  1.0      0.0        0.0 
 30.0  12.0  1.0  0.0  0.0  0.0      0.0        0.0 
 27.0  11.0  1.0  0.0  0.0  1.0      0.0        0.0 
 33.0   8.0  1.0  0.0  0.0  1.0      0.0        0.0 
 22.0   9.0  1.0  0.0  0.0  1.0      0.0        0.0 
 23.0  12.0  1.0  0.0  0.0  0.0      0.0        0.0 
 32.0  11.0  1.0  0.0  0.0  1.0      0.0        0.0 
 22.0  16.0  1.0  0.0  0.0  0.0      0.0        0.0 
 33.0  12.0  0.0  0.0  1.0  0.0      0.0        0.0 
 19.0   9.0  1.0  0.0  0.0  1.0      0.0        0.0 
 21.0  13.0  1.0  0.0  0.0  0.0      0.0        0.0 
 18.0   8.0  1.0  0.0  0.0  1.0      0.0        0.0 
  ⋮                         ⋮                       
 29.0   9.0  1.0  0.0  0.0  1.0   9268.94    9160.69
 28.0   9.0  1.0  0.0  1.0  1.0  10222.4     9210.45
 30.0  11.0  1.0  0.0  1.0  1.0      0.0     9311.94
 25.0  10.0  1.0  0.0  1.0  1.0  13520.0     9319.44
 28.0  11.0  1.0  0.0  1.0  1.0    824.389  10033.9 
 22.0  10.0  0.0  0.0  0.0  1.0  27864.4    10598.7 
 44.0   9.0  1.0  0.0  1.0  1.0  12260.8    10857.2 
 21.0   9.0  1.0  0.0  0.0  1.0  31886.4    12357.2 
 28.0  11.0  1.0  0.0  0.0  1.0  17491.5    13371.3 
 29.0   9.0  0.0  1.0  0.0  1.0   9594.31   16341.2 
 25.0   9.0  1.0  0.0  1.0  1.0  24731.6    16946.6 
 22.0  10.0  0.0  0.0  1.0  1.0  25720.9    23032.0

Using DataFrame

The readtable function returns a DataFrame which allows for indexing by column names.

using DataFrames #`using` is like `library` in R

dsfname = joinpath(Pkg.dir("TargetedLearning"), "examples", "data", "lalonde_dw.csv")

df = readtable(dsfname)

#check the column names:
names(df)

10-element Array{Symbol,1}:
 :treatment
 :age      
 :education
 :black    
 :hispanic 
 :married  
 :nodegree 
 :RE74     
 :RE75     
 :RE78

Data frames in Julia are indexed by symbols instead of strings. Symbols are no entirely unlike strings, and are created in julia with :symbolname or Symbol("symbolname").

Now let's get the treatment and outcome variables out of df.

treatment = df[:treatment]
outcome = df[:RE78]

445-element DataArray{Float64,1}:
  9930.05
  3595.89
 24909.5 
  7506.15
   289.79
  4056.49
     0.0 
  8472.16
  2164.02
 12418.1 
  8173.91
 17094.6 
     0.0 
     ⋮   
     0.0 
  1239.84
  3982.8 
     0.0 
     0.0 
  7094.92
 12359.3 
     0.0 
     0.0 
 16900.3 
  7343.96
  5448.8

We see that variables in a data frame are actually DataArrays instead of regular Julia Arrays. The functions in TargetedLearning.jl currently only work with Arrays of floating point numbers, so we'll convert them.

treatment = convert(Array{Float64}, treatment)
outcome = convert(Array{Float64}, outcome)
typeof(outcome)

Array{Float64,1}

We can also index into the data frame using ranges, just like regular matrixes (but we only index on columns). Let's get the baseline covariates. When you get more than one column out of a data frame, you get back another data frame. For some reason that I do not know, convert won't work for us here, but array will get us what we need (a Julia array instead of a DataFrame).

baseline_covars = array(df[2:end-1])

445x8 Array{Real,2}:
 37  11  1  0  1  1      0.0        0.0 
 22   9  0  1  0  1      0.0        0.0 
 30  12  1  0  0  0      0.0        0.0 
 27  11  1  0  0  1      0.0        0.0 
 33   8  1  0  0  1      0.0        0.0 
 22   9  1  0  0  1      0.0        0.0 
 23  12  1  0  0  0      0.0        0.0 
 32  11  1  0  0  1      0.0        0.0 
 22  16  1  0  0  0      0.0        0.0 
 33  12  0  0  1  0      0.0        0.0 
 19   9  1  0  0  1      0.0        0.0 
 21  13  1  0  0  0      0.0        0.0 
 18   8  1  0  0  1      0.0        0.0 
  ⋮               ⋮                     
 29   9  1  0  0  1   9268.94    9160.69
 28   9  1  0  1  1  10222.4     9210.45
 30  11  1  0  1  1      0.0     9311.94
 25  10  1  0  1  1  13520.0     9319.44
 28  11  1  0  1  1    824.389  10033.9 
 22  10  0  0  0  1  27864.4    10598.7 
 44   9  1  0  1  1  12260.8    10857.2 
 21   9  1  0  0  1  31886.4    12357.2 
 28  11  1  0  0  1  17491.5    13371.3 
 29   9  0  1  0  1   9594.31   16341.2 
 25   9  1  0  1  1  24731.6    16946.6 
 22  10  0  0  1  1  25720.9    23032.0

Formulas

One nice thing about DataFrames.jl is that is has support for formulas. They are similar to R's formulas, but there are some differences. Some packages, like GLM.jl take data frames with formulas as input. Those packages can be used for computing initial estimates, but TargetedLearning.jl does not support formulas and DataFrames directly. However, you can use DataFrames.jl's functionality to take a DataFrame and a formula and get a numeric design matrix based on a formula.

For example, suppose we'd like to include an age squared term and an interaction term between education and marital status. It looks like polynomial terms aren't implemented currently, so you'll have to manually make those terms.

#create age squared
df[:age2] = df[:age] .* df[:age];

#if you want to suppress the intercept, use with -1
# by default the first column will be a column of 1s, which we want.
fm = treatment ~ age + age2 + education + black + hispanic + married + nodegree + RE74 + RE75 + married&nodegree

Formula: treatment ~ age + age2 + education + black + hispanic + married + nodegree + RE74 + RE75 + married & nodegree

#take the field named `m` from the created ModelMatrix object
expanded_baseline_covars = ModelMatrix(ModelFrame(fm, df)).m

445x11 Array{Float64,2}:
 1.0  37.0  1369.0  11.0  1.0  0.0  1.0  1.0      0.0        0.0   1.0
 1.0  22.0   484.0   9.0  0.0  1.0  0.0  1.0      0.0        0.0   0.0
 1.0  30.0   900.0  12.0  1.0  0.0  0.0  0.0      0.0        0.0   0.0
 1.0  27.0   729.0  11.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 1.0  33.0  1089.0   8.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 1.0  22.0   484.0   9.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 1.0  23.0   529.0  12.0  1.0  0.0  0.0  0.0      0.0        0.0   0.0
 1.0  32.0  1024.0  11.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 1.0  22.0   484.0  16.0  1.0  0.0  0.0  0.0      0.0        0.0   0.0
 1.0  33.0  1089.0  12.0  0.0  0.0  1.0  0.0      0.0        0.0   0.0
 1.0  19.0   361.0   9.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 1.0  21.0   441.0  13.0  1.0  0.0  0.0  0.0      0.0        0.0   0.0
 1.0  18.0   324.0   8.0  1.0  0.0  0.0  1.0      0.0        0.0   0.0
 ⋮                             ⋮                                   ⋮  
 1.0  29.0   841.0   9.0  1.0  0.0  0.0  1.0   9268.94    9160.69  0.0
 1.0  28.0   784.0   9.0  1.0  0.0  1.0  1.0  10222.4     9210.45  1.0
 1.0  30.0   900.0  11.0  1.0  0.0  1.0  1.0      0.0     9311.94  1.0
 1.0  25.0   625.0  10.0  1.0  0.0  1.0  1.0  13520.0     9319.44  1.0
 1.0  28.0   784.0  11.0  1.0  0.0  1.0  1.0    824.389  10033.9   1.0
 1.0  22.0   484.0  10.0  0.0  0.0  0.0  1.0  27864.4    10598.7   0.0
 1.0  44.0  1936.0   9.0  1.0  0.0  1.0  1.0  12260.8    10857.2   1.0
 1.0  21.0   441.0   9.0  1.0  0.0  0.0  1.0  31886.4    12357.2   0.0
 1.0  28.0   784.0  11.0  1.0  0.0  0.0  1.0  17491.5    13371.3   0.0
 1.0  29.0   841.0   9.0  0.0  1.0  0.0  1.0   9594.31   16341.2   0.0
 1.0  25.0   625.0   9.0  1.0  0.0  1.0  1.0  24731.6    16946.6   1.0
 1.0  22.0   484.0  10.0  0.0  0.0  1.0  1.0  25720.9    23032.0   1.0

It's clunky, but will get the job done. More detailed documentation is here.

Missing data and categorical data

DataFrames.jl and DataArrays.jl have ways of handling both missing data and categorical data. TargetedLearning.jl does not, so you'll have to deal with those issues ahead of time. The documentation for those packages has more information on both.

TMLE for the average treatment effect

To use TargetedLearning.jl, we first need to scale the outcome to be between 0 and 1.

scaled_outcome = (outcome .- minimum(outcome)) ./ (maximum(outcome) - minimum(outcome))
extrema(scaled_outcome)

(0.0,1.0)

We now need to estimate $\bar{Q}_0$ and $g_0$. We'll use logistic regression for both, regressing on the columns created using the formula above.

using TargetedLearning

n = size(expanded_baseline_covars, 1)

#estimate \bar{Q} by regressing on baseline covars and treatment
Qfit = lreg([expanded_baseline_covars treatment], scaled_outcome)

#computed estimates of logit(\bar{Q}_n(a,W) for a=1 and 0
#linpred returns the linear part of the prediction, which is in the logit scale for logistic regression
logitQnA1 = linpred(Qfit, [expanded_baseline_covars ones(n)])
logitQnA0 = linpred(Qfit, [expanded_baseline_covars zeros(n)])

#esitmate g by regressing on W
gfit = lreg(expanded_baseline_covars, treatment)
#compute estimated probabilities of treatment
gn1 = predict(gfit, expanded_baseline_covars);

Warning: could not import Base.add! into NumericExtensions

Given initial estimates $\bar{Q}_n$ and $g_n$, we just need to call the tmle function:

scaled_estimate = tmle(logitQnA1, logitQnA0, gn1, treatment, scaled_outcome, param=ATE(), weightedfluc=false)

Estimate
      Estimate Std. Error Lower 95% CL Upper 95% CL
ATE  0.0271417  0.0112603   0.00507195    0.0492115

The tmle function returns an estimate along with an estimated variance based on the EIC. It's important to note that this variance is asymptotically consistent if both $\bar{Q}_n$ and $g_n$ are consistent, and is conservative if $g_n$ is consistent but $\bar{Q}_n$ is not. If $g_n$ is not consistent, then EIC based variance estimate is not reliable.

The outcome in the original data set represented earnings, but this estimate is for the ATE of an outcome scaled between 0 and 1. To get it back to the original scale, we can just multiply by the scaling factor.

estimate_orig_scale = scaled_estimate * (maximum(outcome) - minimum(outcome))

Estimate
     Estimate Std. Error Lower 95% CL Upper 95% CL
psi   1636.86    679.085      305.879      2967.84

After transforming the original estimate, the estimand name is now called "psi" instead of "ATE". To make the output look nicer, we can rename it:

name!(estimate_orig_scale, "ATE on earnings")

Estimate
                 Estimate Std. Error Lower 95% CL Upper 95% CL
ATE on earnings   1636.86    679.085      305.879      2967.84

CTMLE for the average treatment effect

To use CTMLE to estimate the ATE, we'll use the same initial estimate of $\mbox{logit}\bar{Q}_n$ as above, stored in logitQnA1 and logitQnA0. We'll also use the same covariates created by expanding the original variables in the data set via a formula, with a first column of ones.

First, we create a CV plan which we can use for multiple calls to ctmle, so results don't change based on different CV folds. By default, the CV plan is generated inside the ctmle function, and can be different if you don't reset the seed random seed before each run.

srand(20) #set the random seed

#collect the treatment indexes from the StratifiedKfold object. 
cvplan = collect(StratifiedKfold(treatment, 10)) #10 fold cv stratified by treatment;

ctmle_est = ctmle(logitQnA1, logitQnA0, expanded_baseline_covars, treatment, scaled_outcome, cvplan=cvplan)

Estimate
     Estimate Std. Error Lower 95% CL Upper 95% CL
ATE  0.027733  0.0108727   0.00642292    0.0490431

To see what covariates were added in fluctuations, call flucinfo on the estimate:

flucinfo(ctmle_est)

Fluctuation info:
Covariates added in 1 steps to 1 fluctuations.
Fluc 1 covars added: [1]

In this case, there is only one fluctuation, and $g_0$ is only estimated using the intercept. As above, we can rescale the estimate back to the original scale.

scaled_estimate * (maximum(outcome) - minimum(outcome))

Estimate
     Estimate Std. Error Lower 95% CL Upper 95% CL
psi   1636.86    679.085      305.879      2967.84

Let's see what happens with an unadjusted initial $\bar{Q}_n$. Here we'll just use the mean of the outcome by treatment.

unadjusted_logitQnA1 = fill(logit(mean(scaled_outcome[treatment.==1])), n)
unadjusted_logitQnA0 = fill(logit(mean(scaled_outcome[treatment.==0])), n)
unadjusted_ctmle_est = ctmle(unadjusted_logitQnA1,
                            unadjusted_logitQnA0,
                            expanded_baseline_covars,
                            treatment,
                            scaled_outcome,
                            cvplan=cvplan)

Estimate
      Estimate Std. Error Lower 95% CL Upper 95% CL
ATE  0.0275032  0.0115604   0.00484526    0.0501612

flucinfo(unadjusted_ctmle_est)

Fluctuation info:
Covariates added in 10 steps to 3 fluctuations.
Fluc 1 covars added: [1,8,10,2,11]
Fluc 2 covars added: [6,4,9,7]
Fluc 3 covars added: [5]

So here we see three fluctuations, with a total of 10 covariates including the intercept being used to estimate $g_0$ out of 11 possible. The estimate is nearly the same as the previous one, with a slightly larger standard error.