In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:
a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)
I like to document the following attributes of a variable:
- name
- description (label, format, scale, etc)
- source (e.g. World bank)
- source media (url and date accessed, CD and ISBN, or whatever)
- file name of the source data on disk (helps when merging codebooks)
- notes
For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:
code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
label=c("Label 1",
"State name",
"Personal identifier",
"Income per capita, thousand of US$, constant year 2000 prices",
"Unique id",
"Calendar year",
"blah",
"bah"),
source=rep("unknown",length(mydata1)),
source_media=rep("unknown",length(mydata1)),
filename = rep("unknown",length(mydata1)),
notes = rep("unknown",length(mydata1))
)
I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.