How to create, structure, maintain and update data codebooks in R?

Asked 17/3, 2011 at 0:3 Answered 14/2, 2020 at 8:47

In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:

a written or computerized list that provides a clear and comprehensive description of the variables that will be included in the database. Marczyk et al (2010)

I like to document the following attributes of a variable:

name

description (label, format, scale, etc)

source (e.g. World bank)

source media (url and date accessed, CD and ISBN, or whatever)

file name of the source data on disk (helps when merging codebooks)

notes

For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:

code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
     label=c("Label 1",
              "State name",
              "Personal identifier",
              "Income per capita, thousand of US$, constant year 2000 prices",
              "Unique id",
              "Calendar year",
              "blah",
              "bah"),
      source=rep("unknown",length(mydata1)),
      source_media=rep("unknown",length(mydata1)),
      filename = rep("unknown",length(mydata1)),
      notes = rep("unknown",length(mydata1))
)

I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.

Hasseman answered 17/3, 2011 at 0:3 Comment(1)

A similar question was asked here – Hasseman 18/3, 2011 at 3:27

You could add any special attribute to any R object with the attr function. E.g.:

x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And see the given attribute in the structure of the object:

> str(x)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
 - attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

And could also load the specified attribute with the same attr function:

> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_.  Wiley."

If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x)) while altering the structure will erease the given attributes (see: str(cbind(x,x))).

UPDATE: based on comments

If you want to list all non-standard attributes, check the following:

setdiff(names(attributes(x)),c("names","row.names","class"))

This will list all non-standard attributes (standard are: names, row.names, class in data frames).

Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)

First, define the uniqe (=non standard) attributes:

uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))

And make a matrix which will hold the names and values:

attribs <- matrix(0,0,2)

Loop through the non-standard attributes and save in the matrix the names and values:

for (i in 1:length(uniqueattrs)) {
    attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}

Convert the matrix to a data frame and name the columns:

attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')

And save in any format, eg.:

write.csv(attribs, 'foo.csv')

To your question about the variable labels, check the read.spss function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo'] where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.

I hope these could help you to write the required functions in a lot neater way than I tried above! :)

Nea answered 17/3, 2011 at 0:11 Comment(8)

@Nea Cool! Many thanks! If I may follow up: How would one modify your statement attr(x, "source") such that it prints both the attribute name (source) and the attribute value ("Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.") side by side and export to .csv file – Hasseman 17/3, 2011 at 0:22

One more thing, attr appears to be data frame specific not variable specific. Useful for adding a source, source media, filename and other attributes common to all variables in the data frame but not so obvious how to add variable specific label and notes without adding one attribute per variable (e.g. attr(x,"var1_label") <- "A label" and so on). Maybe that is ok... – Hasseman 17/3, 2011 at 0:30

@Fred: I added more details to my answer, I hope that could be helpful to you. As being an autodidact R learner with limited knowledge, my answer will not fulfill all your needs, but I hope that it could get you closer to the goal. – Nea 17/3, 2011 at 0:57

You can actually add attributes to each variable in a data frame: attr(x$var1, "foo") <- "label" – Westward 17/3, 2011 at 0:59

Thanks for UPDATE daroczig. This is very useful. It certainly helps me think in a more structured way about the sort of function I need and some possibilities. – Hasseman 17/3, 2011 at 1:4

Using attr has one definite advantage: data and meta data are linked together. But will have to test how stable/convenient this link is versus having a separate meta data frame as I do now. I am newbie in R so don't have a prior either way right now. Others more experienced may have more insight on this. – Hasseman 17/3, 2011 at 1:10

@Fred: by linking data and metadata together, saving the R object via save instead of third party's formats will guarantee that your metadata will be accessible at all time. Though if your colleges do not work in R, there is no gain. This way I would collect all metadata in a database outside of any stat. software (e.g. in an online groupware solution) and always import those data parallel to the exact datasets. But anyway: does it make sense in this situation to link (or even: load) the metadata into any stat. software? It could be only interesting to load metadata while writing the assay. – Nea 17/3, 2011 at 23:56

@Nea Good point. But, for better or worse, R is becoming my only tool for creating, documenting, and sharing data, including meta data as CSV files. – Hasseman 18/3, 2011 at 4:0

A more advanced version would be to use S4 classes. For example, in bioconductor the ExpressionSet is used to store microarray data with its associated experimental meta data.

The MIAME object described in Section 4.4, looks very similar to what you are after:

experimentData <- new("MIAME", name = "Pierre Fermat",
          lab = "Francis Galton Lab", contact = "[email protected]",
          title = "Smoking-Cancer Experiment", abstract = "An example ExpressionSet",
          url = "www.lab.not.exist", other = list(notes = "Created from text files"))

Zolly answered 17/3, 2011 at 13:19 Comment(1)

there's now also the memisc which appears to implement just this: S4 classes for survey and codebook metadata. – Dagenham 9/8, 2016 at 18:55

The comment() function might be useful here. It can set and query a comment attribute on an object, but has the advantage other normal attributes of not being printed.

dat <- data.frame(A = 1:5, B = 1:5, C = 1:5)
comment(dat$A) <- "Label 1"
comment(dat$B) <- "Label 2"
comment(dat$C) <- "Label 3"
comment(dat) <- "data source is, sampled on 1-Jan-2011"

which gives:

> dat
  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
> dat$A
[1] 1 2 3 4 5
> comment(dat$A)
[1] "Label 1"
> comment(dat)
[1] "data source is, sampled on 1-Jan-2011"

Example of merging:

> dat2 <- data.frame(D = 1:5)
> comment(dat2$D) <- "Label 4"
> dat3 <- cbind(dat, dat2)
> comment(dat3$D)
[1] "Label 4"

but that looses the comment on dat():

> comment(dat3)
NULL

so those sorts of operations would need handling explicitly. To truly do what you want, you'll probably either need to write special versions of functions you use that maintain the comments/metadata during extraction/merge operations. Alternatively you might want to look into producing your own classes of objects - say as a list with a data frame and other components holding the metadata. Then write methods for the functions you want that preserve the meta data.

An example along these lines is the zoo package which generates a list object for a time series with extra components holding the ordering and time/date info etc, but still works like a normal object from point of view of subsetting etc because the authors have provided methods for functions like [ etc.

Quench answered 17/3, 2011 at 15:25 Comment(1)

Thanks! That merging will loose comment on dat() is to be expected, even desired: the merged data may have 2 different sources. One angle of attack is to approach it like melt: that is populating variable-level comments with data frame-level comments before merging. In other software I have atomized the documentation down to the observation level, which is useful when a record for a unit has been spliced or gap filled with records from other sources (but generally this is overkill). – Hasseman 17/3, 2011 at 15:35

As of 2020, there are R packages directly dedicated to codebooks that may fit your needs.

The codebooks package is a comprehensive package that can generate codebooks (with common attributes plus descriptive statistics) in different formats. It has a website and a paper (Arslan, 2019, How to Automatically Document Data With the codebook Package to Facilitate Data Reuse. The paper has, in Figure 1, also a comparison of different approaches.
Here is an example.
The dataspice package (featured by rOpenSci) is particularly dedicated to generating metadata that can be found by search engines on the web. It has a website.
Here is an example.
The dataMaid package can generate a report containing metadata and descriptive statistics, and it can perform certain checks. It's on CRAN and GitHub, and it has a JSS paper (Petersen & Ekstrøm, 2019, dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R).
Here is an example.
The memisc package has a lot of functionality for working with survey data and also comes with a codebook function. It has a website.
Here is an example.
There is also a blog post by Marta Kołczyńska with a lightweight function that generates a data frame with metadata (which can be exported, e.g., to an Excel file).
Here is an example.

Sgraffito answered 14/2, 2020 at 8:47 Comment(0)

How I do this is a little different and markedly less technical. I generally follow the guiding principle that if text is not designed to be meaningful to the computer and only meaningful to humans, then it belongs in comments in the source code.

This may feel rather "low tech" but there are some good reasons to do this:

When someone else picks up your code in the future, it is intuitive that the comments are unambiguously intended for them to read. Parameters set in unusual places within data structures may not be obvious to the future user.
Keeping track of parameters set inside of abstract objects requires a fair bit of discipline. Creating code comments requires discipline as well, but the absence of the comment is immediately obvious. If descriptions are being carried along as part of the object, glancing at the code does not make this obvious. The code then becomes less "literate" in the "literate programming" sense of the word.
Carrying descriptions of the data inside the data object can easily result in descriptions that are incorrect. This can happen if, for example, a column containing a measurement in kg is multiplied by 2.2 to convert the units to pounds. It would be very easy to overlook the need to update the metadata.

Obviously there are some real advantages to carrying metadata along with the objects. And if your workflow makes the above points less germane, then it may make a lot of sense to create a metadata attachment to your data structure. My intent was only to share some reasons why a "lower tech" comment based approach might be considered.

Khaki answered 17/3, 2011 at 15:28 Comment(4)

Thanks! These are all good points. I don't document all my work so carefully, but for large collaborative projects, for publication, etc. it is useful. Some of my collaborators would not touch R with a 6ft pole. Having codebooks and data in flat files helps collaborative work. Finally, in Aremos I would document the raw data. Any data created from there is automatically labeled with the formula used to create it so you can always go back the chain see what you have (e.g. creating y <- x*z would create a label field for y that reads "y <- x*z". – Hasseman 17/3, 2011 at 15:46

That self documenting feature of Aremos is pretty neat. I didn't realize it did that. Everything old is new again! My answer was certainly not an answer to your question and was more of a "something to consider" comment. Thanks for taking it in that context. – Khaki 17/3, 2011 at 15:55

I guess your approach might work better in combination with Sweave in the sense that any relevant comments in your code ought to be reflected in the final document. The advantage here is other collaborators don't need to read R script. The disadvantage is the Latex doc usually is prepared at the end of the process, whereas documentation starts from the beginning. So codebook + Sweave might be ideal (if laborious)... – Hasseman 17/3, 2011 at 16:57

BTW raw data files and the R script may work as replication files. But having replicated some published work I find authors only make available their final "analysis" database. Replicating the latter is nearly impossible as most original data providers don't have a version control system. Moreover, the analysis data is typically put together by some RA whose scripts are long lost. All the author has is the analysis script - with little info on where data came from - and data that is often poorly documented if at all. Not sure putting those two together counts as replication. – Hasseman 18/3, 2011 at 4:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags