Data Management for Data Doubles

Running the Data Doubles project can be a challenge at times because the team has to coordinate activities across eight different universities. Some of this is university bureaucracy (doing that many IRB applications!) but some of it is internal to keeping the project running smoothly. One of the keys to Data Doubles’ success is actually good data management, i.e. making sure that we all use standardized ways for naming, organizing, and processing the data that funnels in from each of the campuses.

Data Doubles team member Kristin Briney is our resident expert at data management and has been leading the project’s data management. She uses multiple strategies to keep the team’s data organized, findable, and computable, including: data management plans, naming conventions, codes, and a data dictionary.

Data Management Plans

Data Doubles is a three year, three-phase project and each phase gets its own data management plan. This may seem like overkill, but it’s necessary because the data collected in each phase is really different and undergoes different workflows. Each data management plan lays out: what the data workflow is for that phase; how files will be named; what codes will be used; and where files will be stored. By making all of this clear in one master document, all team members are on the same page as to what to do with their campus’ data. This makes finding files from any campus easy.

Naming Conventions

Naming conventions are one of the most important parts to managing the Data Doubles project data. By naming files in a consistent way, the team is able to track progress and find specific files quickly. For example, here’s the naming convention we used for phase one data:

  • THEME_SITEID_YYYYMMDD_TYPE
    • THEME is the three-letter coding for one of the interview themes
    • SITE is the two-letter coding for institution
    • ID is the interview subject’s ID number; IDs are two-digit numbers ranging from 01-03 (this is because there are only three interviews per theme per site)
    • DATE of the interview in YYYYMMDD format
    • TYPE is the coding for data type/analysis stage
  • Examples:
    • PRI_IN02_20180222_CaseSummary.pdf
    • PRO_BL03_20180222_Audio.mp3
    • AWA_MK01_20180222_Notes.pdf
    • LLA_LB02_20180222_OriginalTranscript.pdf

This convention packed a lot of information into the file name! Using the convention consistently helped us track any single interview’s processing from audio recording to transcript to case summary. We could also keep track of interview completion across each interview theme and on each campus. With 15 interviews at each of eight sites (and then multiple files for each interview as it was processed), consistent file naming was critical in phase one.

Codes

You’ll notice that the file names above use codes heavily. Codes are a good compromise between information being human readable and not taking up many characters. In particular, the team regularly uses codes to represent the different campuses (e.g. UW-Madison is “UW”), both in file names and in the survey data. Codes are handy tags for our data (and are always documented in the data management plan).

Data Dictionary

The survey data in phase two lives in one giant spreadsheet and the team is using a data dictionary to both process and interpret this data. If you’re not familiar, a data dictionary describes each variable in a spreadsheet and gives further context to that variable. In our case, the data dictionary lists for each survey question:

  • The default Qualtrics question ID
  • The question’s thematic category
  • The assigned question ID (which is both human readable and computable)
  • The question type (e.g., a Yes/No question)
  • The question text

All this information helps team members contextualize the survey data spreadsheet (which is formatted as a header row of question IDs followed by rows and rows of data).

Our data dictionary is extra special, though, because we use it to automatically process and clean our survey data in R. Our R code reads in the data dictionary and leverages it to: assign the proper variable IDs; delete unnecessary variables (i.e., the Qualtrics default outputs); and recode data from text to numbers (e.g., switch “Yes” to 1 and “No” to 0 as documented in the data management plan) based on the question type listed in the data dictionary. Using the data dictionary for these processes allows us to write code to do these tasks automatically instead of doing everything by hand. It also makes adjustments easy—just update the data dictionary!

Data Manager

The final necessary piece of the puzzle is having a designated data manager to oversee the development of the data management plans and data conventions, as well as to periodically clean up messy files.

In Conclusion

All of these structures have made a huge difference in managing the mountain of data that the Data Doubles project collects and uses in each project phase. It is only through effective data management that the Data Doubles team has been able to work remotely across eight different campuses.

One thought to “Data Management for Data Doubles”

Leave a Reply

Your email address will not be published. Required fields are marked *