Create a long-form dataset with tidyr or reshape2

There are at least two packages, which let you create data in a long-format. This a step in the process of getting tidy data. For many functions you need your data in this format, for example when creating a chart with multiple lines in ggplot2 in R.

When do you need this?

When your data is stacked:

year 1  2
2013 54 65
2014 34 90
2015 89 100

This form violates two thirds of Hadley Wickham’s rules for tidy data:

  1. Each variable forms a column

  2. Each observation forms a row

  3. Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)

The same data in the best practice longform:

year treatment result
2013 1 54
2013 2 65
2014 1 34
2014 2 90
2015 1 89
2015 2 100

Now, every variable has its column, every observation is a row.

Two packages to get long-form data

  1. Reshape2
  2. Tidyr


The central function in reshape2 is called melt()

Example with dataset above:

data_long <- melt(data, id.vars="year", measure.vars=c("treatment a", "treatment b"),"treatment","result")

measure.vars: which columns need to be packed in melted into one column how is this new column called? how is the value column called? default: value

Learn more about reshape2 in a tutorial: An Introduction to reshape2


The function with tidyr is gather

gather(year, result, 1:2)

Suppose you have 5 different treatments and the header row of your stacked data looks like this:

year 1 2 3 4 5

The code is then:

gather(year, result, 1:5)

A tutorial for tidyr: Data Processing with dplyr & tidyr

Both packages are by Hadley Wickham, they can do much more than making stacked data long.