Data Transformation with dplyr

May 18, 2020
dplyr

So it’s been about more than two months since my fist blog post. The lockdown during this coronavirus kept me a little bit busy emotionally at the beginning but I’m now already used to this normality and enjoying the current state. The lockdown has been great for productivity! I learned how to use the dplyr() package, did the Statistical Inferece course from the Data Science specialization on Coursera and now I’m about to finish Regression Models from Coursera as well.

In this post, I’ll talk about what I’ve learned from the dplyr package. The Statistical Inferece and Regression Models course are more about stats than R, so maybe I’ll leave that for another post.

I first learned about dplyr in the Getting and Cleaning Data on Coursera. However, in my opinion, the course doesn’t provided enough exercises to really internalize and digest the different functions in dplyr. I ended up picking up R for Data Science (Wickham & Grolemund, 2016). The book is pretty good for learning the tools for data science and it provides a good amount of exercises to practice. I can really recommend using this book as a complement to the course on Coursera.

Dplyr

dplyr is described as “grammar of data manipulation”. Indeed, the dplr package consist of 5 verbs that will help you with data manipulation. As in the book, I’ll use the nycflights13 package to explain the dplyr package. The nycflights13 package contains information about all flights departing from New York in 2013.

library(dplyr)
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.0.2

From the nycflights13 package, we’ll be mainly using the flights data frame. Let’s have a look at it.

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

fights is a tibble, i.e. a data frame that works better with dplyr. A tibble provides many advantages. First, we observe that when we call flights, it only prints out the first 10 rows of the data frame instead of the whole data frame (we don’t need to have a look at every observation of the data frame). Secondly, a tibble shows the type of each variable.

dplyr has 5 basic verbs: (i)filter(), (ii) arrange(), (iii) select(), (iv) mutate() and (v) summarize().

1. filter()

filter() allows you to filter rows based on the values specified in the argument of the function. So let’s say you want to filter all flights that either departed on women’s international day, March 8th.

filter(flights, month==3, day==8)
## # A tibble: 979 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     3     8        1           2355         6      431            440
##  2  2013     3     8        5           2020       225      226           2318
##  3  2013     3     8        8           2145       143      240             34
##  4  2013     3     8        9           2155       134      238             48
##  5  2013     3     8       11           2230       101      437            312
##  6  2013     3     8       12           2245        87      127           2356
##  7  2013     3     8       13           2140       153      144           2305
##  8  2013     3     8       17           1918       299      237           2216
##  9  2013     3     8       22           2359        23      447            438
## 10  2013     3     8       22           2150       152      258             42
## # ... with 969 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Filtering flights that departed with more than 1 hour delay.

filter(flights, (arr_delay>=60)) %>% 
    print(width=Inf)
## # A tibble: 28,317 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1120            944        96     1331           1213
##  6  2013     1     1     1255           1200        55     1451           1330
##  7  2013     1     1     1301           1150        71     1518           1345
##  8  2013     1     1     1337           1220        77     1649           1531
##  9  2013     1     1     1342           1320        22     1617           1504
## 10  2013     1     1     1356           1259        57     1538           1438
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1       137 MQ        4576 N531MQ  LGA    CLT        118      544     6     30
##  2       851 MQ        3944 N942MQ  JFK    BWI         41      184    18     35
##  3       123 UA         856 N534UA  EWR    BOS         37      200     7     33
##  4       145 UA        1086 N76502  LGA    IAH        248     1416     9      0
##  5        78 EV        4495 N16561  EWR    SAV        117      708     9     44
##  6        81 MQ        4601 N518MQ  LGA    BNA        139      764    12      0
##  7        93 MQ        4646 N542MQ  LGA    MSP        170     1020    11     50
##  8        78 B6         673 N636JB  JFK    LAX        352     2475    12     20
##  9        73 EV        3832 N13969  EWR    STL        194      872    13     20
## 10        60 UA          32 N17128  EWR    ORD        135      719    12     59
##    time_hour          
##    <dttm>             
##  1 2013-01-01 06:00:00
##  2 2013-01-01 18:00:00
##  3 2013-01-01 07:00:00
##  4 2013-01-01 09:00:00
##  5 2013-01-01 09:00:00
##  6 2013-01-01 12:00:00
##  7 2013-01-01 11:00:00
##  8 2013-01-01 12:00:00
##  9 2013-01-01 13:00:00
## 10 2013-01-01 12:00:00
## # ... with 28,307 more rows

Filtering flights that have Miami or Atlanta as destination.

filter(flights, dest %in% c("MIA", "ATL")) %>% 
    print(width=Inf)
## # A tibble: 28,943 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      600            600         0      837            825
##  4  2013     1     1      606            610        -4      858            910
##  5  2013     1     1      606            610        -4      837            845
##  6  2013     1     1      607            607         0      858            915
##  7  2013     1     1      615            615         0      833            842
##  8  2013     1     1      623            610        13      920            915
##  9  2013     1     1      655            700        -5     1002           1020
## 10  2013     1     1      658            700        -2      944            939
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  2       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  3        12 MQ        4650 N542MQ  LGA    ATL        134      762     6      0
##  4       -12 AA        1895 N633AA  EWR    MIA        152     1085     6     10
##  5        -8 DL        1743 N3739P  JFK    ATL        128      760     6     10
##  6       -17 UA        1077 N53442  EWR    MIA        157     1085     6      7
##  7        -9 DL         575 N326NB  EWR    ATL        120      746     6     15
##  8         5 AA        1837 N3EMAA  LGA    MIA        153     1096     6     10
##  9       -18 DL        2003 N997DL  LGA    MIA        161     1096     7      0
## 10         5 DL        1547 N6703D  LGA    ATL        126      762     7      0
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 06:00:00
##  3 2013-01-01 06:00:00
##  4 2013-01-01 06:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 06:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 07:00:00
## 10 2013-01-01 07:00:00
## # ... with 28,933 more rows

For illustration issues, I included %>% print(width=Inf) so that you can see all the columns of the data frame. The line is not necessary to filter the rows and you can leave the part out if you don’t want to have all columns visible.

2. arrange()

arrange allows you to rearrange the order of the rows. So, let’s say you want to arrange the flights with a descending air time:

arrange(flights, desc(arr_time))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2209           2155        14     2400           2337
##  2  2013     1     5     2116           2130       -14     2400             18
##  3  2013     1    13     2243           2129        74     2400           2224
##  4  2013     1    16     2138           2107        31     2400           2322
##  5  2013     1    17     2256           2249         7     2400           2357
##  6  2013     1    22     2212           2055        77     2400           2250
##  7  2013     1    22     2249           2125        84     2400           2250
##  8  2013     1    25     2055           1725       210     2400           1933
##  9  2013     1    28     2303           2250        13     2400           2354
## 10  2013     1    30     2155           1915       160     2400           2137
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, desc(arr_time, dest))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     2209           2155        14     2400           2337
##  2  2013     1     5     2116           2130       -14     2400             18
##  3  2013     1    13     2243           2129        74     2400           2224
##  4  2013     1    16     2138           2107        31     2400           2322
##  5  2013     1    17     2256           2249         7     2400           2357
##  6  2013     1    22     2212           2055        77     2400           2250
##  7  2013     1    22     2249           2125        84     2400           2250
##  8  2013     1    25     2055           1725       210     2400           1933
##  9  2013     1    28     2303           2250        13     2400           2354
## 10  2013     1    30     2155           1915       160     2400           2137
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Additional columns included in the argument helps to break ties.

3. select()

This function allows you to obtain a subset of the data frame by just selecting the columns you’re interested in. Let’s say you want a subset of the data with only the information about the tail number and the carrier of the flights dataframe.

select(flights, tailnum, carrier)
## # A tibble: 336,776 x 2
##    tailnum carrier
##    <chr>   <chr>  
##  1 N14228  UA     
##  2 N24211  UA     
##  3 N619AA  AA     
##  4 N804JB  B6     
##  5 N668DN  DL     
##  6 N39463  UA     
##  7 N516JB  B6     
##  8 N829AS  EV     
##  9 N593JB  B6     
## 10 N3ALAA  AA     
## # ... with 336,766 more rows

Or if you want to select all columns except of some of them:

select(flights, -(carrier:air_time))
## # A tibble: 336,776 x 13
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 5 more variables: arr_delay <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can also use select() in combination with everything() to move some columns at the beginning of the data frame:

select(flights, air_time, distance, hour, everything())
## # A tibble: 336,776 x 19
##    air_time distance  hour  year month   day dep_time sched_dep_time dep_delay
##       <dbl>    <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>
##  1      227     1400     5  2013     1     1      517            515         2
##  2      227     1416     5  2013     1     1      533            529         4
##  3      160     1089     5  2013     1     1      542            540         2
##  4      183     1576     5  2013     1     1      544            545        -1
##  5      116      762     6  2013     1     1      554            600        -6
##  6      150      719     5  2013     1     1      554            558        -4
##  7      158     1065     6  2013     1     1      555            600        -5
##  8       53      229     6  2013     1     1      557            600        -3
##  9      140      944     6  2013     1     1      557            600        -3
## 10      138      733     6  2013     1     1      558            600        -2
## # ... with 336,766 more rows, and 10 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, minute <dbl>, time_hour <dttm>

4. mutate()

Mutate allows you to add new columns or variables to the dataframe. The new variable will be inserted at the end of the dataframe. Alternatively, you can use transmute() to only keep the new column created.

mutate(flights, dist_in_km=distance*1.6093) %>% 
    print(width=Inf)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
##  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
##  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
##  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
##  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
##  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
##  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
## 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
##    time_hour           dist_in_km
##    <dttm>                   <dbl>
##  1 2013-01-01 05:00:00      2253.
##  2 2013-01-01 05:00:00      2279.
##  3 2013-01-01 05:00:00      1753.
##  4 2013-01-01 05:00:00      2536.
##  5 2013-01-01 06:00:00      1226.
##  6 2013-01-01 05:00:00      1157.
##  7 2013-01-01 06:00:00      1714.
##  8 2013-01-01 06:00:00       369.
##  9 2013-01-01 06:00:00      1519.
## 10 2013-01-01 06:00:00      1180.
## # ... with 336,766 more rows
transmute(flights, dist_in_km=distance*1.6093)
## # A tibble: 336,776 x 1
##    dist_in_km
##         <dbl>
##  1      2253.
##  2      2279.
##  3      1753.
##  4      2536.
##  5      1226.
##  6      1157.
##  7      1714.
##  8       369.
##  9      1519.
## 10      1180.
## # ... with 336,766 more rows

5. summarize()

Finally, we have summarize() which is usually used with the pipe operator %>% and group_by(). The pipe operator allows you to save the intermediate output of the command and use it on the next line. So, instead of creating several different variables that saves the intermediate result, you can use the pipe operator and just write all the commands in a “chain”. You can notice that I’ve already used the pipe operator in some of the code before. It saves the result of the code on the left hand side of the pipe operator and uses this result on the next line. I would recommend you learning the keyboard shortcut for the pipe operator. In windows, it is Ctrl + Shift + M.

By using group_by(), you apply functions on a group basis instead of on the whole data frame. Also, it is important to note that group_by() doesn’t change the data frame at all, it still have the same amount of columns and rows.

Following the examples of the book, let us investigate the average arrival delay of the flights.

summarize(flights, delay=mean(arr_delay, na.rm=TRUE))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  6.90

The output we obtain is the average delay over all observaitons. What if we want to know the average arrival delay per day? We’ll need to group the data frame by days!

group_by(flights, year, month, day) %>% 
    summarize(avg_delay_by_day=mean(dep_delay, na.rm=TRUE))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day avg_delay_by_day
##    <int> <int> <int>            <dbl>
##  1  2013     1     1            11.5 
##  2  2013     1     2            13.9 
##  3  2013     1     3            11.0 
##  4  2013     1     4             8.95
##  5  2013     1     5             5.73
##  6  2013     1     6             7.15
##  7  2013     1     7             5.42
##  8  2013     1     8             2.55
##  9  2013     1     9             2.28
## 10  2013     1    10             2.84
## # ... with 355 more rows

What I am doing with group_by() is grouping the flights dataframe by year, month and day. The last column of this new output shows us the average delay per day instead the average delay over all observations.

We can also calcualte the average arrival delay per month. For this, we would have to group the data by year and month.

group_by(flights, year, month) %>% 
    summarize(avg_delay_by_month=mean(arr_delay, na.rm=TRUE))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month avg_delay_by_month
##    <int> <int>              <dbl>
##  1  2013     1              6.13 
##  2  2013     2              5.61 
##  3  2013     3              5.81 
##  4  2013     4             11.2  
##  5  2013     5              3.52 
##  6  2013     6             16.5  
##  7  2013     7             16.7  
##  8  2013     8              6.04 
##  9  2013     9             -4.02 
## 10  2013    10             -0.167
## 11  2013    11              0.461
## 12  2013    12             14.9

Or maybe you think it makes more sense to group by destination:

group_by(flights, dest) %>% 
    summarize(by_dest=mean(arr_delay, na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 105 x 2
##    dest  by_dest
##    <chr>   <dbl>
##  1 ABQ      4.38
##  2 ACK      4.85
##  3 ALB     14.4 
##  4 ANC     -2.5 
##  5 ATL     11.3 
##  6 AUS      6.02
##  7 AVL      8.00
##  8 BDL      7.05
##  9 BGR      8.03
## 10 BHM     16.9 
## # ... with 95 more rows

In this post, I provide a quick overview of dplyr. It naturally takes some time to understand everything and all the commands, but you’ll eventually get it! I can really recommend checking out the book and do the exercises.