Data Transformation

Session 2b

Author

Affiliation

Zixi Chen, PhD

NYU-Shanghai

Published

September 25, 2025

“Transformation includes narrowing in on observations of interest (like all people in one city or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).”

–https://r4ds.hadley.nz/intro

In tidyverse framework, dplyr is the main package used for data transformation. The following goes through a list of popular functions that help you manipulate your data. Using the Twitter example data we imported last week, we will start with wrangling the columns (i.e., variables) and then switch to rows (i.e., observations.)

Let’s first load tidyverse and import the drob_tweets data that we’ve imported last week.

library(tidyverse)

drob_tweets<-read_delim(file="https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/data/tweets_dave.csv", 
                        col_types = cols(tweet_id =col_character(),
                                         in_reply_to_status_id = col_character(),
                                         in_reply_to_user_id = col_character(),
                                         timestamp = col_datetime(format = "%Y-%m-%d %H:%M:%S %z"),
  source = col_character(),
  text = col_character(),
  retweeted_status_id = col_character(),
  retweeted_status_user_id = col_character(),
  retweeted_status_timestamp = col_datetime("%Y-%m-%d %H:%M:%S %z"),
  expanded_urls = col_character()))

glimpse(drob_tweets)

Rows: 4,174
Columns: 10
$ tweet_id                   <chr> "816108564720812032", "815892641820901376",…
$ in_reply_to_status_id      <chr> NA, NA, NA, NA, NA, "815320862660366336", N…
$ in_reply_to_user_id        <chr> NA, NA, NA, NA, NA, "3230388598", NA, "1425…
$ timestamp                  <dttm> 2017-01-03 02:26:59, 2017-01-02 12:08:59, …
$ source                     <chr> "<a href=\"http://twitter.com/download/ipho…
$ text                       <chr> "RT @ParkerMolloy: 2017 is off to quite a s…
$ retweeted_status_id        <chr> "816082588137881600", "776081137722650628",…
$ retweeted_status_user_id   <chr> "634734888", "2842614819", "69133574", "756…
$ retweeted_status_timestamp <dttm> 2017-01-03 00:43:45, 2016-09-14 15:32:16, …
$ expanded_urls              <chr> "https://twitter.com/ParkerMolloy/status/81…

1 Manipulating columns

1.1 Create new columns

Fig.1 Artwork retrieved from allisonhorst.com

We use mutate() to create new variables that are functions of existing variables. The original dataset will not be altered.

Let’s create a few time related variables from the existing timestamp variable.

drob_tweets_1<- mutate(drob_tweets, # original data
                       year=year(timestamp), # new vars. = f(existing vars.)
                       month=month(timestamp), 
                       month2=month(timestamp, label = T),
                       day=day(timestamp),
                       timestamp.ymd=as_date(timestamp))

Activity 1

Can you think of any alternative ways to create the year, month, day, and timestamp.ymd variables.

1.2 Select and remove columns

Now, I only want to examine the time variables so I use select() to keep these variables.

drob_tweets_YM<-select(drob_tweets_1, 
                       c(timestamp, year, month, month2, day, timestamp.ymd))

glimpse(drob_tweets_YM)

Rows: 4,174
Columns: 6
$ timestamp     <dttm> 2017-01-03 02:26:59, 2017-01-02 12:08:59, 2017-01-01 19…
$ year          <dbl> 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 20…
$ month         <dbl> 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
$ month2        <ord> Jan, Jan, Jan, Jan, Jan, Dec, Dec, Dec, Dec, Dec, Dec, D…
$ day           <int> 3, 2, 1, 1, 1, 31, 31, 30, 29, 29, 29, 29, 29, 29, 29, 2…
$ timestamp.ymd <date> 2017-01-03, 2017-01-02, 2017-01-01, 2017-01-01, 2017-01…

Of course, we can also use select() to remove variables.

drob_tweets_lite<-select(drob_tweets_1, 
                       -c(in_reply_to_user_id, # use minus to de-select
                          retweeted_status_id:expanded_urls)) #   use colon to specify a sequence of variables
glimpse(drob_tweets_lite)

Rows: 4,174
Columns: 10
$ tweet_id              <chr> "816108564720812032", "815892641820901376", "815…
$ in_reply_to_status_id <chr> NA, NA, NA, NA, NA, "815320862660366336", NA, "8…
$ timestamp             <dttm> 2017-01-03 02:26:59, 2017-01-02 12:08:59, 2017-…
$ source                <chr> "<a href=\"http://twitter.com/download/iphone\" …
$ text                  <chr> "RT @ParkerMolloy: 2017 is off to quite a start.…
$ year                  <dbl> 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, …
$ month                 <dbl> 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12, 1…
$ month2                <ord> Jan, Jan, Jan, Jan, Jan, Dec, Dec, Dec, Dec, Dec…
$ day                   <int> 3, 2, 1, 1, 1, 31, 31, 30, 29, 29, 29, 29, 29, 2…
$ timestamp.ymd         <date> 2017-01-03, 2017-01-02, 2017-01-01, 2017-01-01,…

Activity 2

Working with GenAI

Note that, by default, select() reorders columns to match the order you specify and drops unselected columns. You can use everything() to retain unselected columns in their original positions after the columns you explicitly reorder. Can you explain what the following line of code do?

drob_tweets_lite<-select(drob_tweets_1,month2, day,everything())

A similar function of select() would be pull(), which will pull out columns for fast examination.

Can you pull out the year and month variables?

What are the differences between pull() and select()? Ask GenAI to see if your thoughts aligned.

1.3 Rename columns

Fig.2 Artwork retrieved from allisonhorst.com

We use rename() to assign new names to existing variables. Here, I rename month2 to letter.month.

drob_tweets_2<-rename(drob_tweets_1, 
                      letter.month=month2, # new name = old name
                      tweet=text)


colnames(drob_tweets_2)

 [1] "tweet_id"                   "in_reply_to_status_id"     
 [3] "in_reply_to_user_id"        "timestamp"                 
 [5] "source"                     "tweet"                     
 [7] "retweeted_status_id"        "retweeted_status_user_id"  
 [9] "retweeted_status_timestamp" "expanded_urls"             
[11] "year"                       "month"                     
[13] "letter.month"               "day"                       
[15] "timestamp.ymd"

Note

Working with GenAI Ask GenAI: “Why is it important to use descriptive variable names like”timestamp.ymd” and “letter.month” instead of “ts” and “month2”?”

1.4 Relocate column positions

Fig.3 Artwork retrieved from allisonhorst.com

For a better view of the data set, I move the newly created time variables to be together with timestamp and right after tweet. In this task, we userelocate().

drob_tweets_3<- relocate(drob_tweets_2, 
                         c(year, month, letter.month, day, timestamp, timestamp.ymd), 
                         .after = tweet)

colnames(drob_tweets_3)

 [1] "tweet_id"                   "in_reply_to_status_id"     
 [3] "in_reply_to_user_id"        "source"                    
 [5] "tweet"                      "year"                      
 [7] "month"                      "letter.month"              
 [9] "day"                        "timestamp"                 
[11] "timestamp.ymd"              "retweeted_status_id"       
[13] "retweeted_status_user_id"   "retweeted_status_timestamp"
[15] "expanded_urls"

2 Pipes

You might find something tedious here. We created many new but unnecessary data sets whenever we executed a new function. Also, if we want to perform multiple functions simultaneously, we must use many parentheses, making your script hard to read. See the following ridiculous codes.

drob_tweets_4<-relocate(rename(mutate(drob_tweets, # original data
                       year=year(timestamp), # new vars. = f(existing vars.)
                       month=month(timestamp), 
                       month2=month(timestamp, label = T),
                       day=day(timestamp),
                       timestamp.ymd=as_date(timestamp)), 
                       
                       letter.month=month2, # rename vars.
                       tweet=text), 
                       
                       c(year, month, letter.month, timestamp), # relocating vars. 
                         .after = tweet)

Luckily, this issue can be easily solved by using a pipe operator %>% from the magrittr package of the tidyverse family. You may read it as “and then,” which pipes an object forward to a function or call expression. Let’s see how we can use it by only creating one new date set while executing all the above functions.

drob_tweets.df<-drob_tweets %>% 
  mutate(year=year(timestamp), # new vars. = f(existing vars.)
         month=month(timestamp), 
         month2=month(timestamp, label = T),
         day=day(timestamp),
         timestamp.ymd=as_date(timestamp)) %>% 
  rename(letter.month=month2, # rename vars.
         tweet=text) %>% 
  relocate(c(year, month, letter.month, day, timestamp,timestamp.ymd), # relocate vars. 
           .after = tweet) %>% 
  select(-c(in_reply_to_user_id, 
            retweeted_status_id:expanded_urls))

glimpse(drob_tweets.df)

Rows: 4,174
Columns: 10
$ tweet_id              <chr> "816108564720812032", "815892641820901376", "815…
$ in_reply_to_status_id <chr> NA, NA, NA, NA, NA, "815320862660366336", NA, "8…
$ source                <chr> "<a href=\"http://twitter.com/download/iphone\" …
$ tweet                 <chr> "RT @ParkerMolloy: 2017 is off to quite a start.…
$ year                  <dbl> 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, …
$ month                 <dbl> 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12, 1…
$ letter.month          <ord> Jan, Jan, Jan, Jan, Jan, Dec, Dec, Dec, Dec, Dec…
$ day                   <int> 3, 2, 1, 1, 1, 31, 31, 30, 29, 29, 29, 29, 29, 2…
$ timestamp             <dttm> 2017-01-03 02:26:59, 2017-01-02 12:08:59, 2017-…
$ timestamp.ymd         <date> 2017-01-03, 2017-01-02, 2017-01-01, 2017-01-01,…

Now, we can remove the other date sets created earlier.

rm(list = ls(pattern="drob_tweets_"))

See how %>% can help improve your coding efficiency and script readability. You may also see |> somewhere else, a native pipe operator added to base R since the recent R 4.1.0 version. %>% and |> work similarly while with some differences. You may read more on this here. There are other pip operators from magrittr. We will see one more example later very soon.

3 Working on rows

Here, we start with filtering tweets that are 1) from 2012 to 2015 and 2) are not retweets. The function filter() subsets a data frame satisfying the specified conditions.

Fig.4 Artwork retrieved from allisonhorst.com

We then arrange the tweet from the newest year to the oldest while keeping the year and day in ascending order using the arrange() function.

drob_tweets.df2 <-drob_tweets.df %>%
  filter(year>=2012 & year<=2015) %>% # keep tweets from 2012-2015 
  filter(!is.na(in_reply_to_status_id)) %>% # keep the reply tweets (defined soly based on this variable)
  arrange(desc(year), month, day) # re-order the tweets sequence by time

View(drob_tweets.df2)

Note

Working with GenAI

Put the following line of code in your console, does it work and why?

drob_tweets.df %>% filter(year=2015)

You can ask GenAI for debugging, for example: “Why does filter(year = 2015) give an error in R?”

4 Many more useful functions…

dplyr has many more useful functions to conduct a variety of tasks. In the following example, I further show a few commonly used functions, particular those ones are useful in manipulating data by groups and computing aggregated statistics.

Example: Retweeting events

We might want to examine users’ retweeting behavior (as it might be a factor contributing to the users’ follower/following network size.)

To know a user’s activeness of retweeting, we may calculate how many retweets per month in each year.

drob_tweets.YMgroup<-drob_tweets.df2 %>%
  group_by(year, month) %>% 
  mutate(reply.tweet.count.YM=n(),
         reply.daysYM=n_distinct(timestamp.ymd)) %>% 
  ungroup() %>% # always ungroup when a grouping task is finished
  group_by(year) %>% 
  mutate(reply.monthsY=n_distinct(month)) %>% 
  ungroup() %>% 
  select(year, month, timestamp.ymd,
         reply.tweet.count.YM,reply.daysYM,reply.monthsY)  

View(drob_tweets.YMgroup)

Now we can tell how many months that this user has active retweeting events each year.

drob_tweets.YMgroup %>% 
  distinct(reply.monthsY, year)

# A tibble: 4 × 2
  reply.monthsY  year
          <int> <dbl>
1            12  2015
2             5  2014
3             4  2013
4             2  2012

Further, we can show the most active retweeting months each year. Let’s take a look at the most active top 3 months.

drob_tweets.YMtop3<-drob_tweets.YMgroup %>%
  distinct(year, month, .keep_all=T) %>% # keep distinct rows at year-month level and keep all variables.
  group_by(year)%>% 
  slice_max(order_by=reply.tweet.count.YM, # sort by number of retweets
            n=3, # specify that we only want the top 3 
            with_ties=T) %>%  # keep the rows of ties
  ungroup()%>% 
  select(-timestamp.ymd) # as we are computing aggregated statistics at the year and month levels, we disregard the day level variable for clarity.

View(drob_tweets.YMtop3)

What can we tell from the above tables?

You can see that there are only two lines of data entry for 2012. That’s because the user only retweeted in two months (i.e., Jan. and Jun.) in 2012. See the corresponding reply.monthsY value is 2.
In 2013, four months shown up since we specified that we want to include the ties. Thus, the four months have retweets and each has only one retweet are all “sliced”.
In 2014, the user has a little bit more engagement in retweeting during five months. Notably, December had the highest number of replied tweets, with September and February following closely.
At last, the number of retweets significantly increased in 2015. I wonder what happened this year. Also, Dec. is an active retweeting month. Now, take a wild guess: in which year did this user have a larger tweeting network?

Finally, we can use summarize() to compute active retweeting statistics at the day level.

drob_tweets.group<-drob_tweets.df2 %>%
  group_by(timestamp.ymd) %>%
  summarise(reply.tweet.daily=n()) %>% 
  ungroup() %>% 
  mutate(mean.reply.tweet.daily=mean(reply.tweet.daily),
         sd.reply.tweet.daily=sd(reply.tweet.daily),
         reply.days=n_distinct(timestamp.ymd)) %>% 
  arrange(desc(reply.tweet.daily))

Activity 3

Can you explain these statistics that we just computed?

Activity 4

What if we move mutate(reply.days=n_distinct(timestamp.ymd)) before the summarise() function? Run the following chunk to see the output.

drob_tweets.group2<-drob_tweets.df2 %>%
  mutate(reply.days=n_distinct(timestamp.ymd)) %>% 
  group_by(timestamp.ymd) %>%
  summarise(reply.tweet.count.daily=n()) %>% 
  ungroup()

5 Other tools for cleaning dirty data

In addition to dplyr and lubridate that we have explored so far for data cleaning and manipulation, janitor is another handy tool to deal with dirty data. Take a look of its vignettes, see how it can help with cleaning column names, removing rows and columns, exploring duplication patterns, cleaning datetime data, and so much more.

6 Dealing with BIG data

So far, we have used two example datasets to demonstrate data importing and wrangling. These two datasets are small in terms of size and are snapshots of penguins and tweets. We worked with these data in data frames, which are processed through our computer’s Random Access Memory (RAM).

As you progress in your CSS journey, we would encounter with large datasets that exceeding the RAM limit. In such instances, managing big data is better handled through databases and database management systems (DBMS). The resources provided below offer introductory strategies for utilizing R in dealing with big data. However, comprehensive training on handling big data goes beyond the scope of this class.

Importing big data from database using DBI and dbplyr.
Officially recommended strategies for working with big data in R.

7 Reflection: working with GenAI

As we wrap up this session, take a moment to reflect on your experience using GenAI as a learning tool while working through the activities. GenAI can be powerful aids when used thoughtfully, but they are most effective when paired with your critical thinking and coding practice.

Take 5–10 minutes to jot down your thoughts on your experience so far. Here are some example questions to guide your reflection:

1) Did GenAI help you understand any concepts more clearly (e.g., debugging, understanding mutate() or group_by())?

2) Were there moments where you felt relying on GenAI might have hindered your learning? How did you address this? Relevently, how did you balance using GenAI with your own problem-solving skills?

3) Did you notice any inaccuracies or limitations in the responses provided by GenAI? How did you verify the information it gave you?

4) What strategies will you use in future sessions to make your interactions with GenAI more effective and aligned with your learning goals?