Detecting Duplicates (base R vs. dplyr) (2024)

Posted on July 22, 2021 by R | Fahim Ahmad in R bloggers | 0 Comments

[This article was first published on R | Fahim Ahmad, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes you may encounter duplicated values in the data which might cause problems depending on how you plan to use the data. In this post, I provide an overview of duplicated() function from base R and the distinct() function from dplyr package to detect and remove duplicates.

I will be using the following data frame as an example in this post.

set.seed(1000)df <- data.frame(ID = sample(paste0("ID-00", 1:7), 10, replace = T),value_1 = sample(1:7, 10, replace = T),value_2 = sample(2:5, 10, replace = T))df## ID value_1 value_2## 1 ID-004 5 2## 2 ID-003 6 5## 3 ID-006 1 3## 4 ID-003 1 4## 5 ID-005 5 5## 6 ID-003 2 3## 7 ID-005 2 2## 8 ID-002 4 3## 9 ID-006 7 2## 10 ID-006 2 3

The duplicated() function returns a logical vector where TRUE specifies which rows of the data frame are duplicates.

For instance, duplicated(df["ID"]) returns the following vector.

## [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

Note: the duplicated() function preserves the first occurrence in the process of identifying the duplicate values, if you want to consider the duplication from the reverse side, then set the fromLast argument to TRUE.

Let’s use the above vector to exclude the duplicated values.

df[!duplicate_rows, ]## ID value_1 value_2## 1 ID-004 5 2## 2 ID-003 6 5## 3 ID-006 1 3## 5 ID-005 5 5## 8 ID-002 4 3

An alternative way to select unique values is dplyr::distinct() function that yields a similar result as above.

dplyr::distinct(df, ID, .keep_all = TRUE)## ID value_1 value_2## 1 ID-004 5 2## 2 ID-003 6 5## 3 ID-006 1 3## 4 ID-005 5 5## 5 ID-002 4 3

Note: The .keep_all argument is used to retain all other columns in the output data frame.

The above chunks of codes remove the duplicated rows based on a single column. What if we want to remove duplicates based on more than a single column?

One way is to concatenate the columns in which you want to check the presence of duplicates. For example, let’s remove the rows where value_1 and value_2 are duplicated.

df <- dplyr::mutate(df, value_1_2 = paste(value_1, value_2))df[!duplicated(df[c("value_1_2")]), ]## ID value_1 value_2 value_1_2## 1 ID-004 5 2 5 2## 2 ID-003 6 5 6 5## 3 ID-006 1 3 1 3## 4 ID-003 1 4 1 4## 5 ID-005 5 5 5 5## 6 ID-003 2 3 2 3## 7 ID-005 2 2 2 2## 8 ID-002 4 3 4 3## 9 ID-006 7 2 7 2

Below is an efficient way of detecting duplicates based on the combination of multiple columns without concatenating the values of the columns in which we want to identify the duplicated values:

# using duplicated() functiondf[!duplicated(df[c("value_1", "value_2")]), ]# using distinct() functiondplyr::distinct(df, value_1, value_2, .keep_all = TRUE)## ID value_1 value_2 value_1_2## 1 ID-004 5 2 5 2## 2 ID-003 6 5 6 5## 3 ID-006 1 3 1 3## 4 ID-003 1 4 1 4## 5 ID-005 5 5 5 5## 6 ID-003 2 3 2 3## 7 ID-005 2 2 2 2## 8 ID-002 4 3 4 3## 9 ID-006 7 2 7 2

It seems that both approaches work very well; however, the advantage of using duplicated() function from base R is it returns a logical vector identifying the duplicated rows that can be used to either drop the duplicated rows or keep only these rows for further investigation while the distinct() function directly removes the duplicated rows without specifying which row has duplicate values.

For instance, let’s keep the duplicated ID numbers only.

df[duplicated(df["ID"], fromLast = F), ]## ID value_1 value_2 value_1_2## 4 ID-003 1 4 1 4## 6 ID-003 2 3 2 3## 7 ID-005 2 2 2 2## 9 ID-006 7 2 7 2## 10 ID-006 2 3 2 3

As mentioned above, the duplicated() function does not assign the first occurrence in the process of identifying the duplicated values as duplicates. Thus, we need to count backward as well to consider the duplication from the reverse side. To do so, we need to set the fromLast argument to TRUE.

df[duplicated(df["ID"], fromLast = F) | duplicated(df["ID"], fromLast = T), ]## ID value_1 value_2 value_1_2## 2 ID-003 6 5 6 5## 3 ID-006 1 3 1 3## 4 ID-003 1 4 1 4## 5 ID-005 5 5 5 5## 6 ID-003 2 3 2 3## 7 ID-005 2 2 2 2## 9 ID-006 7 2 7 2## 10 ID-006 2 3 2 3

To leave a comment for the author, please follow the link and comment on their blog: R | Fahim Ahmad.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Detecting Duplicates (base R vs. dplyr) (2024)

Related