How to Remove Duplicate Rows in R (2024)

How to remove duplicates or duplicate rows in R DataFrame (data.frame)? There are multiple ways to get the duplicate rows in R by removing all duplicates from a single column, selected columns, or all columns. In this article, I will explain all these examples by using functions from R base, dplyr, and data.table.

Remove duplicates using R base functions
Remove duplicate rows using dplyr
Remove duplicate rows using data.table

1. Quick Examples of Remove Duplicate Rows

The following are quick examples of how to remove duplicates or duplicate rows from R DataFrame (data.frame)

# Quick Examples# Remove duplicate rowsdf2 <- df[!duplicated(df), ]# Remove duplicates by single columndf2 <- df[!duplicated(df$id), ]# Remove duplicates on selected columnsdf2 <- unique( df[ , c('id','pages','chapters','price') ] )# Using dplyr# Remove duplicate rows (all columns)library(dplyr)df2 <- df %>% distinct()# Remove duplicates on specific columndf2 <- df %>% distinct(id, .keep_all = TRUE)# Remove duplicates on selected columnsdf2 <- df %>% distinct(id,pages, .keep_all = TRUE)# using data.tablelibrary(data.table)dt <- data.table(df)#Remove duplicates on specific columndt2 <- unique(dt, by = "id")

Let’screate an R DataFrame, run these examples, and explore the output. If you already have data in CSV you can easilyimport CSV files to R DataFrame. Also, refer toImport Excel File into R.

# Create dataframedf=data.frame(id=c(11,11,33,44,44), pages=c(32,32,33,22,22), name=c("spark","spark","R","java","jsp"), chapters=c(76,76,11,15,15), price=c(144,144,321,567,567))df

Yields below output. Note that the first 2 rows have duplicates (all column values) and the last two rows have duplicates on columns of id, pages, chapters and price.

# Output id pages name chapters price1 11 32 spark 76 1442 11 32 spark 76 1443 33 33 R 11 3214 44 22 java 15 5675 44 22 jsp 15 567

2. Remove Duplicates using R Base Functions

R base provides duplicated() and unique() functions to remove duplicates in an R DataFrame (data.frame), By using these two functions we can delete duplicate rows by considering all columns, single column, or selected columns.

2.1 Remove Duplicate Rows

duplicated() is an R base function that takes vector or data.frame as input and selects rows that are duplicates, by negating the result you will remove all duplicate rows in the R data.frame. For example, from my data frame above having the first 2 rows are duplicated, running the below example eliminates duplicate rows and returns the data frame with unique rows.

# Remove duplicate rowsdf2 <- df[!duplicated(df), ]df2# Output# id pages name chapters price# 1 11 32 spark 76 144# 3 33 33 R 11 321# 4 44 22 java 15 567# 5 44 22 jsp 15 567

In case you want to remove duplicates based on a single column, use the column name as an argument to the function.

2.2 Remove Duplicates on Selected Columns

Use the unique() function to remove duplicates from the selected multiple columns of the R data frame. The following example removes duplicates by selecting columns id, pages, chapters and price.

# Remove duplicates on selected columnsdf2 <- unique( df[ , c('id','pages','chapters','price') ] )df2# Output# id pages chapters price# 1 11 32 76 144# 3 33 33 11 321# 4 44 22 15 567

3. Remove Duplicate Rows using dplyr

dplyr package provides distinct() function to remove duplicates, to use this, you need to load the library usinglibrary("dplyr")to use its methods. In case you don’t have this package, install it usinginstall.packages("dplyr").

For bigger data sets it is best to use the methods fromthe dplyrpackage as they perform 30% faster.the dplyrpackage uses C++ code to evaluate.

3.1 Use distinct() to Remove Duplicates

distinct() method selects unique rows from a data frame by removing all duplicates in R. This is similar to the R base unique function but, this performs faster when you have large datasets, so use this when you want better performance.

# Using dplyr# Remove duplicate rows (all columns)library(dplyr)df2 <- df %>% distinct()df2# Output# id pages name chapters price# 1 11 32 spark 76 144# 2 33 33 R 11 321# 3 44 22 java 15 567# 4 44 22 jsp 15 567

Here, we use the infix operator%>%frommagrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example,x %>% f(y)converted intof(x, y)so the result from the left-hand side is then “piped” into the right-hand side.

3.2 Remove Duplicates on Specific column

Similarly, you can also use this to get duplicate rows on a single column. Here, I am using an optional argument .keep_all=TRUE which keeps all variables in.data. If a combination of...is not distinct, this keeps the first row of values.

#Remove duplicates on specific columndf2 <- df %>% distinct(id, .keep_all = TRUE)df2# Output# id pages name chapters price# 1 11 32 spark 76 144# 2 33 33 R 11 321# 3 44 22 java 15 567

3.3 Get Unique Rows on Selected Columns

If you want to get unique rows on selected columns of the R data.frame, just pass the columns as arguments to this distinct() function.

#Remove duplicates on selected columnsdf2 <- df %>% distinct(id,pages, .keep_all = TRUE)df2# Output# id pages name chapters price# 1 11 32 spark 76 144# 2 33 33 R 11 321# 3 44 22 java 15 567

4. Remove Duplicate Rows using data.table

Use unique() function from data.table package to eliminate duplicates, data.tableis a package that is used to work with tabular data inR ProgrammingLanguage. It provides the efficientdata.tableobject which is a much improved and better performance version of the defaultdata.frame.

# using data.tablelibrary(data.table)dt <- data.table(df)#Remove duplicates on specific columndt2 <- unique(dt, by = "id")dt2# Output# id pages name chapters price#1: 11 32 spark 76 144#2: 33 33 R 11 321#3: 44 22 java 15 567

Frequently Asked Questions on Remove Duplicate Rows

How do I identify and count duplicate rows in a data frame in R?

You can use the duplicated() function to identify duplicates and sum(duplicated(df)) to count them in a data frame df.

What function is commonly used to remove duplicate rows in R?

The unique() function is commonly used to remove duplicate rows from a data frame in R.

How can I remove duplicates based on specific columns in R?

You can use the duplicated() function with the subset parameter to check for duplicates based on specific columns, and then use this information to filter the data frame accordingly.

How do I keep the first occurrence of each unique row and remove subsequent duplicates in R?

You can use the duplicated() function with the fromLast parameter set to TRUE to mark duplicates from the end. Then, you can use boolean indexing to keep the first occurrence.

Conclusion

In this article, you have learned how to remove duplicates or duplicate rows in R by using the R base function duplicated(), unique() and using the dplyr package function distinct() and finally using the unique() function from data.table. If a performance matters use either function from the dplyr or data.table.

How to remove rows in R
How to remove columns in R
How to remvoe rows with NA in R
How to select columns in R
How to rename columns in R
Remove Character From String in R
How to Remove NA from Vector?
Uninstall or Remove Package from R Environment
R Remove Duplicates From Vector
How to remove the first row from the R data frame?