Identify and Remove Duplicate Data in R
Easy
30 mins
Data Manipulation in R
111181310101289
This tutorial describes how to identify and remove duplicate data in R.
You will learn how to use the following R base and dplyr functions:
- R base functions
duplicated()
: for identifying duplicated elements andunique()
: for extracting unique elements,
- distinct() [dplyr package] to remove duplicate rows in a data frame.
Contents:
- Required packages
- Demo dataset
- Find and drop duplicate elements
- Extract unique elements
- Remove duplicate rows in a data frame
- Summary
Required packages
Load the tidyverse
packages, which include dplyr
:
library(tidyverse)
Demo dataset
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
my_data <- as_tibble(iris)my_data
## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 144 more rows
Find and drop duplicate elements
The R function duplicated()
returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.
Given the following vector:
x <- c(1, 1, 4, 5, 4, 6)
- To find the position of duplicate elements in x, use this:
duplicated(x)
## [1] FALSE TRUE FALSE FALSE TRUE FALSE
- Extract duplicate elements:
x[duplicated(x)]
## [1] 1 4
- If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
## [1] 1 4 5 6
- Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columnsmy_data[!duplicated(my_data$Sepal.Width), ]
## # A tibble: 23 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 17 more rows
! is a logical negation. !duplicated() means that we don’t want duplicate rows.
Extract unique elements
Given the following vector:
x <- c(1, 1, 4, 5, 4, 6)
You can extract unique elements as follow:
unique(x)
## [1] 1 4 5 6
It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:
unique(my_data)
## # A tibble: 149 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 143 more rows
Remove duplicate rows in a data frame
The function distinct()
[dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique()
.
Remove duplicate rows based on all columns:
my_data %>% distinct()
## # A tibble: 149 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 143 more rows
Remove duplicate rows based on certain columns (variables):
# Remove duplicated rows based on Sepal.Lengthmy_data %>% distinct(Sepal.Length, .keep_all = TRUE)
## # A tibble: 35 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 29 more rows
# Remove duplicated rows based on # Sepal.Length and Petal.Widthmy_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)
## # A tibble: 110 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 104 more rows
The option .kep_all is used to keep all variables in the data.
Summary
In this chapter, we describe key functions for identifying and removing duplicate data:
- Remove duplicate rows based on one or more column values:
my_data %>% dplyr::distinct(Sepal.Length)
- R base function to extract unique elements from vectors and data frames:
unique(my_data)
- R base function to determine duplicate elements:
duplicated(my_data)
Recommended for you
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
- Course: Machine Learning: Master the Fundamentals by Stanford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University
Popular Courses Launched in 2020
- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
Trending Courses
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Amazon FBA
Amazing Selling Machine
- Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM
Books - Data Science
Our Books
- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
Others
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Subset Data Frame Rows in R (Prev Lesson)
(Next Lesson) Reorder Data Frame Rows in R
Comments ( 20 )
Abouelela
19 Dec 2018
you are missing a comma here after the row x[duplicated(x)]. It should be like this x[duplicated(x), ]
Reply
Kassambara
23 Dec 2018
x is a vector, so you don’t need to add a comma
Reply
Sergio Vicente
07 Apr 2019
Any way, your comment was very useful to me, ’cause I am working with a data frame (in my case). Tks a lot.
Reply
Gal Inbar
18 Apr 2019
Iam using Data table , and also very useful !!
Thanks !!Reply
Kassambara
18 Apr 2019
Thank you for your positive feedback. Highly appreciated!
Reply
Phat
03 Jun 2019
Error: Length of logical index vector for `[` must equal number of columns (or 1):
* `.data` has 1348 columns
* Index vector has length 1191Reply
Kassambara
03 Jun 2019
please, clarify your question and provide reproducible example
Reply
Julián
21 Jun 2019
Thanks, it is a simple and useful tutorial.
Reply
Kassambara
21 Jun 2019
Thank you Juliàn for your feedback!
Reply
Stonemonroy
03 Jul 2019
Que excelente tutorial, simple y sencillo, pero en el punto. Lo he utilizado varias veces.
Reply
Kassambara
03 Jul 2019
Thank you for your positive feedback!
Reply
Robyn
19 Jul 2019
hi I’m trying to KEEP ONLY duplicate rows base on a column. I first tested for unique;
unique(Jan_19)
# A tibble: 178,492 x 22then the number of duplicates base on my CON column
Jan_19[duplicated(Jan_19$CON), ]
# A tibble: 251 x 22then tried to drop the rows where CON was not duplicated
Jan_19 %>% !distinct(CON, .keep_all = TRUE)
Error in distinct(CON, .keep_all = TRUE) : object ‘CON’ not foundany advise? Thanks for the codes, quite useful
Reply
Kassambara
23 Jul 2019
You can use the following R code:
library(dplyr)Jan_19 %>% distinct(CON, .keep_all = TRUE)
Reply
Andreas Rybicki
20 Jan 2020
Kassambara,
the lesson “Identify and Remove Duplicate Data in R” was extremely helpful for my task,
Question:
two dataframes like “iris”, say iris for Country A and B,
the dataframes are quite large, up to 1 mio rows and > 10 columns,
I’d like to check, whether a row in B contains the same input in A.
E.g. in ‘iris’ row 102 == 143;
let’s assume row 102 is in iris country_A and row 143 in iris…._B. How could I identify any duplicates in these two DF’s?
I searched in stackexchange but didn’t find any helpful solution.
ThksReply
Zbig
15 Apr 2020
Now I have a slightly harder task:
what to do if I want to remove only subsequent, immediate duplicates, but if they are divided by something I want to preserve them.
Example: you have a data frame with object id, time and the place where it happened:
df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))
and I would like to extract paths of these object – for example object 1 was at place 1, then 2, then back to 1 – and I would like to preserve that in data so that later I can see that it moved from 1 to 2 and then from 2 to 1
any ideas?Reply
Kassambara
15 Apr 2020
If you want to keep distinct rows based on multiple columns, you can go as follow:
library(dplyr)df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))df %>% distinct(id, time, place, .keep_all = TRUE)
Reply
Moses
20 May 2020
You are always on point! A quick one…
What are the major check points in data management? I know there are duplicates, missing data, ….Reply
Kassambara
20 May 2020
Thank you for your feedback, highly appreciated. There is also Outlier identifications
Reply
Hasan Ayouby
21 May 2020
How to permanently remove the duplicates? because once i used this function, it acts only like a filter. but the original table stays intact.
Reply
Kassambara
22 May 2020
To overwrite, your original file, type this:
my_data = my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
Reply
Give a comment
Course Curriculum
- Select Data Frame Columns in R
40 mins
- Subset Data Frame Rows in R
50 mins
- Identify and Remove Duplicate Data in R
30 mins
- Reorder Data Frame Rows in R
30 mins
- Rename Data Frame Columns in R
20 mins
- Compute and Add new Variables to a Data Frame in R
30 mins
- Compute Summary Statistics in R
40 mins
Teacher
Alboukadel Kassambara
Role : Founder of Datanovia
- Website : https://www.datanovia.com/en
- Experience : >10 years
- Specialist in : Bioinformatics and Cancer Biology
Read More