Identify and Remove Duplicate Data in R - Datanovia (2024)

Identify and Remove Duplicate Data in R

Easy

30 mins

Data Manipulation in R

111181310101289

This tutorial describes how to identify and remove duplicate data in R.

You will learn how to use the following R base and dplyr functions:

  1. R base functions
    • duplicated(): for identifying duplicated elements and
    • unique(): for extracting unique elements,
  2. distinct() [dplyr package] to remove duplicate rows in a data frame.

Identify and Remove Duplicate Data in R - Datanovia (1)



Contents:

  • Required packages
  • Demo dataset
  • Find and drop duplicate elements
  • Extract unique elements
  • Remove duplicate rows in a data frame
  • Summary

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)my_data
## # A tibble: 150 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 144 more rows

Find and drop duplicate elements

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)
  • To find the position of duplicate elements in x, use this:
duplicated(x)
## [1] FALSE TRUE FALSE FALSE TRUE FALSE
  • Extract duplicate elements:
x[duplicated(x)]
## [1] 1 4
  • If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
x[!duplicated(x)]
## [1] 1 4 5 6
  • Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
# Remove duplicates based on Sepal.Width columnsmy_data[!duplicated(my_data$Sepal.Width), ]
## # A tibble: 23 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 17 more rows

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)
## # A tibble: 149 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 143 more rows

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()
## # A tibble: 149 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 143 more rows

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Lengthmy_data %>% distinct(Sepal.Length, .keep_all = TRUE)
## # A tibble: 35 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 29 more rows
# Remove duplicated rows based on # Sepal.Length and Petal.Widthmy_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)
## # A tibble: 110 x 5## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## # ... with 104 more rows

The option .kep_all is used to keep all variables in the data.

Summary

In this chapter, we describe key functions for identifying and removing duplicate data:

  • Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal.Length)
  • R base function to extract unique elements from vectors and data frames: unique(my_data)
  • R base function to determine duplicate elements: duplicated(my_data)


Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Data science

  • Course: Machine Learning: Master the Fundamentals by Stanford
  • Specialization: Data Science by Johns Hopkins University
  • Specialization: Python for Everybody by University of Michigan
  • Courses: Build Skills for a Top Job in any Industry by Coursera
  • Specialization: Master Machine Learning Fundamentals by University of Washington
  • Specialization: Statistics with R by Duke University
  • Specialization: Software Development in R by Johns Hopkins University
  • Specialization: Genomic Data Science by Johns Hopkins University

Popular Courses Launched in 2020

  • Google IT Automation with Python by Google
  • AI for Medicine by deeplearning.ai
  • Epidemiology in Public Health Practice by Johns Hopkins University
  • AWS Fundamentals by Amazon Web Services

Trending Courses

  • The Science of Well-Being by Yale University
  • Google IT Support Professional by Google
  • Python for Everybody by University of Michigan
  • IBM Data Science Professional Certificate by IBM
  • Business Foundations by University of Pennsylvania
  • Introduction to Psychology by Yale University
  • Excel Skills for Business by Macquarie University
  • Psychological First Aid by Johns Hopkins University
  • Graphic Design by Cal Arts

Amazon FBA

Amazing Selling Machine

  • Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM

Books - Data Science

Our Books

  • Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
  • Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
  • Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
  • R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
  • GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
  • Network Analysis and Visualization in R by A. Kassambara (Datanovia)
  • Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
  • Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

  • R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
  • Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
  • Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
  • An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
  • Deep Learning with R by François Chollet & J.J. Allaire
  • Deep Learning with Python by François Chollet

Subset Data Frame Rows in R (Prev Lesson)

(Next Lesson) Reorder Data Frame Rows in R

Back to Data Manipulation in R

Comments ( 20 )

    • Identify and Remove Duplicate Data in R - Datanovia (3)

      Kassambara

      23 Dec 2018

      x is a vector, so you don’t need to add a comma

      Reply

    • Identify and Remove Duplicate Data in R - Datanovia (4)

      Sergio Vicente

      07 Apr 2019

      Any way, your comment was very useful to me, ’cause I am working with a data frame (in my case). Tks a lot.

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (5)

    Gal Inbar

    18 Apr 2019

    Iam using Data table , and also very useful !!
    Thanks !!

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (6)

      Kassambara

      18 Apr 2019

      Thank you for your positive feedback. Highly appreciated!

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (7)

    Phat

    03 Jun 2019

    Error: Length of logical index vector for `[` must equal number of columns (or 1):
    * `.data` has 1348 columns
    * Index vector has length 1191

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (8)

      Kassambara

      03 Jun 2019

      please, clarify your question and provide reproducible example

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (9)

    Julián

    21 Jun 2019

    Thanks, it is a simple and useful tutorial.

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (10)

      Kassambara

      21 Jun 2019

      Thank you Juliàn for your feedback!

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (11)

    Stonemonroy

    03 Jul 2019

    Que excelente tutorial, simple y sencillo, pero en el punto. Lo he utilizado varias veces.

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (12)

      Kassambara

      03 Jul 2019

      Thank you for your positive feedback!

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (13)

    Robyn

    19 Jul 2019

    hi I’m trying to KEEP ONLY duplicate rows base on a column. I first tested for unique;
    unique(Jan_19)
    # A tibble: 178,492 x 22

    then the number of duplicates base on my CON column
    Jan_19[duplicated(Jan_19$CON), ]
    # A tibble: 251 x 22

    then tried to drop the rows where CON was not duplicated
    Jan_19 %>% !distinct(CON, .keep_all = TRUE)
    Error in distinct(CON, .keep_all = TRUE) : object ‘CON’ not found

    any advise? Thanks for the codes, quite useful

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (14)

      Kassambara

      23 Jul 2019

      You can use the following R code:

      library(dplyr)Jan_19 %>% distinct(CON, .keep_all = TRUE)

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (15)

    Andreas Rybicki

    20 Jan 2020

    Kassambara,

    the lesson “Identify and Remove Duplicate Data in R” was extremely helpful for my task,

    Question:
    two dataframes like “iris”, say iris for Country A and B,
    the dataframes are quite large, up to 1 mio rows and > 10 columns,
    I’d like to check, whether a row in B contains the same input in A.
    E.g. in ‘iris’ row 102 == 143;
    let’s assume row 102 is in iris country_A and row 143 in iris…._B. How could I identify any duplicates in these two DF’s?
    I searched in stackexchange but didn’t find any helpful solution.
    Thks

    Reply

  • Identify and Remove Duplicate Data in R - Datanovia (16)

    Zbig

    15 Apr 2020

    Now I have a slightly harder task:
    what to do if I want to remove only subsequent, immediate duplicates, but if they are divided by something I want to preserve them.
    Example: you have a data frame with object id, time and the place where it happened:
    df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))
    and I would like to extract paths of these object – for example object 1 was at place 1, then 2, then back to 1 – and I would like to preserve that in data so that later I can see that it moved from 1 to 2 and then from 2 to 1
    any ideas?

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (17)

      Kassambara

      15 Apr 2020

      If you want to keep distinct rows based on multiple columns, you can go as follow:

      library(dplyr)df <- data.frame(id=c(1,1,1,2,2,2), time=rep(1:3, 2), place=c(1,2,1,1,1,2))df %>% distinct(id, time, place, .keep_all = TRUE)

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (18)

    Moses

    20 May 2020

    You are always on point! A quick one…
    What are the major check points in data management? I know there are duplicates, missing data, ….

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (19)

      Kassambara

      20 May 2020

      Thank you for your feedback, highly appreciated. There is also Outlier identifications

      Reply

  • Identify and Remove Duplicate Data in R - Datanovia (20)

    Hasan Ayouby

    21 May 2020

    How to permanently remove the duplicates? because once i used this function, it acts only like a filter. but the original table stays intact.

    Reply

    • Identify and Remove Duplicate Data in R - Datanovia (21)

      Kassambara

      22 May 2020

      To overwrite, your original file, type this:

      my_data = my_data %>% distinct(Sepal.Length, .keep_all = TRUE)

      Reply

Give a comment

Course Curriculum

  • Select Data Frame Columns in R

    40 mins

  • Subset Data Frame Rows in R

    50 mins

  • Identify and Remove Duplicate Data in R

    30 mins

  • Reorder Data Frame Rows in R

    30 mins

  • Rename Data Frame Columns in R

    20 mins

  • Compute and Add new Variables to a Data Frame in R

    30 mins

  • Compute Summary Statistics in R

    40 mins

Teacher

Identify and Remove Duplicate Data in R - Datanovia (22)

Alboukadel Kassambara
Role : Founder of Datanovia
  • Website : https://www.datanovia.com/en
  • Experience : >10 years
  • Specialist in : Bioinformatics and Cancer Biology

Read More

Identify and Remove Duplicate Data in R - Datanovia (2024)
Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 6724

Rating: 4.7 / 5 (67 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.