Remove duplicate records based on multiple columns?

Asked 2/1, 2013 at 15:0 Answered 23/1, 2023 at 15:8

Solved ruby-on-rails-3 activerecord duplicates destroy

I'm using Heroku to host my Ruby on Rails application and for one reason or another, I may have some duplicate rows.

Is there a way to delete duplicate records based on 2 or more criteria but keep just 1 record of that duplicate collection?

In my use case, I have a Make and Model relationship for cars in my database.

Make      Model
---       ---
Name      Name
          Year
          Trim
          MakeId

I'd like to delete all Model records that have the same Name, Year and Trim but keep 1 of those records (meaning, I need the record but only once). I'm using Heroku console so I can run some active record queries easily.

Any suggestions?

Transmit answered 2/1, 2013 at 15:0 Comment(0)

155

class Model

  def self.dedupe
    # find all models and group them on keys which should be common
    grouped = all.group_by{|model| [model.name,model.year,model.trim,model.make_id] }
    grouped.values.each do |duplicates|
      # the first one we want to keep right?
      first_one = duplicates.shift # or pop for last one
      # if there are any more left, they are duplicates
      # so delete all of them
      duplicates.each{|double| double.destroy} # duplicates can now be destroyed
    end
  end

end

Model.dedupe

Find All
Group them on keys which you need for uniqueness
Loop on the grouped model's values of the hash
remove the first value because you want to retain one copy
delete the rest

Hydrolytic answered 2/1, 2013 at 15:15 Comment(4)

This is in the Model model? – Anguished 3/7, 2015 at 17:18

@meetalexjohnson it should be in whatever activerecord model you have. – Hydrolytic 4/7, 2015 at 0:35

Interesting method but a little inefficient with large set of records. Wondering if there is a way to do it with active record it self. – Teratogenic 30/12, 2015 at 16:58

Works but extremely inefficient for large data sets. A much faster way is to use this algo to collect the ids in an array first and then use one DELETE FROM sql statement to delete the array of ids. – Supposition 2/6, 2016 at 14:44

If your User table data like below

User.all =>
[
    #<User id: 15, name: "a", email: "[email protected]", created_at: "2013-08-06 08:57:09", updated_at: "2013-08-06 08:57:09">, 
    #<User id: 16, name: "a1", email: "[email protected]", created_at: "2013-08-06 08:57:20", updated_at: "2013-08-06 08:57:20">, 
    #<User id: 17, name: "b", email: "[email protected]", created_at: "2013-08-06 08:57:28", updated_at: "2013-08-06 08:57:28">, 
    #<User id: 18, name: "b1", email: "[email protected]", created_at: "2013-08-06 08:57:35", updated_at: "2013-08-06 08:57:35">, 
    #<User id: 19, name: "b11", email: "[email protected]", created_at: "2013-08-06 09:01:30", updated_at: "2013-08-06 09:01:30">, 
    #<User id: 20, name: "b11", email: "[email protected]", created_at: "2013-08-06 09:07:58", updated_at: "2013-08-06 09:07:58">] 
1.9.2p290 :099 >

Email id's are duplicate, so our aim is to remove all duplicate email ids from user table.

Step 1:

To get all distinct email records id.

ids = User.select("MIN(id) as id").group(:email,:name).collect(&:id)
=> [15, 16, 18, 19, 17]

Step 2:

To remove duplicate id's from user table with distinct email records id.

Now the ids array holds the following ids.

[15, 16, 18, 19, 17]
User.where("id NOT IN (?)",ids)  # To get all duplicate records
User.where("id NOT IN (?)",ids).destroy_all

** RAILS 4 **

ActiveRecord 4 introduces the .not method which allows you to write the following in Step 2:

User.where.not(id: ids).destroy_all

Footworn answered 6/8, 2013 at 9:11 Comment(1)

This is dangerous: running it again when you don't have dups will delete more than you want because the logic is "delete everything except D". I think the better logic is "delete everything in D", where D is the list of ids of duplicated rows. – Stowage 29/4, 2020 at 19:13

Similar to @Aditya Sanghi 's answer, but this way will be more performant because you are only selecting the duplicates, rather than loading every Model object into memory and then iterating over all of them.

# returns only duplicates in the form of [[name1, year1, trim1], [name2, year2, trim2],...]
duplicate_row_values = Model.select('name, year, trim, count(*)').group('name, year, trim').having('count(*) > 1').pluck(:name, :year, :trim)

# load the duplicates and order however you wantm and then destroy all but one
duplicate_row_values.each do |name, year, trim|
  Model.where(name: name, year: year, trim: trim).order(id: :desc)[1..-1].map(&:destroy)
end

Also, if you truly don't want duplicate data in this table, you probably want to add a multi-column unique index to the table, something along the lines of:

add_index :models, [:name, :year, :trim], unique: true, name: 'index_unique_models'

Sanious answered 12/1, 2016 at 7:41 Comment(0)

You could try the following: (based on previous answers)

ids = Model.group('name, year, trim').pluck('MIN(id)')

to get all valid records. And then:

Model.where.not(id: ids).destroy_all

to remove the unneeded records. And certainly, you can make a migration that adds a unique index for the three columns so this is enforced at the DB level:

add_index :models, [:name, :year, :trim], unique: true

Leapfrog answered 6/4, 2018 at 17:33 Comment(2)

Am I missing something? Wouldn't the second code block here just clear the whole table except for the ids found in the first code block? – Caisson 20/4, 2020 at 17:7

That's what the OP was looking for, deleting all duplicates - the first method gets you all of the non-dupes – Digital 29/4, 2020 at 7:24

To run it on a migration I ended up doing like the following (based on the answer above by @aditya-sanghi)

class AddUniqueIndexToXYZ < ActiveRecord::Migration
  def change
    # delete duplicates
    dedupe(XYZ, 'name', 'type')

    add_index :xyz, [:name, :type], unique: true
  end

  def dedupe(model, *key_attrs)
    model.select(key_attrs).group(key_attrs).having('count(*) > 1').each { |duplicates|
      dup_rows = model.where(duplicates.attributes.slice(key_attrs)).to_a
      # the first one we want to keep right?
      dup_rows.shift

      dup_rows.each{ |double| double.destroy } # duplicates can now be destroyed
    }
  end
end

Stretchy answered 22/3, 2016 at 14:25 Comment(1)

You could add model.unscoped to queries to avoid being caught on default scope not being present on the current group query. – Pomatum 29/10, 2018 at 17:10

Based on @aditya-sanghi's answer, with a more efficient way to find duplicates using SQL.

Add this to your ApplicationRecord to be able to deduplicate any model:

class ApplicationRecord < ActiveRecord::Base
  # …

  def self.destroy_duplicates_by(*columns, order: :id)
    groups = select(columns).group(columns).having(Arel.star.count.gt(1))
    groups.each do |duplicate|
      records = where(duplicate.attributes.symbolize_keys.slice(*columns))
      records.limit(nil).order(order).offset(1).destroy_all
    end
  end
end

You can then call destroy_duplicates_by to destroy all records (except the first) that have the same values for the given columns. For example:

Model.destroy_duplicates_by(:name, :year, :trim)

Iz answered 20/5, 2020 at 11:34 Comment(0)

I chose a slightly safer route (IMHO). I started by getting all the unique records.

    ids = Model.where(other_model_id: 1).uniq(&:field).map(&:id)

Then I got all the ids

    all_ids = Model.where(other_model_id: 1).map(&:id)

This allows me to do a matrix subtraction for the duplicates

    dups = all_ids - ids

I then map over the duplicate ids and fetch the model because I want to ensure I have the records I am interested in.

    records = dups.map do |id| Model.find(id) end

When I am sure I want to delete, I iterate again to delete.

    records.map do |record| record.delete end

When deleting duplicate records on a production system, you want to be very sure you are not deleting important live data, so in this process, I can double-check everything.

So in the case above:

    all_ids =  Model.all.map(&:ids)

    uniq_ids = Model.all.group_by do |model|
      [model.name, model.year, model.trim] 
    end.values.map do |duplicates|
      duplicates.first.id
    end

    dups = all_ids - uniq_ids

    records = dups.map { |id| Model.find(id) }

    records.map { |record| record.delete }

or something like this.

Showbread answered 23/1, 2023 at 15:8 Comment(0)

-3

You can try this sql query, to remove all duplicate records but latest one

DELETE FROM users USING users user WHERE (users.name = user.name AND users.year = user.year AND users.trim = user.trim AND users.id < user.id);

Smriti answered 1/6, 2015 at 7:41 Comment(1)

This will remove all. – Distract 7/12, 2017 at 13:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags