
Calculating similarity between two data with Ruby and Elasticsearch
I recently had to find similar data located in a dataset, in order to find potential duplicate records: "John Doe 123456789" "John Foe 123123123" After considering a couple of options, I’ve decided to continue with Elasticsearch, as it was already integrated in the project I was working on. The Ruby client of Elasticsearch provided a useful function on search results, records.each_with_hit, that I could abuse for this situation: file = File.open("some_file_path", "w") User.all.each do |u| r = User.search "*#{u.full_name}*" file.write("Searching for: #{u.id_number}-#{u.full_name}\n") r.records.each_with_hit {|r, hit|}.map{|k, v| "#{k.id_number}-#{k.full_name}: #{v._score}"}.each do |y| unless (u.id_number == y.split("-")[0]) if y.split(": ")[1].to_f > 1.2 file.write(" #{y}\n") end end end file.write("***********************************\n\n") end The 1.2 value in the script can be adjusted according to your needs. It will basically represent how close two values are. You might choose to increase it, if you want your results to cover a wider range of records, or decrease it if you are only interested in values looking very similar. Here is a sample output for the script: ...