
For a positive case, two records are near duplicate. If the two records are duplicate the target value is 1 and 0 otherwise. The training data consists pairs of records and a a target value. Instead of guessing the weights we can train a neural network for regression. We would like to assign varying weights to the different fields. For example, the email ID may be different for same customer data from different sources. Some of the popular metric are listed belowĪll these techniques treat all fields equally, which may not be appropriate for highly nuanced business data. The second task for finding record similarities is aggregation of field similarities.There are various techniques for record similarities. It’s more tolerant of typographical errors in dataįield Similarity Aggregation with Machine Learning.Vocabulary size is fixed and the size depends on the number of characters in the ngram.It has the following advantages compared to TF-IDF I have chosen a character ngram based approach, which is more appropriate when the data is characterized by small typographical errors. Term i.e word frequency inverse document frequency of TF-IDF.Here are some techniques for vectorization based approaches. For text fields similarity there are various options as belowĮdit distance algorithms are computationally intensive and and does not scale well for long text. Find field pairwise similarity between the 2 recordsįor numeric fields, the field similarity could simply be a normalized difference.Here are the 2 steps for similarity calculation If the similarity is above some threshold, the pair of records is considered duplicates. The general approach is to pair a record with all other data records and find similarity. It could easily be reimplemented with TensorFlow. The solution is available in open source project avenir. It’s applicable for any structured data, whether a relational table or a JSON data. In this post we will go through a simple feed forward neural network based solution for finding duplicates.

The training data for a machine learning model may have duplicates and unless removed it will have an adverse impact on the model performance. It could be an issue in an analytic project based on data aggregated from various sources. It often appears when data from different silos are consolidated. Duplicate data is a ubiquitous problem in the data world.
