Modak Analytics

Data fingerprinting

Accelerating Data Mapping and Unification using Fingerprints

Data fingerprinting

In big data, we try to standardize the column names. In order to standardize the column names, we use fuzzy match. Fuzzy match is applied on metadata. But, there is a possibility that we cannot rely entirely on the metadata. We have to drill down the data as well. In order to standardize the column names and to unify the data into the specific columns, data fingerprinting becomes significantly important.




Why is data fingerprinting useful?

In this process, the comparison of column values is done across different tables and a hash code against the column is generated. Irrespective of what the column name is labelled across different tables, if the column shares the same data, then a score will be generated from 0 to 1 as how much of data is matched and then the mapping of the data will be done and the data will be merged. This score will be generated using an algorithm.

For example, if there are different tables where the column is labelled as “col”, “column”, “col1”, but the data which is shared in the columns are same, then the data is checked, a hash will be generated against that column, a score between 0 to 1 is generated and then mapping of the data takes place by merging the columns.