Calculation of match quality

To compute the ratio of a match, texts are compared in ONTRAM according to the following procedure. Matches with a quality below the 50% limit are not considered in ONTRAM. The recommended quality limit is at least 60%.

Determination of the Weighted Total Length

  1. The text is split into small logical units.
  2. Each unit is assigned a type with a weighting (see table "weighting").
  3. The character length of each unit is multiplied by the weighting.
  4. The values from step three are added to the weighted text length from all units.
  5. Steps one to four are now carried out for the second text.
  6. The weighted text lengths of text one and text twos are added together

Result: the weighted total length of both texts was calculated.

Example

Texts to be compared:

  • Text one: 80kg flour
  • Text two: 80kg sugar

Step

Result

1 80, kg, flour
2 80 = number (75), kg = unit (70), flour = word (100)
3 2*75, 2*70, 4*100
4 Weigthed text length text one = 150 + 140 + 400 = 690
5 Weigthed text length text two = 2*75 + 2*70 + 6*100 = 890
6 Weighted total length = 690 + 890 = 1580

Calculation of change in length

  1. The text is split into small logical units.
  2. Each unit is assigned a type with a weighting.
  3. The similarity of the units is calculated with an algorithm based on the Levenshtein distance as a percentage.
  4. The similarity is multiplied by the weighted character length for the respective units of both texts
  5. The values are added together.

Result: the change length was calculated.

Example

Texts to be compared:

  • Text one: 80kg flour
  • Text two: 80kg sugar

Step

Result

1
  • 80
  • 80
  • kg
  • kg
  • Flour
  • Sugar
2
  • 80 = number (75)
  • 80 = number (75)
  • kg = unit (70)
  • kg = unit (70)
  • Flour = word (100)
  • Sugar = word (100)
3
  • 80 : 80 = Levenshtein distance 0, difference 0%
  • kg : kg = Levenshtein distance 0, difference 0%
  • Flour : Sugar = Levenshtein distance 6, difference 80%
4
  • 0% * (2*75 + 2*75) = 0
  • 0% * (2*70 + 2*70) = 0
  • 80% * (4*100 + 6*100) = 800
5 Change length = 0 + 0 + 800 = 800

Determination of match quality

The match quality is calculated using the formula: (weighted total length - change length) / weighted total length

Example

Hit quality = (1580 - 800) / 1580 = 49%

Weighting (default settings)

Type

Weighting

Others 20
Word 100
Word numbers 110
Ambiguity 100
URL 50
email 50
Stop word 80
Abbreviation 75
Number 75
Unit 70
Punctuation 30
Day 10