Calculation of match quality
To compute the ratio of a match, texts are compared in ONTRAM according to the following procedure. Matches with a quality below the 50% limit are not considered in ONTRAM. The recommended quality limit is at least 60%.
Determination of the Weighted Total Length
- The text is split into small logical units.
- Each unit is assigned a type with a weighting (see table "weighting").
- The character length of each unit is multiplied by the weighting.
- The values from step three are added to the weighted text length from all units.
- Steps one to four are now carried out for the second text.
- The weighted text lengths of text one and text twos are added together
Result: the weighted total length of both texts was calculated.
Example
Texts to be compared:
- Text one: 80kg flour
- Text two: 80kg sugar
Step
|
Result
|
1 |
80, kg, flour |
2 |
80 = number (75), kg = unit (70), flour = word (100) |
3 |
2*75, 2*70, 4*100 |
4 |
Weigthed text length text one = 150 + 140 + 400 = 690 |
5 |
Weigthed text length text two = 2*75 + 2*70 + 6*100 = 890 |
6 |
Weighted total length = 690 + 890 = 1580 |
Calculation of change in length
- The text is split into small logical units.
- Each unit is assigned a type with a weighting.
- The similarity of the units is calculated with an algorithm based on the Levenshtein distance as a percentage.
- The similarity is multiplied by the weighted character length for the respective units of both texts
- The values are added together.
Result: the change length was calculated.
Example
Texts to be compared:
- Text one: 80kg flour
- Text two: 80kg sugar
Step
|
Result
|
1 |
|
2 |
- 80 = number (75)
- 80 = number (75)
- kg = unit (70)
- kg = unit (70)
- Flour = word (100)
- Sugar = word (100)
|
3 |
- 80 : 80 = Levenshtein distance 0, difference 0%
- kg : kg = Levenshtein distance 0, difference 0%
- Flour : Sugar = Levenshtein distance 6, difference 80%
|
4 |
- 0% * (2*75 + 2*75) = 0
- 0% * (2*70 + 2*70) = 0
- 80% * (4*100 + 6*100) = 800
|
5 |
Change length = 0 + 0 + 800 = 800 |
Determination of match quality
The match quality is calculated using the formula: (weighted total length - change length) / weighted total length
Example
Hit quality = (1580 - 800) / 1580 = 49%
Weighting (default settings)
Type
|
Weighting
|
Others |
20 |
Word |
100 |
Word numbers |
110 |
Ambiguity |
100 |
URL |
50 |
email |
50 |
Stop word |
80 |
Abbreviation |
75 |
Number |
75 |
Unit |
70 |
Punctuation |
30 |
Day |
10 |