Appendix C
The next topic concerns the problems in the source data.
EduRS.xml Fragment 1 - Figure 18
- [F18H0] The middle name is all in lower case. If name parts are all shifted to upper case or all shifted to lower case before comparison then this becomes unimportant.
- [F18H1] The MAJOR data field is empty. This is needed to properly label the EducationalTraining instance and so this row will be rejected.
EduRS.xml Fragment 2 - Figure 19
- [F19H0] The INSTITUTION field has mixed case characters which will cause a mismatch if character for character equality is required for organizational matching. However, if all characters are shifted to the same case before comparison this will not be a problem.
- [F19H1] The NETID field is missing. This causes uncertainty when comparing names. Since it is not uncommon for there to be several distinct people with exactly the same name parts, a token like netid is perfect for disambiguation of people. This is because netid is guaranteed to be unique even when a person’s name has changed due to choice, marriage or divorce. A missing netid weakens the association of a degree record and a person.
EduRS.xml Fragment 3 - Figure 20
- [F20H0] See Figure 18, highlight 0.
- [F20H1] Extra embedded whitespace. The XPATH function normalize-space deals effectively with this sort of issue.
EduRS.xml Fragment 4 - Figure 21
- [F21H0] Trailing whitespace.
- [F21H1] Missing netid. See Figure 19 Highlight 1.
EduRS.xml Fragment 5 - Figure 22
- [F22H0] Same as [F18H1]. This row will also be rejected.
- [F22H1] Same as [F19H1].
EduRS.xml Fragment 6 - Figure 23