Semantic MediaWiki's unconstrained schema approach allows users to create or define properties freely and with that freedom it is possible that conceptional identical or near-duplicate properties (similar properties) can occur and be used for value annotations without being detected by an agent that engages in a data curation1 task.
Several methods can help mitigate and counter syntactic similarity issues in the first place such as:
- Use of templates to formalize user input
- Use of
#REDIRECT
to build a pool of synonyms around a canonical property and allow them to be merged2 into a coherent extension of a properties semantics.
Syntactically similar properties should be cleared and removed during the task called semantic gardening if they are indeed not different to each other. See an example for this on the sandbox wiki.3 Semantic MediaWiki 2.5.0 brought the feature of syntactic property similarity evaluation as well as special page "PropertyLabelSimilarity"Lists syntactic similarities between property labels which assists in displaying syntacticly similar properties and performing the task of semantic gardening.4
Exemption[edit]
Configuration parameter $smwgSimilarityLookupExemptionProperty
Sets the property used to exclude a property from being evaluated during similarity checks defines a property that allows to describe properties in terms of an exemption condition meaning to exclude a property from the process of syntactic similarity evaluation. By default this property is called "owl:differentFrom".
For example, on the property page "Governance level" one may annotate [[owl:differentFrom::Governance level of]]
which would result in a suppressed similarity lookup for both properties "Governance level" and "Governance level of" property when compared to each other. Thus it is clear that these two properties "Governance level" and "Governance level of" are indeed similar but conceptually different and they will not be shown on special page "PropertyLabelSimilarity"Lists syntactic similarities between property labels. See the respective example on the sandbox wiki.5
Syntactic vs. semantic similarity[edit]
Syntactic similarity is understood as function that "analyzes the syntactic similarity of a pair of tags" using the "Levenshtein Distance, the Cosine Similarity, the Jaccard Similarity, the Jaro Distance"6 while semantic similarity analyzes the "semantic relations defined between tags as well as their frequency"6 .
Example[edit]
- Property similarities and the resulting property similarity report on <sandbox.semantic-mediawiki.org>
See also[edit]
- Help page on semantic gardening
- Help page on property declaration
- Help page on property naming
- Help page on property uniqueness
- Help page on special page "PropertyLabelSimilarity"Lists syntactic similarities between property labels
- Help page on configuration parameter
$smwgSimilarityLookupExemptionProperty
Sets the property used to exclude a property from being evaluated during similarity checks
Notes
- | Mikhail Bilenko, Raymond J Mooney. "Adaptive duplicate detection using learnable string similarity measures". ACM (2003): 39--48.
- | Kenji Sagae, Andrew S Gordon. "Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures". ACM (2009): 192--201.
- | Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka. "Measuring semantic similarity between words using web search engines.". {{{publisher}}} 7. (2007): 757--766.
References
- ^ | "...term used to indicate processes and activities related to the organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data..." from Data curation. (2017, February 13). In Wikipedia, The Free Encyclopedia. Retrieved 21:51, March 25, 2017.
- ^ | Iulia Dănăilă, Liviu P Dinu, Vlad Niculae, Octavia-Maria Sulea. "String Distances for Near-duplicate Detection". Instituto Polit{\'e}cnico Nacional, Centro de Innovaci{\'o}n y Desarrollo Tecnol{\'o}gico en C{\'o}mputo (2012): 21--25.
- ^ | Semantic MediaWiki: GitHub pull request #2244 example #1
- ^ Semantic MediaWiki: GitHub pull request gh:smw:2244
- ^ | Semantic MediaWiki: GitHub pull request #2244 example #2
- a b | Nik Bessis Fatos Xhafa (eds.) Richard Mordinyi Eva Kühn (auth.). "Next Generation Data Technologies for Collective Computational Intelligence". Springer-Verlag Berlin Heidelberg (2011).