$smwgFulltextLanguageDetection
From semantic-mediawiki.org
Configuration parameter details: | |
Name | $smwgFulltextLanguageDetection |
Description | Sets which languages to detect for the full-text search from an indexable text |
Default setting | []; |
Software | Semantic MediaWiki |
Since version | |
Until version | still available |
Configuration | Full-text search · Experimental |
Keyword | full-text search · data store · relational database · sql store · sql database · experimental |
$smwgFulltextLanguageDetection
is a configuration parameter that sets which languages to detect for the full-text search from an indexable text. It was introduced in Semantic MediaWiki 2.5.0Released on 14 March 2017 and compatible with MW 1.23.0 - 1.29.x..1
- Using the feature connected to this configuration parameter is experimental.
- This setting only takes effect if the full-text search feature was enabled.
Default setting[edit]
$smwgFulltextLanguageDetection = [];
This means that by default language detection is disabled.
Available language detectors[edit]
TextCatLanguageDetector
: Allows for "N-Gram-Based Text Categorization" via TextCat2 and relies on the "wikimedia-textcat" utility3.CdbNGramLanguageDetector
: Allows for "N-Gram-Based Text Categorization" via the "constant database"24
Changing the default setting[edit]
Changing the content of this configuration parameter requires to run maintenance script "rebuildFulltextSearchTable.php"Allows to rebuild the full text search data table
To modify the setting to this configuration parameter, add one of the following lines to your "LocalSettings.php" file after the enableSemantics()
call:
- Allow major Western European languages to be detected
$smwgFulltextLanguageDetection => [
'TextCatLanguageDetector' => [
'de',
'en',
'es',
'fr',
'pt'
]
];
- Allow major East Asian languages to be detected (MySQL 5.7+)
$smwgFulltextLanguageDetection => [
'TextCatLanguageDetector' => [
'ja',
'zh',
'ko'
]
];
- A large list of languages does have a detrimental influence on the performance when trying to detect a language from a free text. Therefore languages should only be added with caution.
- This configuration parameter should only hold one language detector at a time.
- Stopwords are only applied after language detection has been enabled.
See also[edit]
- Help page on full-text search
References
- ^ | Semantic MediaWiki: GitHub pull request gh:smw:1481
- a b TextCat as an implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization"
- ^ "wikimedia-textcat" A TextCat language guesser utility ported to PHP.
- ^ CDB, short for "constant database", refers to a very fast and highly reliable database system which uses a simple file with key value pairs.