This is a table of comparison, or cheatsheet, for the different matching strategies available depending on the search engine and configuration used:
- SMW with the standard setup of SQLStore. The documentation in 'Selecting pages' and child pages usually refers to the standard setup.
- SMW with Full-Text Search (FTS) enabled for the SQLStore (still considered experimental). A guide is separately available here.
- SMW with ElasticStore using Elasticsearch as a search engine. A guide is available from Github.
- Of these tree, ES is the most flexible, allowing for a wider range of configurations, e.g. different analyzers for different data types. Some alternative options may not have been taken into account yet. For the default settings and examples of other possibilities, see DefaultSettings.php and these JSON files.
Not yet included in this comparison is SPARQLStore, details for which are currently lacking.
Standard SQLStore | Full-Text Search | ElasticStore | |
---|---|---|---|
Properties indexed | All, incl. type Text, Page and URL | Configurable. Default: user-defined properties of type Text and URL.[1] | All, incl. type Text, Page and URL |
Regular | |||
LIKE/NOT LIKE operators | ~ /!~ and like: /nlike: (standard)like: /nlike: (fallback if FTS is enabled)[2]
|
~ /!~ [3]
|
~ /!~ and like: /nlike:
|
Wide proximity operators | Unsupported | ~~ /!~~ (two tildes not one)
|
~~ /!~~ ; also supported by in: and phrase:
|
Wildcards | standalone + (any value)* (0 or more)[4]? (any single character)
|
standalone + (any value)* (0 or more)? (any single character)
|
standalone + (any value)* (0 or more; see also in: )? (any single character)
|
in: (shorthand)
|
Unsupported | Unsupported | Equivalent to~* ... * (with named property), or ~~* ... * (wide proximity)
|
Tokenisation | No | Yes | Yes (different tokenizers available) |
Maximum searchable string length | first 255 per value (type Page) first 40 or 72, or 300 (type Text)[5] |
? per token[6] | maximum token length configurable (Length token filter) |
Supported wildcard (* /? ) positions
|
start, middle, end (of value) | end (of token) only[7] | start, middle, end (of token) |
Match only at beginning/end of property value | Supported | Unsupported (tokens or phrases only) | Unsupported (tokens or phrases only) except with e.g. Keyword tokenizer.[8] |
Meaning of whitespace between characters | part of string | token delimiter | token delimiter (usually, but see Keyword tokenizer) |
Case folding | No | Yes | Possible (lowercase filter) |
Accent folding[9] | No | Yes | Possible (asciifolding filter) |
Features unique to tokenisation | |||
Minimum token length | - | Configurable (default: 3)[10] | Configurable (Length token filter) |
Boolean operators (+ /- )
|
- | Yes caveat: do not apply to strings below minimum token length (Runtime Exception) |
Yes |
Stopword filter | - | Yes | Yes (Stop token filter) |
CJK support | - | Yes (onoi/tesa; see documentation) | Yes (CJK language analyzer) |
Phrase matching | |||
Phrase matching | Already the default. Matching is exact. | Use double quotes (" ... " ). Case/accent-insensitive. Does not support wildcards.
|
Use double quotes, or phrase: (see below). Level of precision depends on case/accent-sensitivity. Does not support wildcards.
|
phrase: (shorthand)
|
Unsupported | Unsupported | Equivalent to ~" ... " , or ~~" ... " (wide proximity)
|
Other features | |||
Search highlighting with #-hl | Yes | Yes | Yes |
- ↑ The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.
- ↑ No such fallback is available for ElasticStore.
- ↑ However, behaviour falls back to regular SQL if the property is not indexed.
- ↑ In a very early version of SMW,
*
may have meant 1 or more. - ↑ See the documentation on search operators.
- ↑ Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85 characters) should be fine.
- ↑ The CONTAINS predicate does not allow for other positions.
- ↑ The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".
- ↑ Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.
- ↑ $smwgFulltextSearchMinTokenSize