User:Dennis Groenewegen/Cheatsheet

This is a table of comparison, or cheatsheet, for the different matching strategies available depending on the search engine and configuration used:

SMW with the standard setup of SQLStore. The documentation in 'Selecting pages' and child pages usually refers to the standard setup.
SMW with Full-Text Search (FTS) enabled for the SQLStore (still considered experimental). A guide is separately available here.
SMW with ElasticStore using Elasticsearch as a search engine. A guide is available from Github.
- Of these tree, ES is the most flexible, allowing for a wider range of configurations, e.g. different analyzers for different data types. Some alternative options may not have been taken into account yet. For the default settings and examples of other possibilities, see DefaultSettings.php and these JSON files.

Not yet included in this comparison is SPARQLStore, details for which are currently lacking.

	Standard SQLStore	Full-Text Search	ElasticStore
Properties indexed	All, incl. type Text, Page and URL	Configurable. Default: user-defined properties of type Text and URL.^[1]	All, incl. type Text, Page and URL
Regular
LIKE/NOT LIKE operators	`~`/`!~` and `like:`/`nlike:` (standard) `like:`/`nlike:` (fallback if FTS is enabled)^[2]	`~`/`!~`^[3]	`~`/`!~` and `like:`/`nlike:`
Wide proximity operators	Unsupported	`~~`/`!~~` (two tildes not one)	`~~`/`!~~`; also supported by `in:` and `phrase:`
Wildcards	standalone `+` (any value) `*` (0 or more)^[4] `?` (any single character)	standalone `+` (any value) `*` (0 or more) `?` (any single character)	standalone `+` (any value) `*` (0 or more; see also `in:`) `?` (any single character)
`in:` (shorthand)	Unsupported	Unsupported	Equivalent to `~* ... ` (with named property), or `~~ ... *` (wide proximity)
Tokenisation	No	Yes	Yes (different tokenizers available)
Maximum searchable string length	first 255 per value (type Page) first 40 or 72, or 300 (type Text)^[5]	? per token^[6]	maximum token length configurable (Length token filter)
Supported wildcard (`*`/`?`) positions	start, middle, end (of value)	end (of token) only^[7]	start, middle, end (of token)
Match only at beginning/end of property value	Supported	Unsupported (tokens or phrases only)	Unsupported (tokens or phrases only) except with e.g. Keyword tokenizer.^[8]
Meaning of whitespace between characters	part of string	token delimiter	token delimiter (usually, but see Keyword tokenizer)
Case folding	No	Yes	Possible (lowercase filter)
Accent folding^[9]	No	Yes	Possible (asciifolding filter)
Features unique to tokenisation
Minimum token length	-	Configurable (default: 3)^[10]	Configurable (Length token filter)
Boolean operators (`+`/`-`)	-	Yes caveat: do not apply to strings below minimum token length (Runtime Exception)	Yes
Stopword filter	-	Yes	Yes (Stop token filter)
CJK support	-	Yes (onoi/tesa; see documentation)	Yes (CJK language analyzer)
Phrase matching
Phrase matching	Already the default. Matching is exact.	Use double quotes (`" ... "`). Case/accent-insensitive. Does not support wildcards.	Use double quotes, or `phrase:` (see below). Level of precision depends on case/accent-sensitivity. Does not support wildcards.
`phrase:` (shorthand)	Unsupported	Unsupported	Equivalent to `~" ... "`, or `~~" ... "` (wide proximity)
Other features
Search highlighting with #-hl	Yes	Yes	Yes

↑ The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.
↑ No such fallback is available for ElasticStore.
↑ However, behaviour falls back to regular SQL if the property is not indexed.
↑ In a very early version of SMW, * may have meant 1 or more.
↑ See the documentation on search operators.
↑ Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85 characters) should be fine.
↑ The CONTAINS predicate does not allow for other positions.
↑ The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".
↑ Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.
↑ $smwgFulltextSearchMinTokenSize

[1] The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.

[2] No such fallback is available for ElasticStore.

[3] However, behaviour falls back to regular SQL if the property is not indexed.

[4] In a very early version of SMW, * may have meant 1 or more.

[5] See the documentation on search operators.

[6] Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu (85 characters) should be fine.

[7] The CONTAINS predicate does not allow for other positions.

[8] The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".

[9] Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.

[10] $smwgFulltextSearchMinTokenSize

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]