User:Dennis Groenewegen/Cheatsheet

From semantic-mediawiki.org
< Dennis Groenewegen
Dennis GroenewegenUser:Dennis Groenewegen/Cheatsheet

This is a table of comparison, or cheatsheet, for the different matching strategies available depending on the search engine and configuration used:

Not yet included in this comparison is SPARQLStore, details for which are currently lacking.

Standard SQLStore Full-Text Search ElasticStore
Properties indexed All, incl. type Text, Page and URL Configurable. Default: user-defined properties of type Text and URL.[1] All, incl. type Text, Page and URL
Regular
LIKE/NOT LIKE operators ~/!~ and like:/nlike: (standard)
like:/nlike: (fallback if FTS is enabled)[2]
~/!~[3] ~/!~ and like:/nlike:
Wide proximity operators Unsupported ~~/!~~ (two tildes not one) ~~/!~~; also supported by in: and phrase:
Wildcards standalone + (any value)
* (0 or more)[4]
? (any single character)
standalone + (any value)
* (0 or more)
? (any single character)
standalone + (any value)
* (0 or more; see also in:)
? (any single character)
in: (shorthand) Unsupported Unsupported Equivalent to
~* ... * (with named property),
or ~~* ... * (wide proximity)
Tokenisation No Yes Yes (different tokenizers available)
Maximum searchable string length first 255 per value (type Page)
first 40 or 72, or 300 (type Text)[5]
? per token[6] maximum token length configurable (Length token filter)
Supported wildcard (*/?) positions start, middle, end (of value) end (of token) only[7] start, middle, end (of token)
Match only at beginning/end of property value Supported Unsupported (tokens or phrases only) Unsupported (tokens or phrases only) except with e.g. Keyword tokenizer.[8]
Meaning of whitespace between characters part of string token delimiter token delimiter (usually, but see Keyword tokenizer)
Case folding No Yes Possible (lowercase filter)
Accent folding[9] No Yes Possible (asciifolding filter)
Features unique to tokenisation
Minimum token length - Configurable (default: 3)[10] Configurable (Length token filter)
Boolean operators (+/-) - Yes
caveat: do not apply to strings below minimum token length (Runtime Exception)
Yes
Stopword filter - Yes Yes (Stop token filter)
CJK support - Yes (onoi/tesa; see documentation) Yes (CJK language analyzer)
Phrase matching
Phrase matching Already the default. Matching is exact. Use double quotes (" ... "). Case/accent-insensitive. Does not support wildcards. Use double quotes, or phrase: (see below). Level of precision depends on case/accent-sensitivity. Does not support wildcards.
phrase: (shorthand) Unsupported Unsupported Equivalent to ~" ... ", or ~~" ... " (wide proximity)
Other features
Search highlighting with #-hl Yes Yes Yes


  1. The data types to be indexed are set in $smwgFulltextSearchIndexableDataTypes. The setting defaults to the constants SMW_FT_BLOB (used for type Text) and SMW_FT_URI (used for type URL). Specific properties can be exempted from indexing.
  2. No such fallback is available for ElasticStore.
  3. However, behaviour falls back to regular SQL if the property is not indexed.
  4. In a very early version of SMW, * may have meant 1 or more.
  5. See the documentation on search operators.
  6. Tokenisation comes with the benefit that length restrictions pertaining to the full string no longer apply. Even Taumatawhakatangi­hangakoauauotamatea­turipukakapikimaunga­horonukupokaiwhen­uakitanatahu (85 characters) should be fine.
  7. The CONTAINS predicate does not allow for other positions.
  8. The tokenizer does record the "order or position of each term", but it is unclear if this information is or can be used for anything other than "phrase and word proximity queries".
  9. Accent folding is a common technique that maps Unicode characters to their ASCII equivalents so that a query can be agnostic of any diacritics or accents being used.
  10. $smwgFulltextSearchMinTokenSize