SQLStore update

From semantic-mediawiki.org
Jump to: navigation, search

This page collects design notes for the new SQLStore3. It was a development document that should not be confused with the actual documentation of what is implemented now.

Discussions are done at the talk page. Also see Tasks page for the process.

Issues and concerns

Here we collect the things that we would like to improve:

  • The code is too complex and thus hard to optimise.
  • The code is not really aware of the possible kinds of dataitems, but uses another way of describing data.
  • The table layout is historical and makes some operations harder (e.g., one cannot find out the type of a value from its table)
  • Data is written too often, even if it did not change.
  • Statistical information as needed for the special pages is very expensive to calculate.
  • String data has unnecessary length restrictions bugzilla:34511
  • ...

Unordered ideas

  • Align the table layout with the types of dataitems to make everything clearer
  • Use hashes to avoid unnecessary write operations
  • Split the code in multiple files based on functionality (ID management, writing, simple reading, statistics, queries)
  • ...

Store Interface

Reading methods

The following table shows all reading methods that a store must implement. The query pattern describes input and output: bold parts are given as an input, parts with ? are left variable, underlined variables are returned as an output. All methods that take properties as inputs will also accept inverse properties: in this case subject and value are swapped and the according methods is called. For example, calling getPropertyValues with an inverse property leads to a call of getPropertySubjects.

Method Query pattern Implementation and Notes Usage
getSemanticData subject ?property ?value Iterate over all value tables; retrieve all data for subject; return SMWSemanticData
  • A "filter" parameter can be used to limit the retrieval to some tables.
  • Results are cached for the PHP session.
  • The returned SMWSemanticData holds raw data and creates DI objects only on demand.
  • fetchSemanticData used to construct SQL query
Everywhere
getPropertyValues subject property ?value Call getSemanticData with filter to retrieve data only from relevant table; post-process results (apply order etc.)
  • caching based on caching for getSemanticData
  • post-processed output not cached
Everywhere
getAllPropertyValues* ?subject property ?value
  • fetchSemanticData used to construct SQL query
  • no caching
Unused in SMW?
getPropertySubjects ?subject property value
  • direct implementation, not using semantic query code
  • no caching
browsing special pages, property pages, backlinks (RDF), update jobs
getAllPropertySubjects ?subject property ?value Shares most code with getPropertySubjects (just need to leave value selection part away) property pages, update jobs
getProperties subject ?property ?value Cycle through all tables to find properties, merge results in memory (ordering, filtering)
  • no caching
Unused in SMW?
getInProperties ?subject ?property value Similar to getProperties
  • only implemented for values of type SMWDIWikiPage now
  • no caching
Unused in SMW?

(*) There is no method getAllPropertyValues in SMWStore yet, but one can call getPropertyValues with NULL as subject to get the same result.

Sorting and filtering results

All reading methods with one return variable support SMWRequestOptions that can be used to sort or to filter the list of return values. Filtering involves settings like Limit and Offset.

Sorting will sort values according to their natural order (if any), usually numerical or lexicographic. Geographic coordinates do not have a natural linear order to use for sorting. However, every kind of data must be sorted in some way (even if the order is not useful in practice). Otherwise the order of results could change between two requests, which would make it impossible to go through all results using Limit and Offset.

Properties are handled like wiki pages and sorted by their title. Wikipages and properties require a JOIN with the smw_ids table for sorting. If no sorting is needed, it is supposed to be cheaper to fetch their ID separately and to use it to select the right values.

Some types of data have multiple versions of each value stored in the database, one for sorting and one for storage. For example, numbers are stored as strings (reliable precision) and as floats (efficient sorting). In this case, it is likely that only the numerical version is needed for selecting values, so no index should be required on the other column.

Sorting and filtering can be done efficiently by the database management system in many cases. However, in order to use data that was cached, it might be necessary to sort and filter in PHP as well. SMWSQLStore2 has methods for both cases: one for translating SMWRequestOptions into SQL options, and one for applying SMWRequestOptions to a list of values.

Table layout

Here we document all database tables and related questions.

Table smw_ids

This table is used to store integer IDs for SMWDIWikiPage objects. These objects can represent wiki pages (possibly with some external interwiki prefix), or subobjects of wiki pages. IDs are also kept for properties (identified by a wiki page) and for special properties (no wiki page).

Columns:

  • smw_id (integer NOT NULL KEY AUTO_INCREMENT): the integer id
  • smw_namespace (integer NOT NULL): namespace of the wiki page
  • smw_title (titletext NOT NULL): title of the wiki page
  • smw_iw (interwikitext NOT NULL): interwiki prefix of the wiki page (or empty string)
  • smw_subobject (titletext NOT NULL): name for subobjects of a page (empty string if the entry is about the page)
  • smw_sortkey (titletext NOT NULL): string used for sorting
  • smw_data_hash (varchar NULL): string used to store hash value of the Semantic Data for this page to compare with newly generated ones on page edits

Indexes:

  • smw_id: primary key
  • smw_title,smw_namespace,smw_iw: selection by wiki page
  • smw_title,smw_namespace,smw_iw,smw_subobject: selection by the full data stored in SMWDIWikiPage
  • smw_sortkey: selection by sortkey (relevant, e.g., when < and > comparators are used with page-type properties in #ask queries)

There are also some special objects that are stored here. These are marked by special interwiki prefixes that cannot occur in real pages:

  • SMW_SQL2_SMWREDIIW: entries for wiki pages that are redirects. Otherwise, such rows are similar to normal entries for (internal) wiki pages.
  • SMW_SQL2_SMWPREDEFIW: predefined entries, usually for predefined properties that have user-visible name (and a wiki page)
  • SMW_SQL2_SMWINTDEFIW: predefined entries for objects that have an Id but no user-visible name (internal objects)
  • SMW_SQL2_SMWBORDERIW: this is used in one row to mark the border between predefined ids (rows that are reserved for hardcoded ids built into SMW) and normal entries. This row is no object, but makes sure that SQL's auto increment counter is high enough to not add any objects before that marked "border".

Predefined entries for predefined properties are created on setup. SMW's internal predefined properties have fixed IDs to avoid lookups. Having them in the table is useful so that they can be found by functions that cannot not check this (e.g., statistical functions that show all properties). It is also important that no other object has the same ID, so it makes sense to occupy the space in the table.

Table smw_property_ids

Similar to table smw_ids but has some new fields to store important information about that property

Columns:

  • prop_id (integer NOT NULL KEY AUTO_INCREMENT): the integer id
  • prop_name (titletext NOT NULL): name of the property
  • prop_table (titletext NOT NULL): name of the property-value table for this property (we are planning to have dedicated tables for some properties)

Indexes:

  • prop_id: primary key
  • prop_title: selection by property name

Property-Value tables

Various tables store property-value pairs that are assigned to a given page. All of these tables have the following two columns:

  • s_id (integer): ID of the subject, that is, the page that has the value
  • p_id (integer): ID of the property to which the value was assigned

All IDs refer to the entries in smw_ids. The columns have the following indexes:

  • s_id: selection by subject
  • p_id: selection by property

For different types of values, there are different further columns as shown next.

SMWDINumber: smw_di_number

This table will store property-values of type SMWDINumber.

Value columns:

  • value: the value of the property(Datatype:varchar).
  • sort_value: value used for sorting purpose(Datatype:double)

Indexes:

  • value: selection by value.

SMWDIBool: smw_di_bool

This table will store property-values of type SMWDIBool.

Value columns:

  • value: the value of the property(Datatype: Boolean).

SMWDITime: smw_di_time

This table will store property-values of type SMWDITime.

Value columns:

  • value_smwformat : the time stored as is now in smw_atts2.value_xsd column.(Datatype:string)
  • value_timestampformat: the time stored as a SMW timestamp(used for sorting and other purposes).(Datatype: double)

Indexes:

  • time_timestampformat: selection by value.

SMWDIString: smw_di_string

This table will store property-values of type SMWDIString.

Value columns:

  • value: the value of the property as blob.(Datatype:Blob)
  • hash_value: Hash value used for sorting and selection.(Datatype:varchar)

Indexes:

  • hash_value: Indexing strings by Hashing will allow effective searching and selection.


SMWDIWikiPage: smw_di_wikipage

This table stores property values of type SMWDIWikiPage. This table is also used to store "compound values". Such values are encoded as subobjects that act as "internal wiki pages", which can have further property values.

Value columns:

  • o_id (integer): ID of the object (value) that was assigned.(Datatype:int)

Indexes:

  • o_id: selection by value

SMWDIGeoCoord: smw_di_geocoord

This table stores property values of type SMWDIGeoCoord.

Value Columns:

  • lat (float): latitude of a coordinate
  • lon (float): longitude of a coordinate
  • alt (float): altitude of a coordinate

Indexes:

  • lat: selection by latitude
  • lon: selection by longitude
  • alt: selection by altitude
Questions and Answers
  • Shouldn't there be an index for lat,lon,alt (whole value)?
  • Do we use the indexes for the individual components anywhere?
  • The current database layout can only store limited precision numbers, so many user inputs are not going to be preserved accurately when storing them. We probably should use varchars to store accurate values. Should these be the only representations or do we also need the floats for anything? (this relates to the previous question)

Maintenance Scripts

  • Property Table migration - This script will make a new table for a property (and put that table name in the property_table field of table smw_property_ids) and move all the property-value tuples into this new table.