Similar Items

Technical documentation about how to setup and configure Similar Items.

Overview

The following tables are used for similar items:

  • [ml].[similar_item]
  • [ml].[similar_item_setting]

The following procedures are used for similar items:

  • [ml].[similar_item_run]
  • [ml].[similar_item_create]

Configure [ml].[similar_item_setting]

To compute the similarity between items, several settings need to be configured.
Those settings are inserted into the [ml].[similar_item_setting] table;

  • id (int - required)

    • Unique identity for this setting for similar item. This id is referenced in @control_table_id in core.stg_element (see below).
  • input_item_selection (nvarchar - not required)

    • By default, the algorithm computes similar items for all open items found in dbo.items table.
      If you want to create similar items only for certain items or a subset of items, then you can place the name of that custom table in this field.
      This custom item selection table needs to have a primary key {id} as item_id.
  • input_features (nvarchar - required)

    • The item features you like to use when comparing all items, listed as a comma-separated list with no space between each word.
    • These features should be listed by their column name like found in dbo.data_element_ref_columns.
    • The item_id column needs to be listed as part of features.
    • All features need to be of type nvarchar as that type is only supported(except for item_id column). So you can not have columns that are of type varchar, int, decimal etc.
    • Example of input_features: item_id,name,description‚product_group_name_level_1,product_group_name_level_2.
  • language_stop_words (nvarchar - required)

    • The language chosen should be the language that most of the features (e.g. item name and description) are written in.
    • The language is used for preprocessing techniques, which generally leads to more accurate results.
    • However, not all languages are supported. If the appropriate language is not on the list below, please enter „other“ as the input parameter.
    • The supported languages are:
      • arabic, bulgarian, catalan, czech, danish, dutch, english, finnish, french, german, hungarian,
        indonesian, italian, norwegian, polish, portuguese, romanian, russian, spanish, swedish, turkish, ukrainian.
  • no_of_similar_items (int - required)

    • For each item in the item range, how many similar items should be generated.
    • E.g. if no_of_similar_items = 5, there will be generated maximum 5 similar items for each item. If the algorithm finds more than 5 similar items it will only yield those 5 items with the highest similarity score. (So this setting is used to limit the number of similar items. If you like to include all similar items you can place a high number in this field.)
  • min_similarity_score (decimal - required)

    • The similarity score can take value from 0 to 1, where 1 means the two items are the same when looking at the data that was fed into the algorithm.
      So if you want to only include similar items that have similarity score higher than 0.5, you would place 0.5 in this field.
      If you want to include all similar items, independent of their similarity score, you place 0 in this field.

ml.similiar_item_setting

In most cases, we only have one line in this setting table, but it is possible to insert more lines.
Let’s say you like to create similar items based on different features or similarity score, you can simply add a new line with your preferred settings and then change in core.stg_element what @control_table_id should be used. Then the similar items results will generate similar items based on the features and settings provided.


Configure [core].[stg_element]

In [core].[stg_element] we specify which setting from [ml].[similar_item_setting] should be used when generating the similar items.
So, for stg_element with name = ml_run_similar_item in the parameters field we want to have @control_table_id=1, where 1 is the id of the setting line from [ml].[similar_item_setting].

The parameters field looks should like this: @batch_id = *BATCH_ID*, @origin_id = *ORIGIN_ID*, @parent_id = *PARENT_ID*, @control_table_id = 1

ml_run_similar_item


How it works

If you like to learn more about the similarity algorithm you can take a look at this article.

  • [ml].[similar_item_create] [procedure] - contains the python algorithm that finds the similar items.
    • This procedure should not be changed at any time.
  • [ml].[similar_item_run] [procedure] - takes into account the settings from [ml].[similar_item_setting] and runs the create procedure.

The similarity algorithm works on item_id level, where it compares all items based on the features provided.
So all items are compared within each location (it is not possible to exclude some locations).