Using Machine learning (ML) assisted data stewardship
Machine learning assisted data stewardship provides multi-fold benefits like accelerating manual tasks, shortening review cycles, and improving data quality. Machine learning-based product categorization feature helps the data stewards in using time on more meaningful activities than thinking about the right category for the products. The feature is based on open source lightweight Python libraries, which have no additional licensing overhead.
Product Auto Categorization
Fix Pack 10 and later, Fix Pack 7 Interim Fix 4-
In Product Master, machine learning (ML) is used to auto-categorize the product names to appropriate product categories. The trained ML assigns appropriate categories to products imported into the Product Master. This feature requires the initial data file containing 2 or more product attribute data columns and the expected product categories column be specified for the training. The training data must have representation of product attribute data from every category. This representation must contain data samples such that the model learns enough variations of possible product attribute data in every category. There should be a correlation between the training data features and their mapped category. At least 100 rows of data should be present for each category in the training file. When training is in progress for specific catalog, no other training should be triggered till the current training gets completed. The latest 3 versions of a model for specific catalog are retained and you can specify the version to be used for prediction through 'Version' attribute of 'Machine Learning Threshold and Version Lookup' table.
-
In Product Master, ML is used to auto-categorize the product names to appropriate product categories. The trained ML assigns appropriate categories to products imported into the Product Master. This feature requires the initial data file containing product names and the expected product categories be specified for the training. The training data must have representation of product names from every category. This representation must contain data samples such that the model learns enough variations of possible product names in every category.
With this feature, next sets of products that are imported into the system do not need their categories that are specified at import time. ML assigns appropriate categories. This model can be improved by passing the feedback from manual review of misclassification identified.
- Suggested Categories/Name
- Suggested Categories/Confidence Score
The spec should have a Feedback attribute of Boolean type. Feedback attribute can be present in primary or secondary spec. Feedback attribute is used in retraining. If you manually update mapping of an item in the collaboration area, the value of the Feedback attribute gets set to True in the workflow.
Product Attributes Enrichment
In Product Master, product description is used to identify the values of attributes for the inlaying data model. This feature enables enriching a product just from its descriptions with no additional need to set the values explicitly. To use this feature, one must provide the list of categories and their products with their attributes and values that are provided in a Microsoft Excel or CSV file. Internally, it uses probabilistic data structures to store the data model and at run time assigns the best value to attributes from a given product description.
Product data standardization
To address data quality issues, IBM® Product Master uses ML to standardize the product attributes and fix any evident misspellings. The ML standardization service can be trained on contextual information of product attributes that helps to identify and replace the misspelt words. For training the ML service, a Microsoft Excel workbook containing all true product attributes data needs to be provided. Training can be done on data of multiple attributes by using training Microsoft Excel workbook having multiple columns, where each column contains true data for specific product attribute. The first row of each attribute data column must contain the full path of the attribute whose data is present in that specific column. Numbers are not considered in training or prediction for product data standardization.
Sample machine learning workflow project
Right-click the following ZIP file link, and select save to download the sample workflows code to your computer.
Fix Pack 10 and later, Fix Pack 7 Interim Fix 4 Fix Pack 9 and earlier