PMML-Association rule

Posted: June 29th, 2010 | Author: laomi | Filed under: 论文读后感 | Tags: | No Comments »

PMML (Predictive Model Markup Language)是被20多个支持者和组织支持的数据挖掘和统计模型的主流标准,现在已经有一些主流的数据分析和挖掘工具已经支持了PMML,例如: SPSS, Weka等。

PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is straightforward to develop a model on one system using one application and deploy the model on another system using another application.

对于模型的描述我们知道,模型就像是一个黑盒子,你给它数据,它给你想得到的数据。例如对于一个分类器来说,一旦你将这个分类器已经训练好了,那么新的一条数据从这个模型的输入之后,模型会返回一个输出,告诉你所输入的这条数据时属于哪一类的。其实PMML其实就是数据统计和挖掘,这么个模型能够被多个软件使用,一旦你在一个软件中训练好一个模型,这个模型将会被支持PMML软件或者是应用复用,这个也是PMML的一个目的。

标题我取的是PMML-Association rule,其实是最近自己在写一个关于运用PMML去表达Association rule的例子,以及如何去扩展这个PMML。对于任何一个模型来说,都有它的输入和输出,在PMML中输入和输出已经表达好了,所以自己在这里只是去了解了一下它,没有进一步的去解读它的schema,在这里侧重点是对Association Rule Model的描述,从目前自己从PMML的官方网站上对的描述摘录如下:

Association rule Model

Element:
Extension
miningSchema
OutPut
ModelStatus
Item
ItemSet
AssociationRule
Attribute:
ModelName
functionName
algorithmName
numberOfTransactions: The number of transactions contained in the input data.
maxNumberOfItemsPerTA The number of items contained in the largest transaction.
avgNumberOfItemsPerTA: The average number of items contained in a transaction.
minimumSupport: The minimum relative support value (#supporting transactions / #total transactions) satisfied by all rules.
minimumConfidence: The minimum confidence value satisfied by all rules. Confidence is calculated as (support (rule) / support(antecedent)).
lengthLimit: The maximum number of items contained in a rule which was used to limit the number of rules.
numberOfItems: The number of different items contained in the input data.
numberOfItemsets: The number of itemsets contained in the model.
numberOfRules: The number of rules contained in the model.

Item

element:
Extension
attribute:
id: An identification to uniquely identify an item.
value: The value of the item as in the input data.
mappedValue: Optional, a value to which the original item value is mapped. For instance, this could be a product name if the original value is an EAN code.
weight : The weight of the item. For example, the price or value of an item.

ItemSet

element:
ItemRef: Item references point to elements of type Item
Attribute:
id: An identification to uniquely identify an Itemset.
support: The relative support of the Itemset.
support(set) = (number of transactions containing the set) / (total number of transactions)
numberOfItems: The number of Items contained in this Itemset

ItemRef

Element:
Extension
Attribute:
itemRef: Contains the identification of an item

AssociationRule

Element:
Extension
Attribute:
antecedent: The id value of the itemset which is the antecedent of the rule. We represent the itemset by the letter A.
consequent: The id value of the itemset which is the consequent of the rule. We represent the itemset by the letter C.
support: The support of the rule, that is, the relative frequency of transactions that contain A and C.
support(A->C) = support(A+C)
confidence: The confidence of the rule.
confidence(A->C) = support(A+C) / support(A)
lift: The lift value of the rule. If the XML attribute is specified explicitly in the rule, the following equation must hold true.
lift(A->C) = confidence(A->C) / support(C)
id: An identification to uniquely identify an association rule.

Soring procedure :关于这部分主要是用了三种不同的机制来实现

  • recommendation,这种机制是只要是模型中的前置条件在输入集合中,这样的的关联规则就会被选中。这样的话,可能会出现关联规则的后置条件在输入结合中
  • exclusiveRecommendation,这种情况和recommendation不同的是不但要求关联规则的前置条件在输入集合中,而且关联规则的后置条件不在数据集合中。
  • ruleAssociation,这种机制要关联规则的前置条件和后置条件都必须在输入集合中。

上面仅仅是我对于PMML官方文档中的关联规则模型的一些摘录,相关的情况可以查看http://www.dmg.org/v4-0/GeneralStructure.html

Share and Enjoy:
  • Sphinn
  • Mixx
  • Google Bookmarks
  • Twitter
  • del.icio.us
  • Digg
  • Diigo
  • MSN Reporter
  • Yahoo! Buzz


Leave a Reply

  • Powered by WP Hashcash