Sunday, August 2, 2020

Association Rule mining with Weka


coinpayu

Data set selection

I selected Adult data set from UCI data set collection (https://archive.ics.uci.edu/ml/ datasets/Adult). The donor of this data set is Ronny Kohavi and Barry Becker from Data Mining and Visualization, Silicon Graphics. This is a multivariate data set with 32561 instances and 15 attributes. The purpose of selecting this data set is to identify and understand the factors affecting to the income of a person.

 

Data set attribute information

Attribute

Information

age

continuous

workclass

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt

continuous

education

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num

continuous

marital-status

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship

 Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex

Female, Male.

capital-gain

continuous

capital-loss

continuous

hours-per-week

continuous

native-country

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

 

 

Data Pre-processing

1.     Remove missing values

Before apply any rules to the data set, we have to pre process the data set. First thing is checking for missing values. In my data set I found there were few missing values for some attributes such as workclass, occupation and native-country. Opened the data set in Microsoft Excel and used MS Excel’s Filter functions to remove rows with missing values (in this case cells with value “?”). After filtering 2399 instances were removed. So 30162 instances were used for processing.

2.     Remove un-necessary attributes

Closely examined the data set and identify attributes which will not use for processing. This will reduce the complexity of the data set and made easy to apply rules. In above data set identified 4 attributes (fnlwgt, education-num, capital-gain and capital-loss) which are less useful for processing and remove them from the data set.  Now there are 11 attributes in the dataset.

3.     Discretization

Because of Association Rule mining, all numerical attributes should be removed. Used Discretization function in Weka to convert numeric attributes as “age” and “hours-per-week” in the dataset to categorical data. 3 bins were rerated for each “age” and “hours-per-week” attributes and weka automatically assign corresponding values in to relevant bin using discretize filter. Now all 11 attributes are nominal attributes and my data set is now ready for apply rules.

Weka created 3 bins for “age” as {'\'(-inf-41.333333]\'','\'(41.333333-65.666667]\'','\'(65.666667-inf)\''} and 3 bins for “hours-per-week” as {'\'(-inf-33.666667]\'','\'(33.666667-66.333333]\'','\'(66.333333-inf)\''}. To increase the readability of the data set and the readability of the results after applying the association rules to the dataset, replaced the labels of the “age” with {'0_41','42_65','66_MAX'} and replaced the labels of “hours-per-week” with {'0_33','34_66','67_MAX'}.

I selected income (which has values >50K, <=50K.) as class variable.

 

Applying Association rules (Apriori Algorithm)

In weka Associate tab selected Apriori algorithm. In Apriori configuration set the algorithm to mine class association rules and changed the number of rules to 40. Other configurations left as default.

 

Results

=== Run information ===

 

Scheme:       weka.associations.Apriori -N 40 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -A -c -1

Relation:     dataset-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R1-precision6-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R9-precision6

Instances:    30162

Attributes:   11

              age

              workclass

              education

              marital-status

              occupation

              relationship

              race

              sex

              hours-per-week

              native-country

              income

=== Associator model (full training set) ===

 

 

Apriori

=======

 

Minimum support: 0.15 (4524 instances)

Minimum metric <confidence>: 0.9

Number of cycles performed: 17

 

Generated sets of large itemsets:

 

Size of set of large itemsets L(1): 21

Size of set of large itemsets L(2): 70

Size of set of large itemsets L(3): 104

Size of set of large itemsets L(4): 61

Size of set of large itemsets L(5): 19

Size of set of large itemsets L(6): 3

 

Best rules found:

 

 1. age=0_41 workclass= Private marital-status= Never-married 7358 ==> income= <=50K 7125    conf:(0.97)

 2. age=0_41 workclass= Private marital-status= Never-married native-country= United-States 6674 ==> income= <=50K 6457    conf:(0.97)

 3. age=0_41 workclass= Private marital-status= Never-married race= White 6149 ==> income= <=50K 5944    conf:(0.97)

 4. age=0_41 workclass= Private marital-status= Never-married race= White native-country= United-States 5703 ==> income= <=50K 5505    conf:(0.97)

 5. age=0_41 marital-status= Never-married 8733 ==> income= <=50K 8414    conf:(0.96)

 6. age=0_41 marital-status= Never-married native-country= United-States 7963 ==> income= <=50K 7666    conf:(0.96)

 7. age=0_41 marital-status= Never-married race= White 7230 ==> income= <=50K 6954    conf:(0.96)

 8. age=0_41 marital-status= Never-married race= White native-country= United-States 6737 ==> income= <=50K 6473    conf:(0.96)

 9. workclass= Private marital-status= Never-married 8025 ==> income= <=50K 7706    conf:(0.96)

10. workclass= Private marital-status= Never-married native-country= United-States 7270 ==> income= <=50K 6972    conf:(0.96)

11. workclass= Private marital-status= Never-married race= White 6688 ==> income= <=50K 6403    conf:(0.96)

12. age=0_41 workclass= Private marital-status= Never-married hours-per-week=34_66 5136 ==> income= <=50K 4915    conf:(0.96)

13. workclass= Private marital-status= Never-married race= White native-country= United-States 6194 ==> income= <=50K 5920    conf:(0.96)

14. age=0_41 marital-status= Never-married sex= Male 4904 ==> income= <=50K 4684    conf:(0.96)

15. marital-status= Never-married 9726 ==> income= <=50K 9256    conf:(0.95)

16. age=0_41 marital-status= Never-married hours-per-week=34_66 6165 ==> income= <=50K 5861    conf:(0.95)

17. marital-status= Never-married native-country= United-States 8876 ==> income= <=50K 8435    conf:(0.95)

18. age=0_41 marital-status= Never-married hours-per-week=34_66 native-country= United-States 5571 ==> income= <=50K 5289    conf:(0.95)

19. marital-status= Never-married race= White 8036 ==> income= <=50K 7622    conf:(0.95)

20. age=0_41 marital-status= Never-married race= White hours-per-week=34_66 5065 ==> income= <=50K 4803    conf:(0.95)

21. marital-status= Never-married race= White native-country= United-States 7489 ==> income= <=50K 7092    conf:(0.95)

22. workclass= Private marital-status= Never-married hours-per-week=34_66 5693 ==> income= <=50K 5390    conf:(0.95)

23. workclass= Private marital-status= Never-married hours-per-week=34_66 native-country= United-States 5103 ==> income= <=50K 4820    conf:(0.94)

24. marital-status= Never-married sex= Male 5414 ==> income= <=50K 5107    conf:(0.94)

25. marital-status= Never-married sex= Male native-country= United-States 4900 ==> income= <=50K 4612    conf:(0.94)

26. marital-status= Never-married hours-per-week=34_66 6994 ==> income= <=50K 6552    conf:(0.94)

27. marital-status= Never-married hours-per-week=34_66 native-country= United-States 6330 ==> income= <=50K 5916    conf:(0.93)

28. marital-status= Never-married race= White hours-per-week=34_66 5727 ==> income= <=50K 5339    conf:(0.93)

29. marital-status= Never-married race= White hours-per-week=34_66 native-country= United-States 5296 ==> income= <=50K 4925    conf:(0.93)

30. age=0_41 workclass= Private sex= Female 5279 ==> income= <=50K 4859    conf:(0.92)

31. age=0_41 relationship= Not-in-family 5038 ==> income= <=50K 4622    conf:(0.92)

32. age=0_41 sex= Female native-country= United-States 5853 ==> income= <=50K 5325    conf:(0.91)

33. age=0_41 sex= Female 6382 ==> income= <=50K 5805    conf:(0.91)

34. workclass= Private relationship= Not-in-family 5899 ==> income= <=50K 5343    conf:(0.91)

35. workclass= Private sex= Female 7642 ==> income= <=50K 6921    conf:(0.91)

36. age=0_41 workclass= Private education= HS-grad 5076 ==> income= <=50K 4591    conf:(0.9)

37. workclass= Private sex= Female native-country= United-States 6926 ==> income= <=50K 6264    conf:(0.9)

38. workclass= Private relationship= Not-in-family native-country= United-States 5397 ==> income= <=50K 4872    conf:(0.9)

39. workclass= Private relationship= Not-in-family race= White 5130 ==> income= <=50K 4627    conf:(0.9)

40. age=0_41 race= White sex= Female 5156 ==> income= <=50K 4649    conf:(0.9)

 

 

Interesting rules

·        age=0_41 workclass= Private marital-status= Never-married native-country= United-States 6674 ==> income= <=50K 6457    conf:(0.97)

·        age=0_41 marital-status= Never-married sex= Male 4904 ==> income= <=50K 4684    conf:(0.96)

·         age=0_41 workclass= Private education= HS-grad 5076 ==> income= <=50K 4591    conf:(0.9)

·         age=0_41 workclass= Private sex= Female 5279 ==> income= <=50K 4859    conf:(0.92)

 

 

Rule Evaluation

by analyzing above selected rules we can identify younger people get an income less than or equal to $50k and young people who educated up to high school level and work in private sector get income less than or equal to $50k. 4th selected rule says that there is 92% confidence in young females who work in private sector get income less than or equal to $50k.

 

 

Use of rules

 

Using those rules can identify that young employees need some capacity building programs to enhance their work experience and gain more income in their young age. Also, female employees need to encourage for achieve successful carrier.


Featured Post

Data recovery of CCTV DVR systems which have Proprietary OS and Proprietary file systems - Literature Review

Abstract The purpose of this study is to explore some different ways of extracting data from closed-circuit television (CCTV) Digital video...

Popular Posts