


Data set
selection
I selected Adult data set from UCI data set collection (https://archive.ics.uci.edu/ml/ datasets/Adult). The donor of this data set is
Ronny Kohavi and Barry Becker from Data Mining and Visualization, Silicon
Graphics. This is a multivariate data set with 32561 instances and 15
attributes. The purpose of selecting this data set is to identify and understand
the factors affecting to the income of a person.
Data set attribute information
Attribute |
Information |
age |
continuous |
workclass |
Private,
Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked. |
fnlwgt |
continuous |
education |
Bachelors,
Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th,
7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |
education-num |
continuous |
marital-status |
Married-civ-spouse,
Divorced, Never-married, Separated, Widowed, Married-spouse-absent,
Married-AF-spouse. |
occupation |
Tech-support,
Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty,
Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing,
Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. |
relationship |
Wife, Own-child,
Husband, Not-in-family, Other-relative, Unmarried. |
race |
White,
Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. |
sex |
Female, Male. |
capital-gain |
continuous |
capital-loss |
continuous |
hours-per-week |
continuous |
native-country |
United-States,
Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc),
India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy,
Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary,
Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |
Data Pre-processing
1.
Remove
missing values
Before apply any rules to the data set, we have to pre
process the data set. First thing is checking for missing values. In my data
set I found there were few missing values for some attributes such as workclass, occupation and
native-country. Opened the data set in Microsoft Excel and used MS Excel’s
Filter functions to remove rows with missing values (in this case cells with
value “?”). After filtering 2399 instances were removed. So 30162 instances
were used for processing.
2.
Remove
un-necessary attributes
Closely
examined the data set and identify attributes which will not use for
processing. This will reduce the complexity of the data set and made easy to
apply rules. In above data set identified 4 attributes (fnlwgt, education-num, capital-gain
and capital-loss) which are less useful for processing and remove them from the
data set. Now there are 11 attributes in
the dataset.
3.
Discretization
Because of Association Rule mining,
all numerical attributes should be removed. Used Discretization function in
Weka to convert numeric attributes as “age” and “hours-per-week” in the dataset
to categorical data. 3 bins were rerated for each “age” and “hours-per-week”
attributes and weka automatically assign corresponding values in to relevant
bin using discretize filter. Now all 11 attributes are nominal attributes and
my data set is now ready for apply rules.
Weka created 3 bins for “age” as {'\'(-inf-41.333333]\'','\'(41.333333-65.666667]\'','\'(65.666667-inf)\''}
and 3 bins for “hours-per-week” as {'\'(-inf-33.666667]\'','\'(33.666667-66.333333]\'','\'(66.333333-inf)\''}.
To increase the readability of the data set and the readability of the results
after applying the association rules to the dataset, replaced the labels of the
“age” with {'0_41','42_65','66_MAX'} and replaced the labels of “hours-per-week”
with {'0_33','34_66','67_MAX'}.
I selected income (which has values >50K, <=50K.) as class
variable.
Applying
Association rules (Apriori Algorithm)
In weka
Associate tab selected Apriori algorithm. In Apriori configuration set the
algorithm to mine class association rules and changed the number of rules to
40. Other configurations left as default.
Results
=== Run information
===
Scheme: weka.associations.Apriori -N 40 -T 0 -C
0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -A -c -1
Relation:
dataset-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R1-precision6-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R9-precision6
Instances: 30162
Attributes: 11
age
workclass
education
marital-status
occupation
relationship
race
sex
hours-per-week
native-country
income
=== Associator
model (full training set) ===
Apriori
=======
Minimum support:
0.15 (4524 instances)
Minimum metric
<confidence>: 0.9
Number of cycles
performed: 17
Generated sets of
large itemsets:
Size of set of
large itemsets L(1): 21
Size of set of
large itemsets L(2): 70
Size of set of
large itemsets L(3): 104
Size of set of
large itemsets L(4): 61
Size of set of
large itemsets L(5): 19
Size of set of
large itemsets L(6): 3
Best rules found:
1. age=0_41 workclass= Private marital-status=
Never-married 7358 ==> income= <=50K 7125 conf:(0.97)
2. age=0_41 workclass= Private marital-status=
Never-married native-country= United-States 6674 ==> income= <=50K
6457 conf:(0.97)
3. age=0_41 workclass= Private marital-status=
Never-married race= White 6149 ==> income= <=50K 5944 conf:(0.97)
4. age=0_41 workclass= Private marital-status=
Never-married race= White native-country= United-States 5703 ==> income=
<=50K 5505 conf:(0.97)
5. age=0_41 marital-status= Never-married 8733
==> income= <=50K 8414
conf:(0.96)
6. age=0_41 marital-status= Never-married
native-country= United-States 7963 ==> income= <=50K 7666 conf:(0.96)
7. age=0_41 marital-status= Never-married
race= White 7230 ==> income= <=50K 6954
conf:(0.96)
8. age=0_41 marital-status= Never-married
race= White native-country= United-States 6737 ==> income= <=50K
6473 conf:(0.96)
9. workclass= Private marital-status=
Never-married 8025 ==> income= <=50K 7706 conf:(0.96)
10. workclass=
Private marital-status= Never-married native-country= United-States 7270 ==>
income= <=50K 6972 conf:(0.96)
11. workclass=
Private marital-status= Never-married race= White 6688 ==> income= <=50K
6403 conf:(0.96)
12. age=0_41
workclass= Private marital-status= Never-married hours-per-week=34_66 5136
==> income= <=50K 4915
conf:(0.96)
13. workclass=
Private marital-status= Never-married race= White native-country= United-States
6194 ==> income= <=50K 5920
conf:(0.96)
14. age=0_41
marital-status= Never-married sex= Male 4904 ==> income= <=50K 4684 conf:(0.96)
15. marital-status=
Never-married 9726 ==> income= <=50K 9256 conf:(0.95)
16. age=0_41
marital-status= Never-married hours-per-week=34_66 6165 ==> income= <=50K
5861 conf:(0.95)
17. marital-status=
Never-married native-country= United-States 8876 ==> income= <=50K
8435 conf:(0.95)
18. age=0_41
marital-status= Never-married hours-per-week=34_66 native-country=
United-States 5571 ==> income= <=50K 5289 conf:(0.95)
19. marital-status=
Never-married race= White 8036 ==> income= <=50K 7622 conf:(0.95)
20. age=0_41
marital-status= Never-married race= White hours-per-week=34_66 5065 ==>
income= <=50K 4803 conf:(0.95)
21. marital-status=
Never-married race= White native-country= United-States 7489 ==> income=
<=50K 7092 conf:(0.95)
22. workclass=
Private marital-status= Never-married hours-per-week=34_66 5693 ==> income=
<=50K 5390 conf:(0.95)
23. workclass=
Private marital-status= Never-married hours-per-week=34_66 native-country=
United-States 5103 ==> income= <=50K 4820 conf:(0.94)
24. marital-status=
Never-married sex= Male 5414 ==> income= <=50K 5107 conf:(0.94)
25. marital-status=
Never-married sex= Male native-country= United-States 4900 ==> income=
<=50K 4612 conf:(0.94)
26. marital-status=
Never-married hours-per-week=34_66 6994 ==> income= <=50K 6552 conf:(0.94)
27. marital-status=
Never-married hours-per-week=34_66 native-country= United-States 6330 ==>
income= <=50K 5916 conf:(0.93)
28. marital-status=
Never-married race= White hours-per-week=34_66 5727 ==> income= <=50K
5339 conf:(0.93)
29. marital-status=
Never-married race= White hours-per-week=34_66 native-country= United-States
5296 ==> income= <=50K 4925
conf:(0.93)
30. age=0_41
workclass= Private sex= Female 5279 ==> income= <=50K 4859 conf:(0.92)
31. age=0_41
relationship= Not-in-family 5038 ==> income= <=50K 4622 conf:(0.92)
32. age=0_41 sex=
Female native-country= United-States 5853 ==> income= <=50K 5325 conf:(0.91)
33. age=0_41 sex=
Female 6382 ==> income= <=50K 5805
conf:(0.91)
34. workclass=
Private relationship= Not-in-family 5899 ==> income= <=50K 5343 conf:(0.91)
35. workclass= Private
sex= Female 7642 ==> income= <=50K 6921
conf:(0.91)
36. age=0_41
workclass= Private education= HS-grad 5076 ==> income= <=50K 4591 conf:(0.9)
37. workclass=
Private sex= Female native-country= United-States 6926 ==> income= <=50K
6264 conf:(0.9)
38. workclass=
Private relationship= Not-in-family native-country= United-States 5397 ==>
income= <=50K 4872 conf:(0.9)
39. workclass=
Private relationship= Not-in-family race= White 5130 ==> income= <=50K
4627 conf:(0.9)
40. age=0_41 race=
White sex= Female 5156 ==> income= <=50K 4649 conf:(0.9)
Interesting rules
·
age=0_41 workclass= Private marital-status= Never-married
native-country= United-States 6674 ==> income= <=50K 6457 conf:(0.97)
·
age=0_41 marital-status= Never-married sex= Male 4904 ==>
income= <=50K 4684 conf:(0.96)
·
age=0_41 workclass= Private education= HS-grad 5076 ==> income=
<=50K 4591 conf:(0.9)
·
age=0_41 workclass= Private sex= Female 5279 ==> income=
<=50K 4859 conf:(0.92)
Rule Evaluation
by analyzing above selected rules we can identify younger people
get an income less than or equal to $50k and young people who educated up to
high school level and work in private sector get income less than or equal to
$50k. 4th selected rule says that there is 92% confidence in young
females who work in private sector get income less than or equal to $50k.
Use of rules
Using those rules can identify that young employees need some capacity
building programs to enhance their work experience and gain more income in
their young age. Also, female employees need to encourage for achieve
successful carrier.