Monday, September 13, 2021

Data recovery of CCTV DVR systems which have Proprietary OS and Proprietary file systems - Literature Review

Abstract

The purpose of this study is to explore some different ways of extracting data from closed-circuit television (CCTV) Digital video recorder (DVR) systems which have proprietary operating systems and proprietary file systems. DVRs for CCTV commonly have an in-built capability to export stored video files to optical storage media. In the cases that a DVR is damaged or forensics examiners have only the hard disk of the DVR, its contents cannot be easily exported. This renders the forensically-sound recovery of proprietary-formatted video files from a DVR hard disk an expensive and challenging exercise

Introduction

Recently, a large amount of video contents have been produced in line with a wide spread of surveillance cameras, digital video recorders and automobile black boxes. Video surveillance and closed-circuit television (CCTV) systems serve as deterrents to crime, and can be used to gather evidence, monitor the behavior of known offenders and reduce the fear of crime. CCTV systems can be broadly categorized into analog, digital and Internet Protocol (IP) based systems. Analog systems have limited abilities to store, replicate and process large amounts of video data, and the quality of images and video files is generally quite low. Digital CCTV systems use digital cameras and hard disk storage media. IP based CCTV systems stream digital camera video using network protocols. In digital CCTV forensics, it is extremely challenging to recover evidence in a forensically-sound manner without data recovery expertise in a wide range of storage media with different filesystems and video formats. The challenge is compounded if the storage media of a digital video recorder (DVR) is damaged or forensics examiners have only the hard disk of the DVR. The variety of digital CCTV systems further complicates the digital forensic process, as many of the systems use proprietary technologies. Therefore, digital forensic practitioners need to have an intimate understanding of digital CCTV systems.

Identify Video File formats

Usually in CCTV DVRs, there is a mechanism for video compression. Most of the time the OS and the file systems of CCTV DVRs are proprietary, but they use standard video compression techniques. Video compression is the process of using a codec to go through the video files to reduce or eliminate unnecessary frames (A video file is a combination of a set of still images called frames). This makes the video files smaller and saves the storage of a CCTV DVR’s hard disks. There are two main types of compression, H.264 and MJPEG, while MPEG4 is an older version. Each compression type has their unique file structure with attributes. Thomas Gloe in 2014 [1] extended the idea of file format forensics to popular digital video data container formats. In his study he identifies manufacturer and model-specific video file format characteristics and point to traces left by processing software. Such traces can be used to authenticate digital video streams and to attribute recordings of unknown or questionable provenance to (groups of) video camera models. Thomas Gloe in 2014 [1] identified all the header attributes, footer attributes and all other segments in AVI and MP4 video container formats and constructed attribute structure diagrams of each video file format. The standard order of each attribute and the case of each attribute name were identified. The purpose of each attribute in each video file format is also illustrated. This knowledge can be used to identify and extract the video files in the CCTV DVRs which use AVI or MP4 video container formats. However his study did not cover identifying and extracting of video files other than AVI or MP4 video container formats in CCTV DVR Systems. If a video file saved in a CCTV DVR hard disk is corrupted or partially overwritten this proposed method cannot be used to extract the remaining video data from the CCTV DVR System. Thomas Gloe in 2014 [1] relied on file structure internals and there are tools that would allow users to forge such information with advanced programming skills.

Video file restoration using the meta-information

Conventional techniques for video file restoration use the meta-information of the file system to recover a video file stored in a CCTV Hard disk. The file system meta-information contains the information such as the address and the link of a video file that can be used for file restoration. Carrier in 2005 [4] proposes a file restoration tool based on the file system, which was implemented in a software toolkit, The Sleuth Kit [5].This program is based on the information from the file and directory structure of a storage file system. Video file restoration may not be possible with his solution, when the file system meta-information is not available or video files are corrupted or partially overwritten.

Analyze the Hex dumps of the videos

Although most CCTV DVR hard disks have Proprietary OS and Proprietary file systems we can perform byte-level analysis (Fig. 1) using tools as WinHex [8]. This kind of byte-level analysis helps forensics examiner to identify the video files saved in a CCTV hard disk in hexadecimal form. Aswami Ariffin, Jill Slay and Kim-Kwang Choo in 9th IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, January 28-30, 2013 [2], explained a hex based Solution to retrieve videos from proprietary formatted CCTV hard disks. They extended the McKemmish’s [3] digital forensic framework and analyzed the cloned hard disk, examined the video stream, byte storage method, format, codec, channel and timestamp to identify the file signatures for searching and carving the video files with timestamps. They determined the byte storage method is little endian or big endian and derived the file signatures This information can be used to correlate each file signature to the channel video that captured the scenes with timestamps.

Fig 1 - Proprietary file format

Aswami Ariffin, Jill Slay and Kim-Kwang Choo [2] performed a search for repetitive hexadecimal patterns in the channel video (from one to the number of channels) interleaved with timestamp tracks. They were able to identify the header file signatures for each channel, footer file signature and hexadecimal timestamps. They used WinHex to search and carve out the digital video evidence according to the channel (video header and footer) and timestamp signatures. Although their proposed solution enables video files with timestamps to be carved without referring to the filesystem, they depend on a player which is capable of playing the curved video files to verify their findings. Most of the time in proprietary formatted CCTV hard disks we can’t easily obtain the repetitive hexadecimal pattern. Corrupted videos or partially overwritten files cannot be recovered by Aswami Ariffin, Jill Slay and Kim-Kwang Choo’s [2] method.

File carving in video file recovery

Eoghan Casey in 2014 [6] presented a various designing trade off in video recovery technique. He identified the practical problems in video recovery and describes tradeoffs that developers must consider when creating file carving tools for identifying and reassembling fragmented AVI, MPEG, and 3GP video files. He also explains if the location of individual video frames can be detected directly within a video container using the relevant specifications, one would not be so dependent on availability of indexes from container formats and the video frame locations could then be determined more locally. Such location information could be used to generate an appropriate container video file index for a partial file.

According to Eoghan Casey’s [6] study because of the complexities encountered in real world hex dumps of CCTV data, there is no single approach to identify fragmented video files, rendering them as playable video files.

Gi-Hyun Na, Kyu-Sun Shim, Ki Woong Moon,Seong G. Kong and Joong Lee in 2014 [7] proposed a method to recover corrupted video files using video codec specification which uses a frame. A video data consists of a sequence of video frames as the minimum meaningful unit of video file. They propose a technique to restore the video data on a frame-by frame basis from its corrupted versions where the video data has been significantly fragmented or partly overwritten in the CCTV hard disk.

The proposed method identifies, collects, and connects isolated video frames using the video codec specifications from non-overwritten portions of the video data to restore a corrupted video file. The technique consists of the extraction phase and connection phase of relevant video frames. (Fig. 2)

Fig. 2 - Processing steps of the proposed frame-based video file restoration technique

The extraction phase uses the video codec specifications to extract a set of video frames from the CCTV Hard disk. In the connection phase, the restored video frames are used to group and connect relevant video frames using the specifications of the video file used.

The proposed method was tested for three kinds of video files encoded with MPEG-4 Visual, H.264_start and H.264_Length codec’s. The recovery rates of video files decrease when the number of fragmentation increases, the degree of overwriting of files has also significantly affected the restoration rate of video files. According to their work a human expert should go through the video header to identify the video codec specifications, which is not a relatively simple task.

Conclusion

Table 1 lists the reviewed studies, their main area of concern and their limitations in successful video recovery in CCTV DVRs which have proprietary OS and proprietary file systems.

Research Main area of concern Identified potential limitations
Forensic analysis of video file formats-2014
  • Basically, focused on the structures of the common video file formats
  • Identified metadata attributes, header and footer attributes of some video file formats which are used in CCTV DVR Systems
  • Relies on metadata information which can be manipulated.
  • Can’t use to recover partially overwritten video files
  • Can’t use to recover corrupted video files
File System Forensics Analysis-2005
  • proposes a file restoration tool based on the information from the file and directory structure of a storage file system
  • Video file restoration may not be possible with this solution, when the file system meta-information is not available or video files are corrupted or partially overwritten
International Conference on Digital Forensics-January 28-30, 2013
  • Presented a various designing trade off in video recovery technique
  • Identified the practical problems in video recovery and describes tradeoffs that developers must consider when creating file carving tools
  • complexities encountered in real world hex dumps of CCTV data, there is no single approach to identify fragmented video files
Frame-Based Recovery of Corrupted Video Files Using Video Codec Specifications-2014
  • Recover corrupted video files using video codec specification which uses a frame
  • Recovery rates of video files decrease when the number of fragmentations increases
  • The degree of overwriting of files has also significantly affected the restoration rate of video files
  • Time takes to recognize the correct frames and reassemble the video
Table 1

Large-size video files are often fragmented and overwritten. Many existing file-based techniques could not restore partially overwritten video files. The frame-based file recovery technique increases restoration ratio. The time of recovery is also important in day to day life. The time taken to identify frames and reassemble the video increases when the size of the CCTV hard disk increases.

REFERENCES

[1] Thomas Gloe, André Fischer and Matthias Kirchner, “Forensic analysis of video file formats”, Digital Investigation 11 (2014) S68–S76.
[2] Aswami Ariffin, Jill Slay and Kim-Kwang Choo, 9th IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, January 28-30, 2013.
[3] R. McKemmish, What is forensic computing? Trends and Issues in Crime and Criminal Justice, no. 118, 1999.
[4] B.Carrier, File System Forensics Analysis, Vol. 3. Boston, MA, USA: Addison-Wesley, 2005.
[5] B.Carrier. (2005). The Sleuth Kit [Online]. Available: http://www.sleuthkit.org/ sleuthkit/
[6] Eoghan Casey Rikkert Zoun, Design Tradeoffs for Developing Fragmented Video CarvingTools.2014 Digital Forensics Research Workshop Published by Elsevier Ltd.
[7] Gi-Hyun Na, Kyu-Sun Shim, Ki Woong Moon,Seong G. Kong, Senior Member, IEEE, Eun-SooKim, and Joong Lee, Frame-Based Recovery of Corrupted Video Files Using Video Codec Specifications IEEE TRANSACTIONS ON IMAGE PROCESSING,VOL. 23, NO. 2, FEBRUARY 2014.
[8] (2004). Winhex [Online]. Available: http://www.x-ways.net/winhex/Index-m.html

Thursday, September 9, 2021

Digital Watermarking

Introduction

With the rapid growth of the internet and world wide web more and more digital media are transmitted and shared across the networks. So commercial organizations tend to use digital watermarks on their digital media such as audio, video, image, etc to protect the ownership rights. As Bryan Smith (Overview of Digital Watermark—For Images and Files, SANS institute, 2000-2002.) explained Digital watermarking is a means to protect ownership of files and images. The idea behind digital watermarking is to create a unique identifier and embed it on a file or image in order to be able to prove that the file or image is the property of the rightful owner. Digital watermarking and steganography are closely related terms. But Watermarking tries to hide a message related to the actual content of the digital signal, while in steganography the digital signal has no relation to the message, and it is merely used as a cover to hide its existence. Other than digital copyright management, digital watermarking has few other important applications such as broadcast monitoring, owner identification, Transaction tracking and copy control.
There are two main watermark embedding strategies available. They are

  1. Blind embedding
  2. Informed embedding

Blind embedding does not exploit the original image statistics to embed a message in an image. The detection is done using linear correlation. Informed embedding is an extension of the blind embedding system. It aims to provide 100% effectiveness in the watermarking detection. In informed embedding the watermark has significant influence from the original image before it adds to the image. Once the watermarked digital media is transmitted over a channel, noise is added. These added noise significantly affect the watermark decoding process. The noise types may be Gaussian noise, Shot noise, Salt and pepper noise etc.

Activity

In this tutorial I used labeled Faces in the Wild by University of Massachusetts, Amherst. This dataset contains 13,000 labeled faces that could be used for any machine learning tasks involving facial recognition. I will be using this dataset for the evaluation of my basic watermarking model. I divide the image set into three sets which contain 4000, 4000 and 5000 images. Then I embered 1-bit watermark, 0 and 1 to the image set 1 and 2. For that, need to derive a reference pattern (Wr) based on the Gaussian Distribution and based on that reference pattern need to derive the message pattern (Wm). With the given alpha value equals to 1 and Wm, I calculate the added pattern (Wa). Embed the Wa into the cover work (CO), which represents all the images in set 1 and set 2 to obtain watermarked work (Cw ). I generate Gaussian and Salt and Pepper noise and add them to the watermarked and non watermarked (image set 3) images one noise at a time. Then using a blind watermark detector calculate the linear correlation between each image and the reference pattern. Plot the detection value resulting from the blind watermark detector and the number of images belonging to each detection value. Then I identify upper and lower decision thresholds for the blind watermark detector for each type of channel noise mentioned above and calculate the false positive rate for each type of noise in the blind watermark embedder. Finally I convert the blind watermark embedder into an informed watermark embedder using the embedding strength parameters as 0.4.

All the algorithms implemented using C++ and OpenCV

  • Downloaded the “All images as gzipped tar file” (173MB, md5sum a17d05bd522c52d84eca14327a23d494) and copy first 4000 images in to a folder named 1, copy next 4000 images to a folder named 2 and copy next 5000 images to a folder named 3.
  • Derive a reference pattern (W r ) based on the Gaussian Distribution. All downloaded images were 250*250 images and the reference pattern also need to be in size 250*250. Create a Mat with 32 bit float values and one channel. Generate a uniform distribution with min value = 0 and max value = 10 and assign the values to the above created mat. Converted the uniform distribution into a normal distribution by subtracting the mean value of the distribution from each value and dividing each by the standard deviation. Then convert the 1 channel image to a 3 channel image by copying the channel 1s values to each channel.
  • Using the reference pattern derive the message pattern (Wm). For 1 bit (1 or 0), the message W_m = W_r when m = 1 and W_m = -W_r when m = 0. Used W_r as it is for messages when m = 1 and used minus version of W_r for messages when m = 0.
  • Compute the added pattern (W a ) using the W m and an alpha value initially set to 1. Added pattern W_ a = α * W_m, the α value is given as 1. So W_a = W_m
  • Embed the W a into the cover work (C O ) to obtain watermarked work (C w ). The cover work represents all the images in set 1 and set 2.To embed W_a to C_0 added the W_a_0 and W_a_1 to the original image C_0. By adding W_a_0 and W_a_1 to the original image C_0 obtained the watermarked images C_w_1 and C_w_0.
  • Add noise to the Cw to represent external effects to the watermark due to processing or malicious intent. The noise functions are
    • Gaussian Noise - Create a Mat with 32 bit float values and one channel. Generate a uniform distribution with min value = 0 and max value = 5 and assign the values to the above created mat. Converted the uniform distribution into a normal distribution by subtracting the mean value of the distribution from each value and dividing each by the standard deviation. Then convert the 1 channel image to a 3 channel image by copying the channel 1s values to each channel.
      Now Mat noise represents a noise based on the Gaussian Distribution.
    • Salt and Pepper Noise - Create a Mat with 32 bit float values and three channels filled with zero. Select a random pixel from the empty Mat. The selected pixel has BGR values. Randomly fill the B, G and R values of the selected pixel by 0 or 255. Number of noise pixels are given and that number of pixel values are changed based on the above mentioned algorithm.
      Now spNoise Mat contain a Mat with Salt and pepper noise. In a salt and pepper noise disturbance the value of a pixel may change to its highest value 255 or to its lowest value 0. For a multi channel pixel, the noise may affect each channel equally or differently. In above code I assumed that the noise disturbs the BGR values of a pixel in different levels. So for a particular pixel, all or a few BGR values become either 0 or 255.
  • Using a blind watermark detector with linear correlation, detect the entire set of images for the watermarked message (0 or 1) or “No watermark” in the case of nothing being embedded. Calculate the Gaussian Noise and Salt Papper NOise
    • With Gaussian Noise - Add the calculated gaussian noise to watermarked images.
    • Calculated linear correlation of watermarked images C_w_0, C_w_1 and reference pattern W_r using the following function.
      Did an element wise multiplication for C_w and W_r, using the OpenCV mul() function. Then calculated the average of B G R of each pixel. Then sum all average pixel values and divide it by number of pixels
    • Computed all the linear correlation of watermarked messages (0 and 1) and “No watermark” messages for all 13000 images which divide into 4000 images for m = 0, 4000 images for m = 1 and 5000 images for no watermark. Obtained the results as a comma separated 3 strings.
    • Using Jupyter Notebook and Matplotlib plotted the results as histograms. Connect all the bin heads of the histogram to obtain a clear result.
      The result is
    • Upper and Lower decision thresholds With the gaussian noise - Calculate the detection rate values of graph interchanging sections.
      With the Gaussian type of channel noise the upper decision threshold is 0.4500 and the lower decision threshold is -0.6475.
    • False Positive Rate With Gaussian Noise - Calculate the total number of images detected as watermarked images which actually do not have watermarks and the images not detected as watermarked images which actually have watermarks.
    • Write a simple java program to access the text files created by the C++ program and calculate the sum of the false positives based on the thresholds found above.
      Received counts :
      • Within the range 0 to 0.45, 265 images identified as false positives
      • Within the range -0.6475 to 0, 328 images identified as false positives
      • Within the range infinity to 0.45, 267 images identified as false positives
      • Within the range -0.6475 to infinity 174 images identified as false positives
      Calculations :
      • Total False positives when gaussian noise present = 265+328+267+174 = 1034
      • False positive rate (FPR) when gaussian noise present = 1034/13000 = 0.079
      • As a percentage FPR when gaussian noise present = 7.9%
    • With Salt and pepper Noise - Added the generated salt and pepper noise to the watermarked image as follows
    • Calculate the linear correlation as above gaussian noise section and plot a graph with the received values.
    • Upper and Lower decision thresholds With the Salt and pepper noise - Calculate the detection rate values of graph interchanging sections.
      With the Salt and pepper type of channel noise the upper decision threshold is 0.4000 and the lower decision threshold is -0.6450.
    • False Positive Rate With Salt and pepper Noise - Calculate the total number of images detected as watermarked images which actually do not have watermarks and the images not detected as watermarked images which actually have watermarks.
    • Write a simple java program to access the text files created by the C++ program and calculate the sum of the false positives based on the thresholds found above.
      Received counts :
      • Within the range 0 to 0.4, 192 images identified as false positives
      • Within the range -0.645 to 0, 360 images identified as false positives
      • Within the range infinity to 0.4, 347 images identified as false positives
      • Within the range -0.645 to infinity 176 images identified as false positives
      Calculations :
      • Total False positives when Salt and pepper noise present = 192+360+347+176 = 1075
      • False positive rate (FPR) when Salt and pepper noise present = 1075/13000 = 0.082
      • As a percentage FPR when Salt and pepper noise present = 8.2%
    • By observing above results we can conclude that when having gaussian noise over a channel it makes less effect on the transmitted media than having salt and pepper noise over a channel. Because the gap between upper and lower decision thresholds when gaussian type noise is present is bigger than the gap between upper and lower decision thresholds when salt and pepper type noise are present. And there is less false positive rate when gaussian noise occurs in a channel than Salt and pepper noise.
  • Covert the blind watermark embedder written above into an informed watermark embedder. Use 0.4 as the embedding strength parameter. As above activities generated a gaussian type reference pattern and embered the single bit (0 and 1) watermark to two sets of 4000 images and those two watermarked image sets and image set with 5000 images were used to calculate the linear correlation. No noise was added, take alpha value as 1. Using the detection rates retrieved plot graphs.
  • Since embedding strength parameter (𝛽) given as 0.4 use two upper and lower decision threshold values as Tc (when m = 0; Tc = 0.6475 and when m = 1; Tc = 0.45.) value to calculate the alpha (α) according to the following equation.
    Used the reference pattern and linear correlation calculating algorithm used in above activity.
    With the above function calculated the alpha values for Tc = 0.6475 and Tc = 0.45. Based on that alpha values, obtained the added pattern (W_a) and watermarked images. Calculate the detection values and plot them.
  • By observing above results can conclude that when the linear correlation value was above 0.85, the message is 1. when the linear correlation was below -1.0475 the message is 1. When the linear correlation is between -1.0475 and 0.85 it is a no watermark embedded image. there are no overlaps between images that have either a 1 or a 0 or no watermark embedded and they are clearly separated.


Sunday, August 2, 2020

Association Rule mining with Weka


coinpayu

Data set selection

I selected Adult data set from UCI data set collection (https://archive.ics.uci.edu/ml/ datasets/Adult). The donor of this data set is Ronny Kohavi and Barry Becker from Data Mining and Visualization, Silicon Graphics. This is a multivariate data set with 32561 instances and 15 attributes. The purpose of selecting this data set is to identify and understand the factors affecting to the income of a person.

 

Data set attribute information

Attribute

Information

age

continuous

workclass

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt

continuous

education

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num

continuous

marital-status

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship

 Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex

Female, Male.

capital-gain

continuous

capital-loss

continuous

hours-per-week

continuous

native-country

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

 

 

Data Pre-processing

1.     Remove missing values

Before apply any rules to the data set, we have to pre process the data set. First thing is checking for missing values. In my data set I found there were few missing values for some attributes such as workclass, occupation and native-country. Opened the data set in Microsoft Excel and used MS Excel’s Filter functions to remove rows with missing values (in this case cells with value “?”). After filtering 2399 instances were removed. So 30162 instances were used for processing.

2.     Remove un-necessary attributes

Closely examined the data set and identify attributes which will not use for processing. This will reduce the complexity of the data set and made easy to apply rules. In above data set identified 4 attributes (fnlwgt, education-num, capital-gain and capital-loss) which are less useful for processing and remove them from the data set.  Now there are 11 attributes in the dataset.

3.     Discretization

Because of Association Rule mining, all numerical attributes should be removed. Used Discretization function in Weka to convert numeric attributes as “age” and “hours-per-week” in the dataset to categorical data. 3 bins were rerated for each “age” and “hours-per-week” attributes and weka automatically assign corresponding values in to relevant bin using discretize filter. Now all 11 attributes are nominal attributes and my data set is now ready for apply rules.

Weka created 3 bins for “age” as {'\'(-inf-41.333333]\'','\'(41.333333-65.666667]\'','\'(65.666667-inf)\''} and 3 bins for “hours-per-week” as {'\'(-inf-33.666667]\'','\'(33.666667-66.333333]\'','\'(66.333333-inf)\''}. To increase the readability of the data set and the readability of the results after applying the association rules to the dataset, replaced the labels of the “age” with {'0_41','42_65','66_MAX'} and replaced the labels of “hours-per-week” with {'0_33','34_66','67_MAX'}.

I selected income (which has values >50K, <=50K.) as class variable.

 

Applying Association rules (Apriori Algorithm)

In weka Associate tab selected Apriori algorithm. In Apriori configuration set the algorithm to mine class association rules and changed the number of rules to 40. Other configurations left as default.

 

Results

=== Run information ===

 

Scheme:       weka.associations.Apriori -N 40 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -A -c -1

Relation:     dataset-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R1-precision6-weka.filters.unsupervised.attribute.Discretize-B3-M-1.0-R9-precision6

Instances:    30162

Attributes:   11

              age

              workclass

              education

              marital-status

              occupation

              relationship

              race

              sex

              hours-per-week

              native-country

              income

=== Associator model (full training set) ===

 

 

Apriori

=======

 

Minimum support: 0.15 (4524 instances)

Minimum metric <confidence>: 0.9

Number of cycles performed: 17

 

Generated sets of large itemsets:

 

Size of set of large itemsets L(1): 21

Size of set of large itemsets L(2): 70

Size of set of large itemsets L(3): 104

Size of set of large itemsets L(4): 61

Size of set of large itemsets L(5): 19

Size of set of large itemsets L(6): 3

 

Best rules found:

 

 1. age=0_41 workclass= Private marital-status= Never-married 7358 ==> income= <=50K 7125    conf:(0.97)

 2. age=0_41 workclass= Private marital-status= Never-married native-country= United-States 6674 ==> income= <=50K 6457    conf:(0.97)

 3. age=0_41 workclass= Private marital-status= Never-married race= White 6149 ==> income= <=50K 5944    conf:(0.97)

 4. age=0_41 workclass= Private marital-status= Never-married race= White native-country= United-States 5703 ==> income= <=50K 5505    conf:(0.97)

 5. age=0_41 marital-status= Never-married 8733 ==> income= <=50K 8414    conf:(0.96)

 6. age=0_41 marital-status= Never-married native-country= United-States 7963 ==> income= <=50K 7666    conf:(0.96)

 7. age=0_41 marital-status= Never-married race= White 7230 ==> income= <=50K 6954    conf:(0.96)

 8. age=0_41 marital-status= Never-married race= White native-country= United-States 6737 ==> income= <=50K 6473    conf:(0.96)

 9. workclass= Private marital-status= Never-married 8025 ==> income= <=50K 7706    conf:(0.96)

10. workclass= Private marital-status= Never-married native-country= United-States 7270 ==> income= <=50K 6972    conf:(0.96)

11. workclass= Private marital-status= Never-married race= White 6688 ==> income= <=50K 6403    conf:(0.96)

12. age=0_41 workclass= Private marital-status= Never-married hours-per-week=34_66 5136 ==> income= <=50K 4915    conf:(0.96)

13. workclass= Private marital-status= Never-married race= White native-country= United-States 6194 ==> income= <=50K 5920    conf:(0.96)

14. age=0_41 marital-status= Never-married sex= Male 4904 ==> income= <=50K 4684    conf:(0.96)

15. marital-status= Never-married 9726 ==> income= <=50K 9256    conf:(0.95)

16. age=0_41 marital-status= Never-married hours-per-week=34_66 6165 ==> income= <=50K 5861    conf:(0.95)

17. marital-status= Never-married native-country= United-States 8876 ==> income= <=50K 8435    conf:(0.95)

18. age=0_41 marital-status= Never-married hours-per-week=34_66 native-country= United-States 5571 ==> income= <=50K 5289    conf:(0.95)

19. marital-status= Never-married race= White 8036 ==> income= <=50K 7622    conf:(0.95)

20. age=0_41 marital-status= Never-married race= White hours-per-week=34_66 5065 ==> income= <=50K 4803    conf:(0.95)

21. marital-status= Never-married race= White native-country= United-States 7489 ==> income= <=50K 7092    conf:(0.95)

22. workclass= Private marital-status= Never-married hours-per-week=34_66 5693 ==> income= <=50K 5390    conf:(0.95)

23. workclass= Private marital-status= Never-married hours-per-week=34_66 native-country= United-States 5103 ==> income= <=50K 4820    conf:(0.94)

24. marital-status= Never-married sex= Male 5414 ==> income= <=50K 5107    conf:(0.94)

25. marital-status= Never-married sex= Male native-country= United-States 4900 ==> income= <=50K 4612    conf:(0.94)

26. marital-status= Never-married hours-per-week=34_66 6994 ==> income= <=50K 6552    conf:(0.94)

27. marital-status= Never-married hours-per-week=34_66 native-country= United-States 6330 ==> income= <=50K 5916    conf:(0.93)

28. marital-status= Never-married race= White hours-per-week=34_66 5727 ==> income= <=50K 5339    conf:(0.93)

29. marital-status= Never-married race= White hours-per-week=34_66 native-country= United-States 5296 ==> income= <=50K 4925    conf:(0.93)

30. age=0_41 workclass= Private sex= Female 5279 ==> income= <=50K 4859    conf:(0.92)

31. age=0_41 relationship= Not-in-family 5038 ==> income= <=50K 4622    conf:(0.92)

32. age=0_41 sex= Female native-country= United-States 5853 ==> income= <=50K 5325    conf:(0.91)

33. age=0_41 sex= Female 6382 ==> income= <=50K 5805    conf:(0.91)

34. workclass= Private relationship= Not-in-family 5899 ==> income= <=50K 5343    conf:(0.91)

35. workclass= Private sex= Female 7642 ==> income= <=50K 6921    conf:(0.91)

36. age=0_41 workclass= Private education= HS-grad 5076 ==> income= <=50K 4591    conf:(0.9)

37. workclass= Private sex= Female native-country= United-States 6926 ==> income= <=50K 6264    conf:(0.9)

38. workclass= Private relationship= Not-in-family native-country= United-States 5397 ==> income= <=50K 4872    conf:(0.9)

39. workclass= Private relationship= Not-in-family race= White 5130 ==> income= <=50K 4627    conf:(0.9)

40. age=0_41 race= White sex= Female 5156 ==> income= <=50K 4649    conf:(0.9)

 

 

Interesting rules

·        age=0_41 workclass= Private marital-status= Never-married native-country= United-States 6674 ==> income= <=50K 6457    conf:(0.97)

·        age=0_41 marital-status= Never-married sex= Male 4904 ==> income= <=50K 4684    conf:(0.96)

·         age=0_41 workclass= Private education= HS-grad 5076 ==> income= <=50K 4591    conf:(0.9)

·         age=0_41 workclass= Private sex= Female 5279 ==> income= <=50K 4859    conf:(0.92)

 

 

Rule Evaluation

by analyzing above selected rules we can identify younger people get an income less than or equal to $50k and young people who educated up to high school level and work in private sector get income less than or equal to $50k. 4th selected rule says that there is 92% confidence in young females who work in private sector get income less than or equal to $50k.

 

 

Use of rules

 

Using those rules can identify that young employees need some capacity building programs to enhance their work experience and gain more income in their young age. Also, female employees need to encourage for achieve successful carrier.


Featured Post

Data recovery of CCTV DVR systems which have Proprietary OS and Proprietary file systems - Literature Review

Abstract The purpose of this study is to explore some different ways of extracting data from closed-circuit television (CCTV) Digital video...

Popular Posts