Noisy Data Using Attribute Selection Smart Tokens Computer Science

Essay add: 28-10-2015, 20:29   /   Views: 195

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. There are so many error will be introduced while integrating the data warehouses or while loading a single data warehouse by the misuse of data entry problem. One of the main errors in data warehouse is noisy data. Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. Noisy data error is due to the misuse of abbreviations, data entry mistakes, duplicate records and spelling errors [16]. The proposed algorithm will be efficient in handling the noisy data by expanding abbreviation removing unimportant characters and eliminating duplicates. The attribute selection algorithm is used for the attribute selection before the token formation. An attribute selection algorithm and token formation algorithm is used for data cleaning to reduce a complexity of data cleaning process and to clean data flexibly and effortlessly without any confusion. This paper uses smart token to increase the speed of the mining process and improve the quality of the data.

The term data mining refers to the step in the knowledge discovery process in which special algorithms are employed in hopes of identifying interesting patterns in the data. These interesting patterns are then analyzed yielding knowledge. Data cleaning is a process of identifying o r determining expected problem when integrating data from different sources or from a single source. There are so many error will be introduced while integrating the data warehouses or while loading a single data warehouse by the misuse of data entry problem. One of the main errors in data warehouse is noisy data.Noise is random error or variance in a measured variable. Noisy attribute values may due to:

Faulty data collection instruments.

Data entry problems.

Data transmission problems.

Technology limitation.

Inconsistency in naming convention.

Attribute selection as a preprocessing step to learning generally involves a combination of search and attribute utility estimation. When the evaluation of the selected features with respect to learning algorithms is considered as well it leads to a large number of possible permutations. An attribute selection is very important to reduce the time of the data cleaning process. An attribute selection algorithm is effective in reducing attribute, removing irrelevant attribute, increasing speed of the data cleaning process, and improving result in clarity.

CLASSIFICATION OF ATTRIBUTES

The attributes are classified as unpredictive, predictive-and-predictable and predictive-but-unpredictable. Unpredictive attributes are futile for or irrelevant to predicting the class. They can be discarded by feature selection methods prior to the learning. Predictive attributes are useful for predicting the class, among which predictable ones can be predicted by the class and other attributes while unpredictable ones cannot.

2. STEPS TO CLEAN THE DATA WAREHOUSE

a) The first step is Scrub dirty data fields. This step attempts to remove typographical errors and abbreviations in data.

b) The second step is Sort tokens in data fields. Characters in a string can be grouped into meaningful pieces, called tokens, which are then sorted.

c) The third step is Sort records. This step sorts the records based on the token value.

d) The fourth step is Comparison of records. A window of fixed size is moved through the sorted records to limit the comparisons for matching records.

e) The final step is merging matching records. Matching record are treated as a partial source of information and merged to obtain a record with more complete information.

3. ATTRIBUTE SELECTION CRITERIA

An attribute selection is a process that chooses best attributes according to a certain criterion. This attribute selection algorithm is used to increase the speed and improve the accuracy of the data cleaning process by removing redundant or irrelevant attribute form the data warehouse [2], [3]. There are three criteria is used to identify relevant attributes for the further data cleaning process.

i) Identifying key attributes

The key is an attribute or a set of attributes that uniquely identify a specific instance of the table. Every table in the data model must have a primary key whose values uniquely identify instances of the entity.The key may be primary key, candidate key, foreign key or composite key.

ii) Classifying distinct and missing values

Missing character values are always same no matter whether it is expressed as one blank, or more than one blanks. Distinct is used to retrieve number of rows that have unique values for each attribute. The accuracy of the result will be poor with low distinct value and high missing value.

Identification power of the attributej(ipj) =Number of distinct equivalenceClasses on total of recordTotal number of records

The distinct value is used to calculate an Identification power of the attribute. An Identification power of the attribute (ipj) is used to evaluate the discriminating power of record attributes

iii) Classifying types of attributes

There are four types of attributes: nominal, ordinal, interval and ratio. The different criteria are given for each attribute types. The value of measurement types are also considered for the attribute selection. The data cleaning with numeric data will not be effective. The categorical data is efficient for the data cleaning process.

4.ATTRIBUTE SELECTION ALGORITHM AND ANALYSIS

An attribute selection algorithm works according to the specified constraints to select the attributes for the data cleaning process.

Attribute selection algorithm

Input: N-Attributes, no. of tuples n, relation instance r

Initialize: L - Temporary relation instance

Output: S - Selected attributes with high threshold values σ

For each attribute xi, i< {1, 2…N}

begin

i) Select key attribute, put into L.

ii) Select attribute collectively (Super key),

then put into L.

iii) Calculate threshold σ with (σ: D / M /MT / S)

a. Distinct (D) value of the attribute xi

if tuple i=1 ^nt = ti

b. Missing (M) value of the attribute xi

if tuple  i=1^n ti = NULL

c. Measurement types (MT) of attribute

(Ordinal, nominal, interval, and

Ratio) xi

d. Size(S) of the attribute xi

Put into L.

iv) Select attribute with high threshold value σ

i, then put into L.

v) Ignore attribute with low threshold value

end

This attribute selection algorithm first selects the relation schema R including N attributes. Then it chooses the relation instance (table) r of the relation schema R. Finally selects the attributes Ai (A1… AN) Of the relation schema R including N attributes. This attribute selection algorithm obtains the temporary relation schema L with attribute name, type, and size, missing value, distinct value, Measurement type and Threshold value. For each attribute, read the relation tuples (records) from the selected relations instance r and find the count of missing target values of the attributes Ai and calculate the percentage value. These percentage values of missing values are stored in the temporary relation instance L. Then find the count of the distinct target values of the attributes Ai and calculate the percentage value. These percentage values of missing values are stored in the temporary relation instance L.

Finally, find the measurement type of the attributes Ai and put in the temporary relation instance L for each attributes Ai. The threshold values are calculated for every target attribute Ai based on the missing values, distinct values and measurement type and put the threshold values for each attribute in the temporary relation instance L. Finally, select the attribute S from the temporary relation L based on the threshold values for the next step of the data cleaning process.

5. ALGORITHM FOR TOKEN FORMATION

The token is formed for each selected attribute field which is having the highest rank. The following step has to be taken for the best token key before forming the token.

The steps are:

i).Remove unimportant tokens

The first step in the token formation is removing the unimportant character before the token formation to get smart or best token for the further data cleaning Process. The unimportant tokens consist of special characters, shortcut forms or ordinal forms, common or stop words, and title or salutation tokens. The common unimportant tokens are listed in the table.

UNIMPORTANT CHARACTERS

a. Special characters are `, '" < > - % + _ ( ). * - $ #! [ ] ^ @ : ; = ? | { } ~ and etc

b. Title or Salutation tokens are 'Rev ','Dr', 'Mr','Miss', 'Master', 'Madam', 'Sir', 'Chief', 'Ms', 'Mister','Shri', 'Drs', 'Dres instead of Dr.', 'Dr.', 'Mistress', 'Sis','Sri', 'Dear', 'Judge', 'Justice', 'Sister'

c. Ordinal forms are 'st', 'nd', 'rd', and 'th'

d. Common abbreviations are 'Pvt', 'Ltd', 'Co', 'Rd',

'St', 'Ave', 'Blk', 'Apt', 'Univ', 'Sch', 'Corp' and etc

e. Common words are 'and', 'the', 'of', 'it', 'as', 'may','than', 'an', 'a', 'off', 'to', 'be', 'or', 'not', 'I', 'about','are', 'at', 'by', 'bom', 'de', 'en', 'for', 'from','how', 'in','is', 'la', 'on', 'that', 'this', 'was', 'what', 'when', 'where', 'who' 'will' 'with' and etc…

ii) Expand abbreviations using Reference table

The use of abbreviation makes problem in the token formation. The expansion of abbreviation is important in the token formation. The some common abbreviations are listed in the Table. These abbreviations are stored in the log table or reference table. This table is used as reference table for the token formation and this table contains some abbreviations.

Reference Table with sample dataS. No. Shortcut Full form

1 a/c account

2 advt Advertisement

3 Apr. April

4 Ave Avenue

5 Co. Company

6 Dept. Department

7 Dep. Departure

8 Est. Established

9 Gov Government

10 H.O Head Office

11 Pvt Private

12 Ltd Limited

13 Rd Road

14 Blk Block

15 Apt Apartment

iii) Formation of Tokens

This step makes use of the selected attribute field values to form a token. The tokens can be created for a single attribute field value or for combined attributes. For example, contact name attribute is selected to create a token for further cleaning process. The contact name attribute is split as first name, middle name and last name. Here first name and last name is combined as contact name to form a token. Tokens are formed using and numeric values, alphanumeric values and alphabetic values. The field values are split. Unimportant elements are removed .Numeric tokens comprise only digits [0 - 9].Alphabetic tokens consist of alphabets (aA - zZ). The first character of each word in the field is considered and the characters are sorted. Alphanumeric tokens comprise of both numeric and alphabetic tokens. It composes a given alphanumeric element into numeric .This step is eliminates the need to use entire string records with multiple passes, for duplicate identification. It also, solves similarity computation problem in a large database by forming token key from some selected fields [8], [9].

ALGORITHM FOR TOKENS

Input: Tables with dirty data, Reference table, Selected Attributes

Output: LOG table with tokens

begin

For attribute i = 1 to last attribute, m

For row j = 1 to last row, n

i) remove special characters

ii) remove shortcut forms or ordinal forms

iii) remove common or stop words

iv) remove title or salutation tokens

v) remove unimportant characters

vi) expand abbreviations using Reference table

vii) if row(j) isnumeric then

a. convert string into number

b. sort or arrange the number in order

c. form a token, then put into LOG table

viii) if row(j) isaphanumeric then

a. separate numeric and alphanumeric

b. split alphanumeric into numeric and

alphabetic

c. sort numeric and alphabetic

separately

d. form a token, then put into LOG table

ix) if row(j) isalphabetic then

a. select the first character from each

word

b. sort these letter in a specific order

c. stringing them together

d. if one word is present, take first three

character as token, then sort the

Characters.

e. form a token, and then put into LOG table

end

The below table produces token key for the address field. The alphanumeric token rule is used in this table. First, it splits the alphanumeric into numeric and alphabetic and then it uses alphabetic rule. Finally, it combines together to get token key.

Formation of Tokens for the address fieldCustomer Address Token keyCreditID Key

1 7464 South Kingsway, 7464 SKSH

Sterling Heights

2 410 Eighth Avenue, 410 EAD

DeKalb

3 7429 Arbutus Boulevard, 7429 ABB

Black lick

4 8287 Scott Road, 8287 SRH

Huntsville

iv) Maintaining LOG Table

The proposed Token formation algorithm is used to form a token for the selected attributes. The formed tokens are stored in the LOG table. This LOG table is a temporary table to store tokens of the selected attribute field values. The comparison of records will be take place in the LOG Table to find duplicates. The sample LOG table with smart token is described below:

LOG Table with Smart TokensCustomer Contact Customer Address PostalCreditID name name key keykey key

1 CC CC 7464 SKSH 48358

2 CM P 410 EAD 60148

3 GJ AABH 7429 ABB 43005

4 AM PC 8287 SRH 35818

5 PR ISW 480 GWSD 9215

6. CONCLUSION

The token formation algorithm and attribute selection algorithm is explained in this article. An attribute selection algorithm is used to select the attribute before the data cleaning process. The token formation algorithm is used to form smart tokens for data cleaning and it is suitable for numeric, alphanumeric and alphabetic data. There are three different rules are described for the numeric, alphabetic, and alphanumeric tokens. The result of the token based data cleaning is to remove noisy data in an efficient way. The time will be reduced by the selection of attributes and by the token based approach. These formed tokens are stored in the LOG Table. The comparison of entire string is required more time than comparison of tokens. This formed token will be used as the blocking key in the further data cleaning process. So, the token formation is very important to define best and smart token. The Future work should consider applying this token-based cleaning technique on similarity function as well as on blocking method. This approach can also be applied in a sequential data cleaning to improve the quality of the data and increase the speed of the data cleaning process.

Article name: Noisy Data Using Attribute Selection Smart Tokens Computer Science essay, research paper, dissertation