The World Wide Web is an enormous repository of web pages and links. The massive amount of data is generated on daily basis. This massive information is available in the form of Log files. Log files contain all the records of the client server interaction such as IP Address, Time Stamp, and numbers of Bytes Transferred etc. This paper focuses on three important parts: understanding the format of the access log file, pre-processing phase and finally identifying the distinct users. Identifying the distinct users from the log is a challenging task. The paper focuses on identifying the distinct users based on different parameters such as user ID, Sessions, Referrer, User agent or browser. This paper also classifies the data into interested users and non-interested users by applying an existing decision tree algorithm. The analysis of the log file provides user navigation behavior that can be used for personalization of system.

Keywords: Log Data, Preprocessing, Decision Tree, Access Log, Common Log Format

1. Introduction

Log files are files that list the activities that have been arose as an interaction between the client and the server. The log files are present in the web server. Computers that deliver the request made by the client are known as web servers.

