# Using Algorithms On Web Server Log Computer Science

The first step is to import the weblog. So, Weblog is the input of the experiment. The next step is to filter the weblog to remove irrelevant information. Then the resulting weblog is input to either threshold based algorithm or clustering algorithm for the purpose of session identification. The output of the experiment is the session details such as Toff, Ton, Tarr and number of connections etc. The experiment is done using the following developing technologies:

Languages: C#

Front end: Visual studio 2010

Back end: SQL server 2008

Microsoft Office Excel 2010

**The Data Sets:**

The experiments were based on web logs. The log file was extracted from NASA website that can be freely downloaded from the link: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html. Figure 6.2 shows a part of logs file. These logs file were used as input for the experiments.

Fig. 6.2. A text file with logs.

**Data cleaning:**

The process of data cleaning in order to remove irrelevant elements is an important step in web log analysis. There are three main kinds of irrelevant information to be removed. They are embedded files, error's requests and HEAD method.

Embedded files are those file which embedded in the web page. This file is usually multimedia file that are insert into the web pages. The embedded files are mainly in the form of graphics and sound files. These files are referenced by base files. For example image is an embedded file. These types of files are not requested by the user. A request that is made by a user to view a particular page usually results in several log entries because some embedded file such as graphics and scripts are downloaded along with the web page. It is unnecessary to include the file requests that the user did not explicitly request. Elimination of such information can be done by checking the suffix of the URL name. So, those entries which have the type extensions: .GIF, .JPEG and .JPG etc. are removed.

The second irrelevant information to be removed is error's request. This can be removed by checking the status of request. There are four classes of status code. They are:

success (200 series),

redirect (300 series),

failure (400 series) and

State error (500 series).

Those entries that correspond to the status code which is not equal to 200 are removed.

The last irrelevant information to be removed is the entries with the 'HEAD' method.

**User identification:**

Unique users must be identified. For the process of user identification, each IP address is considered as a single user. To consider only significant users, those users which have more than 800 TCP connections are selected for the experiment.

**Session identification using clustering techniques:**

To identify the web user session, the two algorithm i.e. hierarchical and partitional algorithm are used in order to take the advantages of both algorithm while avoiding the limitation of both methods. Figure 6.3 shows the steps of clustering algorithm for the process of identification of web user session.

The algorithm is run in order to get initial k clusters. (k being smaller than the total no. of items/clusters).

Finally partitional algorithm is again run over the original data with k equal to the actual number from the Hierarchical Agglomerative Algorithm

Partitional clustering algorithm

This algorithm is run over the k clusters in order to get actual number of clusters using quality indicator function

Hierarchical Agglomerative Clustering algorithm

Final Partitional Clustering algorithm

Fig. 6.3. Steps in clustering algorithm.

First, partitional algorithm is used to obtain an initial clustering. This is performed by selecting a large number of clusters. But the number of clusters must be less than the number of samples. The next step is to apply hierarchical agglomerative clustering algorithm to the result obtained from first step. The hierarchical agglomerative clustering technique is used to combine the clusters in order to obtain a good estimation of the final number of clusters. To get a fine definition of number of clusters, a partitional clustering algorithm is used. The detail description of each step is given below:

**STEP I: Initial selection of cluster (Partitional clustering algorithm)**

The first step starts with k-mean clustering method with k cluster. The k cluster is chosen as large as possible but it should be less than the number of samples. The distance between any two adjacent samples is then calculated. Farthest (k-1) couples are taken according to the distance metric to determine k intervals. Let Tintbe the inferior bounds of the intervals and Tsupbe the superior bound of the intervals. Then the centroid of each cluster is calculated as

centroid= ( Tsup + Tint)/2

The partitional algorithm is then iteratively run to obtained k initial clusters. Each cluster C is represented by a small subset R(C) of samples. |R(C) |may be less than or equal to 2 since the metric space is R. The representative samples R(C) may be either the cluster centroid or single linkage procedure.

**STEP II: Hierarchical agglomerative clustering algorithm**

To calculate the distance between two clusters, the hierarchical agglomerative algorithm is iteratively run using representative samples. The number of steps is k since the procedure starts with k initial clusters. Hierarchical agglomerative procedure merges the closest clusters at each step. After merging the two clusters, distances between clusters are recomputed. This process is repeated. The process ends after k iterations.

The clustering quality indicator function which measures the distance between the two closest clusters is used to select the best clustering among those determined in the iterative process. Quality indicator function is denoted byÎ³(s).At each step s, the clustering quality is calculated to find if the optimal number of clusters has been found. LetCj(s) denote the jth cluster of step s. At each step, the procedure evaluates the quality indicator functionÎ³(s):

Where

A sharp increase in the value of Î³(s) indicates that the merging procedure is artificially merging two clusters which are too far apart. The optimal number of clusters Nc is then determined by:

NC=N-(argmaxs (Î³(s)- Î³(s-1))-1)

Fig. 6.4. Sample plot of the quality indicator function.

The graph for quality indicator function is shown in figure 6.4. This graph is plot by using NASA log file which are available online at http://ita.ee.lbl.gov/html/contrib/NASAHTTP.html. This graph shows that for about 795 steps the aggregation of the two closest clusters is clearly beneficial in terms of clustering quality. Then, the aggregation process joins two clusters which are too far apart, forcing a sudden increase in from step s-1 to step s and therefore in Î³(s). From this, it is conclude that till 794 steps, merging of clusters take place between small cluster distances. At step 795, a sharp increase in quality indicator function indicates the merging of two clusters having large distance between them. Here at this point, best number of clusters is obtained i.e. 794 clusters of the given data.

**STEP III: Final clustering creation (Final partitional Clustering algorithm)**

The last step is to find a fine definition of Nc clusters. In this step, a partitional clustering algorithm is used. Using the number of cluster and the cluster representatives, the partitional clustering algorithm is run over the original data which contain all samples. Then a fixed number of iteration of partitional clustering algorithm is run to get a final refinement of the clustering definition.

**Session identification using threshold based algorithm:**

To identify the web user session using threshold based algorithm, first the threshold value is selected. Based on this threshold value, the algorithm identifies the sessions. The steps of threshold based algorithm to identify web user session are as follow:

Choose threshold value.

Calculate inter-arrival time between two connections.

If inter arrival time is less than the threshold value then the two connections are in the same session.

If the inter arrival time is greater than the threshold value then the first connections is the last element of the current session and the second connection is the first element of the next session.

**Chapter - 7**

**PERFORMANCE ANALYSIS**

This chapter describes the correctness properties of the clustering scheme and compared it with that of the threshold-based mechanism. The percentage of misidentified sessions is used as the performance metric. Two types of errors may occur: (i) to erroneously separate in two or more clusters or (ii) to merge two or more distinct sessions. The percentage of errors is defined as 100 times the total number of observed errors divided by the total number of connection arrival times.

Fig. 7.1. Percentage of errors with respect to Toff [10].

The above graph was taken from the paper "Web user-session inference by means of clustering techniques". In this paper the authors conclude that the performance of clustering and optimal threshold based algorithm improves as Toff increases. The graph of clustering is almost parallel to the graph of optimal threshold. So, in this experiment, if x-axis is assumed as the optimal threshold graph then the graph corresponding to the clustering algorithm must be parallel to the x-axis.

In this work, session obtained by running optimal threshold is considered as the accurate session. If the clustering algorithm is more accurate, then its graph (i.e. T_off- error percentage graph) must be parallel to x-axis when x-axis is assumed as the optimal threshold graph.

In the remaining of this chapter, the optimal threshold value is determined from the web server log. Then, the threshold based algorithm is run using optimal threshold value. After that, threshold based algorithm is again run with different threshold value and the results are compared with that of the sessions that is obtained by using optimal threshold value in order to get the misidentified session. Similarly, clustering algorithm is run to identify sessions and compared with that of the sessions obtained by using optimal threshold value to get the misidentified sessions. By using these misidentified sessions as the performance metric, we compared the clustering algorithm and threshold based algorithm.

**7.1 Determination of optimal threshold value:**

Iteration number of a session is used as the number of TCP connections in a session. For example, if the iteration number of a session is five, then the session has five connections. A sequence of connections is grouped into a session if and only if:

The connections are from the same user ID or IP address

The time interval between two adjacent connections is less than or equal to the session interval in use.

By grouping the sessions with the same iteration number, we can see the distribution of various sessions. The distributions show the percentages of sessions with a particular iteration in relation to the total number of sessions. This was done because different session intervals cut the logs into different number of sessions, and so a percentage comparison was more meaningful.

The experiment and analysis focused on monitoring only the distributions of the session with less than or equal to 6 iterations because their total covers a very large percentage of sessions.

Table 7.1

The results of session interval from the web logs.

Session interval (in sec)

Percentage of 1 iteration

Percentage of 2 iteration

Percentage of 3 iteration

Percentage of 4 iteration

Percentage of 5 iteration

Percentage of 6 iteration

60

80%

13%

3%

1%

0%

0%

80

72%

16%

5%

1%

0%

0%

100

71%

17%

6%

2%

1%

0%

120

68%

18%

6%

2%

1%

0%

140

65%

18%

8%

2%

1%

1%

160

64%

18%

8%

3%

1%

1%

180

62%

19%

8%

3%

1%

1%

200

61%

19%

9%

3%

1%

1%

220

59%

19%

9%

4%

2%

1%

240

57%

19%

10%

4%

2%

1%

260

56%

19%

10%

4%

2%

1%

280

55%

20%

10%

5%

2%

1%

300

54%

20%

10%

5%

2%

1%

In this experiment, 60 to 300 seconds is taken as a range of session interval. After 300 second, graphs of all iterations are almost stable. More importance is given to the graph with smaller number of connections in a session while observing optimal session interval. Sudden dramatically change indicates the existence of optimal session interval. Iteration graph increase gradually from 300 second to 80 second. From 80 second, it increases dramatically. Similarly other graphs shows stable from 300 second to 80 second, after 80 second, it shows sudden decrease.

Fig. 7.2. The result of session interval from the web logs.

The optimal threshold value should not be too large or too small. The results of the experiment show that most sessions were not affected when the session interval is larger than 80 seconds. When the session interval is less than 80 seconds, most sessions are affected. When the session interval becomes smaller, the percentage of sessions with 1 iteration increases dramatically whereas the percentages of sessions with 3-6 iterations, decreases dramatically. This shows that the optimal session interval is nearly 80 seconds. Hence, 80 seconds is taken as the optimal threshold value in the remaining part of this work.

**7.2 Parameter sensitivity:**

In clustering algorithm, initial k value (i.e. number of clusters) is assigned in order to run the algorithm. But its value is replaced when the final partitional algorithm is run with the value from hierarchical agglomerative algorithm. Therefore, k-value must be independent of its selection.

Fig.7.3. Clustering sensitivity to the initial number of clusters k.

Figure 7.3 shows the error probability of clustering algorithm by taking initial number of cluster, k=600, 700 and 800. This graph shows that error probability is independent from the value of k, since graphs are superimposed. Hence, initial number of cluster is not a critical parameter.

**7.3 Percentage of misidentified sessions:**

Consider the percentage of sessions misidentified by the clustering procedure to assess the quality of the results. The result thus obtained is compared with that of the threshold based mechanism.

Figure 7.4 shows the percentage of errors obtained by running threshold based scheme and clustering scheme. Performance of the threshold based mechanism is evaluated by taking threshold value of 20, 1000 and 1800 seconds. In this figure, clustering graph is almost parallel to x-axis which shows that it is less error than other threshold algorithms. Threshold algorithms (except optimal threshold) give irregular graph to the x-axis.

Fig. 7.4. Percentage of errors obtained by running clustering and threshold based mechanism.

From this, it is conclude that threshold based mechanism may not perform well if the threshold value is not correctly set. Its error probability is much larger than the clustering scheme. When the value of threshold is too large, the percentage of errors is high which is due to the result of merging two subsequent different sessions. Small value of threshold value also induces high percentage of error which is due to the result that Toff goes below a given value. Hence, the percentage of errors for clustering scheme is always less than 0.5% and it is less sensitive to the variation of Toff.

**Chapter - 8**

**CONCLUSION**

Clustering techniques are used for inferring web user-session. In order to identify the web user session, a combination of the hierarchical and partitional clustering techniques is used. First, a k-mean partitional algorithm is used to obtain an initial clustering. After getting the initial clustering, a hierarchical agglomerative clustering technique is run on the representative samples, obtained from the first step, to obtain the final number of clusters. Finally, partitional clustering is applied to the real data to obtain the fine definition of clusters. This process for identification of web user-session can deal with large amount of data. The effectiveness and robustness of the clustering techniques was used to show its ability in the identification of web user-sessions without requiring any a priori definition of threshold values. This may be used in characterizing web user-sessions and in the study of internet traffic properties.

There are some limitations which could be directions for future work. First of all, the resulting algorithm is applied only to the web traffic. So, this work can be extended in order to apply to different types of traffic and to deal with other types of user sessions which are not related to the web. After the identification of user-session, second identification is required to remove errors in the session pages and unrelated pages so that it can improve the quality of identified session.

Article name: **Using Algorithms On Web Server Log Computer Science essay, research paper, dissertation**