Workload characterization and customer interaction at e-commerce web servers
Electronic commerce servers have a significant presence in today's Internet. Corporations want to maintain high availability, sufficient capacity, and satisfactory performance for their E-commerce Web systems, and want to provide satisfactory services to customers. Workload characterization and the analysis of customers' interactions with Web sites are the bases upon which to analyze server performance, plan system capacity, manage system resources, and personalize services at the Web site. To date, little empirical evidence has been discovered that identifies the characteristics for Web workloads of E-commerce systems and the behaviours of customers. This thesis analyzes the Web access logs at public Web sites for three organizations: a car rental company, an IT company, and the Computer Science department of the University of Saskatchewan. In these case studies, the characteristics of Web workloads are explored at the request level, functionlevel, resource level, and session level; customers' interactions with Web sites are analyzed by identifying and characterizing session groups. The main E-commerce Web workload characteristics and performance implications are: i) The requests for dynamic Web objects are an important part of the workload. These requests should be characterized separately since the system processes them differently; ii) Some popular image files, which are embedded in the same Web page, are always requested together. If these files are requested and sent in a bundle, a system will greatly reduce the overheads in processing requests for these files; iii) The percentage of requests for each Web page category tends to be stable in the workload when the time scale is large enough. This observation is helpful in forecasting workload composition; iv) the Secure Socket Layer protocol (SSL) is heavily used and most Web objects are either requested primarily through SSL or primarily not through SSL; and v) Session groups of different characteristics are identified for all logs. The analysis of session groups may be helpful in improving system performance, maximizing revenue throughput of the system, providing better services to customers, and managing and planning system resources. A hybrid clustering algorithm, which is a combination of the minimum spanning tree method and k-means clustering algorithm, is proposed to identify session clusters. Session clusters obtained using the three session representations Pages Requested, Navigation Pattern, and Resource Usage are similar enough so that it is possible to use different session representations interchangeably to produce similar groupings. The grouping based on one session representation is believed to be sufficient to answer questions in server performance, resource management, capacity planning and Web site personalization, which previously would have required multiple different groupings. Grouping by Pages Requested is recommended since it is the simplest and data on Web pages requested is relatively easy to obtain in HTTP logs.
DegreeMaster of Science (M.Sc.)
CommitteeLinks, Graham; Carter, James A.; Bunt, Rick B.
Copyright DateAugust 2004
Web Usage Mining