Log-File Analysis

Log-file analysis uses the records stored in the transaction logs of information retrieval systems, web search engines, and websites to offer valuable understanding of the interactions between these systems and people. This understanding informs system design, interface development, and information architecture. Log files (or transaction logs) are an unobtrusive and relatively easily method of recording significant amounts of usage data on a considerable number of users at low cost.

A log file or transaction log is a record of the interactions between a system and the users of that system. Rice and Borgman (1983) present transaction logs as a data collection method that automatically captures the type, content, or time of transactions made by a person from a terminal with that system. Peters (1993) views transaction logs as electronically recorded interactions between online information systems and the people who search for the information found in those systems.

Nature And Goals Of Log-File Analyses

Once one has collected and recorded the data in a log, one must analyze this data in order to obtain useful information. The process of conducting this analysis is transaction log analysis (TLA) or log-file analysis. TLA can focus on many issues and questions, but it typically addresses issues of system performance, information structure, or measurements of user interactions. Peters (1993) describes TLA as the study of electronically recorded interactions between online information retrieval systems and the persons who search for information found in those systems. Blecic et al. (1998) define TLA as the detailed and systematic examination of each search command or query by a user and the following database result or output. Spink and Jansen (2004) also provide comparable definitions of TLA but with a web focus.

The goal of TLA is to gain a clearer understanding of the interactions among searcher, content, and system, or the interactions between two of these elements. The research questions are the drivers for the particular study. From this understanding, one achieves some stated objective, such as improved system design, advanced searching assistance, or identifying user information searching behavior.

TLA is conceptually based on a grounded theory approach (Glaser & Strauss 1967), although one could also view it from a purely quantitative approach. Grounded theory emphasizes a systematic discovery of theory from data using methods of comparison and sampling. The resulting theories or models are grounded in observations of the real world, rather than being abstractly generated. Therefore, grounded theory is an inductive approach to theory or model development, rather than the deductive alternative.

Computer–User Interaction

Using TLA as a methodology, one examines the characteristics of searching episodes in order to isolate trends and identify typical interactions between searchers and the system. Interaction has several meanings in information searching, addressing a variety of transactions including query submission, query modification, results list viewing, and use of information objects (e.g., web page, pdf file, video). TLA addresses levels one and two (move and tactic) of Bates’s (1990) four levels of interaction, which are move, tactic, stratagem, and strategy. Saracevic (1997) views interaction as the exchange of information between users and system. Increases in interaction result from increases in communication content. Hancock-Beaulieu (2000) identifies three aspects of interaction, which are interaction within and across tasks, interaction as task sharing, and interaction as a discourse.

Interactions are the physical expressions of communication exchanges between the searcher and the system. For example, a searcher may submit a query (i.e., an interaction). The system may respond with a results page (i.e., a reaction). The searcher may click on a uniform resource locator (URL) in the results listing (i.e., an interaction). Therefore, for TLA, interaction is a mechanical expression of underlying information needs or motivations.

Discussing TLA as a methodological approach, Sandore and Kaske (1993) review methods of applying the results of TLA. Borgman et al. (1996) comprehensively review past literature from different methodologies employed in these studies. Several researchers have viewed TLA as a high-level designed process, including Cooper (1998). Other researchers, such as Hancock-Beaulieu et al. (1990), Griffiths et al. (2002), and Hargittai (2002), have advocated using TLA in conjunction with other research methodologies or data collection. These other methods include questionnaires, interviews, video analysis, and verbal protocol analysis.

In terms of strengths, log-file analysis provides a method of collecting data from a great number of users. Given the current nature of the web, transaction logs appear to be a reasonable and non-intrusive means of collecting user–system interaction data during the web information searching process from a large number of searchers. One can easily collect data on hundreds of thousands to millions of interactions, depending on the traffic of the website. One can also collect this data inexpensively. The costs are the software and storage. The data collection is unobtrusive, so the interactions represent the unaltered behavior of searchers. Finally, transaction logs are, at present, the only method for obtaining significant amounts of data within the complex environment that is the web.

Limitations To Log-File Analysis

Researchers have critiqued TLA as a research methodology (Blecic et al. 1998; HancockBeaulieu et al. 1990; Phippen et al. 2004). Kurth (1993) comments that transaction logs can only deal with the actions that the user takes, not their perceptions, emotions, or background skills, and further identifies three methodological issues with TLA: execution, conception, and communication. Kurth states that TLA can be difficult to execute due to collection, storage, and analysis issues associated with the hefty volume and complexity of the dataset (i.e., the significant number of variables). With complex datasets, it is sometimes difficult to develop a conceptual methodology for analyzing the dependent variables, from a positivist framework. Communication problems occur when researchers do not define terms and metrics in sufficient detail to allow other researchers to interpret and verify their results. This issue also occurs during the data collection period.

Certainly, any researcher who has utilized TLA would agree with these critiques. However, upon reflection, these are issues with many, if not all, empirical methodologies. Further, although Kurth’s critique is still generally valid, advances in transaction logging software, standardized transaction log formats, and improved data analysis software and methods have addressed many of these shortcomings.

There are certainly limitations that do exist. Transaction logs are primarily a server-side data collection method; therefore, some interaction events are masked from these logging mechanisms, such as when the user clicks on the back or print button on the browser software, or cuts or pastes information from one window to another on a client computer. Transaction logs also, as stated previously, do not record the underlying situational, cognitive, or affective elements of the searching process.

Another limitation is that there may be certain types of data not in the transaction log, individuals’ identities being the most common example. An IP (Internet protocol) address typically represents the “user” in a transaction log. Since more than one person may use a computer, an IP address is an imprecise representation of the user. Search engines are overcoming this limitation somewhat by the use of cookies. In addition, there is no way to collect demographic data when using transaction logs in a naturalistic setting. This constraint is characteristic of many nonintrusive naturalistic studies. However, there are several sources for demographic data on the web population based on observational and survey data. From these data sources, one may obtain reasonable estimates of needed demographic data.

In addition, a transaction log does not record the reasons for the search, the searcher motivations, or other qualitative aspects of use. This is certainly a limitation. In the instances where one needs this data, transaction log analysis can be used in conjunction with other data collection methods. However, this invasiveness then intrudes on the unobtrusiveness that is an inherent advantage of transaction logs as a data collection method.

Finally, the logged data may not be complete due to caching of server data on the client machine or proxy servers. This is a relatively minor concern for web search engine research because of the method by which most search engines dynamically produce their results pages. For example, a user accesses the page of results from a search engine using the back button of a browser. This navigation accesses the results page via the cache on the client machine. The web server will not record this action. However, if the user clicks on any URL on that results page, functions coded on the results page redirect the click first to the web server, from which the web server records the visit to the website.

Overall, transaction logs are powerful tools for collecting data on the interactions between users and systems. Using this data, TLA can provide significant insights into user–system interactions, and it complements other methods of analysis by overcoming the limitations inherent in these methods. With respect to shortcomings, where feasible TLA can be combined with other data collection methods or other research results to improve the robustness of the analysis. Of course, although transaction logs many times “reside in the background” for most searchers, there are privacy concerns that the researcher must be aware of when using this method of data collection. Even with these shortcomings, TLA is a powerful tool for web searching research and the TLA process can be helpful in future web searching research endeavors.

References:

  1. Bates, M. J. (1990). Where should the person stop and the information search interface start? Information Processing and Management, 26(5), 575 – 591.
  2. Blecic, D., Bangalore, N. S., Dorsch, J. L., Henderson, C. L., Koenig, M. H., & Weller, A. C. (1998). Using transaction log analysis to improve OPAC retrieval results. College and Research Libraries, 59(1), 39 – 50.
  3. Borgman, C. L., Hirsh, S. G., & Hiller, J. (1996). Rethinking online monitoring methods for information retrieval systems: From search product to search process. Journal of the American Society for Information Science, 47(7), 568 – 583.
  4. Cooper, M. D. (1998). Design considerations in instrumenting and monitoring Web-based information retrieval systems. Journal of the American Society for Information Science, 49(10), 903 – 919.
  5. Glaser, B., & Strauss, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Chicago: Aldine.
  6. Griffiths, J. R., Hartley, R. J., & Willson, J. P. (2002). An improved method of studying user–system interaction by combining transaction log analysis and protocol analysis. Information Research, 7(4). At http://InformationR.net/ir/7-4/paper139.html.
  7. Hancock-Beaulieu, M. (2000). Interaction in information searching and retrieval. Journal of Documentation, 56(4), 431– 439.
  8. Hancock-Beaulieu, M., Robertson, S., & Nielsen, C. (1990). Evaluation of online catalogues: An assessment of methods (BL Research Paper 78). London: British Library Research and Development Department.
  9. Hargittai, E. (2002). Beyond logs and surveys: In-depth measures of people’s web use skills. Journal of the American Society for Information Science and Technology, 53(14), 1239–1244.
  10. Kurth, M. (1993). The limits and limitations of transaction log analysis. Library Hi Tech, 11(2), 98 –104.
  11. Peters, T. (1993). The history and development of transaction log analysis. Library Hi Tech, 42(11), 41– 66.
  12. Phippen, A., Sheppard, L., & Furnell, S. (2004). A practical evaluation of web analytics. Internet Research: Electronic Networking Applications and Policy, 14(4), 284 – 293.
  13. Rice, R. E., & Borgman, C. L. (1983). The use of computer-monitored data in information science. Journal of the Amercian Society for Information Science, 44(1), 247 – 256.
  14. Sandore, B., Flaherty, P., & Kaske, N. K. (1993). A manifesto regarding the future of transaction log analysis. Library Hi Tech, 11(2), 105 –111.
  15. Saracevic, T. (1997). The stratified model of information retrieval interaction: Extension and applications. Proceedings of the Annual Meeting of the American Society for Information Science, 34, 313–327.
  16. Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. New York: Kluwer.
Scroll to Top