Technical solutions to web site visitor tracking

17 June 2005

There are two main ways that the problem of identifying an individual when they visit a web site. This is assuming they are not willing to put their hand up and identify themselves by logging in. That may not always be likely to happen depending on the type of site or service on offer. So, assuming we can’t pursuade them to identify themselves so readily, the two technical methods we need to think about for getting this information could broadly be classed as what I’d call ‘intrusive’ and ‘non-intrusive’.

Intrusive methods have a lasting effect on the visitor’s computer, usually through depositing a file of some kind. The most common, and the most familiar, method for doing this is the browser cookie. There are a few other techniques, such as the recent uses of Flash Shared Objects, but only a handful and each with their own problems. These have the advantage that we can be certain of the identity of the computer that is visiting the site (although still not 100% sure that it’s the same individual using it) as only their computer will have that particular file on it. It doesn’t matter if they change ISP or otherwise play around their computer’s settings as long as they use the same browser, and changing browser is not too common an occurrence.

The disadvantage of these techniques is that some people don’t like the idea of a web site, let alone an ad banner server, putting something on their computer without them being asked about it. This has generated a market for tools which block or remove these intrusive files, generally sold as anti-spyware. There are also only a limited number of techniques we can exploit to get the file onto the user’s system and we’re restricted by what the browser, or other common plug-in (Flash and Java), manufacturers decide to give us. As is likely to happen with Flash Shared Objects, these are all easy to block either through the use of bespoke software or simply the user changing their browser preferences to explicitly deny certain technologies.

Non-intrusive methods are what we’re left with if users decide to actively combat the intrusive methods, and this is an area that no-one seems to have really cracked. Historically, this was how web analytics started by tracking people by IP address. It hasn’t really come a lot further and most analytics vendors talk about using the combination of IP address and user agent if it’s not possible to use a cookie. (The user agent is the way a web browser identifies itself to a web server, i.e. ‘Microsoft Internet Explorer version x.xxxxxx’). Although there are a huge number of sub-versions of each browser, that doesn’t really solve the problem of changing IP addresses for repeated sessions (as with dial-up users), changing IP address mid-session (as with AOL users and other proxies) or mass installations of the same browser behind the same proxy server (as with some corporate setups). There are techniques to improve accuracy, but none reach certainty by any means.

All of this information, such as IP address and user agent, is ‘volunteered’ by the web browser as a user surfs, and with the addition of some client-side JavaScript code we can find out a lot more such as screen size, colour depth, system time, plus some others. I call non-intrusive because although we are running some code on the user’s machine (as part of the page they are visiting) we’re not leaving any kind of mark on the user’s system, only reading information.

The aim of the non-intrusive methods is to identify a user by generating some kind of unique fingerprint based on what we can find out. As far as I’m aware, there is no sure-fire way of doing this, but it’s where a lot of the lateral thinking is going. There isn’t really anywhere else to go with the intrusive methods unless the browser and plug-in manufacturers come up with something new. (And then, following that, seeing if it become’s regularly blocked in the same way as cookies are now.) Areas that we’ve been looking at include what we can find out about the visitor’s browser setup (which, again, will often be identical across corporate installations) but also by looking at what they do, what they’ve seen and whether we can generate a fingerprint that way. These ideas often fall foul of the fact that we’re not dealing with a controlled set of variables but rather with side-effects i.e. the user may change something about their system not because they don’t want to be tracked, but for some other completely innocuous reason. Examples of this might be making use of the browsing history and cache (which regularly expires and often cleared out by users, hence losing all that data) or system time (changes over time so can’t be relied on to any real degree of accuracy).

The change in browser technology and the shift in market share are also playing a part in making it difficult to produce new solutions. As an example, research may be conducted into the way files are downloaded from the server to try and guage whether a particular visitor has seen a page or combination of files before, but the download methods for the main browsers differ enough to cause problems here. Firefox is easily configurable to open many more parallel requests than either IE or its own default setting. Given that a new version of IE is on the horizon any solution which made use of any kind of browsing technology side-effect could well become obselete, or possibly even give completely incorrect results.

These techniques do have some usefulness when it comes to tracking a user within a single session, and indeed we can do that very accurately without cookies. What is much harder is to identify repeat visits weeks or months after someone has first seen a piece of content. It is these kinds of timeframes within which people are more likely to clear out their cookies but even with this deletion the cookie (or other intrusive method) is still much more accurate than any of the non-intrusive ‘guesswork’ methods.

That doesn’t mean we’re going to give up on trying to find a better way to measure but unless either a) there is a revolution in browsing technology and the general consumer is willing to accept being identified (I’m not holding my breath – as a web user I’d have issues with giving up that kind of anonymity) or b) we pursuade people to identify themselves voluntarily, we are for the moment only working towards greater statistical accuracy rather than certainty, especially in the most valuable areas of repeat visits.

Leave a Reply Cancel reply