DistroWatch.comMining DistroWatch.com Logs (Part 2)

This article pursues the analysis of DistroWatch.com's logs I started one week ago. Last time, the data were prepared so that we could investigate the evolution, in time and space, of the popularity of GNU/Linux distributions. Pre-processing the logs in a different manner allows to focus on other interesting questions. In this way, although the extracted patterns will have the same "shape" as in last week's extraction, they will, this time, help us in discovering groups of distributions fulfilling similar purposes.

Instead of last week's ternary relation, this time, we will end up with mining a 4-ary relation. More precisely, a symmetric graph of distributions evolving in time and space. Take the red pill and welcome to the real world... of data-mining! When a visitor of DistroWatch.com (identified by her IP address from which the country is inferred) visits, the same day, pages related to different distributions, she probably searches for a Free operating system to fulfill her specific needs. Hence, we consider that these distributions present a common interest for her. Of course, she may just randomly click here and there. In this case, this visitor creates noise in the data. Anyway, the first step consists in counting how many visitors, from a given country, have loaded, the same day, two given distributions. This is to be done for all days from June, 1st 2004 (Ladislav Bodnar was not logging the IP addresses before), all 40 biggest countries (according to these counts) and all pairs of distributions (among the 350 most important ones). Then, like in last week's analysis, we aggregate the days in semesters (December-May and June-November) and normalize the counts we made such that every semester and every country has the same importance.

Among the millions of visits in a semester, almost all pairs of distribution pages have, one day, been consulted by at least one visitor. We will keep, in the relation, only those which are more frequently found. But wait a minute! The pairs involving the most popular distributions will, of course, occur more frequently than those pertaining to obscure (though very similar in purpose) ones. Hence, instead of stating one unique threshold for the whole data set (like we did last time), one threshold per distribution will take care of filtering out the distributions it is not sufficiently associated to. Such a threshold is easily set. It is a given fraction of the maximal number of visits binding it to another distribution. As a consequence, the popularity of a distribution does not play any role.

It is now time to mine the resulting data set. As mentioned earlier, it is a symmetric graph of distributions evolving in time (the semesters) and space (the countries). Our algorithm, namely Data-Peeler, extracts, under constraint, patterns we will analyze soon. In addition to a weighted area constraint we used in last week's extraction, we force the extracted patterns to have the same sets of elements in the two dimensions related to the graph of distributions. Thus, the symmetry of the graph is preserved inside the patterns. Let us move on to the results! Many extracted patterns are very similar to each other. We can identify sorts of communities of interest gathering specific sets of distributions.

Let us start with the biggest community: the old mainstream general-purpose distributions. At the center of this community (again remember that this does not relate to popularity but to the common purposes these distributions serve), Slackware, Gentoo and Ubuntu. A bit further away (i.e., not as much related to the other distributions of this group), Fedora, openSUSE and Debian. At the border of this community, Yellow Dog, MEPIS, Mandriva, Vector, FreeBSD and Damn Small Linux. When looking at the countries present in these patterns, it appears that the visitors from some European countries are clearly those making these associations. The United Kingdom shows off by being in almost all these patterns. Finland is also extremely present. Australia, Greece and Denmark are not far away. Why would these European and Australian visitors focus more on mainstream distributions than others? Maybe they are more conservative and keep on tracking the evolution of these solid distributions instead of searching for more specialized ones.

Another clear community of interest gathers all distributions which are specifically designed to manipulate movies and music: dyne:bolic, AGNULA, MoviX, GeeXboX and ArtistX. The clicks on all these distributions are tightly correlated. ArtistX, though, is a bit more absent from the extracted patterns. Again these correlations mainly come from Europe and Australia. GNU/Linux is obviously a popular choice among Swiss artists. Indeed it is the country where these associations are the most frequent. The United Kingdom, Belgium, Australia and the Netherlands are almost at the level of Switzerland.

Some mainstream distributions and the art-centric ones are related to each other too. In particular, the GeeXboX and the dyne:bolic pages are often visited by people who also show interest in MEPIS or Damn Small Linux. Ubuntu and ArtistX are also quite often in patterns grouping mainstream distributions with others focusing on movies and music applications. In this hybrid community, Knoppix, AGNULA, Xandros and PCLinuxOS are sometimes encountered. This community could be understood as a "collision" created by visitors interested in two separate kinds of distributions. Nevertheless, several common points can be found between them. Thus, the mainstream distributions we are dealing with here, are primarily designed for desktop use. Hence they are also suited to play movies and music. Furthermore, every distribution from this community uses the APT package management system (if any). Where do the connections making such associations come from? The most present countries are a mix of those that lead the two previous ones: the United Kingdom, Australia and Belgium. To some extent, visitors from Finland, Canada, France, Switzerland and Southern Korea make these associations too.

Last big community: the distributions that are specifically designed to act as firewalls. Centered around IPCop and ClarkConnect, this community also gathers Devil, SmoothWall and Astaro. To some extent, m0n0wall is also part of this group. Censornet is only associated once with Astaro and ClarkConnect. Since it is a bit particular in its approach of Web filtering, this is not surprising. Whereas the previously discussed distributions were not properly tied to one community, the firewalls are never related to other distributions in the patterns Data-Peeler extracts. Australian visitors of DistroWatch.com are frequently browsing several pages related to the firewalls. Visitors from Belgium and Indonesia are also very prone to do so.

The fact that the same countries appear in all the communities reveals a more general phenomenon. Western European and Australian visitors of DistroWatch.com prefer to track the evolution of identified communities of distributions, whereas visitors from other countries (in particular American ones) are prone to click more or less randomly in order to discover new flavors of GNU/Linux. Hence, the former create nice constant patterns Data-Peeler can filter, while the latter follow a behavior that cannot be set apart from noise.

The temporal evolutions of these communities was not discussed so far. The reason for this is that the task performed here is biased. Indeed, since only the biggest patterns are extracted, they are always corresponding to communities which were important in the middle of the rather short period of time (w.r.t. the evolution of these communities) studied here. The sets of distributions that were consulted together at the beginning of the period (June, 1st 2004) or at the end of it (November 30th, 2007) do not appear since the pattern involving them do not gather enough semesters to satisfy the weighted-area constraint. Hence, every community seems to increasingly tighten the links between its distribution until late 2005 and then decrease in importance. The GNU/Linux ecosystem is alive. Every community we have identified in this article are evolving. Some distributions were commonly recognized as serving the same purposes. Then the visitors of DistroWatch.com swerves to newcomers in the distribution panorama and the past associations are weakened. New ones are created. Nevertheless, it can be observed that the community of old mainstream general-purpose distributions better withstand this bias than, let say, the art-centric ones: they constitute a stable ground which does not evolve much along time.

Let us take a look at a community that is much smaller, in terms of the quantity and the sizes of the extracted patterns: the light distributions. Around Damn Small Linux, we found, in this community, Puppy, Feather, Knopperdisk, Vector and, to some extent, SaxenOS and DeLi. The United Kingdom, Finland and Australia are, again, the most present countries in these patterns. Interestingly, this community fights against the bias identified in the previous paragraph. Thus, until today, these light distributions are, as a group, increasingly focusing the attention of the visitors.

Other particular patterns could be discussed, like the tight links between Gentoo, Gentoox and GentooTH or the fact that people visiting Foresight also visit rPath. The big picture, however, essentially comprises the communities detailed above. If you have any comment about the analysis done in this article, do not shrink back from letting a message below. To conclude, I would like to credit the other researchers who have been working, as much as I have, on Data-Peeler: Jérémy Besson, Céline Robardet and Jean-François Boulicaut.

Valid
							       HTML
							       4.01
							       Strict Valid
								      CSS