DistroWatch.comMining DistroWatch.com Logs (Part 1)

Mining the logs from the famous DistroWatch.com website enables to formally assess the trends in the GNU/Linux ecosystem. In particular, this first part will analyze the popularity of Ubuntu with respect to the former predominance of Mandriva.

In a month from now, an algorithm called Data-Peeler will be presented at a prestigious conference which will be held in Atlanta, Georgia. This algorithm, developed in the French research team I work in, deals with data-mining. This computer science topic aims at extracting knowledge from data. In the case of Data-Peeler, the considered data are n-ary relations. The extracted knowledge binds subsets of the n domains that are simultaneously frequent.

Hence, once I have had implemented this algorithm, I needed an interesting real-life n-ary relation to assess the relevancy of the extracted knowledge. I immediately thought of DistroWatch.com. This popular website gathers comprehensive information about GNU/Linux, BSD and Solaris distributions. Interpreting the Free operating system trends may look less serious than the gene expression data analysis my team is accustomed to. However, I feel much more comfortable with it! That is why I wrote to Ladislav Bodnar, maintainer of DistroWatch.com. He kindly agreed to share its logs with me and wished me "Happy number crunching!".

On DistroWatch.com, every distribution is described on a separate page. I have considered that a visitor loading such a page is "interested" in the distribution. Ladislav analyzes every IP address contacting his server so that the country the connection comes from, is logged as well. Finally, timestamps enable to study the evolution of the interest granted to the different distributions along time. In the end, here is a wonderful ternary relation between distributions, countries and time. From it, Data-Peeler extracts patterns binding sets of distributions with sets of countries and set of time periods. Their meaning is the following: some users from those countries were interested in the same set of distributions along the same time segments.

Since it does not make much sense to consider the evolution of the GNU/Linux ecosystem day after day, I aggregated the data along semesters. I decided that the first semester of a year would start on December, 1st and finish on May, 31st. Hence, a semester matches, more or less, the golden age of the latest version of most distributions having a six month release politics. In this way, I obtained, for each distribution page and each semester, how many connections from a country were logged. To avoid biases related to the heterogeneous number of connections per country and the increasing popularity of DistroWatch.com, I normalized the data. Thus, every semester and every country has the same weight. Finally, I set a global threshold: above it, an entry (semester, country, distribution) was kept. Under, it was rejected. In this way, the less popular distributions disappeared (they will be back in the next article!). In the end, a relation on 10 semesters (from the December 2002 to November 2007), 242 countries and 294 distributions was ready to be mined.

Data-Peeler presents the interesting property to handle a broad class of constraints at extraction time. This significantly decreases the execution times and allows to take a look at the returned knowledge (without any constraint, the output is much larger than the input!). I chose a minimal weighted area constraint: instead of computing the pattern area by simply multiplying its dimension sizes, these dimensions are exponentially weighted by a coefficient inversely proportional to the maximal value for the dimension. Hence, a pattern which occurs along all the 10 semesters will be output, whereas a pattern gathering 10 distributions (out of 294) may not. Let us finally move on to the results!

Mandriva (formerly called Mandrake) was obviously the most popular distribution until the second half of 2005. Indeed many big patterns gathering tens of countries are related to Mandriva before this date. From Mandriva 2006, this distribution seems to have lost most of its appeal. Among the 40 biggest countries (in number of connections on DistroWatch.com), visitors from Austria, China, Czech Republic, Germany, Turkey and the USA stopped earlier to look at Mandriva on DistroWatch.com (from the second half of 2004). Those from Danemark, Greece, the Netherlands, Romania, Russia and Sweden only waited one semester more. It is interesting to point out that Mandriva's predominance in space has never been total. In particular, Argentina, Bulgaria, Japan, and Ukraine have never really shown any interest in this distribution.

One other distribution provides patterns gathering many countries: Ubuntu. Those patterns are even more impressive in size than Mandriva's. Indeed, the largest one gathers 132 countries! The predominance of Ubuntu started at the beginning of 2005, once the first version released. Investigating further the behaviors of the biggest countries, it appears that, during the last semester in the data (second semester of 2007), users from Argentina, India, Indonesia and Romania did not grant much interest to Ubuntu anymore. Nevertheless, they are not the more technological countries which have led the decline of Mandriva. Hence, there is probably nothing to worry about at Canonical.

Among the remaining patterns, three other mainstream distributions are well represented: Debian, Fedora and openSUSE. In general, these patterns correspond to small countries whose GNU/Linux users are focusing on one of these distributions. As a consequence, a broad international interest has, so far, only been granted to Mandriva and Ubuntu. Nevertheless, openSUSE succeeded in capturing the attention of 35 countries from the second half of 2005 to late 2006. Is the patent deal with Microsoft (signed in November 2006) the reason why this distribution did not follow the examples of Mandriva and Ubuntu? I like to think so! It is interesting to notice that users from Finland, at the image of its most famous citizen, Linus Torvald, do not seem to bother a lot with this agreement, and have kept much of their attention focused on this distribution.

Even if Vine used to capture a lot of attention from Japanese users (until late 2005), they have always looked interested in Fedora too. Not as much as Taïwanese visitors though. Indeed, connections from this country have kept on focusing on the Fedora page since the first release of this Free (as in Freedom) distribution.

Although the patterns regarding Debian only pertain to small countries, I think it does not mean much. Indeed, Debian follows the philosophy "released when ready". This does not generate frequent news on DistroWatch.com. Since the patterns, Data-Peeler extracts, need to exclusively contain tuples from the relation (no "hole"), it looks natural that the measured interest in Debian is not as constant as in the distributions following a fast release cycle. Nevertheless, the only pattern which runs along all the 10 semesters is related to Debian: Monaco contains very few GNU/Linux users (the Principality counts 32000 inhabitants) but most have always shown interest in Debian!

Let us take a look at the more exotic distributions which are showing up. Soyombo is the perfect example! Until the first semester of 2006, this Mongolian distribution was of particular interest for... Mongolian visitors of course! However, this pattern is not out of interest since only one beta version of this distribution was released in December 2003! Obviously, the Mongolian visitors of DistroWatch would like their local version of GNU/Linux to be further developed. Anyone? Another particularity: visitors from Yemen have always been paying much attention to MoviX since its first release. Can someone give an explanation to this?

This extraction task was focusing on the broad (in space and time) interest in some GNU/Linux distributions. Preprocessing differently the logs allows to put a focus on other problems. Soon, I will present other interesting knowledge Data-Peeler extracted from these data. The effort will be put on discovering common interest in several (possibly obscure) distributions. Hold the line!

Valid
							       HTML
							       4.01
							       Strict Valid
								      CSS