Around the Net
Apache broadly maintained its share this month with a gain of three quarters of a million sites. The first European Apache Conference was recently announced, with Apachecon being held in London from October 23-25th. Weblogic gains some 90,000 sites overtaking thttpd and Zeus. The great majority of the new sites are hosted on the fast growing NameZero hosting service located at Exodus.
Microsoft announced a new tool aimed at giving hosting companies the capability of producing Windows 2000 based web sites more quickly. The tool has been codeveloped with interland.net who are currently the largest hoster of Windows 2000 sites in the US, with around 2,500 sites. Interland also have around 30,000 sites hosted on Linux, and some 60,000 on NT4. Windows 2000 has yet to see significant takeup with hosting companies, but this was clearly set out as a goal for Microsoft by Steve Ballmer in the advent of the Windows 2000 launch.
Dogfood
HotMail has commenced its much awaited migration to a Microsoft operating system. Some Windows 2000 machines have recently been moved into the load balancing pool, with currently between 90-95% of requests being served by the established FreeBSD/Apache platform, and 5-10% from Windows 2000. The Hotmail site infrastructure is enormous, and even if everything runs smoothly, a migration will likely take several weeks. LinkExchange, the other prominent FreeBSD site owned by Microsoft, still runs FreeBSD but redirects users to BCentral.
How many Active Sites are there?
Motivation
The Web Server Survey has run since August 1995, exploring the internet to find new web sites. At the end of each month, an HTTP request is sent to each site, determining the web server used to support the site, and, through careful inspection of the TCP/IP characteristics of the response, the operating system.
Over the last two years there has been significant growth in the internet’s DNS, fueled by rife domain name speculation, falling registration prices, easier and more efficient administrative procedures, and widespread publicity. It has become more common practice for companies offering internet registration services to place a template site on the web for each domain that they register. Additionally, some hosting service companies find it convenient to create new sites at the time of customer signup, rather than at the point at which the customer is prepared to place real, personalised content on to the web.
So, whereas in the early days of the web, hostnames were a good indication of actively managed content providing information and services to the internet community, the situation now is considerably more blurred, with the web including a great deal of activity, but also some considerable quantity of sites untouched by human hand, produced automatically at the point of customer acquisition by domain registration or hosting service companies.
The biggest domain registries are large enough to be significant even in the context of the 17 million sites found by the June 2000 Web Server Survey. For example, register.com host some 1.4m domain names, the great majority of which are template sites. Network Solutions have a system hosted at Digex which hosts around 750,000 template sites for domain name holders. These two domain name registries presently account for about 12% of the hostnames found in the Web Server Survey.
Additionally, many companies will register in more than one domain. For example, Netcraft holds the netcraft.com, netcraft.net, and netcraft.co.uk domains, and currently uses three hostnames that will resolve to the Netcraft site. This means that there are nine names in the DNS that will resolve to the same content; www.netcraft.com, netcraft.com, ssl.netcraft.com, www.netcraft.net, netcraft.net, ssl.netcraft.net, www.netcraft.co.uk, netcraft.co.uk, ssl.netcraft.co.uk,
Circa 1996-7, the number of distinct IP addresses would have been a good approximation to the number of real sites, since hosting companies would typically allocate an IP address to each site with distinct content, and multiple domain names could point to the IP address being used to serve the site content.
However, with the adoption of HTTP/1.1 virtual hosting, and the availability of load balancing technology it is possible to reliably host a great many active sites on a single [or small number of] ip addresses. For example FreeServe has around 150,000 sites hosted on four load balanced IP addresses. These are substantially all active sites produced by real people crafting HTML in FrontPage, Word, Netscape, text editors, etc.
This all begs the question as to how many active sites there really are on the web.
Methodology
In the simplest terms, the front page is taken from each hostname appearing in the Web Server Survey, and compared with the front page of other hostnames on the same IP address. Only sites with distinct content will be counted, such that in the example above of the Netcraft site, the site content is counted once no matter how many domain and hostnames point at the site.
On IP addresses with a very large number of sites, we would generate an unreasonably large number of requests with this approach. So, the decision was taken that on IP addresses with more than 100 sites, we would sample.
The formula used to determine this number is as follows;
N <= 100 ? N : int(100 + log10(N/100)*100);
which means that the number of pages we fetch is logarithmically proportional to the number of sites on that IP address, but will be at least 100 assuming that there are 100 or more sites on the IP address. This works out to 515 pages from futuresite.register.com‘s 1,414,626 sites in June 2000.
Keeping the number of requests to be under 1000 even in the most pathological case should ensure that the survey is not banned from sites, and that the results remain respectable and accurate. For the great majority of IP addresses with fewer than 100 sites, the survey is effectively a census, taking the front page from each site.
The comparison is made based on the HTML tag structure of the page, rather than the text of the whole page, as the typical scenario is that a domain registration or hosting company will use the same template tag structure for each customer, but vary some text on each site to reflect the domain name, the customer name, or the date registered, etc, which makes a strict comparison unsafe. An MD5 checksum is taken of the tag structure of each page, and this is used as the basis of the comparison, with only distinct MD5 signatures being counted.
Some companies like to use Framesets for presentation, even on inactive sites. So where the front page contains a <frame> tag, we follow the references to take the frame pages, and compare the tag structure of those in the same way as the initial front pages. Following these links is necessary otherwise there would be a risk that all front pages consisting of <frame> references would be incorrectly assessed as inactive.
The reason for doing this is that most inactive sites are automatically generated from a template, and whilst the page contents may differ between sites, the page structure remains constant.
Results
The first deployment of this methodology was in June 2000. The June 2000 “normal” by-hostname survey found a shade over 17 million sites. Nearly 10 million of these sites are removed from the survey by applying the active sites methodology which found 7.5 million active sites on 3.4 million IP addresses.
Apache loses 6,205,474 sites, causing its share to drop around three percentage points from 62.53% to 59.56%. Netscape-Enterprise loses 957,629 sites, making its share drop from 6.74% to 2.61%. Microsoft-IIS loses 1,458,421 sites, so its share of the survey rises some six and a half percentage points from 20.36% to 26.84%.
Web Server Count by Sites | Web Server Count by Active Sites |
---|---|
![]() |
![]() |
Globally, the operating system share of active web sites in June 2000 is as follows;
Operating System Count by Sites | Operating System Count by Active Sites |
---|---|
![]() |
![]() |
Compared with the by-hostname survey, Linux drops by 6%, Solaris by 4%, NT gains 7%. Other O/S’s each gain by a little. On active sites Linux & NT are very close, with Linux having slightly more than a 1% lead over the combined figure for NT4 and Win2000.
OS | Hosts | % | Active | % |
---|---|---|---|---|
Linux | 6,116,811 | 35.73 | 2,265,095 | 29.99 |
Microsoft | 3,644,187 | 21.32 | 2,222,841 | 28.32 |
Other | 3,802,268 | 21.24 | 1,873,525 | 23.59 |
Solaris | 3,484,135 | 20.35 | 1,233,494 | 16.33 |
unknown | 233,676 | 1.36 | 132,862 | 1.76 |
As expected, a sizeable chunk of the Linux drop is at register.com which has around 1.3m inactive sites. Likewise for Solaris, Network Solutions accounts for somewhere over 800,000 inactive sites. Conversely, theplanet.net which hosts FreeServe, the largest UK dial-up ISP with around 1.3m active subscribers, only falls by about 20%, maintaining 136,000 of around 170,000 sites. This is a nice test case for the methodology, as FreeServe is a large number of active sites on just four IP addresses using http/1.1 virtual hosting.
Of the 3.3 million IP addresses with active sites, the great majority, 3.2 million, have just a single active site on them. At the other end of the scale the largest number of active sites on a single IP address is 286,620 at Webjump
Essentially, the domain name registries disappear, and the cheap or free bulk hosters come to the fore, together with the larger colocation companies such as exodus.net who may provide connectivity for “downstream” bulk hosters. This is exactly as one would predict.
Regional Variations
Generally, most countries lose fewer sites as a result of applying the active sites methodology than the US. This is because the largest domain registries are hosted on US networks. Additionally, in many countries domain name registration is more expensive or more bureaucratic than in the US, which has somewhat restricted the use of multiple domain names by businesses. Germany and the United Kingdom are the largest hosters of sites external to the US, and hosting companies there use broadly similar technology to that employed by companies in the US.
Conclusion
Eliminating inactive hostnames which are used to either provide additional ways of accessing existing sites, or are simply used as templates by domain registration companies is a very useful derivative of the Web Server Survey. Over 10 million hostnames are removed from the analysis, giving a much clearer picture of the technology used to host actively developed web sites. This picture is still strongly influenced by hosting companies choice of technology, but the domain name registries are largely eliminated using this approach. As expected, NT and Windows 2000 are much more fully represented in this analysis, but the success of Linux with the hosting companies is also very clear. Solaris is being pushed further upmarket by the combination of Linux and NT, and does well in specific high volume transaction environments such as internet brokerage.
Reports and Interactive Queries
Reports are provided showing server usage for the Internet as a whole, and for selected domains, with links to all the sites responding to the survey. A facility for you to check what server a particular site is running now is also available. The same form can be used to ensure that a particular site is included in future surveys. A directory of sites running in developer domains is also provided, while the sites discovered by the survey can be explored.
Fair Use, Copyright
Excerpts from this survey may be reproduced if Netcraft and the url http://www.netcraft.com/survey/ are attributed.
Your comments and criticism are much appreciated.