Spass mit Webspiders II

Markus Stumpf
Friday, January 29. 2010
Tagein, tagaus
0
Trackback URL

Mit Hilfe der Datei robots.txt im Root-Verzeichnis eines Webservers (also zB. http://www.example.com/robots.txt) kann man Webspiders rudimentär steuern. Will man nicht, dass diese Crawlers einen bestimmten Bereich des Webservers ablaufen, schreibt man Einträge der Form:

User-Agent: * Disallow: /g2/main.php/tag/

Schön wäre es nun, wenn sich die Crawlers auch daran halten würden. Manche tun das nämlich nicht, obwohl sie das /robots.txt abrufen:

20100128 16:18:22	dotnetdotcom.org	GET	200	515	/robots.txt
20100128 17:14:55	dotnetdotcom.org	GET	200	20193	/g2/main.php/tag/steckerlfisch
...
20100128 20:15:30	dotnetdotcom.org	GET	200	515	/robots.txt
20100128 21:06:41	dotnetdotcom.org	GET	200	30934	/g2/main.php/tag/brugmansia
20100128 21:06:45	dotnetdotcom.org	GET	200	16502	/g2/main.php/tag/butterfly
20100128 21:06:59	dotnetdotcom.org	GET	200	16423	/g2/main.php/tag/eucalyptus
20100128 21:07:25	dotnetdotcom.org	GET	200	57129	/g2/main.php/tag/wallersdorf
...
20100129 00:02:40	dotnetdotcom.org	GET	200	515	/robots.txt
20100129 00:51:20	dotnetdotcom.org	GET	200	16484	/g2/main.php/tag/24indigo
20100129 00:52:04	dotnetdotcom.org	GET	200	25027	/g2/main.php/tag/siegestor
...
20100129 03:43:11	dotnetdotcom.org	GET	200	515	/robots.txt
20100129 04:31:11	dotnetdotcom.org	GET	200	16483	/g2/main.php/tag/bambus
20100129 04:31:15	dotnetdotcom.org	GET	200	16382	/g2/main.php/tag/banana
20100129 04:31:19	dotnetdotcom.org	GET	200	16413	/g2/main.php/tag/buddha
20100129 04:31:23	dotnetdotcom.org	GET	200	56972	/g2/main.php/tag/canico
20100129 08:11:38	dotnetdotcom.org	GET	200	55245	/g2/main.php/tag/winter
...

[UPDATE] Mittlerweile habe ich Antwort erhalten. dotnetdotcom.org hält sich an den Robot Exclusion Standard. Das Problem liegt darin, dass ich mehrere Zeilen mit "User-Agent: *" in meinen robots.txt hatte. dotnetdotcom.org interpretiert das als Fehler. Ich habe es entsprechend abgeändert.
Weitere Recherche führte dann zu B.4.1 Search robots: The robots.txt file. Hier wird es genauer spezifiziert (was mir bisher neu war):

There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.
If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

[/UPDATE]

In solchen Fällen empfiehlt sich dann ein Block in der Konfiguration des Webservers. Im Falle von

Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)

und eines apache Webservers wäre das dann ein Eintrag

# dotnetdotcom.org crawler Deny from 208.115.111.240/28

Eine E-Mail an die Betreiber hat natürlich innerhalb von 36 Stunden weder eine Antwort noch eine Änderung gebracht. [UPDATE] Dafür aber kurz danach.[/UPDATE].

Comments

No comments