Robot Rules Parser
M-Software.de - Robot Rules Parser online
Anlässlich eines kleines Problems mit der robots.txt habe ich ein kleines Testprogramm geschrieben, mit dem ich meine robots.txt testen kann. Da ich nun aber nicht alles selber machen wollte, habe ich mit mal die unterschiedlichen Methoden der Verarbeitung von robots.txt Dateien angesehen. Darunter waren auch die Verarbeitungsprogramm von wget, htdig und nutch, die alle OpenSource sind und daher im Sourcecode vorliegen.
Aufgabe der robots.txt ist es dem Crawler mitzuteilen, welche Webseiten er von der Domain nicht anfordern darf. Eine typische robots.txt Datei sieht folgendermaßen aus.# robots.txt für http://m-software.de/ User-agent: User-agent: Bad Spider Disallow: / User-agent: * Disallow: /intern1/ Disallow: /intern2/ Disallow: /rss.phpIn dieser robots.txt werden allen Crawlern (*) die Verzeichnisse intern1, intern2 und die Datei rss.php im Hauptverzeichnis verboten. Zusätzlich wird noch dem Crawler der auf den Namen "Bad Spider" hört das Hauptverzeichnis verboten. "Bad Spider" ist dabei ein Alias, den man durch den Spider ersetzen sollte, der nicht auf die Domain zugreifen darf. Um nun nicht irgendeinen Fehler in der robots.txt zu haben kann man nun in kleinen Java Programm dass in dem IFRAME geladen wurde die robots.txt testen.
Viel Spaß damit. Natürlich ist jeder gerne eingeladen, den IFRAME auf seiner eigenen Seite anzuzeigen. <IFRAME SRC="http://service.m-software.de/robots/" WIDTH="467" HEIGHT="151" scrolling="no" frameborder="0"></IFRAME> PS: Die robots.txt wird von vielen Spidern berücksichtigt, aber man sollte sich nicht darauf verlassen. Hier noch eine Liste von verdächtigen Robots.
User-Agent: ActiveAgent User-Agent: Alexibot User-Agent: Aqua_Products User-Agent: AskJeeves User-Agent: BackDoorBot User-Agent: BackDoorBot 1.0 User-Agent: BackDoorBot/1.0 User-Agent: BackWeb User-Agent: BecomeBot User-Agent: Black Hole User-Agent: BlackWidow User-Agent: BlowFish User-Agent: BlowFish 1.0 User-Agent: BlowFish/1.0 User-Agent: Bookmark search tool User-Agent: BotALot User-Agent: BotRightHere User-Agent: BuiltBotTough User-Agent: Bullseye User-Agent: Bullseye/1.0 User-Agent: BunnySlippers User-Agent: Cegbfeieh User-Agent: Cegbfeieh User-Agent: CheeseBot User-Agent: CherryPicker User-Agent: CherryPicker /1.0 User-Agent: CherryPicker 1.0 User-Agent: CherryPickerElite 1.0 User-Agent: CherryPickerElite/1.0 User-Agent: CherryPickerSE 1.0 User-Agent: CherryPickerSE/1.0 User-Agent: ChinaClaw User-Agent: Collector User-Agent: Copernic User-Agent: Copier User-Agent: CopyRightCheck User-Agent: Crescent User-Agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 User-Agent: Crescent Internet ToolPak HTTPOLE Control v.1.0 User-Agent: DISCo User-Agent: DISCo Pump User-Agent: DISCo Pump 3.1 User-Agent: DittoSpyder User-Agent: Download Demon User-Agent: Download Wonder User-Agent: Downloader User-Agent: Drip User-Agent: EirGrabber User-Agent: EmailCollector User-Agent: EmailCollector 1.0 User-Agent: EmailSiphon User-Agent: EmailWolf User-Agent: EmailWolf 1.00 User-Agent: Enterprise_Search User-Agent: Enterprise_Search/1.0 User-Agent: EroCrawler User-Agent: Express WebPictures User-Agent: ExtractorPro User-Agent: EyeNetIE User-Agent: FairAd Client User-Agent: FileHound User-Agent: Flaming AttackBot User-Agent: FlashGet User-Agent: Foobot User-Agent: FreeFind User-Agent: Gaisbot User-Agent: GetRight User-Agent: GetRight/4.2 User-Agent: GetSmart User-Agent: GetWeb! User-Agent: Go!Zilla User-Agent: Go-Ahead-Got-It User-Agent: Googlebot-Image User-Agent: GrabNet User-Agent: Grabber User-Agent: Grafula User-Agent: HLoader User-Agent: HMView User-Agent: HTTrack User-Agent: Harvest User-Agent: Harvest 1.5 User-Agent: Harvest/1.5 User-Agent: Hatena Antenna User-Agent: Image Stripper User-Agent: Image Sucker User-Agent: Indy Library User-Agent: InfoNaviRobot User-Agent: InterGET User-Agent: Internet Ninja User-Agent: Iria User-Agent: Iron33 User-Agent: Iron33/1.0.2 User-Agent: JOC User-Agent: JOC Web Spider User-Agent: Jeeves User-Agent: JennyBot User-Agent: JetCar User-Agent: Jetbot User-Agent: Jetbot/1.0 User-Agent: JustView User-Agent: Kenjin Spider User-Agent: Keyword Density User-Agent: Keyword Density/0.9 User-Agent: LNSpiderguy User-Agent: LexiBot User-Agent: LinkScan User-Agent: LinkScan/8.1a Unix User-Agent: LinkWalker User-Agent: LinkextractorPro User-Agent: MIDown tool User-Agent: MIIxpc User-Agent: MIIxpc/4.2 User-Agent: MSIECrawler User-Agent: Mag-Net User-Agent: Magnet User-Agent: Mass Downloader User-Agent: Mata Hari User-Agent: Memo User-Agent: Microsoft URL Control User-Agent: Microsoft URL Control - 5.01.4511 User-Agent: Microsoft URL Control - 6.00.8169 User-Agent: Mirror User-Agent: Mister PiX User-Agent: Mozilla User-Agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 9 User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows ME) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT) User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP) User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; AIRF) User-Agent: NICErsPRO User-Agent: NPBot User-Agent: Navroad User-Agent: NearSite User-Agent: Net Vampire User-Agent: NetAnts User-Agent: NetMechanic User-Agent: NetSpider User-Agent: NetZIP User-Agent: Ninja User-Agent: Nutch User-Agent: Octopus User-Agent: Offline Explorer User-Agent: Offline Navigator User-Agent: OmniExplorer_Bot User-Agent: Openbot User-Agent: Openfind User-Agent: Openfind User-Agent: Openfind data gathere User-Agent: Openfind data gatherer User-Agent: Oracle Ultra Search User-Agent: PageGrabber User-Agent: Papa Foto User-Agent: PerMan User-Agent: ProPowerBot User-Agent: ProPowerBot/2.14 User-Agent: ProWebWalker User-Agent: Pump User-Agent: Python-urllib User-Agent: QueryN Metasearch User-Agent: RMA User-Agent: Radiation User-Agent: Radiation Retriever User-Agent: Radiation Retriever 1.1 User-Agent: ReGet User-Agent: RealDownload User-Agent: Reaper User-Agent: Recorder User-Agent: RepoMonkey User-Agent: RepoMonkey Bait & Tackle/v1.01 User-Agent: Roverbot User-Agent: Siphon User-Agent: SiteSnagger User-Agent: SmartDownload User-Agent: Snake User-Agent: SpaceBison User-Agent: SpankBot User-Agent: Stanford User-Agent: Stanford Comp Sci User-Agent: Sucker User-Agent: SuperBot User-Agent: SuperHTTP User-Agent: Surfbot User-Agent: Szukacz User-Agent: Szukacz/1.4 User-Agent: Szukacz/1.4 User-Agent: Teleport User-Agent: Teleport Pro User-Agent: Teleport Pro/1.29.1590 User-Agent: Teleport Pro/1.29.1616 User-Agent: Teleport Pro/1.29.1632 User-Agent: Teleport Pro/1.29.1718 User-Agent: TeleportPro User-Agent: Telesoft User-Agent: Teoma User-Agent: The Intraformant User-Agent: TheNomad User-Agent: TightTwatBot User-Agent: Titan User-Agent: True_Robot User-Agent: True_Robot/1.0 User-Agent: URL Control User-Agent: URL_Spider_Pro User-Agent: URLy Warning User-Agent: VCI User-Agent: VCI WebViewer VCI WebViewer Win32 User-Agent: Vacuum User-Agent: VoidEYE User-Agent: WWW-Collector User-Agent: WWW-Collector-E User-Agent: WWWOFFLE User-Agent: WX_mail User-Agent: Web Image Collector User-Agent: Web Sucker User-Agent: WebAuto User-Agent: WebBandit User-Agent: WebBandit 2.1 User-Agent: WebBandit 3.50 User-Agent: WebBandit/3.50 User-Agent: WebCapture 2.0 User-Agent: WebCopier User-Agent: WebCopier v.2.2 User-Agent: WebCopier v3.2a User-Agent: WebEMailExtrac. User-Agent: WebEMailExtractor 1.0B User-Agent: WebEnhancer User-Agent: WebFetch User-Agent: WebGo IS User-Agent: WebLeacher User-Agent: WebReaper User-Agent: WebSauger User-Agent: WebStripper User-Agent: WebVac User-Agent: WebWhacker User-Agent: WebZIP User-Agent: WebZIP/4.21 User-Agent: WebZIP/5.0 User-Agent: WebZip User-Agent: WebZip/4.0 User-Agent: WebmasterWorld User-Agent: WebmasterWorld Extractor User-Agent: WebmasterWorldForumBot User-Agent: Website User-Agent: Website Quester User-Agent: Website eXtractor User-Agent: Webster User-Agent: Webster Pro User-Agent: Wget User-Agent: Wget/1.5.3 User-Agent: Wget/1.6 User-Agent: Whacker User-Agent: WhoWhere User-Agent: Widow User-Agent: Xaldon User-Agent: Xaldon/WebSpider User-Agent: Xenu\'s User-Agent: Xenu\'s Link Sleuth 1.1c User-Agent: Zeus User-Agent: Zeus 32297 Webster Pro V2.9 Win32 User-Agent: Zeus Link Scout User-Agent: aconon Index User-Agent: asterias User-Agent: autoemailspider User-Agent: b2w User-Agent: b2w 0.1 User-Agent: b2w/0.1 User-Agent: cosmos User-Agent: dloader(naverrobot)/1.0 User-Agent: dumbot User-Agent: eCatch User-Agent: emailcollector User-Agent: es User-Agent: gotit User-Agent: grub User-Agent: grub-client User-Agent: hloader User-Agent: httplib User-Agent: humanlinks User-Agent: ia_archiver User-Agent: ia_archiver/1.6 User-Agent: larbin User-Agent: lftp User-Agent: libWeb User-Agent: libWeb/clsHTTP User-Agent: likse User-Agent: looksmart User-Agent: lwp-trivial User-Agent: lwp-trivial/1.34 User-Agent: moget User-Agent: moget/2.1 User-Agent: mozilla User-Agent: mozilla/3 User-Agent: mozilla/4 User-Agent: mozilla/5 User-Agent: naver User-Agent: pavuk User-Agent: pcBrowser User-Agent: psbot User-Agent: scooter User-Agent: searchpreview User-Agent: sootle User-Agent: spanner User-Agent: suzuran User-Agent: tAkeOut User-Agent: toCrawl/UrlDispatcher User-Agent: turingos User-Agent: webbandit 4.00.0