Author Topic: Do you know the contents of the Google robot.txt file?  (Read 3346 times)

0 Members and 1 Guest are viewing this topic.

Offline polonus

  • Avast Überevangelist
  • Probably Bot
  • *****
  • Posts: 34065
  • malware fighter
Do you know the contents of the Google robot.txt file?
« on: March 22, 2009, 11:55:46 PM »
Hi malware fighters,

Here it is:
User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Allow: /news?btcid=
Disallow: /news?btcid=*&
Allow: /news?btaid=
Disallow: /news?btaid=*&
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /relcontent
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /default
Disallow: /m?
Disallow: /m/?
Disallow: /m/ig
Disallow: /m/images?
Disallow: /m/lcb
Disallow: /m/news?
Disallow: /m/news/i?
Disallow: /m/setnewsprefs?
Disallow: /m/search?
Disallow: /m/trends
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /froogle?
Disallow: /products?
Disallow: /froogle_
Disallow: /product_
Disallow: /products_
Disallow: /print
Disallow: /books
Allow: /booksrightsholders
Disallow: /patents?
Disallow: /scholar?
Disallow: /complete
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Disallow: /maps?
Disallow: /mapstt?
Disallow: /mapslt?
Disallow: /maps/stk/
Disallow: /maps/br?
Disallow: /mapabcpoi?
Disallow: /center
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /reader/
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Disallow: /mbd?
Disallow: /extern_js/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /notebook/search?
Disallow: /music
Disallow: /musica
Disallow: /musicad
Disallow: /musicas
Disallow: /musicl
Disallow: /musics
Disallow: /musicsearch
Disallow: /musicsp
Disallow: /musiclp
Disallow: /browsersync
Disallow: /call
Disallow: /archivesearch?
Disallow: /archivesearch/url
Disallow: /archivesearch/advanced_search
Disallow: /base/search?
Disallow: /base/reportbadoffer
Disallow: /base/s2
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /codesearch?
Disallow: /codesearch/feeds/search?
Disallow: /wapsearch?
Disallow: /safebrowsing
Allow: /safebrowsing/diagnostic
Disallow: /reviews/search?
Disallow: /orkut/albums
Disallow: /jsapi
Disallow: /views?
Disallow: /c/
Disallow: /cbk
Disallow: /recharge/dashboard/car
Disallow: /recharge/dashboard/static/
Disallow: /translate_c
Disallow: /translate_suggestion
Disallow: /s2/profiles/me
Allow: /s2/profiles
Disallow: /s2
Disallow: /transconsole/portal/
Disallow: /gcc/
Disallow: /aclk
Disallow: /cse?
Disallow: /tbproxy/
Disallow: /MerchantSearchBeta/
Disallow: /imesync/
Disallow: /websites?
Disallow: /shenghuo/search?
Disallow: /support/forum/search?
Disallow: /reviews/polls/
Disallow: /hosted/images/
Disallow: /hosted/life/
Disallow: /ppob/?
Disallow: /ppob?
Disallow: /ig/add?
Disallow: /adwordsresellers
Disallow: /accounts/o8
Allow: /accounts/o8/id
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml

Thats it,

pol
Cybersecurity is more of an attitude than anything else. Avast Evangelists.

Use NoScript, a limited user account and a virtual machine and be safe(r)!

Offline Lisandro

  • Avast team
  • Certainly Bot
  • *
  • Posts: 67183
Re: Do you know the contents of the Google robot.txt file?
« Reply #1 on: March 23, 2009, 12:17:44 AM »
What is this file?
What does it do?
Which browser does it affect?
The best things in life are free.

Offline polonus

  • Avast Überevangelist
  • Probably Bot
  • *****
  • Posts: 34065
  • malware fighter
Re: Do you know the contents of the Google robot.txt file?
« Reply #2 on: March 23, 2009, 12:31:57 AM »
Hi Tech,

The robots.txt file is a set of instructions for visiting robots (spiders) that index the content of your web site pages, e.g. here google. For those spiders that obey the file, it provides a map for what they can, and cannot index. Some of these instructions also go for the visitors of Google, for instance give in ? you will not get a result, it is excluded, when you have high set anonymity Google will not allow you to search, because they consider you to be a robot,

polonus
Cybersecurity is more of an attitude than anything else. Avast Evangelists.

Use NoScript, a limited user account and a virtual machine and be safe(r)!

Offline Lisandro

  • Avast team
  • Certainly Bot
  • *
  • Posts: 67183
Re: Do you know the contents of the Google robot.txt file?
« Reply #3 on: March 23, 2009, 10:55:28 PM »
Thanks Polonus, but I don't have a webpage to protect from robots ;)
The best things in life are free.

Offline scythe944

  • Avast Evangelist
  • Massive Poster
  • ***
  • Posts: 2913
    • My Tech Blog
Re: Do you know the contents of the Google robot.txt file?
« Reply #4 on: March 26, 2009, 07:54:52 PM »
I know the purpose of the robot.txt file, but why did you post this one? What site did it come from, and why was it necessary to post?
For generic computer (not avast) problems, you can also visit my forum for help: http://www.jacobytech.net/forum

Offline polonus

  • Avast Überevangelist
  • Probably Bot
  • *****
  • Posts: 34065
  • malware fighter
Re: Do you know the contents of the Google robot.txt file?
« Reply #5 on: March 26, 2009, 10:58:27 PM »
Hi scythe944,

Sometimes during privacy sensible browser sessions, while analyzing specific code it is nice to pose being a search bot, User Agent Switcher is a nice add-on, but you have to tweak the setting a bit to have the specific search bots, because where users cannot go some bots go. Then with requests it is also handy to know if you visit as a search bot (non-malicious bot off-course) that Google sets exclusions in front of you.

The same goes for users, where Google mistake them to be automatic tools, then you are blocked by Google and not even offered a captcha to proof you are human, this for instance happens when you have high anonymity (Tor + Privoxy + Stealth), then you have to go to scroogle or another similar search page.
But there are other tools beside a browser, a nice tool I use is IntelliTamper and there I use a custom made weak cgi list to see what it on a particular web-server,

I do browser sessions with TDIMonitor to see what is going on on the machine and also analyzing web-traffic with WireShark, all the connections I have are monitored by a special monitor program that reads Host name IP Address Local Port Remote Port Service Type Interface and Connection State for every Network interface I have.

Do things, get some education, use your imagination, tweak a little, learn to back-engineer a bit,
analyze some hex, write some code, be a better user and keep to open standards and free software,
I learned a lot from my Internet guru: F.RAVIA and a lot of others,

pol




« Last Edit: March 26, 2009, 11:01:07 PM by polonus »
Cybersecurity is more of an attitude than anything else. Avast Evangelists.

Use NoScript, a limited user account and a virtual machine and be safe(r)!