Google bots are crawling in a new way

On the hunt for HTML formsÂ…

By Stephen Shankland, 16 April 2008 08:55

NEWS

Google's search bots, which scour the web constantly for new pages, have begun a new, more active phase of their indexing jobs.

In a blog post last week, Jayant Madhavan and Alon Halevy of Google's crawling and indexing team said the company has begun an experiment in which its indexing software experimentally enters text in website forms to see what previously undiscovered pages may appear.

The best of Google Earth

From Hollywood to Vegas and racetracks to controversial domes... click here to travel the world with Google Earth.

The post said: "In the past few months, we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. This experiment is part of Google's broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines."

The new Google indexing practice involves only "high quality" websites and doesn't run on sites with 'robots.txt' files or other standard mechanisms of warding off indexing software.

To decide what words to "type" into the forms, the indexing software samples from among words on the web page with the form, Google said.

The technology looks related to a company called Transformic which Google acquired, according to a blog post by Anand Rajaraman, who was involved with the technology earlier in his career, while working for Halevy.

Comments

There are 3 comments. Join the discussion

  1. 1. Richard

    Great, Google Tax Service:

    Can we now look forward to Google automatically filing our online tax forms, using its robot to fill in the figures?

    After all, Google must already know enough about our financial affairs?

  2. 2. Karen Challinor

    we could see some interesting automatically generated replies on Silicon.Com then

  3. 3. anonymous

    This is exactly what some hackers already do. One of the reasons for pages not being available directly is that they may contain some form of restricted information, or that which few people are allowed access to. Many of those pages may contain images, and other information that comes together on the initial page and means something. ie a form of database stucture in which the basic components are out of context until assembled. This can give rise to all sorts of interesting, and more likely unfortunate consequences. Not only that but underlying hidden details may well be out of date and basically misleading. In other instances some information may be in the process of being built and assembled ready for a future launch. With websites potentially being extremely large it is often easier to upload material bits at a time, especially if different parts of a website need to be udpated independently and at different times.

    Also the habit of an active search like this means that web-servers will be doing much more unproductive work than they were originally sized for.

    There isn't any real point in this exercise, other than trying to be clever. Many of us waste inordinate amounts of time fighting hacking attacks and this will just add to that.

Post your comment

In order to post a comment you need to be registered and logged in.

Log in or create your silicon.com account below

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy.

Questions about membership? Find the answers in the Membership FAQ