Webbots, Spiders, and Screen Scrapers

An excellent introduction to writing server side scripts that interact with other people's web pages.

My Rating: yesyesyesyeshalf





This edition has been significantly updated from the first edition. While the book still contains the same amount of material there are several chapters from the first edition which have been completely removed and several new chapters on different subjects added. The author comments in the introduction about the huge amount of feedback received for the first edition and so presumably much of the change is a result of that feedback. This should make the second edition of the book even more useful than the first edition.

One of the new chapters is called "Advanced Parsing With Regular Expressions" and rather interestingly a section in the chapter on how Regular Expressions are often not the right tool to use for parsing. It seems that many readers of the first edition disagreed with the authors view that Regular Expressions can match patterns but are ineffective at determining context and that since the context is what you are generally looking for in scraping content that there are better ways to do it that don't involve Regular Expressions. The prior chapter in both editions covers exactly that but the author has listened to the feedback and provided information in this new chapter on how to extract patterns using Regular Expressions where the context is of lesser importance.

A large part of chapter 31 deals with potential legal issues in connection with running these types of script. The author makes it very clear that the information provided in this chapter is not legal advice and also that much of what is discussed relates to laws that apply in a specific country. In this way he is able to provide many examples of the sorts of things that could get you into trouble and need to be avoided without any suggestion that these are the only such issues.

Overall the book presents a well balanced introduction to this topic. The many example applications covered in the "Projects" section of the book will make a useful starting point for developing your own scripts. Simply pick one that does a part of what you need and add the code to do the rest.

Perhaps the most useful feature of this book isn't in the book itself but is on the associated web site. The author has very kindly provided a selection of dummy web sites suitable to test scripts against without the risk of your script drawing the attention of site owners if it malfunctions and does something inappropriate. By using this service you can make sure that the script actually works properly before letting it loose on real web sites.

More Information from the Publisher

go to top

FaceBook Follow
Twitter Follow