Webmaster Forums Banner Professional Hosting from Just Host
Welcome Guest Search | Active Topics | Members | Log In | Register

crawlering websites in php Options
tan
Posted: Friday, May 30, 2008 8:44:07 AM
Rank: Advanced Member
Groups: Member

Joined: 1/31/2008
Posts: 63
Points: 189
Location: Pakistan
hi guys

what syntex we have to write to start php page crawler web pages over internet ?
Sponsor
Posted: Friday, May 30, 2008 8:44:07 AM
rasheed
Posted: Friday, May 30, 2008 8:45:35 AM
Rank: Advanced Member
Groups: Member

Joined: 1/31/2008
Posts: 42
Points: 126
Location: UK
Well, you can use fopen() with fopen_url enabled (or whatever it's called). You can use stream contexts to supply additional parameters for this.

Alternatively you can use curl (if enabled).

As a third option, you can use one of the other HTTP clients already made (there are at least two in PEAR, I've not tried either of them).

As a fourth option you can write your own HTTP implementation (which is what I ultimately did after it became obvious that fopen() wasn't flexible enough even with the stream context options).

You'll also need a HTML parser - fortunately PHP5 has one built in via libxml2 - the DOMDocument::loadHTML function will do what you want.

Making a web crawler is VERY involved and takes a huge amount of work. Real web pages have a lot of errors in and you'll encounter a lot of problems.

Issues I found:
- Multithreading efficiently
- Database locking / contention issues
- Startup/shutdown and remembering what pages are done
- Parsing robots.txt
- Handling broken things (for example, servers which return a 200 status even for pages which don't exist).
- Handling SPAM sites created just to piss robots off (believe me, there are a LOT of these)
- Gracefully handling errors / exceptions thrown from inside the crawler itself and deciding what to do with those URLs in the queue
- Handling encodings correctly - even when the page has several conflicting messages (headers, meta) about which encoding it's in or just plain lies.
- Handling non-HTML pages
- Redirect handling
- Deciding what to spider next / prioritisation

These are just a few of the issues I found when trying to do this.

My conclusion was that PHP isn't a very suitable language for a HTTP spider - it simply doesn't give you enough low level control over most things (such as sockets, processes, threads, locking, high performance db stuff).

But it did work and I spidered hundreds of thousands of web pages with it.
Users browsing this topic
Guest


Forum Jump
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Main Forum RSS : RSS

ASPNET Theme created by Boskone (Dan Ferguson)
Powered by HaqTech.Com
Copyright © 2003-2006 Yet Another Forum.net. All rights reserved.
This page was generated in 1.170 seconds.