Rank: Advanced Member Groups: Member
Joined: 1/31/2008 Posts: 63 Points: 189 Location: Pakistan
|
hi guys
what syntex we have to write to start php page crawler web pages over internet ?
|
Rank: Advanced Member Groups: Member
Joined: 1/31/2008 Posts: 42 Points: 126 Location: UK
|
Well, you can use fopen() with fopen_url enabled (or whatever it's called). You can use stream contexts to supply additional parameters for this.
Alternatively you can use curl (if enabled).
As a third option, you can use one of the other HTTP clients already made (there are at least two in PEAR, I've not tried either of them).
As a fourth option you can write your own HTTP implementation (which is what I ultimately did after it became obvious that fopen() wasn't flexible enough even with the stream context options).
You'll also need a HTML parser - fortunately PHP5 has one built in via libxml2 - the DOMDocument::loadHTML function will do what you want.
Making a web crawler is VERY involved and takes a huge amount of work. Real web pages have a lot of errors in and you'll encounter a lot of problems.
Issues I found: - Multithreading efficiently - Database locking / contention issues - Startup/shutdown and remembering what pages are done - Parsing robots.txt - Handling broken things (for example, servers which return a 200 status even for pages which don't exist). - Handling SPAM sites created just to piss robots off (believe me, there are a LOT of these) - Gracefully handling errors / exceptions thrown from inside the crawler itself and deciding what to do with those URLs in the queue - Handling encodings correctly - even when the page has several conflicting messages (headers, meta) about which encoding it's in or just plain lies. - Handling non-HTML pages - Redirect handling - Deciding what to spider next / prioritisation
These are just a few of the issues I found when trying to do this.
My conclusion was that PHP isn't a very suitable language for a HTTP spider - it simply doesn't give you enough low level control over most things (such as sockets, processes, threads, locking, high performance db stuff).
But it did work and I spidered hundreds of thousands of web pages with it.
|