Go back to previous page
Forum URL: http://www.dombom.com/cgi-bin/dcforum/dcboard.cgi
Forum Name: The New MadBomber Marketing and SEO Forum
Topic ID: 44
#0, Web Crawler that Lists URL and and Pictures....
Posted by sgtaw on Dec-15-06 at 11:24 AM
Hi all,

Need some help.

I am an affilate of a merchant that sells hard goods. They don't have a datafeed so I am trying to create one myself.

I have found several programs that will crawl a site and list every single URL, keyword, description, and title. Then spit it out into csv or whatever.

Here's the problem. I also want it to grab the image urls on each page and associate it with that particular page. I know this will have multiple image urls picked up for each page, but I can clean that up easily.

If I can do this last step then having a full blown datafeed (with pictures is created!

Any suggestions?

Thanks,

Ed


#1, RE: Web Crawler that Lists URL and and Pictures....
Posted by Kurt on Dec-15-06 at 11:50 AM
In response to message #0
Hey Ed,

Without seeing the exact data in question, it is hard for me to give a work-around.

Out of curiousity, have you tried to use an html2rss program to create a pheed from the pages?

You can download it here:
http://blogbomb.com/blogless.zip

There's a few other options that may work, depending on the original output.


#2, RE: Web Crawler that Lists URL and and Pictures....
Posted by sgtaw on Dec-15-06 at 12:55 PM
In response to message #1
Mery Christmas Kurt!

Thanks for the quick reply....

I'm not sure that html2rss will work... I tried to change your tags for my purposes and got an error.

Here is what I am trying to do.

1. Let's take this site for example http://www.tennis-warehouse.com
I want to "crawl" this site grabbing all the product pages.

2. In grabbing those pages, I want to be able to grab various bits of information (this can change from site to site). The key items are: url of the page, title, metadescription, and (the problem child) the picture url.

For instance, http://www.tennis-warehouse.com/descpage.html?PCODE=MTLX10.

In addition to the items I mentioned, I want to grab the picture of the tennis racket. Most preferrably, I would want to have the url of where the picture is located.

3. I then want to be able to have all that information saved as a CSV so that I can upload it, for instance to BIB.

I played a tiny bit with instantrss. I found a unique tag in the webpages and replaced instantrss tags. But it got me an error.

I guess a work around would be to down load the site I am interested in. Then do a replacez using the tags that you have in instantrss. Then uploading the site to my server so that I can run instantrss.

Thanks Kurt!

Ed


#3, RE: Web Crawler that Lists URL and and Pictures....
Posted by Kurt on Dec-15-06 at 01:02 PM
In response to message #2
Hey Ed,

It appears there is a pattern in this paticular example that you may be able to use.

It seems the main graphic has the same name as the page name:
descpageRCDUNLOP-MF200P.html
-and-
MF200P.jpeg

Note that the data after the hyphen for the page name is the same as the graphic name.

I only checked this on 3 or 4 pages, but it held true each time. You'll need to check it some more.

If this holds up, you should be able to use the Tuelz to manipulate the data so that you can get the graphic URL using the page URL.

But don't spend 5x as much time and effort trying to find a work-around that it would take to do "by hand".