Disclaimer
Location
Intro
General Info
Basic usage info
Future Plans
Contact info & Closing
THIS APPLICATION IS PROVIDED FREE OF CHARGE AND IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. USE AT YOUR OWN RISK. YOU ASSUME FULL RESPONSIBILITY FOR THE USE OF THIS PRODUCT AND AGREE TO HOLD THE AUTHOR HARMLESS FROM ANY LIABILITIES THAT MAY ARISE FROM ITS USE.
Latest version can be obtained here.
That disclaimer out of the way I want to thank you for using my WebScraper application. If you have any questions or comments for me feel free to contact me at siggi@gooeynet.net. Also I would be honored if you could sign my guest book.
This application simplifies downloading user defined file types from any web site. It comes with a fully functional browser built right in so you can surf to a web site and have this application automatically download all embedded photos on that site as well as all documents of specified types that are directly linked on that site. A good example of where this will come in handy is if you are at a site with a lot of pictures either embedded directly on the page or linked directly from a page, this tool will download all those pictures in few simple mouse clicks. Note: Picture album sites that wrap the big picture in HTML code will not work as well as those that link the pictures themselves directly to the thumbnail. This application was design for sites that link directly to the file you want. Also sites that use any sort of scripting to link the file will not work with this. In order for this application to be able to scrape the site you need to be able to right click on the picture or link and select either "Save Picture" or "Save Target As". If that gives the expected results then this application should work for you.
I hope I designed this application to be fairly straight forward to use. If I did, most of the features should be intuitive to the average web surfer. if you have comments on how to improve this application I'd be very interesting in hearing about it. First of all this application has a fully functioning web browser. Simply surf to a site and all items that match your current settings (per the options dialog) are shown in a section below the browser. Select the ones you want downloaded and click on "Add to Queue". You can keep surfing and adding items to the queue. At any time you can look at the queue by clicking on "Show Queue", this will open up another window listing all items you've got in the queue. This is also how you start the download process by clicking on "Start Download". You can also manage the queue by removing items from the queue and/or changing the order of items in the queue. While the items in the queue are being downloaded you can go back to the main window and keep adding items to the queue. All items added to the queue while download is in progress will be automatically downloaded. If the queue empties the download stops and any items added after that will not be downloaded until the queue is started again.


There are three Top level menus:
- File
- Open Configuration file
- Loads in selected configuration file, that had been previously saved using the save option
- Save Configuration file
- Saves the current configuration, along with history buffer and work queue, into an ini file that can be later
loaded in using the open feature.
- Quit
- This simply exits the application and saves the current configuration, along with history buffer and work queue,
into the default ini file which will be automatically loaded in at startup.
- Edit
- Tools
- Options
- This brings up a dialog with three tabs.
- Output
- This is where you specify where the application should store the downloaded files

- Logging
- Here you specify different logging options

- Log Path
- This is where the logs are stored, can be same or different from the output path
- Log File Name
- The name of the main log file. Events and errors are written to this file
-
Log File Download statistics to a CSV file
- When checked this writes detail statistics about each file downloaded to a comma separate value (CSV) file
which can then be analyzed in Microsoft Excel or other spreadsheet applications for download performance statistics. This will be written to the same path as the main log file with the same name as the log file with a CSV extension.
Using the example in above figure the log file is C:\My Documents\My Pictures\WebScrape\WebScrape.log and the CSV file will be C:\My Documents\My Pictures\WebScrape\WebScrape.csv
- Unattended mode, only log errors and resume
- When checked no error messages are displayed, they are simply written to the log file and the application will
attempt to recover from the error. When cleared the error is first written to the log file then displayed in a dialog box. Program execution (including downloads) are halted until the error is dealt with by the user.
- Log each file downloaded
- When checked this writes the name of each file, including the URL, each time an attempt to download it is made. Each time a file is successfully downloaded that is also logged. If check box is cleared only errors and session level events are written to the log file. No mention is made of what was downloaded or from where.
- Other
- Here you set various miscellaneous options

- File Extensions to find
- List here a space separated list of all file extensions you want to scrape. When a page is loaded into the browser portion, the application looks through all the hyperlinks in the document. If it links to a file with an extension that is listed in this box then that link will get listed. The example above lists most common video and picture extensions. For definitions check out http://www.filext.com.
- Startup page
- This describes how the program behaves on startup.
- Home Page
- This option tell the application to start by going to the home page defined in your default system browser
- Last Page visited
- This specifies that the application should start by loading up the site that is at the top of the
history buffer. This of course requires that there is something in the history buffer.
- Scrape
- Linked items
- When checked the application will look for files linked on each page. Each link will need to match the file extensions listed above.
- Embedded items
- This allows you to scrape pictures that are actually on a page, rather than linked to the page. The pictures still need to match the extensions listed. If you're trying to scrape a picture album type of a site and only want the actual picture and not the thumbnail, you should clear this check box.
- Note: Picture album sites that wrap the big picture in HTML code will not work as well as those that link the pictures them self's directly to the thumbnail. This application was design for sites that link directly to the file you want. Also sites that wrap pictures in any sort of scripting will not be scraped.
- If File Exists
- This specifies what to do if a filename conflicts occurs.
- Skip
- File will not be downloaded and removed from the queue. If you've chosen to have file downloads logged then there will be a log entry to this effect.
- Rename
- The application will add a
sequence number to the file name to make it unique. You can tell when this occurred by comparing the URL and the file name in either the log file or the CSV file.
- Overwrite
- The application will not check if there a filename conflict when saving files. This is dangerous options and should be used with care
- Do Not track History
- If checked the application will not populate the URL dropdown box with the URLs you've visited. Clear it so that you can easily go to a previous URL just by selecting it from the URL drop down box. This is maintained between sessions via the configuration file.
-
Ask before creating non-existent paths
- When checked you will get the familiar "C:\foo\bar path was not found, do you want to create it" dialog box when either save path or log path does not exists. If this is cleared the application will automatically create any paths it can not found, which can cause confusion if you mistype your path.
- Only Save files larger than X KB
- This is a simple filter that allows you filter out very small files, such as icons and other small images
often found at most sites. You may have to experiment with the size that works best for you.
- History
- This brings up a dialog showing you the complete history buffer and allows you to either clear it or remove specific items.
- About
- This is the standard dialog showing version and build info, along with copyright notice and contact info.
- Help
- This will bring up this file in the browser section.
Currently I have ideas for other features like the ability to automatically crawl an entire site, however I have no timeline as to when I might implement them. If you have ideas for additional functionality send them over and I might decide to implement them. If you have improvement ideas I'd love to hear them, I might even implement them. If you find a bug please send me details on how to reproduce the bug and I'll see what I can do about it.
Again if you have any questions and/or comments feel free drop me email at siggi@gooeynet.net. I'd love to know what you think about this application.
Sincerely,
Siggi Bjarnason
http://www.icecomputing.com/siggi.htm