WebCopy v0.98b7 96/06/08

Copyright 1994, 1995, 1996 by Víctor Parada (vparada@inf.utfsm.cl).

Copy files (recursively) via HTTP protocol.

Contents:

Description:

WebCopy is a perl program that retrieves the URL specified in a unix-like command line. It can also retrieve recursively any file that a HTML file references, i.e. inlined images and/or anchors, if specified with an option.

It can be used as a "mirror" program to retrieve a tree of documents from a remote site, and put them on-line immediately through the local server.

By default, only the document pointed to by the URL in the command line is retrieved. Many switches can be specified in the command line and each option enables one type reference to follow.

To avoid endless recursion, only files at one site can be retrieved with one command. WebCopy never follows links to files not in the same host, port number and protocol (only HTTP/1.x supported) of the first document retrieved. A list of discarded URLs are logged for future references in a file called "W.log".

This program does not comply with the Robot Exclusion Standard, since it retrieves almost what the user specifies in the command line.

The user must know what kind of server and documents he want to access. WebCopy does all what it knows to stop at CGI-generated files (virtual documents).

What's New since v0.97b2:

Usage:

webcopy [options] http://host:port/path/file [http://proxy:port]

Options (can be combined):

-o
output through stdout.
You can redirect the output to another filename or pipe it to a program. Use it with -g and -s option. You cannot recurse HTML files or use -v or -q options in this mode.
-v
operates in verbose mode.
Displays every URL to fetch. -vv is "very verbose" and outputs every header line the server sends with the file.
-q
query each URL to transfer.
Use it to select the files to transfer. Enter 'n' to skip file, 'y' to transfer, 'a' to transfer all the remaining files, and 'q' to quit immediately. If you don't say 'y' to the first file (the one specified in the command line), no recursion is made.
-s
do not log in 'W.log'.
This file is always stored in the root of the working directory unless this option is specified. It can be parsed to get a list of every file (NOT) transferred.
Warning: If you run multiple copies of WebCopy at the same time on the same working directory, you'll get one log file with all the messages in it, unless you specify this option in all but one running copies of WebCopy.
-tdelay
set 'delay' seconds between transfers.
This option is used to change the default 15 seconds of delay before every connection. This delay is due to avoid server overload.
-wpath
set working directory to 'path'.
WebCopy stores files in the current working directory. Use this option to force WebCopy to use another directory.
-xfile
set default index to 'file'.
When a directory index is required, this is the filename that is used to store the output. Defaults to index.html.
-zfile
post 'file' or query string if ommited.
You can send some URL-encoded form data using the POST method to a CGI script. If the filename is omited, the data is taken from the query string specified in the URL after a "?". The data in the file must be in URL-encoded format, and spaces are suppressed.
-yuserid:password
Sends 'userid:password' to retrieve authenticated files.
If the userid and password are ommited, WebCopy asks for them.
If WebCopy finds a password protected file during a recursive retrieval or this info is no longer valid, it will ask for this information again only if option -v was also specified in the command line.
-Yuserid:password
Same as -y except that an older authentication method is used. Try this if the other one does not work.
-kext1:ext2:ext3:...
Keep only extensions present in the specified list.
Forces WebCopy to ignore file references without a specified extension. This does not work for URLs with a trailing slash '/' or without a filename extension.
-Kext1:ext2:ext3:...
Kill extensions present in the specified list.
Forces WebCopy to ignore file references with a specified extension. This does not work for URLs with a trailing slash '/' or without a filename extension.
-rdepth
recurse HTML documents to the given depth or the whole tree if this number is not specified.
Same as -il.
Retrieves the files referenced in <A>, <FRAME>, and <AREA> tags.
Warning: Never leave WebCopy unattended if you don't know what you are recursively retrieving.
-i
include inlined images, backgrounds and sounds.
Retrieves the files referenced in <IMG>, <FIG>, <BODY>, <TABLE>, and <BGSOUND> tags.
-l
follow hypertext links.
Recurse through hypertext references in .html documents.
If a maximum depth needs to be specified, use -r instead.
Warning: Never leave WebCopy unattended if you don't know what you are recursively retrieving.
-m
allow imagemaps.
This option makes WebCopy to parse a remote .map file specified as an URL in the command line. Only NCSA's and CERN's compatable formats are scanned. To know which URL you must give, you must guess the location of that file in the server. Sometimes, you may remove "/cgi-bin/imagemap" from the URL available in the page which contains the map. This is not done automagically by WebCopy, because some HTTP servers (like Spinner) recognizes .map extension, and starts the CGI over that file, using the (given?) coordinates instead of transfer the file.
-c
allow links to CGI scripts.
By default, WebCopy discards references that seems to be a CGI script (e.g. /cgi-bin/ in the path). Use this option if you want to retrieve the output of a CGI script. If the base path is other than current, you'll also require options -paf.
-a
allow absolute references to the same host.
References like /path/to/file.html, where the path is the current one (the one that was specified in the command line) are not rejected when this option is specified. If other paths are required, also use option -p.
-f
allow full URL references to the same host.
Complete http: URLs are accepted only if this option is specified in the command line and the host and port remain the same than the current, but still rejected unless option -p is also specified.
-p
allow paths other than current.
References like /images/some.gif, where the path is not the current, are accepted. Use this option to allow references to CGI scripts. To keep the same document structure of the server and to avoid document name collision, option -d is recommended.
Warning: This option can cause WebCopy to retrieve the whole data from a server if it finds a reference to the server root in some document while using recursion. Never leave WebCopy unattended if you don't know what you are recursively retrieving.
-d
keep directory path in URL for local file.
The defaut behaviour of WebCopy is to set the working directory the equivalent of the document directory specified in the command line's URL. Using this option, WebCopy sets the working directory to be the same of the root directory of the server, so directories in the path are also created in the working directory. If you want to specify this option after doing some documents transfer, you'll have to create the subdirectories yourself and move the retrieved files in working directory to the subdirectory, or you will get duplicated files.
-u
use local copy of file if exists.
Before doing a request to a server, WebCopy checks for the file in the working directory, and sends file information to the server. Only if the file was changed since last access, the new version is retrieved. This option forces WebCopy to use the local copy of the file if it exists, without checking if the file was changed in the server.
-g
get a new copy of file even if exists.
Before doing a request to a server, WebCopy checks for the file in the working directory, and sends file information to the server. Only if the file was changed since last access, the new version is retrieved. This option forces WebCopy to ignore the local copy of the file if it exists, and retrieves the one that resides at the server, useful if you are using -o option and redirecting the output to a file with the same name. It may also force a PROXY or server with cache to refresh the file via no-cache pragma.
-n
don't use defined PROXY.
If http_proxy environtment variable is defined, this option makes WebCopy to ignore it. It also ignores a PROXY specified in the command line.
-h
help.
WebCopy displays a brief help, ignores other options specified and exits.
--dump
core dump.
If your OS has undump command and you want to speed up initialization/compile time, WebCopy generates a big core file if this option is the only one in the command line.

Note: Some options conflicts with others. For example, you cannot use -v and -o at the same time because both require STDOUT.

Examples:

  1. To retrieve a single file and store it with some name in current directory:
    webcopy -so http://www.host/images/icon.gif > logo.gif
    If you use the same name for both source and destination file, you must also specify -g option, or you will probably get an empty file.
  2. To retrieve a page and some of the inlined images without delay:
    webcopy -vsiqt0 http://www.host/page.html
    and press RETURN on each file NOT to transfer.
  3. To retrieve a document, all the inlined images and the referenced files without recursing them:
    webcopy -vr1 http://www.host/page.html
    Check in W.log for ignored links.
  4. To mirror a group of files in some other directory:
    webcopy -rwpub/mirror/name http://www.host/intro.html
  5. To retrieve the output of a form:
    1. Get the form:
      webcopy -so http://www.host/form.html > form.html
    2. Using an editor, change:
      <FORM METHOD=POST ACTION="http:/www.host/cgi-bin/proc">
      tag into:
      <FORM METHOD=POST ACTION="mailto:yourself@yourdomain">
    3. Using a WWW browser, read the modified file, fill the form and press "OK" button.
    4. Wait for your own mail to arrive. It should contain the posted URL-encoded data in the body.
    5. Save the mail in a file (post.dat) without the mail headings.
    6. Post the data:
      webcopy -so -zpost.dat http://www.host/cgi-bin/proc > result.html
    If you are smart enough, you can write your own files of data and just do step 6, or use the following:
    webcopy -so -z http://www.host/cgi-bin/proc?postdata > result.html
  6. To verbosely retrieve html documents and icons that are not in the same directory of the server:
    webcopy -vvrpafd http://www.host/path/page.html
  7. To retrieve a file using a PROXY, overriding the default http_proxy environment variable:
    webcopy http://www.host/path/page.html http://otherproxy
  8. To retrieve a password protected file, identified by userid "Aladdin" and password "open sesame", you must quote embeded blanks:
    webcopy "-yAladdin:open sesame" http://www.host/path/page.html
  9. To retrieve only .html or .txt or .lst or .htm files:
    webcopy -r -khtml:txt:lst:htm http://www.host/path/page.html
    or
    webcopy -r -khtml -ktxt -klst -khtm http://www.host/path/page.html
    or even
    webcopy -r -k.html.txt.lst.htm http://www.host/path/page.html
  10. To ignore postscript, Word and RTF files:
    webcopy -r -Kps:doc:rtf http://www.host/path/page.html

License Agreement and Lack of Warranty:

If you (want to) use this program, please send e-mail to the author. He will try to notify you of any updates made to it.

System Requirements:

Down-loading and Setting-Up:

  1. Make sure you have the previous System Requirements.
  2. Get the latest version of WebCopy from its home FTP server: ftp://ftp.inf.utfsm.cl/pub/utfsm/perl/webcopy.tgz
    This is a gzip'ed tar archive.
  3. Untar the file with the command:
    tar -xzvf webcopy.tgz
    (GNU version of tar).
  4. Make sure you got the following files in a subdir called webcopy-0.98b7:
  5. Chdir to that directory.
  6. Read the License Agreement and Lack of Warranty in webcopy.html using an HTML browser.
  7. Edit the Makefile file and select or change the path and filename for PERL and DESTINATION macros as required, and select the version of perl for IGNORE macro.
  8. Run the Makefile file:
    make
    or, to force perl 4 code in WebCopy:
    make perl4
    or, to force perl 5 code in WebCopy:
    make perl5
    If you cannot do make, copy or move webcopy.src to webcopy, edit webcopy and change "%PERL%" in the first 2 lines into the location of your perl interpreter, for example: "/usr/local/bin/perl".
    Also, remove all lines containing the string "#P5" if your interpreter is perl4, or "#P4" if you are using perl5 (yes, this is OK. #P5 are lines with Perl 5 code).
    You must also need to comment out all lines containing "#UNDEFINED". If something goes wrong when you run WebCopy, uncomment some of them and place the required code.
  9. Check that your site has hostname program. If doesn't, create your own script:
    #!/bin/sh
    /bin/uname -h
    or (if you run WebCopy in the same host every time):
    #!/bin/sh
    echo "myhostname"
    or something like that, then make it executable:
    chmod 755 hostname
    and place it in a directory available in the PATH.
  10. Test if it compiles OK:
    ./webcopy -h
    It should display a help menu.
  11. Test if it can connect to an HTTP server:
    ./webcopy -vv http://www/
    It should display some status and create two files:
  12. Move webcopy to a suitable directory. This can also be done with:
    make install
  13. Use it at your own risk!
  14. Register yourself (it's free) and send feed-back!

If you cannot do gunzip or tar, please send e-mail to the author. He will try to send you a shar'ed copy of it :-)


Document last modified on 96/06/08 by Víctor Parada