Metalink/HTTP: Mirrors and Checksums in HTTP HeadersPompano BeachFLUSAanthonybryan@gmail.comhttp://www.metalinker.orgneil@nabber.orghttp://www.nabber.orghenrik@henriknordstrom.nethttp://www.henriknordstrom.net/Roke Manor ResearchOld Salisbury LaneRomseySO51 0ZNHampshireUK+44 1794 833 465alan.ford@roke.co.ukThis document specifies Metalink/HTTP: Mirrors and Checksums in HTTP Headers, a different way to get information that is usually contained in the Metalink XML-based download description format.
Metalink/HTTP describes multiple download locations (mirrors), Peer-to-Peer, checksums, digital signatures, and other information using existing standards for HTTP headers. Clients can transparently
use this information to make file transfers more robust and reliable.Metalink/HTTP is an alternative representation of Metalink information, which is usually presented as an XML-based document format .
Metalink/HTTP attempts to provide as much functionality as the Metalink/XML format by using existing standards such as Web Linking ,
Instance Digests in HTTP , and ETags. Metalink/HTTP is used to list
information about a file to be downloaded. This can include lists of multiple URIs (mirrors), Peer-to-Peer information, checksums, and digital signatures.Identical copies of a file are frequently accessible in multiple locations on the Internet over a variety of protocols (such as FTP, HTTP, and Peer-to-Peer).
In some cases, users are shown a list of these multiple download locations (mirrors) and must manually select a single one on the basis of geographical location, priority, or bandwidth.
This distributes the load across multiple servers, and should also increase throughput and resilience. At times, however, individual servers can be slow, outdated, or unreachable, but this can not be determined until the download has been initiated. Users will rarely have sufficient information to choose the most appropriate server, and will often choose the first in a list which may not be optimal for their needs, and will lead to a particular server getting a disproportionate share of load.
The use of suboptimal mirrors can lead to the user canceling and restarting the download to try to manually find a better source. During downloads, errors in transmission can corrupt the file.
There are no easy ways to repair these files. For large downloads this can be extremely troublesome.
Any of the number of problems that can occur during a download lead to frustration on the part of users.Some popular sites automate the process of selecting mirrors using DNS load balancing, both to approximately balance load between servers, and to direct clients to nearby servers with the hope that this improves throughput. Indeed, DNS load balancing can balance long-term server load fairly effectively, but it is less effective at delivering the best throughput to users when the bottleneck is not the server but the network.This document describes a mechanism by which the benefit of mirrors can be automatically and more effectively realized. All the information about a download, including mirrors, checksums, digital signatures, and more can be transferred in coordinated HTTP Headers.
This Metalink transfers the knowledge of the download server (and mirror database) to the client. Clients can fallback to other mirrors if the current one has an issue. With this knowledge,
the client is enabled to work its way to a successful download even under adverse circumstances.
All this is done transparently to the user and the download is much more reliable and efficient.
In contrast, a traditional HTTP redirect to a mirror conveys only extremely minimal information - one link to one server, and there is no provision in the HTTP protocol to handle failures.
Furthermore, in order to provide better load distribution across servers and potentially faster downloads to users, Metalink/HTTP facilitates multi-source downloads, where portions of a file are downloaded from multiple mirrors (and optionally, Peer-to-Peer) simultaneously.[[ Discussion of this draft should take place on IETF HTTP WG mailing list at ietf-http-wg@w3.org or the Metalink discussion mailing list
located at metalink-discussion@googlegroups.com. To join the list, visit
http://groups.google.com/group/metalink-discussion . ]]Detailed discussion of Metalink operation is covered in ; this section will present a very brief, high-level overview of how Metalink achieves its goals.Upon connection to a Metalink/HTTP server, a client will receive information about other sources of the same resource and a checksum of the whole resource. The client will then be able to request chunks of the file from the various sources, scheduling appropriately in order to maximise the download rate.This specification describes conformance of Metalink/HTTP.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, , as scoped to those conformance targets.In this context, "Metalink" refers to Metalink/HTTP which consists of mirrors and checksums in HTTP Headers as described in this document. "Metalink/XML" refers to the XML format described in .Metalink resources include a Link header to present a list of mirrors in the response to a client request for the resource. The checksum of a resource must be included via Instance Digests in HTTP .Metalink servers are HTTP servers with one or more Metalink
resources. Mirror and checksum information provided by the originating Metalink server MUST be considered authoritative. Metalink servers and their associated mirror servers
SHOULD all share the same ETag policy (ETag Synchronization), i.e. based on the file contents (checksum) and not server-unique filesystem metadata. The emitted ETag MAY be implemented the same as the Instance Digest for simplicity. Metalink servers MAY offer Metalink/XML documents that contain partial file checksums and other information.Mirror servers are typically FTP or HTTP servers that "mirror" another server. That is, they provide identical copies of (at least some) files that are also on the mirrored server.
Mirror servers MAY be Metalink servers. Mirror servers MUST support serving partial content. HTTP mirror servers SHOULD share the same ETag policy as the originating Metalink server. HTTP Mirror servers SHOULD support Instance Digests in HTTP .Metalink clients use the mirrors provided by a Metalink server with Link header . Metalink clients MUST support HTTP and MAY support FTP, BitTorrent,
or other download methods. Metalink clients MUST switch downloads from one mirror to another if the mirror becomes unreachable. Metalink clients SHOULD support multi-source, or parallel,
downloads, where portions of a file are downloaded from multiple mirrors simultaneously (and optionally, from Peer-to-Peer sources). Metalink clients MUST support Instance Digests in HTTP by
requesting and verifying checksums. Metalink clients MAY make use of digital signatures if they are offered.Mirrors are specified with the Link header and a relation type of "duplicate" as defined in .[[Some organizations have many mirrors. Only send a few mirrors, or only use the Link header if Want-Digest is used?]]It is up to the server to choose how many Link headers to send. Such a decision could be a hard-coded limit, a random selection, based on file size, or based on server load.Mirror servers are listed in order of priority (from most preferred to least) or have a "pri" value, where mirrors with lower values are used first.This is purely an expression of the server's preferences; it is up to the client what it does with this information, particularly with reference to how many servers to use at any one time. A client MUST respect the server's priority ordering, however.[[Would it make more sense to use qvalue-style policies here, i.e. q=1.0 through q=0.0 ?]]Mirror servers MAY have a "geo" value, which
is a alpha-2 two letter country code for the geographical location of the physical server the URI is used
to access. A client may use this information to select a mirror, or set of mirrors, that are geographically near (if the client has access to such information), with the aim of reducing network load at inter-country bottlenecks.
There are two types of mirror servers: preferred and normal. Preferred mirror servers are HTTP mirror servers that MUST share the same ETag policy as the originating Metalink server. Optimally, they will do both. Preferred mirrors make it possible to detect early on, before data is transferred, if the file requested matches the desired file.
Preferred HTTP mirror servers have a "pref" value of 1. By default, if unspecified then mirrors are considered "normal" and do not share the same ETag policy. FTP mirrors, as they do not emit ETags, MUST always be considered "normal".HTTP Mirror servers SHOULD support Instance Digests in HTTP .[[Suggestion: In order for clients to identify servers that have coordinated ETag policies, the ETag MUST begin with "Metalink:", e.g.
]]Some mirrors may mirror single files, whole directories, or multiple directories.Mirror servers MAY have a "depth" value, where "depth=0" is the default. A value of 0 means ONLY that file is mirrored. A value of 1 means that file and all other files and subdirectories in the
directory are mirrored. A value of 2 means the directory above, and all files and subdirectories, are mirrored.Is the above example, 4 directories up are mirrored, from /dir2/ on down.Metainfo files, which describe ways to download a file over Peer-to-Peer networks or otherwise, are specified with the Link header and a relation type of "describedby" and a type parameter that indicates
the MIME type of the metadata available at the URI.Metalink clients MAY support the use of metainfo files for downloading files.Full Metalink/XML files for a given resource can be specified as shown in .
This is particularly useful for providing metadata such as checksums of chunks, allowing a
client to recover from partial errors (see ).OpenPGP signatures are specified with the Link header and a relation type of "describedby" and a type parameter of "application/pgp-signature".Metalink clients MAY support the use of OpenPGP signatures.Metalink servers MUST provide Instance Digests in HTTP for files they describe with mirrors. Mirror servers SHOULD as well.Metalink clients begin a download with a standard HTTP GET request to the Metalink server. A Range limit is optional, not required. Alternatively, Metalink clients can begin with a HEAD request to the Metalink server to discover mirrors via Link headers. After that, the client follows with a GET request to the desired mirrors.The Metalink server responds with the data and these headers:
From the Metalink server response the client learns some or all of the following metadata
about the requested object, in addition to also starting to receive the
object:Object size.ETag.Mirror profile link, which may describe the mirror's priority, whether it shares the ETag policy of the originating Metalink server, geographical location, and mirror depth.Peer-to-peer information.Metalink/XML, which can include partial file checksums to repair a file.Digital signature.Instance Digest, which is the whole file checksum.(Alternatively, the client could have requested a HEAD only, and then skipped to making the following decisions on every available mirror server found via the Link headers)If the object is large and gets delivered slower than expected then the Metalink client starts a number of parallel ranged downloads (one per selected mirror server other than the
first) using mirrors provided by the Link header with "duplicate" relation type, using the location of the original GET request in the "Referer" header field. The size and number of ranges requested from each server is for the client to decide, based upon the performance observed from each server. Further discussion of performance considerations is presented in .If no range limit was given in the original request then work from the tail of the object (the first request is still running and will eventually catch up), otherwise continue after the range requested in the first request. If no Range was provided, the original connection must be terminated once all parts of the resource have been retrieved. It is recommended that a HEAD request is undertaken first, so that the client can find out if there are any Link headers, and then Range-based requests are undertaken to the mirror servers as well as on the original connection.Preferred mirrors have coordinated ETags, as described in , and If-Match conditions based on the ETag SHOULD be used to quickly detect out-of-date mirrors by using the ETag from the Metalink server response. If no indication of ETag syncronisation/knowledge is given then If-Match
should not be used, and optimally there will be an Instance Digest in the mirror response which we can use to detect a mismatch early, and if not then a mismatch won't be detected until the completed object is verified. Early file mismatch detection is described in detail in .One of the client requests to a mirror server:
The mirror servers respond with a 206 Partial Content HTTP status code and appropriate "Content-Length" and "Content Range" header fields. The mirror server response, with data, to the above request:
If the first request was not Range limited then abort it by closing the connection when it catches up with the other parallel downloads of the same object.Downloads from mirrors that do not have the same file size as the Metalink server MUST be aborted.Once the download has completed, the Metalink client MUST verify the checksum of the file.Error prevention, or early file mismatch detection, is possible before file transfers with the use of file sizes, ETags, and Instance Digests. Error dectection requires Instance Digests, or checksums, to
determine after transfers if there has been an error. Error correction, or download repair, is possible with partial file checksums.In HTTP terms, the requirement is that merging of ranges from multiple
responses must be verified with a strong validator, which in this
context is the same as either Instance Digest or a strong ETag.
In most cases it is sufficient that the Metalink server provides
mirrors and Instance Digest information, but operation will be
more robust and efficient if the mirror servers do implement a synchronized
ETag as well. In fact, the emitted ETag
may be implemented the same as the Instance Digest for simplicity,
but there is no need to specify how the ETag is generated, just that it needs
to be shared among the mirror servers.
If the mirror server provides neither synchronized ETag or Instance Digest, then early detection of
mismatches is not possible unless file length also differs. Finally, the error is still detectable, after the download has completed, when the merged response is verified.ETag can not be used for verifying the integrity of the received
content. But it is a guarantee issued by the Metalink server that the content is
correct for that ETag. And if the ETag given by the mirror server matches
the ETag given by the master server, then we have a chain of trust where the master server
authorizes these responses as valid for that object.This guarantees that a mismatch will be detected by using only the synchronized ETag from a master server and mirror server, even alerted by the mirror servers themselves by responding
with an error, preventing accidental merges of ranges from different
versions of files with the same name. This even includes many malicious attacks
where the data on the mirror has been replaced by some other file, but
not all.Synchronized ETag can not strictly protect against malicious attacks or server or
network errors replacing content, but neither can Instance Digest on the
mirror servers as the attacker most certainly can make the server
seemingly respond with the expected Instance Digest even if the file
contents have been modified, just as he can with ETag, and the same
for various system failures also causing bad data to be returned. The Metalink client
has to rely on the Instance Digest returned by the Metalink master server in the
first response for the verification of the downloaded object as a whole.If the mirror servers do return an Instance Digest, then that is a bonus,
just as having them return the right set of Link headers is. The set of
trusted mirrors doing that can be substituted as master servers
accepting the initial request if one likes.The benefit of having slave mirror servers (those not trusted as
masters) return Instance Digest is that the client then can detect
mismatches early even if ETag is not used. Both ETag and slave mirror
Instance Digest do provide value, but just one is sufficient for early
detection of mismatches. If none is provided then early detection of
mismatches is not possible unless the file length also differs, but the
error is still detected when the merged response is verified.Partial file checksums can be used to detect errors during the download. Metalink servers are not required to offer partial file checksums, but they are encouraged to do so.If the object checksum does not match the Instance Digest then fetch the Metalink/XML as specified in , where partial file checksums may be found,
allowing detection of which server returned incorrect data.
If the Instance Digest computation does not match then the client needs
to fetch the partial file checksums, if available, and from there figure out what of
the downloaded data can be recovered and what needs to be fetched again.
If no partial checksums are available, then the client MUST fetch the complete object from
other mirrors.When opting to download simultaneously from multiple mirrors, there are a number of factors (both within and outside the influence of the client software) that are relevant to the performance achieved:The number of servers used simultaneously.The ability to pipeline sufficient or sufficiently large range requests to each server so as to avoid connections going idle.The ability to pipeline sufficiently few or sufficiently small range requests to servers so that all the servers finish their final chunks simultaneously.The ability to switch between mirrors dynamically so as to use the fastest mirrors at any moment in timeObviously we do not want to use too many simultaneous
connections, or other traffic sharing a bottleneck link will
be starved. But at the same time, good performance requires
that the client can simultaneously download from at least one
fast mirror while exploring whether any other mirror is
faster. Based on laboratory experiments, we suggest a good
default number of simultaneous connections is probably four,
with three of these being used for the best three mirrors
found so far, and one being used to evaluate whether any other
mirror might offer better performance.The size of chunks chosen by the client should be sufficiently
large that the chunk request headers and reponse headers
represent neglible overhead, and sufficiently large that they
can be pipelined effectively without needing a very high rate
of chunk requests. At the same time, the amount of time
wasted waiting for the last chunk to download from the last
server after all the other servers have finished should be
minimized. Thus we currently recommend that a chunk size of
at least 10KBytes should be used. If the file being
transfered is very large, or the download speed very high,
this can be increased to perhaps 1MByte. As network
bandwidths increase, we expect these numbers to increase
appropriately, so that the time to transfer a chunk remains
significantly larger than the latency of requesting a chunk
from a server.Accordingly, IANA has made the following registration to the Link Relation Type registry.o Relation Name: duplicateo Description: Refers to a resource whose available representations are byte-for-byte identical with the corresponding representations of the context IRI.o Reference: This specification.o Notes: This relation is for static resources. That is, an HTTP GET request on any duplicate will return the same representation. It does not make sense for dynamic or POSTable resources and should not be used for them.Metalink clients handle URIs and IRIs. See Section 7 of and Section 8 of for security
considerations related to their handling and use.There is potential for spoofing attacks where the attacker publishes Metalinks with false information.
In that case, this could deceive unaware downloaders that they are downloading a malicious or worthless file. Also, malicious publishers could attempt a distributed denial of service attack by inserting unrelated URIs into Metalinks.Currently, some of the digest values defined in Instance Digests in HTTP are considered insecure.
These include the whole Message Digest family of algorithms which are not suitable for cryptographically strong verification. Malicious people could provide files that appear to be
identical to another file because of a collision, i.e. the weak cryptographic hashes of the intended file and a substituted malicious file could match.If a Metalink contains whole file hashes as described in , it SHOULD include "sha-256" which is SHA-256, as specified in , or stronger. It MAY also include other hashes.Metalinks should include digital signatures, as described in .Digital signatures provide authentication, message integrity, and non-repudiation with proof of origin.Key words for use in RFCs to Indicate Requirement LevelsHarvard University1350 Mass. Ave.CambridgeMA 02138- +1 617 495 3864sob@harvard.edu
General
keyword
In many standards track documents several words are used to signify
the requirements in the specification. These words are often
capitalized. This document defines these words as they should be
interpreted in IETF documents. Authors who follow these guidelines
should incorporate this phrase near the beginning of their document:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119.
Note that the force of these words is modified by the requirement
level of the document in which they are used.
Hypertext Transfer Protocol -- HTTP/1.1Instance Digests in HTTPUniform Resource Identifier (URI): Generic SyntaxInternationalized Resource Identifiers (IRIs)Web Linkingmnot@mnot.netISO 3166-1:2006. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codesInternational Organization for StandardizationSecure Hash Standard (SHS)National Institute of Standards and Technology (NIST)The Metalink Download Description FormatMetalinker Projectanthonybryan@gmail.comhttp://www.metalinker.orgMetalinker Projecttatsuhiro.t@gmail.comhttp://aria2.sourceforge.netMetalinker Projectneil@nabber.orghttp://www.nabber.orgNovell, Inc.poeml@mirrorbrain.orghttp://www.mirrorbrain.org/Thanks to the Metalink community, Mark Handley, Mark Nottingham, Daniel Stenberg, Tatsuhiro Tsujikawa, Peter Poeml, Matt Domsch, Micah Cowan, and David Morris.Support for simultaneous download from multiple mirrors is based upon work by Mark Handley and Javier Vela Diago, who also provided validation of the benefits of this approach.[[ to be removed by the RFC editor before publication as an RFC. ]]This draft, compared to the Metalink/XML format :(+) Reuses existing HTTP standards without much new besides a Link Relation Type. It's more of a collection/coordinated feature set.(?) The existing standards don't seem to be widely implemented.(+) No XML dependency, except for Metalink/XML for partial file checksums.(+) Existing Metalink/XML clients can be easily converted to support this as well.(+) Coordination of mirror servers is preferred, but not required. Coordination may be difficult or impossible unless you are in control of all servers on the mirror network.(-) Requires software or configuration changes to originating server.(-?) Tied to HTTP, not as generic. FTP/P2P clients won't be using it unless they also support HTTP, unlike Metalink/XML.(-) Requires server-side support. Metalink/XML can be created by user (or server, but server component/changes not required).(-) Also, Metalink/XML files are easily mirrored on all servers. Even if usage in that case is not as transparent, it still gives access to users at all mirrors (FTP included) to all download information with no changes needed to the server.(-) Not portable/archivable/emailable. Metalink/XML is used to import/export transfer queues. Not as easy for search engines to index?(-) Not as rich metadata.(-) Not able to add multiple files to a download queue or create directory structure.[[ to be removed by the RFC editor before publication as an RFC. ]]Known issues concerning this draft:
Use of Link header to describe Mirrors. Only send a few mirrors with Link header, or only send them if Want-Digest is used? Some organizations have many mirrors.Would it make more sense to use qvalue-style policies to describe mirror priority, i.e. q=1.0 through q=0.0 ?Using Metalink/XML for partial file checksums. That adds XML dependency to apps for an important feature. Is there a better method?Do we need an "official" MIME type for .torrent files or allow "application/x-bittorrent"?-14 : December 31, 2009.
Baseline file hash: SHA-256.-13 : November 22, 2009.
Metalink/XML for partial file checksums.-12 : November 11, 2009.
Clarifications.-11 : October 23, 2009.
Mirror changes.-10 : October 15, 2009.
Mirror coordination changes.-09 : October 12, 2009.
Mirror location, coordination, and depth.Split HTTP Digest Algorithm Values Registration into draft-bryan-http-digest-algorithm-values-update.-08 : October 4, 2009.
Clarifications.-07 : September 29, 2009.
Preferred mirror servers.-06 : September 24, 2009.
Add Mismatch Detection, Error Recovery, and Digest Algorithm values.Remove Content-MD5 and Want-Digest.-05 : September 19, 2009.
ETags, preferably matching the Instance Digests.-04 : September 17, 2009.
Temporarily remove .torrent.-03 : September 16, 2009.
Mention HEAD request, negotiate mirrors if Want-Digest is used.-02 : September 6, 2009.
Content-MD5 for partial file checksums.-01 : September 1, 2009.
Link Relation Type Registration: "duplicate"-00 : August 24, 2009.
Initial draft.