1-Overview of HTTP

来源:互联网 发布:python爬虫实例 编辑:程序博客网 时间:2024/06/05 07:27

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

1.1 HTTP(Hypertext Transfer Protocol): The Internet’s Multimedia Courier

  • HTTP uses reliable data-transmission protocols that guarantees your data will not be damaged or scrambled in transit.

1.2 Web Clients and Servers

  • Web servers(also called HTTP servers) store the Internet’s data and provide the data when it is requested by HTTP clients. The clients send HTTP requests to servers, and servers return the requested data in HTTP responses, Figure 1-1.

1.3 Resources

  • Web servers host web resources that are the source of web content.
  • The simplest kind of web resource is a static file on the web server’s filesystem, such as text files, AVI movie files, or any other format you can think of.
  • Resources can also be software programs that generate content on demand. These dynamic content resources can generate content based on your identity, on what information you’ve requested. They can show you a live image from a camera, or let you trade stocks, or buy gifts from online stores(Figure 1-2).

1.3.1 Media Types

  • HTTP tags each object being transported through the Web with a MIME (Multipurpose Internet Mail Extensions) type label.

  • Web servers attach a MIME type to all HTTP object data(Figure 1-3). When a web browser gets an object back from a server, it looks at the associated MIME type to see if it knows how to handle the object.
  • A MIME type(Appendix D) is a textual label, represented as a primary object type and a specific subtype, separated by a slash. For example:
      text/html: HTML-formatted text document.
      text/plain: Plain ASCII text document.
      image/jpeg: JPEG version of an image.

1.3.2 URIs

  • The server resource name is called a uniform resource identifier(URI) that uniquely identifying and locating information resources around the world.

  • Figure 1-4 shows how the URI specifies the HTTP protocol to access the saw-blade GIF resource on Joe’s store’s server. URIs come in two flavors, called URLs and URNs.

1.3.3 URLs

  • The uniform resource locator(URL) is the most common form of resource identifier. URLs describe the specific location of a resource on a particular server.

  • URLs follow a standardized format of three main parts:
    1. The first part is called the scheme, and it describes the protocol used to access the resource. This is usually the HTTP protocol(http:// ).
    2. The second part gives the server Internet address(e.g., www.example.com).
    3. The rest names a resource on the web server(e.g., /specials/saw-blade.gif ).

1.3.4 URNs

  • The second flavor of URI is the uniform resource name(URN). A URN serves as a unique name for a particular piece of content, independent of where the resource currently resides. URNs allow resources to move from place to place and to be accessed by multiple network access protocols while maintaining the same name.
  • For example, the following URN name the Internet standards document “RFC 2141” regardless of where it resides(it may even be copied in several places):
    urn:ietf:rfc:2141

1.4 Transactions

  • An HTTP transaction consists of a request command(sent from client to server), and a response result(sent from the server back to the client). This communication happens with HTTP messages, Figure 1-5.

1.4.1 Methods

  • HTTP supports different request commands(HTTP methods). Every HTTP request message has a method that tells the server what action to perform. Table 1-2 lists five common HTTP methods.

1.4.2 Status Codes

  • Every HTTP response message comes back with a status code which is a three-digit numeric code that tells the client if the request succeeded, or if other actions are required. A few common status codes are shown in Table 1-3.

  • HTTP also sends an explanatory textual “reason phrase” with each numeric status code. The textual phrase is included only for descriptive purposes; the numeric code is used for all processing. The following status codes and reason phrases are treated identically by HTTP software:
200 OK200 Document attached200 Success200 All's cool, dude

1.4.3 Web Pages Can Consist of Multiple Objects

  • An application often issues multiple HTTP transactions to accomplish a task. For example, a web browser issues a cascade of HTTP transactions to fetch and display a graphics-rich web page. The browser performs one transaction to fetch the HTML “skeleton” that describes the page layout, then issues additional HTTP transactions for each embedded image, graphics pane, etc. These embedded resources might even reside on different servers, as shown in Figure 1-6.

1.5 Messages

  • HTTP messages are line-oriented sequences of characters. There are two kinds of HTTP messages: request messages(sent from web clients to web servers), response messages(sent from servers to clients). Figure 1-7.

  • HTTP messages consist of three parts:
    1. Start line: the first line is the start line, indicating what to do for a request or what happened for a response.
    2. Header fields: zero or more header fields follow the start line. Each header field consists of a name and a value, separated by a colon(:). The headers end with a blank line.
    3. Body: after the blank line is an optional message body containing any kind of data. Request bodies carry data to the web server; response bodies carry data back to the client. The body can contain arbitrary data(binary or text).

1.5.1 Simple Message Example

  • Figure 1-8:
    1. The browser sends an HTTP request message that has a GET method in the start line, and the local resource is /tools.html. The request indicates it is speaking Version 1.0 of the HTTP protocol. The request message has no body.
    2. The server sends back an HTTP response message that contains the HTTP version number(HTTP/1.0), a success status code(200), a descriptive reason phrase(OK), and a block of response header fields, all followed by the response body containing the requested document. The response body length is noted in the Content-Length header, and the document’s MIME type is noted in the Content-Type header.

1.6 Connections

1.6.1 TCP/IP

  • Figure 1-9. HTTP is an application layer protocol and it leaves the details of networking to TCP/IP. TCP provides:
    1. Error-free data transportation.
    2. In-order delivery(data will always arrive in the order in which it was sent).
    3. Unsegmented data stream(can dribble out data in any size at any time).

1.6.2 Connections, IP Addresses, and Port Numbers

  • Before an HTTP client can send a message to a server, it needs to establish a TCP/IP connection between the client and server using Internet protocol(IP) addresses and port numbers. URLs are the addresses for resources and they can provide the IP address for the machine that has the resource.
  • http://207.200.83.29:80/index.html
    http://www.netscape.com:80/index.html
    http://www.netscape.com/index.html
    1. The first has the machine’s IP address “207.200.83.29” and port number “80”.
    2. The second has a domain-name/hostname(“www.netscape.com”). Hostnames can be converted into IP addresses through Domain Name Service(DNS).
    3. The final has no port number. When the port number is missing from an HTTP URL, the default value of port is 80.

  • Figure 1-10 shows how a browser uses HTTP to display a HTML resource that resides on a server.
    (a) The browser extracts the server’s hostname from the URL.
    (b) The browser converts the server’s hostname into the server’s IP address.
    (c) The browser extracts the port number(if any) from the URL.
    (d) The browser establishes a TCP connection with the web server.
    (e) The browser sends an HTTP request message to the server.
    (f) The server sends an HTTP response back to the browser.
    (g) The connection is closed, and the browser displays the document.

1.6.3 A Real Example Using Telnet

  • The Telnet utility connects keyboard to a destination TCP port and connects the TCP port output back to display screen. Telnet is commonly used for remote terminal sessions, but it can connect to any TCP server, including HTTP servers.
  • Telnet lets you open a TCP connection to a port on a machine and type characters directly into the port. The web server treats you as a web client, and any data sent back on the TCP connection is displayed onscreen.
  • We will use Telnet to fetch the document pointed to by the URL http://www.joes-hardware.com:80/tools.html.
    1. We need to look up the IP address of www.joes-hardware.com and open a TCP connection to port 80 on that machine. Telnet does this legwork for us.
    2. Once the TCP connection is open, we need to type in the HTTP request.
    3. When the request is complete(indicated by a blank line), the server should send back the content in an HTTP response and close the connection.
  • Our example HTTP request for http://www.joes-hardware.com:80/tools.html is shown in Example 1-1. What we typed is shown in boldface.
Example 1-1. An HTTP transaction using telnet % telnet www.joes-hardware.com 80Trying 161.58.228.45...Connected to joes-hardware.com.Escape character is '^]'.GET /tools.html HTTP/1.1Host: www.joes-hardware.comHTTP/1.1 200 OKDate: Sun, 01 Oct 2000 23:25:17 GMTServer: Apache/1.3.11 BSafe-SSL/1.38(Unix) FrontPage/4.0.4.3 Last-Modified: Tue, 04 Jul 2000 09:46:21 GMTETag: "373979-193-3961b26d"Accept-Ranges: bytesContent-Length: 403Connection: closeContent-Type: text/html<HTML><HEAD><TITLE>Joe's Tools</TITLE></HEAD><BODY><H1>Tools Page</H1><H2>Hammers</H2><P>Joe's Hardware Online has the largest selection of hammers on the earth.</P> <H2><A NAME=drills></A>Drills</H2><P>Joe's Hardware has a complete line of cordless and corded drills, as well as the latest in plutonium-powered atomic drills, for those big around the house jobs.</P> ...</BODY></HTML>Connection closed by foreign host.
  • Telnet looks up the hostname and opens a connection to the www.joes-hardware.com web server, which is listening on port 80. The three lines after the command are output from Telnet, telling us it has established a connection.
  • We then type in our request command, “GET /tools.html HTTP/1.1”, and send a Host header providing the original hostname, followed by a blank line, asking the server to GET us the resource “/tools.html” from the server www.joes-hardware.com. After that, the server responds with a response line, several response headers, a blank line, and finally the body of the HTML document.

1.7 Protocol Versions

  • HTTP/0.9: The 1991 prototype version of HTTP is known as HTTP/0.9. This protocol contains many serious design flaws and should be used only to interoperate with legacy clients. HTTP/0.9 supports only the GET method, and it does not support MIME typing of multimedia content, HTTP headers, or version numbers.
  • HTTP/1.0: HTTP/1.0 added version numbers, HTTP headers, additional methods, and multimedia object handling.
  • HTTP/1.0+: Many features, including long-lasting “keep-alive” connections, virtual hosting support, and proxy connection support, were added to HTTP and became unofficial, de facto standards HTTP/1.0+.
  • HTTP/1.1: HTTP/1.1 focused on correcting architectural flaws in the design of HTTP, specifying semantics, introducing significant performance optimizations, and removing mis-features. HTTP/1.1 is the current version of HTTP.
  • HTTP-NG(a.k.a. HTTP/2.0): HTTP-NG is a prototype proposal for an architectural successor to HTTP/1.1 that focuses on significant performance optimizations and a more powerful framework for remote execution of server logic.

1.8 Architectural Components of the Web

  • In this section, we outline several important applications, including:
    Proxies: HTTP intermediaries that sit between clients and servers.
    Caches: HTTP storehouses that keep copies of popular web pages close to clients.
    Gateways: Special web servers that connect to other applications.
    Tunnels: Special proxies that blindly forward HTTP communications.
    Agents: Semi-intelligent web clients that make automated HTTP requests.

1.8.1 Proxies

  • HTTP proxy servers build blocks for web security, application integration, and performance optimization.

  • Figure 1-11, a proxy sits between a client and a server, receiving all of the client’s HTTP requests and relaying the requests to the server(perhaps after modifying the requests). These applications act as a proxy for the user, accessing the server on the user’s behalf.
  • Proxies are often used for security, acting as trusted intermediaries through which all web traffic flows. Proxies can also filter requests and responses; for example, to detect application viruses in corporate downloads.

1.8.2 Caches

  • A web cache or caching proxy is a special type of HTTP proxy server that keeps copies of popular documents that pass through the proxy. The next client requesting the same document can be served from the cache’s personal copy(Figure 1-12).

  • A client may be able to download a document more quickly from a nearby cache than from a distant web server. HTTP defines many facilities to make caching more effective and to regulate the freshness and privacy of cached content.

1.8.3 Gateways

  • Gateways are special servers that act as intermediaries for other servers. They are often used to convert HTTP traffic to another protocol. A gateway always receives requests as if it was the origin server for the resource. The client may not be aware it is communicating with a gateway.

  • For example, an HTTP/FTP gateway receives requests for FTP URIs via HTTP requests but fetches the documents using the FTP protocol(Figure 1-13). The resulting document is packed into an HTTP message and sent to the client.

1.8.4 Tunnels

  • Tunnels are HTTP applications that, after setup, blindly relay raw data between two connections. HTTP tunnels are often used to transport non-HTTP data over one or more HTTP connections, without looking at the data.

  • Figure 1-14. One use is to carry encrypted Secure Sockets Layer(SSL) traffic through an HTTP connection, allowing SSL traffic through corporate firewalls that permit only web traffic. An HTTP/SSL tunnel receives an HTTP request to establish an outgoing connection to a destination address and port, then proceeds to tunnel the encrypted SSL traffic over the HTTP channel so that it can be blindly relayed to the destination server.

1.8.5 Agents

  • User agents are client programs that make HTTP requests on the user’s behalf. Any application that issues web requests is an HTTP agent. There are many other kinds of user agents except web browser.

  • There are machine-automated user agents that autonomously wander the Web, issuing HTTP transactions and fetching content, without human supervision. These automated agents are called “spiders” or “web robots”(Figure 1-15). Spiders wander the Web to build useful archives of web content, such as a search engine’s database or a product catalog for a comparison-shopping robot.

1.9 The End of the Beginning

1.10 For More Information

1.10.1 HTTP Protocol Information

  • http://www.w3.org/Protocols/: This page contains many great links about the HTTP protocol.
  • http://www.ietf.org/rfc/rfc2616.txt: RFC 2616, “Hypertext Transfer Protocol HTTP/1.1,” is the official specification for HTTP/1.1, the current version of the HTTP protocol.
  • http://www.ietf.org/rfc/rfc1945.txt: RFC 1945, “Hypertext Transfer Protocol HTTP/1.0,” is an informational RFC that describes the modern foundation for HTTP.

1.10.2 Historical Perspective

1.10.3 Other World Wide Web Information

  • http://www.ietf.org/rfc/rfc2396.txt: RFC 2396, “Uniform Resource Identifiers(URI): Generic Syntax,” is the detailed reference for URIs and URLs.
  • http://www.ietf.org/rfc/rfc2141.txt: RFC 2141, “URN Syntax,” is a 1997 specification describing URN syntax.
  • http://www.ietf.org/rfc/rfc2046.txt: RFC 2046, “MIME Part 2: Media Types,” is the second in a suite of five Internet specifications defining the Multipurpose Internet Mail Extensions standard for multimedia content management.

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

0 0