【翻译】What really happens when you navigate to a URL

来源:互联网 发布:seo转行 编辑:程序博客网 时间:2024/05/16 15:36

本文目录

    • 本文目录
    • You enter a URL into the browser
    • The browser looks up the IP address for the domain name
    • The browser sends a HTTP request to the web server
    • The facebook server responds with a permanent redirect
    • The browser follows the redirect
    • The server handles the request
    • The server sends back a HTML response
    • The browser begins rendering the HTML
    • The browser sends requests for objects embedded in HTML
    • The browser sends further asynchronous AJAX requests
    • Conclusion
    • 引用文献

As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.

In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.

作为一个软件开发者,你一定会对网络应用如何工作有一个完整的层次化的认知,同样这里也包括这些应用所用到的技术:像浏览器,HTTP,HTML,网络服务器,需求处理等等。
本文将更深入的研究当你输入一个网址的时候,后台到底发生了一件件什么样的事


1. You enter a URL into the browser

It all starts here:
首先,在浏览器里输入要网址:
这里写图片描述

2. The browser looks up the IP address for the domain name

浏览器查找域名的IP地址

这里写图片描述

The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:
浏览器导航的第一步是通过访问的域名找出其IP地址。DNS查找过程如下:

  1. Browser cache – The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a fixed duration (varies between browsers, 2 – 30 minutes).

    浏览器缓存 – 浏览器会缓存DNS记录一段时间。 有趣的是,操作系统没有告诉浏览器储存DNS记录的时间,这样不同浏览器会储存个自固定的一个时间(2分钟到30分钟不等)。

  2. OS cache – If the browser cache does not contain the desired record, the browser makes a system call (gethostbyname in Windows). The OS has its own cache.
    系统缓存 – 如果在浏览器缓存里没有找到需要的记录,浏览器会做一个系统调用(windows里是gethostbyname)。这样便可获得系统缓存中的记录。

  3. Router cache – The request continues on to your router, which typically has its own DNS cache.
    路由器缓存 – 接着,前面的查询请求发向路由器,它一般会有自己的DNS缓存。

  4. ISP DNS cache – The next place checked is the cache ISP’s DNS server. With a cache, naturally.
    ISP DNS 缓存 – 接下来要check的就是ISP缓存DNS的服务器。在这一般都能找到相应的缓存记录。

  5. Recursive search – Your ISP’s DNS server begins a recursive search, from the root nameserver, through the .com top-level nameserver, to Facebook’s nameserver. Normally, the DNS server will have names of the .com nameservers in cache, and so a hit to the root nameserver will not be necessary.

递归搜索 – 你的ISP的DNS服务器从跟域名服务器开始进行递归搜索,从.com顶级域名服务器到Facebook的域名服务器。一般DNS服务器的缓存中会有.com域名服务器中的域名,所以到顶级服务器的匹配过程不是那么必要了。


Here is a diagram of what a recursive DNS search looks like:
DNS递归查找如下图所示:
这里写图片描述

One worrying thing about DNS is that the entire domain like wikipedia.org or facebook.com seems to map to a single IP address. Fortunately, there are ways of mitigating the bottleneck:

DNS有一点令人担忧,这就是像wikipedia.org 或者 facebook.com这样的整个域名看上去只是对应一个单独的IP地址。还好,有几种方法可以消除这个瓶颈:

  • 循环 DNS 是DNS查找时返回多个IP时的解决方案。举例来说,Facebook.com实际上就对应了四个IP地址。

  • 负载平衡器 是以一个特定IP地址进行侦听并将网络请求转发到集群服务器上的硬件设备。 一些大型的站点一般都会使用这种昂贵的高性能负载平衡器。

  • 地理 DNS 根据用户所处的地理位置,通过把域名映射到多个不同的IP地址提高可扩展性。这样不同的服务器不能够更新同步状态,但映射静态内容的话非常好。

  • Anycast 是一个IP地址映射多个物理主机的路由技术。 美中不足,Anycast与TCP协议适应的不是很好,所以很少应用在那些方案中。
    大多数DNS服务器使用Anycast来获得高效低延迟的DNS查找。

3.The browser sends a HTTP request to the web server

浏览器给web服务器发送一个HTTP请求
这里写图片描述

You can be pretty sure that Facebook’s homepage will not be served from the browser cache because dynamic pages expire either very quickly or immediately (expiry date set to past).
So, the browser will send this request to the Facebook server:

因为像Facebook主页这样的动态页面,打开后在浏览器缓存中很快甚至马上就会过期,毫无疑问他们不能从中读取。
所以,浏览器将把一下请求发送到Facebook所在的服务器:

GET http://facebook.com/ HTTP/1.1Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]Accept-Encoding: gzip, deflateConnection: Keep-AliveHost: facebook.comCookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]

The GET request names the URL to fetch: “http://facebook.com/”. The browser identifies itself (User-Agent header), and states what types of responses it will accept (Accept and Accept-Encoding headers). The Connection header asks the server to keep the TCP connection open for further requests.

GET 这个请求定义了要读取的URL: “http://facebook.com/”。 浏览器自身定义 (User-Agent 头), 和它希望接受什么类型的相应 (Accept andAccept-Encoding 头). Connection头要求服务器为了后边的请求不要关闭TCP连接。

The request also contains the cookies that the browser has for this domain. As you probably already know, cookies are key-value pairs that track the state of a web site in between different page requests. And so the cookies store the name of the logged-in user, a secret number that was assigned to the user by the server, some of user’s settings, etc. The cookies will be stored in a text file on the client, and sent to the server with every request.

请求中也包含浏览器存储的该域名的cookies。可能你已经知道,在不同页面请求当中,cookies是与跟踪一个网站状态相匹配的键值。这样cookies会存储登录用户名,服务器分配的密码和一些用户设置等。Cookies会以文本文档形式存储在客户机里,每次请求时发送给服务器。

There is a variety of tools that let you view the raw HTTP requests and corresponding responses. My favorite tool for viewing the raw HTTP traffic is fiddler, but there are many other tools (e.g., FireBug) These tools are a great help when optimizing a site.

用来看原始HTTP请求及其相应的工具很多。作者比较喜欢使用fiddler,当然也有像FireBug这样其他的工具。这些软件在网站优化时会帮上很大忙。

In addition to GET requests, another type of requests that you may be familiar with is a POST request, typically used to submit forms. A GET request sends its parameters via the URL (e.g.: http://robozzle.com/puzzle.aspx?id=85). A POST request sends its parameters in the request body, just under the headers.

除了获取请求,还有一种是发送请求,它常在提交表单用到。发送请求通过URL传递其参数(e.g.: http://robozzle.com/puzzle.aspx?id=85)。发送请求在请求正文头之后发送其参数。

The trailing slash in the URL “http://facebook.com/” is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it is not clear whether folderOrFile is a folder or a file. In such cases, the browser will visit the URL without the slash, and the server will respond with a redirect, resulting in an unnecessary roundtrip.

像“http://facebook.com/”中的斜杠是至关重要的。这种情况下,浏览器能安全的添加斜杠。而像“http: //example.com/folderOrFile”这样的地址,因为浏览器不清楚folderOrFile到底是文件夹还是文件,所以不能自动添加 斜杠。这时,浏览器就不加斜杠直接访问地址,服务器会响应一个重定向,结果造成一次不必要的握手。


4. The facebook server responds with a permanent redirect

facebook服务的永久重定向响应
这里写图片描述
This is the response that the Facebook server sent back to the browser request:
图中所示为Facebook服务器发回给浏览器的响应:

HTTP/1.1 301 Moved PermanentlyCache-Control: private, no-store, no-cache, must-revalidate, post-check=0,      pre-check=0Expires: Sat, 01 Jan 2000 00:00:00 GMTLocation: http://www.facebook.com/P3P: CP="DSP LAW"Pragma: no-cacheSet-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;      path=/; domain=.facebook.com; httponlyContent-Type: text/html; charset=utf-8X-Cnection: closeDate: Fri, 12 Feb 2010 05:09:51 GMTContent-Length: 0

The server responded with a 301 Moved Permanently response to tell the browser to go to “http://www.facebook.com/” instead of “http://facebook.com/”.

服务器给浏览器响应一个301永久重定向响应,这样浏览器就会访问“http://www.facebook.com/” 而非“http://facebook.com/”。

There are interesting reasons why the server insists on the redirect instead of immediately responding with the web page that the user wants to see.

One reason has to do with search engine rankings. See, if there are two URLs for the same page, say http://www.igoro.com/ and http://igoro.com/, search engine may consider them to be two different sites, each with fewer incoming links and thus a lower ranking. Search engines understand permanent redirects (301), and will combine the incoming links from both sources into a single ranking.

Also, multiple URLs for the same content are not cache-friendly. When a piece of content has multiple names, it will potentially appear multiple times in caches.

为什么服务器一定要重定向而不是直接发会用户想看的网页内容呢?这个问题有好多有意思的答案。

其中一个原因跟搜索引擎排名有 关。你看,如果一个页面有两个地址,就像http://www.igoro.com/ 和http://igoro.com/,搜索引擎会认为它们是两个网站,结果造成每一个的搜索链接都减少从而降低排名。而搜索引擎知道301永久重定向是 什么意思,这样就会把访问带www的和不带www的地址归到同一个网站排名下。

还有一个是用不同的地址会造成缓存友好性变差。当一个页面有好几个名字时,它可能会在缓存里出现好几次。


5. The browser follows the redirect

浏览器跟踪重定向地址
这里写图片描述

GET http://www.facebook.com/ HTTP/1.1Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]Accept-Language: en-USUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]Accept-Encoding: gzip, deflateConnection: Keep-AliveCookie: lsd=XW[...]; c_user=21[...]; x-referer=[...]Host: www.facebook.com

The meaning of the headers is the same as for the first request.
头信息以之前请求中的意义相同。


6. The server ‘handles’ the request

服务器“处理”请求

这里写图片描述

The server will receive the GET request, process it, and send back a response.
This may seem like a straightforward task, but in fact there is a lot of interesting stuff that happens here – even on a simple site like my blog, let alone on a massively scalable site like facebook.

服务器接收到获取请求,然后处理并返回一个响应。
这表面上看起来是一个顺向的任务,但其实这中间发生了很多有意思的东西- 就像作者博客这样简单的网站,何况像facebook那样访问量大的网站呢!

  • Web server software
    The web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (in ASP.NET, PHP, Ruby, …) that reads the request and generates the HTML for the response.
    In the simplest case, the request handlers can be stored in a file hierarchy whose structure mirrors the URL structure, and so for example http://example.com/folder1/page1.aspx URL will map to file /httpdocs/folder1/page1.aspx. The web server software can also be configured so that URLs are manually mapped to request handlers, and so the public URL of page1.aspx could be http://example.com/folder1/page1.
    Web 服务器软件
    web服务器软件(像IIS和阿帕奇)接收到HTTP请求,然后确定执行什么请求处理来处理它。请求处理就是一个能够读懂请求并且能生成HTML来进行响应的程序(像ASP.NET,PHP,RUBY…)。
    举 个最简单的例子,需求处理可以以映射网站地址结构的文件层次存储。像http://example.com/folder1/page1.aspx这个地 址会映射/httpdocs/folder1/page1.aspx这个文件。web服务器软件可以设置成为地址人工的对应请求处理,这样 page1.aspx的发布地址就可以是http://example.com/folder1/page1。

  • Request handler
    The request handler reads the request, its parameters, and cookies. It will read and possibly update some data stored on the server. Then, the request handler will generate a HTML response.
    请求处理
    请求处理阅读请求及它的参数和cookies。它会读取也可能更新一些数据,并讲数据存储在服务器上。然后,需求处理会生成一个HTML响应。

One interesting difficulty that every dynamic website faces is how to store data. Smaller sites will often have a single SQL database to store their data, but sites that store a large amount of data and/or have many visitors have to find a way to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and usage of simplified databases with weakened consistency semantics.

所 有动态网站都面临一个有意思的难点 -如何存储数据。小网站一半都会有一个SQL数据库来存储数据,存储大量数据和/或访问量大的网站不得不找一些办法把数据库分配到多台机器上。解决方案 有:sharding (基于主键值讲数据表分散到多个数据库中),复制,利用弱语义一致性的简化数据库。

One technique to keep data updates cheap is to defer some of the work to a batch job. For example, Facebook has to update the newsfeed in a timely fashion, but the data backing the “People you may know” feature may only need to be updated nightly (my guess, I don’t actually know how they implement this feature). Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.

委 托工作给批处理是一个廉价保持数据更新的技术。举例来讲,Fackbook得及时更新新闻feed,但数据支持下的“你可能认识的人”功能只需要每晚更新 (作者猜测是这样的,改功能如何完善不得而知)。批处理作业更新会导致一些不太重要的数据陈旧,但能使数据更新耕作更快更简洁。

7. The server sends back a HTML response

服务器发回一个HTML响应
这里写图片描述

Here is the response that the server generated and sent back:
图中为服务器生成并返回的响应:

HTTP/1.1 200 OKCache-Control: private, no-store, no-cache, must-revalidate, post-check=0,    pre-check=0Expires: Sat, 01 Jan 2000 00:00:00 GMTP3P: CP="DSP LAW"Pragma: no-cacheContent-Encoding: gzipContent-Type: text/html; charset=utf-8X-Cnection: closeTransfer-Encoding: chunkedDate: Fri, 12 Feb 2010 09:05:55 GMT2b3��������T�n�@����[...]

The entire response is 36 kB, the bulk of them in the byte blob at the end that I trimmed.

The Content-Encoding header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you’ll see the HTML you’d expect:

整个响应大小为36kB,其中大部分在整理后以blob类型传输。

内容编码头告诉浏览器整个响应体用gzip算法进行压缩。解压blob块后,你可以看到如下期望的HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"       lang="en" id="facebook" class=" no_js"><head><meta http-equiv="Content-type" content="text/html; charset=utf-8" /><meta http-equiv="Content-language" content="en" />...

In addition to compression, headers specify whether and how to cache the page, any cookies to set (none in this response), privacy information, etc.

关于压缩,头信息说明了是否缓存这个页面,如果缓存的话如何去做,有什么cookies要去设置(前面这个响应里没有这点)和隐私信息等等。

Notice the header that sets Content-Type to text/html. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser will use the header to decide how to interpret the response, but will consider other factors as well, such as the extension of the URL.

请注意报头中把Content-type设置为“text/html”。报头让浏览器将该响应内容以HTML形式呈现,而不是以文件形式下载它。浏览器会根据报头信息决定如何解释该响应,不过同时也会考虑像URL扩展内容等其他因素。


8. The browser begins rendering the HTML

浏览器开始渲染显示HTML

Even before the browser has received the entire HTML document, it begins rendering the website:

在浏览器接收到全部HTML代码之前,浏览器就已经开始渲染网页了。

这里写图片描述

9. The browser sends requests for objects embedded in HTML

浏览器发送获取嵌入在HTML中的对象
这里写图片描述
As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files.

Here are a few URLs that my visit to facebook.com retrieved:

在浏览器显示HTML时,它会注意到需要获取其他地址内容的标签。这时,浏览器会发送一个获取请求来重新获得这些文件。

下面是几个我们访问facebook.com时需要重获取的几个URL:

#Images  //图片http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gifhttp://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif#CSS style sheets   //CSShttp://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.csshttp://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css#JavaScript files   //JShttp://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.jshttp://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js

Each of these URLs will go through process a similar to what the HTML page went through. So, the browser will look up the domain name in DNS, send a request to the URL, follow redirects, etc.

这些地址都要经历一个和HTML读取类似的过程。所以浏览器会在DNS中查找这些域名,发送请求,重定向等等…

However, static files – unlike dynamic pages – allow the browser to cache them. Some of the files may be served up from cache, without contacting the server at all. The browser knows how long to cache a particular file because the response that returned the file contained an Expires header. Additionally, each response may also contain an ETag header that works like a version number – if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.

但 不像动态页面那样,静态文件会允许浏览器对其进行缓存。有的文件可能会不需要与服务器通讯,而从缓存中直接读取。服务器的响应中包含了静态文件保存的期限 信息,所以浏览器知道要把它们缓存多长时间。还有,每个响应都可能包含像版本号一样工作的ETag头(被请求变量的实体值),如果浏览器观察到文件的版本 ETag信息已经存在,就马上停止这个文件的传输。

Can you guess what “fbcdn.net” in the URLs stands for? A safe bet is that it means “Facebook content delivery network”. Facebook uses a content delivery network (CDN) to distribute static content – images, style sheets, and JavaScript files. So, the files will be copied to many machines across the globe.

试着猜猜看“fbcdn.NET”在地址中代表什么?聪明的答案是”Facebook内容分发网络”。Facebook利用内容分发网络(CDN)分发像图片,CSS表和JavaScript文件这些静态文件。所以,这些文件会在全球很多CDN的数据中心中留下备份。

Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites will use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook’s static files are hosted by Akamai, the largest CDN provider.

静态内容往往代表站点的带宽大小,也能通过CDN轻松的复制。通常网站会使用第三方的CDN。例如,Facebook的静态文件由最大的CDN提供商Akamai来托管。

As a demonstration, when you try to ping static.ak.fbcdn.net, you will get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, may get responses from different servers, which demonstrates the load-balancing that happens behind the scenes.

举例来讲,当你试着ping static.ak.fbcdn.Net的时候,可能会从某个akamai.net服务器上获得响应。有意思的是,当你同样再ping一次的时候,响应的服务器可能就不一样,这说明幕后的负载平衡开始起作用了。

10. The browser sends further asynchronous (AJAX) requests

浏览器发送异步(AJAX)请求
这里写图片描述

In the spirit of Web 2.0, the client continues to communicate with the server even after the page is rendered.

在Web 2.0伟大精神的指引下,页面显示完成后客户端仍与服务器端保持着联系。

For example, Facebook chat will continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the list of your friends who are online.

以 Facebook聊天功能为例,它会持续与服务器保持联系来及时更新你那些亮亮灰灰的好友状态。为了更新这些头像亮着的好友状态,在浏览器中执行的 JavaScript代码会给服务器发送异步请求。这个异步请求发送给特定的地址,它是一个按照程式构造的获取或发送请求。还是在Facebook这个例 子中,客户端发送给http://www.facebook.com/ajax/chat/buddy_list.PHP一个发布请求来获取你好友里哪个 在线的状态信息。

This pattern is sometimes referred to as “AJAX”, which stands for “Asynchronous JavaScript And XML”, even though there is no particular reason why the server has to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.

提起这个模式,就必须要讲讲”AJAX”– “异步JavaScript 和 XML”,虽然服务器为什么用XML格式来进行响应也没有个一清二白的原因。再举个例子吧,对于异步请求,Facebook会返回一些JavaScript的代码片段。

Among other things, the fiddler tool lets you view the asynchronous requests sent by your browser. In fact, not only you can observe the requests passively, but you can also modify and resend them. The fact that it is this easy to “spoof” AJAX requests causes a lot of grief to developers of online games with scoreboards. (Obviously, please don’t cheat that way.)

除了其他,fiddler这个工具能够让你看到浏览器发送的异步请求。事实上,你不仅可以被动的做为这些请求的看客,还能主动出击修改和重新发送它们。AJAX请求这么容易被蒙,可着实让那些计分的在线游戏开发者们郁闷的了。(当然,可别那样骗人家~)

Facebook chat provides an example of an interesting problem with AJAX: pushing data from server to client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages arrived.

Facebook聊天功能提供了关于AJAX一个有意思的问题案例:把数据从服务器端推送到客户端。因为HTTP是一个请求-响应协议,所以聊天服务器不能把新消息发给客户。取而代之的是客户端不得不隔几秒就轮询下服务器端看自己有没有新消息。

Long polling is an interesting technique to decrease the load on the server in these types of scenarios. If the server does not have any new messages when polled, it simply does not send a response back. And, if a message for this client is received within the timeout period, the server will find the outstanding request and return the message with the response.

这些情况发生时长轮询是个减轻服务器负载挺有趣的技术。如果当被轮询时服务器没有新消息,它就不理这个客户端。而当尚未超时的情况下收到了该客户的新消息,服务器就会找到未完成的请求,把新消息做为响应返回给客户端。

11.Conclusion

Hopefully this gives you a better idea of how the different web pieces work together.

总结一下
希望看了本文,你能明白不同的网络模块是如何协同工作的

12.引用文献

原文在这里 : http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/

1 0