How to get a web page content type
来源:互联网 发布:金融类软件如何推广 编辑:程序博客网 时间:2024/04/30 05:57
A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta>
tag, or an XML file’s <?xml>
tag. This tip shows how to get the page's content type and extract the MIME type and character set.
Table of Contents
- Code
- If you are using CURL...
- If you are using the fopen wrappers...
- If you have an HTML or XHTML file, but not the HTTP header...
- If you have an XML file, but not the HTTP header...
- Example
- Explanation
- Content type in the HTTP header
- Content type in an HTML/XHTML <meta> tag
- Content type in an XML <?xml> tag
- Further reading
- Related articles at NadeauSoftware.com
- Web articles and specifications
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, extract URLs on the page, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.
Code
If you are using CURL...
After calling curl_exec()
to get a web page, call curl_getinfo()
to get the content type string from the HTTP header, such as:
text/html; charset=utf-8
Use preg_match()
to get the MIME type and (optional) character set:
/* Get the content type from CURL */
$content_type = curl_getinfo( $ch, CURLINFO_CONTENT_TYPE );
/* Get the MIME type and character set */
preg_match( '@([/w/+]+)(;/s+charset=(/S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];
If you are using the fopen wrappers...
After reading the web page using file_get_contents()
, or one of the other file functions, use the global $http_response_header
variable to get the HTTP header as an array of strings. Look for the last entry that starts with “Content-Type:
” (there will be multiple entries like this if the page was redirected – the last one is for the returned page):
Content-Type: text/html; charset=utf-8
Use preg_match()
to get the MIME type and (optional) character set:
/* Get the content type from the HTTP response */
$nlines = count( $http_response_header );
for ( $i = $nlines-1; $i >= 0; $i-- ) {
$line = $http_response_header[$i];
if ( substr_compare( $line, 'Content-Type', 0, 12, true ) == 0 ) {
$content_type = $line;
break;
}
}
/* Get the MIME type and character set */
preg_match( '@Content-Type:/s+([/w/+]+)(;/s+charset=(/S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];
If you have an HTML or XHTML file, but not the HTTP header...
A web page stored in a local file has no HTTP header. Read the file into a string and look for a <meta>
tag for (X)HTML where the http-equiv
attribute reads “Content-Type”.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use preg_match()
to find the tag and get the MIME type and (optional) character set:
/* Get the MIME type and character set */
preg_match( '@<meta/s+http-equiv="Content-Type"/s+content="([/w/]+)(;/s+charset=([^/s"]+))?@i',
$page, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];
If you have an XML file, but not the HTTP header...
XML data stored in a local file has no HTTP header. Read the file into a string and look for an <?xml>
tag:
<?xml version="1.0" encoding="UTF-8" ?>
Use preg_match()
to find the tag and get the character set (the MIME type is always “application/xml” for XML):
/* Get the character set */
preg_match( '@</?xml.+encoding="([^/s"]+)@si', $page, $matches );
$mime = 'application/xml';
if ( isset( $matches[1] ) )
$charset = $matches[1];
Example
Read an HTML file, get the character set, and convert to UTF-8:
$text = file_get_contents( $filename );
preg_match( '@<meta/s+http-equiv="Content-Type"/s+content="([/w/]+)(;/s+charset=([^/s"]+))?@i',
$text, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];
$utf8_text = iconv( $charset, "utf-8", $text );
See the PHP iconv()
manual for how to use the function to convert from a web page’s original character set to UTF-8 (or any other character set).
Explanation
The content type of a page gives the file’s MIME type and character set:
- MIME type: the file’s type, such as text, HTML, an image, a sound, etc. (see Wikipedia on MIME types and a list of standard Internet Media Types). Some common types include:
MIME type Meaning text/plain
Plain text filetext/html
HTML web pageapplication/xml
XML data fileapplication/xhtml+xml
XHTML web pageimage/jpeg
JPEG imageimage/png
PNG imageimage/gif
GIF image - Character set: the character encoding of the file (see Wikipedia on Character encoding). UTF-8 is widely used for web pages because it can represent characters in any of the world’s languages. Some sites may still use language-specific encodings, such as Big5 for traditional Chinese characters, ISO 8859 for the latin character set, or any of several old Microsoft Windows-specific character sets.
When building a PHP search engine or page analysis tool, the MIME type tells you if you’ve got an HTML page, an image, a sound, or whatever. The character set tells you the encoding used by the page. For most tasks, you’ll need to convert the page to UTF-8 using iconv()
before parsing the text.
Content type in the HTTP header
The content type for a page is specified by a text string included in the HTTP response header from the web server, such as:
Content-Type: text/html; charset=utf-8
After the string “Content-Type:
”, the first part of the value is the MIME type and the second part the character set. The character set is optional but it should always be included for text-based content. For images, sounds, and other non-text content, there will be no character set specified.
Content type in an HTML/XHTML <meta> tag
If the HTTP header doesn’t include the content type (which may indicate a misconfigured web server), the content type may be included within a <meta>
tag in HTML and XHTML files, such as:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
As with the HTTP header, the first part of the type is the MIME type, and the second part is the character set. These values are supposed to match the HTTP header content type, if given. If they don’t, the HTTP header has precedence over the <meta>
tag.
The content type <meta>
tag is optional, but if it is given it should be as close to the top of the <head>
section as possible. This enables web browsers, and PHP code, to find the content type quickly and then use the character set name to guide handling of the rest of the page text.
Content type in an XML <?xml> tag
For XML files, the character set may be within a <?xml>
tag near the top of the file, such as:
<?xml version="1.0" encoding="UTF-8" ?>
XML doesn’t specify a MIME type in the <?xml>
tag. Instead, the MIME type for XML is always “application/xml
”.
- How to get a web page content type
- How to save content/text of a web page by forcing save-as option
- How to save a web page as HTML or MHT
- how to get pdf page numbers
- How to get AutoCAD Mtext content
- command to get the man page content to file
- 如何用.Net 取得指定网面的内容? How to get html web page data?
- How to get the content of the 'identity' section in web.config
- how to get a polygon
- How to get a solution?
- How to create a minimal master page
- How to Create A Facebook Page
- How to Make a Single Page Website
- How to grab web page in chinese
- How WebKit Loads a Web Page
- How WebKit Loads a Web Page
- How WebKit Loads a Web Page
- How to get a type in C++ when its template argument is the argument
- java中的Properties文件操作使用举例
- Test Test
- Oracle工具——DBVERIFY
- 多重环境 (第一次转型)
- 一口温水-克服紧张情绪
- How to get a web page content type
- js判断整数
- ARCSDE 解锁方法
- 人才
- 人才
- ipv6技术介绍
- ARMSYS6410开发板产品FAQ
- dwr实时查询数据库 定时弹出预约提醒 Java
- Windows核心编程 第3章 内核对象