How to get a web page content type

来源:互联网 发布:金融类软件如何推广 编辑:程序博客网 时间:2024/04/30 05:57

A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta> tag, or an XML file’s <?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.

Table of Contents

  1. Code
    1. If you are using CURL...
    2. If you are using the fopen wrappers...
    3. If you have an HTML or XHTML file, but not the HTTP header...
    4. If you have an XML file, but not the HTTP header...
  2. Example
  3. Explanation
    1. Content type in the HTTP header
    2. Content type in an HTML/XHTML <meta> tag
    3. Content type in an XML <?xml> tag
  4. Further reading
    1. Related articles at NadeauSoftware.com
    2. Web articles and specifications

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, extract URLs on the page, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Code

If you are using CURL...

After calling curl_exec() to get a web page, call curl_getinfo() to get the content type string from the HTTP header, such as:

text/html; charset=utf-8

Use preg_match() to get the MIME type and (optional) character set:

/* Get the content type from CURL */
$content_type = curl_getinfo( $ch, CURLINFO_CONTENT_TYPE );

/* Get the MIME type and character set */
preg_match( '@([/w/+]+)(;/s+charset=(/S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];

If you are using the fopen wrappers...

After reading the web page using file_get_contents(), or one of the other file functions, use the global $http_response_header variable to get the HTTP header as an array of strings. Look for the last entry that starts with “Content-Type:” (there will be multiple entries like this if the page was redirected – the last one is for the returned page):

Content-Type: text/html; charset=utf-8

Use preg_match() to get the MIME type and (optional) character set:

/* Get the content type from the HTTP response */
$nlines = count( $http_response_header );
for ( $i = $nlines-1; $i >= 0; $i-- ) {
$line = $http_response_header[$i];
if ( substr_compare( $line, 'Content-Type', 0, 12, true ) == 0 ) {
$content_type = $line;
break;
}
}

/* Get the MIME type and character set */
preg_match( '@Content-Type:/s+([/w/+]+)(;/s+charset=(/S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];

If you have an HTML or XHTML file, but not the HTTP header...

A web page stored in a local file has no HTTP header. Read the file into a string and look for a <meta> tag for (X)HTML where the http-equiv attribute reads “Content-Type”.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Use preg_match() to find the tag and get the MIME type and (optional) character set:

/* Get the MIME type and character set */
preg_match( '@<meta/s+http-equiv="Content-Type"/s+content="([/w/]+)(;/s+charset=([^/s"]+))?@i',
$page, $matches );

if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];

If you have an XML file, but not the HTTP header...

XML data stored in a local file has no HTTP header. Read the file into a string and look for an <?xml> tag:

<?xml version="1.0" encoding="UTF-8" ?>

Use preg_match() to find the tag and get the character set (the MIME type is always “application/xml” for XML):

/* Get the character set */
preg_match( '@</?xml.+encoding="([^/s"]+)@si', $page, $matches );
$mime = 'application/xml';
if ( isset( $matches[1] ) )
$charset = $matches[1];

Example

Read an HTML file, get the character set, and convert to UTF-8:

$text = file_get_contents( $filename );

preg_match( '@<meta/s+http-equiv="Content-Type"/s+content="([/w/]+)(;/s+charset=([^/s"]+))?@i',
$text, $matches );

if ( isset( $matches[1] ) )
$mime = $matches[1];
if ( isset( $matches[3] ) )
$charset = $matches[3];

$utf8_text = iconv( $charset, "utf-8", $text );

See the PHP iconv() manual for how to use the function to convert from a web page’s original character set to UTF-8 (or any other character set).

Explanation

The content type of a page gives the file’s MIME type and character set:

  • MIME type: the file’s type, such as text, HTML, an image, a sound, etc. (see Wikipedia on MIME types and a list of standard Internet Media Types). Some common types include: MIME type Meaning text/plainPlain text file text/htmlHTML web page application/xmlXML data file application/xhtml+xmlXHTML web page image/jpegJPEG image image/pngPNG image image/gifGIF image
  • Character set: the character encoding of the file (see Wikipedia on Character encoding). UTF-8 is widely used for web pages because it can represent characters in any of the world’s languages. Some sites may still use language-specific encodings, such as Big5 for traditional Chinese characters, ISO 8859 for the latin character set, or any of several old Microsoft Windows-specific character sets.

When building a PHP search engine or page analysis tool, the MIME type tells you if you’ve got an HTML page, an image, a sound, or whatever. The character set tells you the encoding used by the page. For most tasks, you’ll need to convert the page to UTF-8 using iconv() before parsing the text.

Content type in the HTTP header

The content type for a page is specified by a text string included in the HTTP response header from the web server, such as:

Content-Type: text/html; charset=utf-8

After the string “Content-Type:”, the first part of the value is the MIME type and the second part the character set. The character set is optional but it should always be included for text-based content. For images, sounds, and other non-text content, there will be no character set specified.

Content type in an HTML/XHTML <meta> tag

If the HTTP header doesn’t include the content type (which may indicate a misconfigured web server), the content type may be included within a <meta> tag in HTML and XHTML files, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

As with the HTTP header, the first part of the type is the MIME type, and the second part is the character set. These values are supposed to match the HTTP header content type, if given. If they don’t, the HTTP header has precedence over the <meta> tag.

The content type <meta> tag is optional, but if it is given it should be as close to the top of the <head> section as possible. This enables web browsers, and PHP code, to find the content type quickly and then use the character set name to guide handling of the rest of the page text.

Content type in an XML <?xml> tag

For XML files, the character set may be within a <?xml> tag near the top of the file, such as:

<?xml version="1.0" encoding="UTF-8" ?>

XML doesn’t specify a MIME type in the <?xml> tag. Instead, the MIME type for XML is always “application/xml”.