远程读取中文网页内容并显示, keyword: html, stringWithContentsOfURL encoding

来源:互联网 发布:少年感的女生知乎 编辑:程序博客网 时间:2024/05/09 21:10

实现的功能很简单,就像远程抓取www.baidu.com的网页内容,就像在浏览器里view->source看到的内容。

最初的代码:
UITextView *web = [[UITextView alloc] initWithFrame:bounds];    NSURL *url = [NSURL URLWithString:@"http://www.baidu.com"];    NSString *pageSource = [NSString stringWithContentsOfURL:url]if (pageSource == nil) {        NSLog(@"nil page source");    }    else {        NSLog(@"not nil");        NSLog(pageSource);    }    web.text = pageSource;


编译运行,网页的内容是抓下来了,可是显示呢,无论是控制台还是textview里显示都是乱码。

既然是乱码,那就改编码吧,先是改成:
NSString *pageSource = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];


结果:nil page source

尝试NSUTF8StringEncoding所在定义处其它编码未果。

打印编码列表,几行code搞定,
    const NSStringEncoding *encodings = [NSString availableStringEncodings];    NSStringEncoding encoding;    int i = 0;    while ((encoding = *encodings++) != 0) {        NSLog(@"%d: %@ == 0x%x\n", i++, [NSString localizedNameOfStringEncoding:encoding], encoding);    }


打印如下:
2009-06-08 23:12:44.420 MoviePre[1243:20b] 0: Western (Mac OS Roman) == 0x1e
2009-06-08 23:12:44.421 MoviePre[1243:20b] 1: Japanese (Mac OS) == 0x80000001
2009-06-08 23:12:44.421 MoviePre[1243:20b] 2: Traditional Chinese (Mac OS) == 0x80000002
2009-06-08 23:12:44.422 MoviePre[1243:20b] 3: Korean (Mac OS) == 0x80000003
2009-06-08 23:12:44.422 MoviePre[1243:20b] 4: Arabic (Mac OS) == 0x80000004
2009-06-08 23:12:44.432 MoviePre[1243:20b] 5: Hebrew (Mac OS) == 0x80000005
2009-06-08 23:12:44.433 MoviePre[1243:20b] 6: Greek (Mac OS) == 0x80000006
2009-06-08 23:12:44.433 MoviePre[1243:20b] 7: Cyrillic (Mac OS) == 0x80000007
2009-06-08 23:12:44.436 MoviePre[1243:20b] 8: Devanagari (Mac OS) == 0x80000009
2009-06-08 23:12:44.447 MoviePre[1243:20b] 9: Gurmukhi (Mac OS) == 0x8000000a
2009-06-08 23:12:44.447 MoviePre[1243:20b] 10: Gujarati (Mac OS) == 0x8000000b
2009-06-08 23:12:44.447 MoviePre[1243:20b] 11: Thai (Mac OS) == 0x80000015
2009-06-08 23:12:44.448 MoviePre[1243:20b] 12: Simplified Chinese (Mac OS) == 0x80000019
2009-06-08 23:12:44.448 MoviePre[1243:20b] 13: Tibetan (Mac OS) == 0x8000001a
2009-06-08 23:12:44.452 MoviePre[1243:20b] 14: Central European (Mac OS) == 0x8000001d
2009-06-08 23:12:44.453 MoviePre[1243:20b] 15: Symbol (Mac OS) == 0x6
2009-06-08 23:12:44.455 MoviePre[1243:20b] 16: Dingbats (Mac OS) == 0x80000022
2009-06-08 23:12:44.455 MoviePre[1243:20b] 17: Turkish (Mac OS) == 0x80000023
2009-06-08 23:12:44.456 MoviePre[1243:20b] 18: Croatian (Mac OS) == 0x80000024
2009-06-08 23:12:44.464 MoviePre[1243:20b] 19: Icelandic (Mac OS) == 0x80000025
2009-06-08 23:12:44.467 MoviePre[1243:20b] 20: Romanian (Mac OS) == 0x80000026
2009-06-08 23:12:44.467 MoviePre[1243:20b] 21: Celtic (Mac OS) == 0x80000027
2009-06-08 23:12:44.468 MoviePre[1243:20b] 22: Gaelic (Mac OS) == 0x80000028
2009-06-08 23:12:44.469 MoviePre[1243:20b] 23: Keyboard Symbols (Mac OS) == 0x80000029
2009-06-08 23:12:44.469 MoviePre[1243:20b] 24: Farsi (Mac OS) == 0x8000008c
2009-06-08 23:12:44.470 MoviePre[1243:20b] 25: Cyrillic (Mac OS Ukrainian) == 0x80000098
2009-06-08 23:12:44.470 MoviePre[1243:20b] 26: Inuit (Mac OS) == 0x800000ec
2009-06-08 23:12:44.471 MoviePre[1243:20b] 27: Unicode (UTF-32LE) == 0x9c000100
2009-06-08 23:12:44.471 MoviePre[1243:20b] 28: Unicode (UTF-8) == 0x4
2009-06-08 23:12:44.472 MoviePre[1243:20b] 29: Unicode (UTF-16) == 0xa
2009-06-08 23:12:44.473 MoviePre[1243:20b] 30: Unicode (UTF-16BE) == 0x90000100
2009-06-08 23:12:44.473 MoviePre[1243:20b] 31: Unicode (UTF-16LE) == 0x94000100
2009-06-08 23:12:44.480 MoviePre[1243:20b] 32: Unicode (UTF-32) == 0x8c000100
2009-06-08 23:12:44.480 MoviePre[1243:20b] 33: Unicode (UTF-32BE) == 0x98000100
2009-06-08 23:12:44.481 MoviePre[1243:20b] 34: Western (ISO Latin 1) == 0x5
2009-06-08 23:12:44.481 MoviePre[1243:20b] 35: Central European (ISO Latin 2) == 0x9
2009-06-08 23:12:44.481 MoviePre[1243:20b] 36: Western (ISO Latin 3) == 0x80000203
2009-06-08 23:12:44.482 MoviePre[1243:20b] 37: Central European (ISO Latin 4) == 0x80000204
2009-06-08 23:12:44.493 MoviePre[1243:20b] 38: Cyrillic (ISO 8859-5) == 0x80000205
2009-06-08 23:12:44.493 MoviePre[1243:20b] 39: Arabic (ISO 8859-6) == 0x80000206
2009-06-08 23:12:44.494 MoviePre[1243:20b] 40: Greek (ISO 8859-7) == 0x80000207
2009-06-08 23:12:44.494 MoviePre[1243:20b] 41: Hebrew (ISO 8859-8) == 0x80000208
2009-06-08 23:12:44.495 MoviePre[1243:20b] 42: Turkish (ISO Latin 5) == 0x80000209
2009-06-08 23:12:44.495 MoviePre[1243:20b] 43: Nordic (ISO Latin 6) == 0x8000020a
2009-06-08 23:12:44.506 MoviePre[1243:20b] 44: Thai (ISO 8859-11) == 0x8000020b
2009-06-08 23:12:44.507 MoviePre[1243:20b] 45: Baltic Rim (ISO Latin 7) == 0x8000020d
2009-06-08 23:12:44.510 MoviePre[1243:20b] 46: Celtic (ISO Latin 8) == 0x8000020e
2009-06-08 23:12:44.511 MoviePre[1243:20b] 47: Western (ISO Latin 9) == 0x8000020f
2009-06-08 23:12:44.511 MoviePre[1243:20b] 48: Romanian (ISO Latin 10) == 0x80000210
2009-06-08 23:12:44.512 MoviePre[1243:20b] 49: Latin-US (DOS) == 0x80000400
2009-06-08 23:12:44.512 MoviePre[1243:20b] 50: Greek (DOS) == 0x80000405
2009-06-08 23:12:44.513 MoviePre[1243:20b] 51: Baltic Rim (DOS) == 0x80000406
2009-06-08 23:12:44.513 MoviePre[1243:20b] 52: Western (DOS Latin 1) == 0x80000410
2009-06-08 23:12:44.513 MoviePre[1243:20b] 53: Greek (DOS Greek 1) == 0x80000411
2009-06-08 23:12:44.514 MoviePre[1243:20b] 54: Central European (DOS Latin 2) == 0x80000412
2009-06-08 23:12:44.514 MoviePre[1243:20b] 55: Cyrillic (DOS) == 0x80000413
2009-06-08 23:12:44.514 MoviePre[1243:20b] 56: Turkish (DOS) == 0x80000414
2009-06-08 23:12:44.515 MoviePre[1243:20b] 57: Portuguese (DOS) == 0x80000415
2009-06-08 23:12:44.516 MoviePre[1243:20b] 58: Icelandic (DOS) == 0x80000416
2009-06-08 23:12:44.517 MoviePre[1243:20b] 59: Hebrew (DOS) == 0x80000417
2009-06-08 23:12:44.517 MoviePre[1243:20b] 60: Canadian French (DOS) == 0x80000418
2009-06-08 23:12:44.517 MoviePre[1243:20b] 61: Arabic (DOS) == 0x80000419
2009-06-08 23:12:44.518 MoviePre[1243:20b] 62: Nordic (DOS) == 0x8000041a
2009-06-08 23:12:44.518 MoviePre[1243:20b] 63: Russian (DOS) == 0x8000041b
2009-06-08 23:12:44.519 MoviePre[1243:20b] 64: Greek (DOS Greek 2) == 0x8000041c
2009-06-08 23:12:44.519 MoviePre[1243:20b] 65: Thai (Windows, DOS) == 0x8000041d
2009-06-08 23:12:44.520 MoviePre[1243:20b] 66: Japanese (Windows, DOS) == 0x8
2009-06-08 23:12:44.522 MoviePre[1243:20b] 67: Simplified Chinese (Windows, DOS) == 0x80000421
2009-06-08 23:12:44.522 MoviePre[1243:20b] 68: Korean (Windows, DOS) == 0x80000422
2009-06-08 23:12:44.524 MoviePre[1243:20b] 69: Traditional Chinese (Windows, DOS) == 0x80000423
2009-06-08 23:12:44.524 MoviePre[1243:20b] 70: Western (Windows Latin 1) == 0xc
2009-06-08 23:12:44.525 MoviePre[1243:20b] 71: Central European (Windows Latin 2) == 0xf
2009-06-08 23:12:44.525 MoviePre[1243:20b] 72: Cyrillic (Windows) == 0xb
2009-06-08 23:12:44.525 MoviePre[1243:20b] 73: Greek (Windows) == 0xd
2009-06-08 23:12:44.526 MoviePre[1243:20b] 74: Turkish (Windows Latin 5) == 0xe
2009-06-08 23:12:44.526 MoviePre[1243:20b] 75: Hebrew (Windows) == 0x80000505
2009-06-08 23:12:44.526 MoviePre[1243:20b] 76: Arabic (Windows) == 0x80000506
2009-06-08 23:12:44.527 MoviePre[1243:20b] 77: Baltic Rim (Windows) == 0x80000507
2009-06-08 23:12:44.529 MoviePre[1243:20b] 78: Vietnamese (Windows) == 0x80000508
2009-06-08 23:12:44.531 MoviePre[1243:20b] 79: Western (ASCII) == 0x1
2009-06-08 23:12:44.532 MoviePre[1243:20b] 80: Japanese (Shift JIS X0213) == 0x80000628
2009-06-08 23:12:44.533 MoviePre[1243:20b] 81: Chinese (GBK) == 0x80000631
2009-06-08 23:12:44.534 MoviePre[1243:20b] 82: Chinese (GB 18030) == 0x80000632
2009-06-08 23:12:44.534 MoviePre[1243:20b] 83: Japanese (ISO 2022-JP) == 0x15
2009-06-08 23:12:44.535 MoviePre[1243:20b] 84: Korean (ISO 2022-KR) == 0x80000840
2009-06-08 23:12:44.536 MoviePre[1243:20b] 85: Japanese (EUC) == 0x3
2009-06-08 23:12:44.536 MoviePre[1243:20b] 86: Simplified Chinese (EUC) == 0x80000930
2009-06-08 23:12:44.538 MoviePre[1243:20b] 87: Traditional Chinese (EUC) == 0x80000931
2009-06-08 23:12:44.539 MoviePre[1243:20b] 88: Korean (EUC) == 0x80000940
2009-06-08 23:12:44.540 MoviePre[1243:20b] 89: Japanese (Shift JIS) == 0x80000a01
2009-06-08 23:12:44.541 MoviePre[1243:20b] 90: Cyrillic (KOI8-R) == 0x80000a02
2009-06-08 23:12:44.541 MoviePre[1243:20b] 91: Traditional Chinese (Big 5) == 0x80000a03
2009-06-08 23:12:44.541 MoviePre[1243:20b] 92: Western (Mac Mail) == 0x80000a04
2009-06-08 23:12:44.542 MoviePre[1243:20b] 93: Simplified Chinese (HZ GB 2312) == 0x80000a05
2009-06-08 23:12:44.542 MoviePre[1243:20b] 94: Traditional Chinese (Big 5 HKSCS) == 0x80000a06
2009-06-08 23:12:44.542 MoviePre[1243:20b] 95: Ukrainian (KOI8-U) == 0x80000a08
2009-06-08 23:12:44.546 MoviePre[1243:20b] 96: Traditional Chinese (Big 5-E) == 0x80000a09
2009-06-08 23:12:44.547 MoviePre[1243:20b] 97: Western (NextStep) == 0x2
2009-06-08 23:12:44.547 MoviePre[1243:20b] 98: Non-lossy ASCII == 0x7
2009-06-08 23:12:44.548 MoviePre[1243:20b] 99: Western (EBCDIC US) == 0x80000c01
2009-06-08 23:12:44.548 MoviePre[1243:20b] 100: Western (EBCDIC Latin 1) == 0x80000c02


看到了吧,试一下几个中文编码吧,
最后我用的是第81项,代码如下:
 NSString *pageSource = [NSString stringWithContentsOfURL:url encoding:0x80000631 error:nil];



无论log还是simulator均显示正常。

真机还未测试。

得到网页内容后随便加几个正则表达式就可以抓到自己想要的内容了:)