Go实战--golang中的JQUERY(PuerkitoBio/goquery、从html中获取链接)

来源:互联网 发布:2017最新人工智能龙头 编辑:程序博客网 时间:2024/04/28 03:44

生命不止,继续 go go go !!!
jQuery应该说是家喻户晓。

jQuery is a fast, small, and feature-rich JavaScript library. It makes things like HTML document traversal and manipulation, event handling, animation, and Ajax much simpler with an easy-to-use API that works across a multitude of browsers. With a combination of versatility and extensibility, jQuery has changed the way that millions of people write JavaScript.

jQuery 是一个 JavaScript 函数库。
jQuery 库包含以下特性:
HTML 元素选取
HTML 元素操作
CSS 操作
HTML 事件函数
JavaScript 特效和动画
HTML DOM 遍历和修改
AJAX
Utilities

在golang的世界中,
github.com/PuerkitoBio/goquery 这个库就实现了类似 jQuery 的功能,让我们能方便的使用 Go 语言操作 HTML 文档。

记住,如果使用golang做爬虫方面的事儿,你可能会用到goquery啊!

参考:
http://blog.studygolang.com/2015/04/go-jquery-goquery/

PuerkitoBio/goquery

github地址:
https://github.com/PuerkitoBio/goquery

Star: 4833

描述:
A little like that j-thing, only in Go.

获取:
go get github.com/PuerkitoBio/goquery

创建 Document 对象
goquery 暴露了两个结构体:Document 和 Selection.
Document 表示一个 HTML 文档,Selection 用于像 jQuery 一样操作,支持链式调用。goquery 需要指定一个 HTML 文档才能继续后续的操作。

查找到指定节点
Selection 有一系列类似 jQuery 的方法,Document 结构体内嵌了 *Selection,因此也能直接调用这些方法。主要的方法是 Selection.Find(selector string),传入一个选择器,返回一个新的,匹配到的 *Selection,所以能够链式调用。

属性操作
经常需要获取一个标签的内容和某些属性值,使用 goquery 可以很容易做到

官方例子

package mainimport (  "fmt"  "log"  "github.com/PuerkitoBio/goquery")func ExampleScrape() {  doc, err := goquery.NewDocument("http://metalsucks.net")  if err != nil {    log.Fatal(err)  }  // Find the review items  doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {    // For each item found, get the band and title    band := s.Find("a").Text()    title := s.Find("i").Text()    fmt.Printf("Review %d: %s - %s\n", i, band, title)  })}func main() {  ExampleScrape()}

输出:

Review 0: Cavalera Conspiracy - PsychosisReview 1: Cannibal Corpse - Red Before BlackReview 2: All Pigs Must Die - Hostage AnimalReview 3: Electric Wizard - Wizard Bloody WizardReview 4: Trivium - The Sin and the Sentence
import (    "fmt"    "log"    "github.com/PuerkitoBio/goquery")func linkScrape() {    doc, err := goquery.NewDocument("http://jonathanmh.com")    if err != nil {        log.Fatal(err)    }    doc.Find("body a").Each(func(index int, item *goquery.Selection) {        linkTag := item        link, _ := linkTag.Attr("href")        linkText := linkTag.Text()        fmt.Printf("Link #%d: '%s' - '%s'\n", index, linkText, link)    })}func main() {    linkScrape()}

输出:

Link #0: 'Skip to content' - '#content'Link #1: 'JonathanMH' - 'https://jonathanmh.com/'Link #2: 'Blog' - 'https://jonathanmh.com/category/blog/'Link #3: 'Hire Me' - 'https://jonathanmh.com/hire-me/'Link #4: 'About' - 'https://jonathanmh.com/about/'Link #5: 'twitter' - 'https://twitter.com/JonathanMH_com'Link #6: 'rss feed' - 'http://jonathanmh.com/feed/'Link #7: 'github' - 'https://github.com/JonathanMH'Link #8: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'Link #9: 'instagram' - 'http://instagram.com/jonathanmh'Link #10: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'Link #11: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'Link #12: 'hire me' - '/hire-me'Link #13: '' - 'https://twitter.com/JonathanMH_com'Link #14: '' - 'https://www.facebook.com/JonathanMH-159526834122370/'Link #15: '' - 'https://www.instagram.com/jonathanmh/'Link #16: '' - 'https://github.com/jonathanmh/'Link #17: 'Work every day like you just got fired' - 'https://jonathanmh.com/work-every-day-like-just-got-fired/'Link #18: 'Vue.js API Client / Single Page App (SPA) Tutorial' - 'https://jonathanmh.com/vue-js-api-client-single-page-app-spa-tutorial/'Link #19: 'Building a Simple Searchable API with Express (Backend)' - 'https://jonathanmh.com/building-a-simple-searchable-api-with-express-backend/'Link #20: 'Music Monday: Doom Soundtrack' - 'https://jonathanmh.com/music-monday-doom-soundtrack/'Link #21: 'Brick by Brick' - 'https://jonathanmh.com/brick-by-brick/'Link #22: 'Taking Screenshots with Headless, The Chrome Debuggping Protocol (CDP) and Golang' - 'https://jonathanmh.com/taking-screenshots-headless-chrome-debuggping-protocol-cdp-golang/'Link #23: 'Firefox has re-joined the Browser Wars' - 'https://jonathanmh.com/firefox-re-joined-browser-wars/'Link #24: 'A Mastodon Review, is it the next Twitter / Facebook by the People?' - 'https://jonathanmh.com/mastodon-review-next-twitter-facebook-people/'Link #25: 'Testing Coin Hive Crowd Source Monero Mining' - 'https://jonathanmh.com/testing-coin-hive-crowd-source-monero-mining/'Link #26: 'Glass Half' - 'https://jonathanmh.com/glass-half/'Link #27: 'read older posts' - '/blog/page/2/'Link #28: 'twitter' - 'https://twitter.com/JonathanMH_com'Link #29: 'rss feed' - 'http://jonathanmh.com/feed/'Link #30: 'github' - 'https://github.com/JonathanMH'Link #31: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'Link #32: 'instagram' - 'http://instagram.com/jonathanmh'Link #33: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'Link #34: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'Link #35: '.htaccess' - 'https://jonathanmh.com/tag/htaccess/'Link #36: 'Adobe' - 'https://jonathanmh.com/tag/adobe/'Link #37: 'Android' - 'https://jonathanmh.com/tag/android/'Link #38: 'Arch Linux' - 'https://jonathanmh.com/tag/arch-linux/'Link #39: 'atom' - 'https://jonathanmh.com/tag/atom/'Link #40: 'bash' - 'https://jonathanmh.com/tag/bash/'Link #41: 'blogging' - 'https://jonathanmh.com/tag/blogging/'Link #42: 'Brackets' - 'https://jonathanmh.com/tag/brackets/'Link #43: 'cigtrack' - 'https://jonathanmh.com/tag/cigtrack/'Link #44: 'CodeIgniter' - 'https://jonathanmh.com/tag/codeigniter/'Link #45: 'CSS' - 'https://jonathanmh.com/tag/css/'Link #46: 'Digital Ocean' - 'https://jonathanmh.com/tag/digital-ocean/'Link #47: 'express.js' - 'https://jonathanmh.com/tag/express-js/'Link #48: 'facebook' - 'https://jonathanmh.com/tag/facebook/'Link #49: 'ghost' - 'https://jonathanmh.com/tag/ghost/'Link #50: 'git' - 'https://jonathanmh.com/tag/git/'Link #51: 'github' - 'https://jonathanmh.com/tag/github/'Link #52: 'gitlab' - 'https://jonathanmh.com/tag/gitlab/'Link #53: 'go' - 'https://jonathanmh.com/tag/go/'Link #54: 'golang' - 'https://jonathanmh.com/tag/golang/'Link #55: 'Google' - 'https://jonathanmh.com/tag/google/'Link #56: 'Gulp' - 'https://jonathanmh.com/tag/gulp/'Link #57: 'gvim' - 'https://jonathanmh.com/tag/gvim/'Link #58: 'JavaScript' - 'https://jonathanmh.com/tag/javascript/'Link #59: 'kickstarter' - 'https://jonathanmh.com/tag/kickstarter/'Link #60: 'Linux' - 'https://jonathanmh.com/tag/linux/'Link #61: 'markdown' - 'https://jonathanmh.com/tag/markdown/'Link #62: 'mindset' - 'https://jonathanmh.com/tag/mindset/'Link #63: 'MVC' - 'https://jonathanmh.com/tag/mvc/'Link #64: 'Nginx' - 'https://jonathanmh.com/tag/nginx/'Link #65: 'node.js' - 'https://jonathanmh.com/tag/node-js/'Link #66: 'npm' - 'https://jonathanmh.com/tag/npm/'Link #67: 'PHP' - 'https://jonathanmh.com/tag/php/'Link #68: 'plugin' - 'https://jonathanmh.com/tag/plugin/'Link #69: 'Raspberry PI' - 'https://jonathanmh.com/tag/raspberry-pi/'Link #70: 'SCSS' - 'https://jonathanmh.com/tag/scss/'Link #71: 'social media' - 'https://jonathanmh.com/tag/social-media/'Link #72: 'ssh' - 'https://jonathanmh.com/tag/ssh/'Link #73: 'Terminal' - 'https://jonathanmh.com/tag/terminal/'Link #74: 'toolbox' - 'https://jonathanmh.com/tag/toolbox/'Link #75: 'UberWriter' - 'https://jonathanmh.com/tag/uberwriter/'Link #76: 'Ubuntu' - 'https://jonathanmh.com/tag/ubuntu/'Link #77: 'vim' - 'https://jonathanmh.com/tag/vim/'Link #78: 'web crawling' - 'https://jonathanmh.com/tag/web-crawling/'Link #79: 'WordPress' - 'https://jonathanmh.com/tag/wordpress/'Link #80: 'Blog' - 'https://jonathanmh.com/category/blog/'Link #81: 'Hire Me' - 'https://jonathanmh.com/hire-me/'Link #82: 'About' - 'https://jonathanmh.com/about/'Link #83: 'twitter' - 'https://twitter.com/JonathanMH_com'Link #84: 'rss feed' - 'http://jonathanmh.com/feed/'Link #85: 'github' - 'https://github.com/JonathanMH'Link #86: 'stackoverflow' - 'http://stackoverflow.com/users/896285/jonathan-m-hethey'Link #87: 'instagram' - 'http://instagram.com/jonathanmh'Link #88: 'facebook' - 'https://www.facebook.com/pages/JonathanMH/159526834122370'Link #89: 'linkedin' - 'http://www.linkedin.com/in/jonathanmh'Link #90: 'JonathanMH' - 'https://jonathanmh.com/'Link #91: 'Proudly powered by WordPress' - 'https://wordpress.org/'
package mainimport (    "os"    "strings"    "text/template"    "github.com/PuerkitoBio/goquery")const rstLink = "`{{.Text}} <{{.Href}}>`_\n"type htmlLink struct {    Text string    Href string}func main() {    url := "https://www.baidu.com"    doc, err := goquery.NewDocument(url)    if err != nil {        panic(err)    }    tmpl := template.Must(template.New("test").Parse(rstLink))    doc.Find("a").Each(func(_ int, link *goquery.Selection) {        text := strings.TrimSpace(link.Text())        href, ok := link.Attr("href")        if ok {            tmpl.Execute(os.Stdout, &htmlLink{text, href})        }    })}

输出:

` </>`_`手写 <javascript:;>`_`拼音 <javascript:;>`_`关闭 <javascript:;>`_`百度首页 </>`_`设置 <javascript:;>`_`登录 <https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F>`_`新闻 <http://news.baidu.com>`_`hao123 <http://www.hao123.com>`_`地图 <http://map.baidu.com>`_`视频 <http://v.baidu.com>`_`贴吧 <http://tieba.baidu.com>`_`学术 <http://xueshu.baidu.com>`_`登录 <https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F>`_`设置 <http://www.baidu.com/gaoji/preferences.html>`_`更多产品 <http://www.baidu.com/more/>`_`新闻 <http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=>`_`贴吧 <http://tieba.baidu.com/f?kw=&fr=wwwt>`_`知道 <http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt>`_`音乐 <http://music.baidu.com/search?fr=ps&ie=utf-8&key=>`_`图片 <http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=>`_`视频 <http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=>`_`地图 <http://map.baidu.com/m?word=&fr=ps01000>`_`文库 <http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8>`_`更多» <//www.baidu.com/more/>`_`把百度设为主页 <//www.baidu.com/cache/sethelp/help.html>`_`关于百度 <http://home.baidu.com>`_`About  Baidu <http://ir.baidu.com>`_`百度推广 <http://e.baidu.com/?refer=888>`_`使用百度前必读 <http://www.baidu.com/duty/>`_`意见反馈 <http://jianyi.baidu.com/>`_`京公网安备11000002000001号 <http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001>`_

这里写图片描述

阅读全文
1 0