Goqueryのインストールと使い方

インストール

実行:

go get github.com/PuerkitoBio/goquery

インポート

import "github.com/PuerkitoBio/goquery"

ページのロード

IMDbの人気映画ページを例にとります:

package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()
	if res.StatusCode != 200 {
		log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
	}

ドキュメントオブジェクトの取得

	// その他の作成方法
	// doc, err := goquery.NewDocumentFromReader(reader io.Reader)
	// doc, err := goquery.NewDocument(url string)
	// doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))

要素の選択

要素セレクタ

基本的なHTML要素に基づいて選択します。たとえば、dom.Find("p")はすべてのpタグに一致します。これは、連鎖呼び出しをサポートしています:

ele.Find("h2").Find("a")

属性セレクタ

要素の属性と値で要素をフィルタリングします。複数のマッチング方法があります:

Find("div[my]")        // my属性を持つdiv要素をフィルタリング
Find("div[my=zh]")     // my属性がzhであるdiv要素をフィルタリング
Find("div[my!=zh]")    // my属性がzhと等しくないdiv要素をフィルタリング
Find("div[my|=zh]")    // my属性がzhであるか、zh-で始まるdiv要素をフィルタリング
Find("div[my*=zh]")    // my属性に文字列zhが含まれるdiv要素をフィルタリング
Find("div[my~=zh]")    // my属性に単語zhが含まれるdiv要素をフィルタリング
Find("div[my$=zh]")    // my属性がzhで終わるdiv要素をフィルタリング
Find("div[my^=zh]")    // my属性がzhで始まるdiv要素をフィルタリング

`parent > child` セレクタ

特定の要素の下にある子要素をフィルタリングします。たとえば、dom.Find("div>p")はdivタグの下のpタグをフィルタリングします。

`element + next` 隣接セレクタ

要素が不規則に選択されているが、前の要素にパターンがある場合に使用します。たとえば、dom.Find("p[my=a]+p")は、pタグのmy属性値がaである隣接するpタグをフィルタリングします。

`element~next` 兄弟セレクタ

同じ親要素の下にある隣接していないタグをフィルタリングします。たとえば、dom.Find("p[my=a]~p")は、pタグのmy属性値がaである兄弟pタグをフィルタリングします。

IDセレクタ

#で始まり、要素を正確に照合します。たとえば、dom.Find("#title")はid=titleのコンテンツに一致し、タグdom.Find("p#title")を指定できます。

ele.Find("#title")

クラスセレクタ

.で始まり、指定されたクラス名を持つ要素をフィルタリングします。たとえば、dom.Find(".content1") 、タグdom.Find("div.content1")を指定できます。

ele.Find(".title")

セレクタOR（|）演算

カンマで区切られた複数のセレクタを組み合わせます。いずれか1つが満たされると、フィルタリングが行われます。たとえば、Find("div,span")です。

func main() {
	html := `<body>
                <div lang="zh">DIV1</div>
                <span>
                    <div>DIV5</div>
                </span>
            </body>`
	dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
	if err != nil {
		log.Fatalln(err)
	}
	dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
		fmt.Println(selection.Html())
	})
}

フィルター

`:contains` フィルター

指定されたテキストを含む要素をフィルタリングします。たとえば、dom.Find("p:contains(a)")はaを含むpタグをフィルタリングします。

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
	fmt.Println(selection.Text())
})

`:has(selector)`

指定された要素ノードを含む要素をフィルタリングします。

`:empty`

子要素を持たない要素をフィルタリングします。

`:first-child` および `:first-of-type` フィルター

Find("p:first-child")は最初のpタグをフィルタリングします。first-of-typeは、それがそのタイプの最初の要素であることを要求します。

`:last-child` および `:last-of-type` フィルター

:first-child および :first-of-type の反対。

`:nth-child(n)` および `:nth-of-type(n)` フィルター

:nth-child(n) は、親要素の n 番目の要素をフィルタリングします。:nth-of-type(n) は、同じタイプの n 番目の要素をフィルタリングします。

`:nth-last-child(n)` および `:nth-last-of-type(n)` フィルター

逆順に計算し、最後の要素が最初の要素になります。

`:only-child` および `:only-of-type` フィルター

Find(":only-child") は、親要素の唯一の子要素をフィルタリングします。Find(":only-of-type") は、同じタイプの唯一の要素をフィルタリングします。

コンテンツの取得

ele.Html()
ele.Text()

トラバーサル

Eachメソッドを使用して、選択した要素をトラバースします:

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
	href, _ := elA.Attr("href")
	fmt.Println(href)
})

組み込み関数

配列ポジショニング関数

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection

拡張関数

Add...()
AndSelf()
Union()

フィルタリング関数

End()
Filter...()
Has...()
Intersection()
Not...()

ループトラバーサル関数

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)

ドキュメント変更関数

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()

属性操作関数

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()

ノード検索関数

Contains()
Is...()

ドキュメントツリートラバーサル関数

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()

型定義

Document
Selection
Matcher

ヘルパー関数

NodeName
OuterHtml

例

はじめにの例

func main() {
	html := `<html>
            <body>
                <h1 id="title">O Captain! My Captain!</h1>
                <p class="content1">
                O Captain! my Captain! our fearful trip is done,
                The ship has weather’d every rack, the prize we sought is won,
                The port is near, the bells I hear, the people all exulting,
                While follow eyes the steady keel, the vessel grim and daring;
                </p>
            </body>
            </html>`
	dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
	if err != nil {
		log.Fatalln(err)
	}
	dom.Find("p").Each(func(i int, selection *goquery.Selection) {
		fmt.Println(selection.Text())
	})
}

IMDbの人気映画情報をクロールする例

package main

import (
	"fmt"
	"log"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	//title := selection.Text()
		log.Fatal(err)
	}

	//fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href)
		fmt.Printf("\n")
	})
}

上記の例では、IMDbの人気映画ページから映画名とリンク情報を抽出しています。実際の使用では、必要に応じてセレクターと処理ロジックを調整できます。

Leapcell: Webホスティングのための次世代サーバーレスプラットフォーム

最後に、Goサービスをデプロイするための最適なプラットフォーム**Leapcell**を紹介します。

1. 多言語サポート

JavaScript、Python、Go、またはRustで開発します。

2. 無制限のプロジェクトを無料でデプロイ

使用量に対してのみ支払い - リクエストも料金もありません。

3. 比類のないコスト効率

アイドル料金なしの従量課金制。
例：平均応答時間60msで694万リクエストを25ドルでサポートします。

4. 合理化された開発者エクスペリエンス

簡単なセットアップのための直感的なUI。
完全に自動化されたCI/CDパイプラインとGitOps統合。
実用的な洞察のためのリアルタイムのメトリックとロギング。

5. 簡単なスケーラビリティと高性能

高い同時実行性を容易に処理するための自動スケーリング。
運用のオーバーヘッドはゼロ - 構築に集中するだけです。

ドキュメントで詳細をご覧ください！

Leapcell Twitter: https://x.com/LeapcellHQ