Google-Extended 預設允許嗎?

不允許。Google-Extended 是反向預設設計 —— 你必須在 robots.txt 明確寫 User-agent: Google-Extended 加上 Allow: /,Google 才會把你的內容用來訓練 Gemini 與 Vertex AI Generative API。沒寫等於不訓練。這跟 GPTBot、ClaudeBot 等遵循傳統 robots.txt 慣例(沒寫 Disallow 就視為允許)的爬蟲剛好相反，是最常被忽略的設定陷阱。

GPTBot 跟 ChatGPT-User 差別在哪?

GPTBot 是 OpenAI 用來大規模抓取網頁、用於訓練未來模型的爬蟲，類似 Googlebot 的角色。ChatGPT-User 則是當使用者在 ChatGPT 對話中觸發瀏覽功能(例如「幫我找這個產品的官網」)時即時去抓取單一網頁的 user-agent。兩者在 robots.txt 是分開的條目:你可以擋 GPTBot 不被訓練、但允許 ChatGPT-User 讓你的內容能在使用者主動詢問時被引用。

Bytespider 不遵守 robots.txt 怎麼辦?

Bytespider(字節跳動)在 2023-2024 年多次被報導不完全遵守 robots.txt。最有效的做法是在邊緣層直接擋:Cloudflare 提供「AI Scrapers and Crawlers」一鍵管理面板可以阻擋已知 AI 爬蟲；或在 nginx/Apache 直接用 User-Agent 字串比對回 403。光寫 robots.txt 對不守規矩的爬蟲沒效。

GEO 技術 ✦ 21 天系列 · Day 10 📅 2026-04-26 ⏱ 10 分鐘閱讀 ✍️ 文/ Rhoda 羅達 🔄 最後更新 2026-07-04

10 種 AI 爬蟲設定指南(GPTBot/ClaudeBot/Google-Extended)

📌 讀完這篇你會學到

10 隻關鍵 AI 爬蟲(GPTBot、ChatGPT-User、ClaudeBot、Google-Extended、PerplexityBot、CCBot、Bytespider、Applebot-Extended、Amazonbot、anthropic-ai)的 User-agent、預設行為、訓練 vs 推論用途差異
完整可複製的 robots.txt 範本,改網域與 Disallow 路徑就能直接上線
設定時最常踩的坑:Google-Extended 跟 Googlebot 是兩回事、Allow vs Disallow 順序、預設行為陷阱
怎麼用 nginx access log、GA4、Cloudflare AI Audit 監控 AI 爬蟲真實訪問
適合誰讀:站長 / 開發者 / SEO 顧問,想要確保網站對 AI 引擎完全開放且可監控

AI 爬蟲決定你能不能進 ChatGPT、Claude、Gemini、Perplexity 的回答。它們跟傳統 search bot(Googlebot、Bingbot)是兩條完全獨立的訪問路徑 —— 你 Google SEO 排名再高，只要 robots.txt 把 GPTBot 擋掉，你的內容就不會被 OpenAI 拿去訓練，使用者問 ChatGPT 時也問不到你。

這篇文章整理 2026 年 5 月最新的 10 隻關鍵 AI 爬蟲清單，逐一說明它們的 user-agent 字串、預設行為、所屬公司、官方文件 URL，並附上一個可直接複製貼上的範例 robots.txt 模板。文末還會教你怎麼在 nginx access log 監控 AI 爬蟲的真實訪問頻率。

💡 一句話結論

10 隻 bot 裡，3 隻是關鍵(GPTBot、ClaudeBot、Google-Extended)，擋掉任何一隻都會嚴重影響你在 AI 引擎的能見度。其中 Google-Extended 是反向預設，沒明確 Allow 等於沒開放。

robots.txt 是什麼?60 秒快速複習

robots.txt 是放在網站根目錄的純文字檔(https://yoursite.com/robots.txt)，用來告訴爬蟲哪些路徑能抓、哪些不能抓。它的語法核心只有兩行:User-agent: 指定爬蟲名稱，Disallow: 或 Allow: 指定路徑規則。多數爬蟲遵循「沒寫 Disallow 就視為可抓」的傳統慣例，但 AI 時代多了一個反例:Google-Extended 必須明確 Allow 才會被抓，這是最常被忽略的陷阱。

一個常見錯誤是只設 User-agent: *(萬用)就以為所有爬蟲都會遵守。事實上，主流 AI 爬蟲都優先匹配「自己的具名規則」，只有沒匹配到才會 fallback 到 *。所以正確做法是針對每隻關心的 bot 寫獨立區塊。下面這 10 隻就是 2026 年你不能忽略的清單。

10 隻關鍵 AI 爬蟲分別是什麼?

這 10 隻爬蟲依用途可分成三種角色:訓練爬蟲(把內容收進模型訓練資料,如 GPTBot、ClaudeBot)、搜尋索引(替 AI 搜尋引擎建立索引,如 PerplexityBot)、即時瀏覽(使用者提問當下即時抓取單頁,如 ChatGPT-User)。其中 GPTBot、ClaudeBot、Google-Extended 三隻是 critical 級(評分權重各 0.20),擋掉任何一隻都會嚴重影響你在 AI 引擎的能見度。以下逐一介紹。

1. GPTBot — OpenAI 訓練爬蟲

GPTBot 是 OpenAI 用來抓取網頁、訓練 GPT 系列模型的主要爬蟲，2023 年 8 月公開上線。它的 user-agent 字串包含 GPTBot 關鍵字，完整格式為 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.2; +https://openai.com/gptbot。GPTBot 預設遵守 robots.txt，沒寫 Disallow 就視為允許訓練。

如果你希望內容進入未來 GPT 模型(這代表 ChatGPT 答得出你的品牌、產品、文章)，請務必允許 GPTBot。權重在我們的評分模型中是最高一級(0.20)，屬於 critical 級別 —— 擋掉它等於放棄 ChatGPT 訓練資料引用。OpenAI 在 platform.openai.com/docs/bots 列出完整 IP 範圍，可在伺服器層白名單。

2. ChatGPT-User — 即時瀏覽爬蟲

ChatGPT-User 是 ChatGPT 在使用者對話中觸發瀏覽功能(例如「幫我找這個產品的官網最新價格」)時，即時抓取單一頁面的 user-agent。它跟 GPTBot 是兩條獨立路徑:GPTBot 是大規模批次訓練，ChatGPT-User 是單次即時讀取。一個常見策略是擋 GPTBot 不訓練、但允許 ChatGPT-User 讓使用者主動查時能引用到你。

ChatGPT-User 的 user-agent 字串以 ChatGPT-User 開頭，2026 年 OpenAI 還推出了 OAI-SearchBot(SearchGPT 專用，與 ChatGPT-User 區隔)。如果你經營電商或新聞網站、希望使用者問 ChatGPT 即時查時能拿到你的內容，務必同時允許這兩隻。文件同樣在 OpenAI 官方 bots 頁。

3. ClaudeBot — Anthropic 主要爬蟲

ClaudeBot 是 Anthropic 在 2024 年正式推出的 web crawler，用於訓練 Claude 模型(包含 Opus、Sonnet、Haiku 系列)。它的 user-agent 字串為 Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])。ClaudeBot 完全遵循 robots.txt 規範，並提供聯絡 email 讓站長反應問題，態度比 Bytespider 友善很多。

在我們的評分模型中，ClaudeBot 與 GPTBot 同為 critical 級別(權重 0.20)。Claude 系列模型在企業端(Bedrock、Vertex AI、第三方整合)市佔成長極快，擋掉等於放棄一大塊 LLM 應用的能見度。Anthropic 官方說明在 support.anthropic.com。

4. anthropic-ai — Anthropic 舊 user-agent

anthropic-ai 是 Anthropic 在 ClaudeBot 命名標準化之前使用的舊 user-agent，目前仍會在某些訪問記錄中看到。為了向後相容(部分舊版爬蟲可能還在用)，建議在 robots.txt 同時針對 anthropic-ai 與 ClaudeBot 各寫一個區塊。權重較低(0.05)，非 critical，但配置成本接近零所以值得一起寫。

另外 Anthropic 還有一隻獨立的 claude-web(對應 Claude.ai 使用者觸發的即時瀏覽，類似 ChatGPT-User 的角色)。如果想完整覆蓋 Anthropic 生態，建議三隻一起設定:ClaudeBot + anthropic-ai + claude-web。

5. Google-Extended — Gemini 訓練爬蟲(預設不抓)

Google-Extended 是 2026 年最容易被忽略的爬蟲。它不是一隻獨立的 bot，而是 Googlebot 抓取資料後，Google 用來判斷「這份資料能不能拿去訓練 Gemini 與 Vertex AI Generative API」的政策標籤。它在 2023 年 9 月由 Google 公告 introduce，用意是讓網站「對 Google Search 開放、但對 Gemini 訓練選擇性開放」。

關鍵反向預設:Google-Extended 在 robots.txt 沒寫等於 Disallow，跟其他 AI 爬蟲剛好相反。要進 Gemini 必須明確寫 User-agent: Google-Extended 配 Allow: /。在我們的評分中是 critical 級(權重 0.20)，也是「不主動配置就會被扣分」的唯一一隻。Google 官方公告:blog.google — An update on web publisher controls。

6. PerplexityBot — Perplexity AI 答案引擎

PerplexityBot 是 Perplexity 用來建立其 AI 搜尋引擎索引的爬蟲。Perplexity 在 AI 搜尋產品中以「明確列出引用連結」著稱，點擊回流到原網站的比例比 ChatGPT 高很多 —— 這代表允許 PerplexityBot 不只是「曝光」，而是會帶回實際流量。權重 0.10，非 critical 但對流量導向型網站價值高。

user-agent 字串為 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)。Perplexity 還有第二隻 Perplexity-User 對應使用者即時查詢，跟 OpenAI 的 GPTBot/ChatGPT-User 同樣分工。文件:docs.perplexity.ai/guides/bots。

7. CCBot — Common Crawl(LLM 訓練資料源頭)

CCBot 是非營利組織 Common Crawl 的爬蟲，自 2008 年開始抓取網路、產出公開資料集。它本身不是 AI 公司，但 幾乎所有主流 LLM(GPT、Claude、Llama、Mistral)的早期訓練資料都包含 Common Crawl。換句話說，擋 CCBot 等於從根源上把自己排除在多數開源與商用模型訓練之外。

user-agent 字串為 CCBot/2.0 (https://commoncrawl.org/faq/)。CCBot 完全遵守 robots.txt，且每月只抓一次更新，流量負擔極低。權重 0.05，但長尾影響廣泛。Common Crawl FAQ:commoncrawl.org/big-picture/frequently-asked-questions。

8. Bytespider — 字節跳動(豆包/Doubao)

Bytespider 是字節跳動(ByteDance)的爬蟲，用於訓練豆包(Doubao)及其海外版 AI 產品。在 2023-2024 年多次被研究機構與媒體報導不完全遵守 robots.txt，即使網站寫了 Disallow,Bytespider 仍會繼續抓。Cloudflare 與多家 CDN 後續推出針對 Bytespider 的封鎖規則。

user-agent 字串為 Mozilla/5.0 (compatible; Bytespider; [email protected])。如果你不希望內容進入字節系 AI 模型，光寫 robots.txt 不夠，建議在 nginx 或 Cloudflare 邊緣層直接用 user-agent 比對封鎖。權重 0.05。封鎖指引:Cloudflare — Manage AI scrapers。

9. Applebot-Extended — Apple Intelligence 訓練

Applebot-Extended 是 Apple 在 2024 年 6 月推出 Apple Intelligence 後新增的爬蟲標籤，跟 Google-Extended 邏輯一致 ——預設不訓練，要明確 Allow 才會被用於 Apple AI 模型。原本的 Applebot(Spotlight、Siri 索引)維持原本的傳統規則，Applebot-Extended 是專屬訓練用的獨立條目。

蘋果生態(iPhone、Mac、iPad)在台灣與北美市佔極高，如果你是消費品牌、媒體、知識內容站，Apple Intelligence 將成為 iOS 使用者的預設 AI 入口，值得明確 Allow。權重 0.05，但對特定垂直(電商、旅遊、本地服務)影響顯著。Apple 官方:support.apple.com — About Applebot。

10. Amazonbot — Alexa / Rufus 智慧助理

Amazonbot 是 Amazon 的爬蟲，用於 Alexa 智慧音箱、Rufus(Amazon 內購物 AI 助理)、AWS Bedrock 部分服務的內容理解。user-agent 字串為 Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)。權重 0.05,non-critical，對電商、產品評論類網站影響較大，因為 Rufus 在亞馬遜購物頁直接整合產品評論摘要。

另外 Meta(Facebook)也有對應的 FacebookBot(用於 Meta AI 訓練)、Meta-ExternalAgent。如果關心 Threads、Instagram、WhatsApp 內 Meta AI 的引用，可以一併在 robots.txt 加上。Amazon 官方文件:developer.amazon.com/support/amazonbot。

10 隻 AI 爬蟲怎麼一張表對照?

一張表記住所有重點。優先級欄位的 critical 代表「擋掉會嚴重影響 GEO 分數」，建議全部 Allow;non-critical 是「值得開放但影響較小」；反向預設代表「沒寫等於不抓，必須明確 Allow」。

爬蟲名稱	公司	用途	預設行為	優先級
`GPTBot`	OpenAI	GPT 訓練	遵守 robots.txt	critical
`ChatGPT-User`	OpenAI	即時瀏覽	遵守 robots.txt	建議開放
`ClaudeBot`	Anthropic	Claude 訓練	遵守 robots.txt	critical
`anthropic-ai`	Anthropic	舊版 user-agent	遵守 robots.txt	non-critical
`Google-Extended`	Google	Gemini 訓練	反向預設(沒寫 = 不抓)	critical
`PerplexityBot`	Perplexity	AI 搜尋索引	遵守 robots.txt	建議開放
`CCBot`	Common Crawl	公開資料集	遵守 robots.txt	建議開放
`Bytespider`	ByteDance	豆包 / Doubao	有時不遵守	視政策決定
`Applebot-Extended`	Apple	Apple Intelligence	反向預設(沒寫 = 不抓)	建議開放
`Amazonbot`	Amazon	Alexa / Rufus	遵守 robots.txt	non-critical

完整 robots.txt 範本怎麼用?(可直接複製)

以下是一份對 AI 友善、保護敏感路徑、含註釋的完整 robots.txt。複製後請把 example.com 改成你的網域，並依需求調整 Disallow 路徑:

# ==========================================
# robots.txt — AI 友善設定(2026-05 版)
# ==========================================

# 傳統 search bot
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- AI 訓練爬蟲(critical 三隻)---

User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /checkout/
Disallow: /api/

User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /checkout/

# Google-Extended:反向預設，必須明確 Allow
User-agent: Google-Extended
Allow: /

# --- AI 即時瀏覽(使用者觸發)---

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- 其他 AI 爬蟲 ---

User-agent: PerplexityBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: FacebookBot
Allow: /

# --- 視政策決定 ---
# 若不希望內容進入字節跳動 AI:
# User-agent: Bytespider
# Disallow: /

# --- 通用 fallback ---
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /api/private/
Crawl-delay: 5

Sitemap: https://example.com/sitemap.xml

✓ 重點檢查清單

(1) Google-Extended 必須有獨立區塊且 Allow: /;(2) Applebot-Extended 同樣需要明確 Allow;(3) GPTBot、ClaudeBot、PerplexityBot 至少都要有區塊不能漏；(4) 所有具名 bot 寫完後再寫 User-agent: * 通用 fallback;(5) 結尾加 Sitemap: 指向你的 sitemap.xml。

常見錯誤有哪些?用什麼工具檢查?

最常見三大錯誤:第一，把 User-agent: * 放在最上面寫 Disallow: /(整站擋光)，具名 bot 區塊就算後寫 Allow 也救不回。第二，把 Google-Extended 當成預設允許 —— 沒明確 Allow 就等於沒讓 Gemini 訓練。第三，用 wildcard 路徑混淆爬蟲(例如 Disallow: /*?)，不同爬蟲對 wildcard 解析略有差異，不確定就用具體路徑。

驗證工具推薦三個:Google Search Console 內建的 robots.txt Tester 可測 Googlebot 與 Google-Extended;OpenAI 在 platform.openai.com/docs/bots 的文件提供 GPTBot 規則範例可比對；Anthropic 文件直接列出 ClaudeBot 完整 user-agent 字串。本工具(AV8D AI SEO Search)也提供 AI 爬蟲開放度自動檢測，一鍵掃描所有 10 隻 bot 的設定。

⚠️ Cloudflare 使用者注意

如果你用 Cloudflare,「Bot Fight Mode」可能會誤擋 GPTBot 與 ClaudeBot。請到 Security → Bots 面板確認；也可使用 Cloudflare 的「AI Audit」功能查看實際有哪些 AI 爬蟲訪問、訪問了哪些頁面。

怎麼監控 AI 爬蟲的真實訪問?

設好 robots.txt 後，你需要驗證爬蟲「真的有來」。最直接的方法是在伺服器 access log 過濾 user-agent。以 nginx 為例，執行以下指令可以列出最近 7 天各 AI 爬蟲的訪問次數:

# nginx — 統計 AI 爬蟲訪問次數
sudo grep -E "GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|Applebot|Amazonbot|anthropic-ai" \
  /var/log/nginx/access.log \
  | awk -F'"' '{print $6}' \
  | grep -oE "(GPTBot|ClaudeBot|Google-Extended|PerplexityBot|CCBot|Bytespider|Applebot-Extended|Amazonbot|anthropic-ai)" \
  | sort | uniq -c | sort -rn

# 範例輸出(健康的網站)
#  1247 GPTBot
#   892 ClaudeBot
#   534 PerplexityBot
#   312 CCBot
#   201 Applebot-Extended
#    45 Bytespider

如果某隻關鍵 bot(GPTBot、ClaudeBot、Google-Extended)連續 30 天訪問次數為 0，代表 robots.txt 寫錯、CDN 擋掉、或 IP 被防火牆封鎖，需要排查。健康的中型內容站，GPTBot + ClaudeBot 加起來每月至少數百到數千次訪問；Google-Extended 的訪問會記在 Googlebot 的 user-agent 下(因為它是政策標籤，不是獨立 bot)，透過 Google Search Console 的「Crawl stats」報表查看更準。

進階監控可以接 GA4(過濾 user-agent)、CloudWatch Logs Insights(AWS 環境)，或用 Cloudflare 的「AI Audit」直接看儀表板。本工具也內建 robots.txt 自動掃描 + AI 爬蟲訪問追蹤，可在發現異常時主動通知。

關鍵名詞,3 句話講清楚?

robots.txt: 是指放在網站根目錄的純文字檔,告訴爬蟲哪些路徑可以抓、哪些不行。AI 爬蟲也讀這個檔。
User-agent: 定義為爬蟲的身分識別字串。每隻 AI 爬蟲(GPTBot、ClaudeBot...)有自己的 User-agent 名稱。
Google-Extended: 指的是 Google 用來訓練 Gemini 的爬蟲控制旗標。它跟 Googlebot 是兩回事 — 預設情況下,如果只放行 Googlebot,Google-Extended 不會抓你的內容用於 AI 訓練。

常見問答:設定 AI 爬蟲時最常踩到的坑?

問:不開放 AI 爬蟲會傷流量嗎?: 答:短期影響有限,中長期愈來愈大。當愈來愈多人用 ChatGPT / Perplexity 找資料時,你的網站若被擋,等於完全不會被引用,曝光直接歸零。
問:開放 AI 爬蟲會被「白嫖」內容嗎?: 答:這是常見誤解。AI 爬蟲抓取後在回答中會引用來源連結(ChatGPT、Perplexity、Claude 都會),反而帶來品牌曝光與點擊。完全擋住才是真的失去曝光。
問:Google-Extended 跟 Googlebot 有什麼差?: 答:Googlebot 是傳統搜尋爬蟲(影響 Google 自然排名),Google-Extended 是 Gemini 訓練資料控制旗標。如果你的 robots.txt 只 Allow Googlebot,Google-Extended 是預設「不抓」狀態,要明確 Allow 才會收。
問:robots.txt 改完多久生效?: 答:爬蟲下次來訪時就生效。Google 系列大約幾小時到 1 天,GPTBot/ClaudeBot 通常 1-3 天會重新抓 robots.txt。
問:一定要列出每一隻 AI 爬蟲嗎?: 答:不一定。如果 `User-agent: *` 是 Allow,所有未明確列出的爬蟲都會被允許。但建議至少明確列出 GPTBot、ClaudeBot、PerplexityBot、Google-Extended 這 4 隻 — 一是清楚表達意圖,二是方便日後個別調整。
問:Bytespider、Amazonbot 也要開嗎?: 答:看目標市場。Bytespider(豆包/抖音)針對中國市場,Amazonbot(Alexa/Rufus)針對 Amazon 生態。一般來說全部開放即可,除非有特殊隱私顧慮。

結論:3 個立刻要做的動作怎麼依序執行?

三步做完,你的網站對 2026 年的 AI 引擎就具備基本曝光資格:

步驟 1:立刻打開 https://yoursite.com/robots.txt 檢查。看 Google-Extended、GPTBot、ClaudeBot 三隻關鍵 bot 是否都有獨立區塊與 Allow。沒有的話進入步驟 2。
步驟 2:套用本文的 robots.txt 範本。改網域與 Disallow 路徑後上傳至網站根目錄,確認 https://yoursite.com/robots.txt 能被瀏覽。
步驟 3:一週後檢查 nginx access log。確認 GPTBot、ClaudeBot 等 User-agent 確實有來訪 — 如果一週還沒看到,可能是站點權重太低或 Sitemap 沒提交。

想看更多 GEO 主題，推薦繼續閱讀:

📚 延伸閱讀與引用

想自動掃描 10 隻 AI 爬蟲的設定?

本工具一鍵檢測 robots.txt 對所有主流 AI 爬蟲的開放度，並給出具體修改建議。免費試用。

🚀 免費分析