Post on: 2025-1-22Last edited: 2026-6-12Words 758Read Time 2 min

type
Post
status
Published
date
Jan 22, 2025
slug
Intro-to-Crawlers
summary
A practical guide to web scraping that starts with robots.txt and data-use boundaries, then compares requests, aiohttp, Scrapy, and Playwright before covering anti-bot handling, parsing, retries, and data quality.
tags
Tools
Development
category
Technology
icon
password
paired_with
1831d487-a2a1-80fb-97ec-fc3e828b1d79
lang
translation_locked
source_hash
5d0e12b7b17a1653031bac1fd358b3216dead84cfc40e0f0c3d13c9e6e9009e4
💭
I started writing crawlers in college, from scraping Tieba posts to collecting stock prices, Douban pages, and GitHub Trending. The tools changed along the way: requests, then Scrapy, then Playwright. This is not meant to be a complete web scraping guide. It is more like a map of the tools I have actually used, when each one fits, and the lessons I only remembered after stepping into a few holes.

Start With Permission

Before writing a crawler, check robots.txt. It is the file where a site tells crawlers which paths are welcome and which paths should be avoided. In Python, the basic check is small:
robots.txt is not the same thing as law. Ignoring it does not automatically mean you will be sued. But it is a signal about whether the site wants to cooperate with crawlers. If you ignore that signal, the site can answer with things that are much more annoying than robots.txt: IP bans, CAPTCHAs, JavaScript encryption, and risk-control rules. Any one of those can waste several days.
There is also the question of what you do with the data. Scraping public pages for your own analysis is one thing. Reselling the data, training a model on it, or bypassing a paywall is another. That is where legal and ethical boundaries start to matter.

My Current Tool Stack

Different scraping jobs need different tools. I roughly split them into three levels.
First level: quick small jobs. requests plus BeautifulSoup. A few lines are enough to fetch a page and pull something out.
Second level: concurrency. aiohttp plus asyncio. If you need to fetch hundreds of URLs without blocking on each one, async requests are much more efficient.
Third level: engineering. Scrapy. It comes with scheduling, deduplication, pipelines, and middleware. If a crawler needs to run for days, resume after failure, slow itself down, and write to a database, Scrapy already has most of that shape.
If the page is a SPA and the content only appears after JavaScript runs, these tools are not enough. Then I use a headless browser. These days I prefer Playwright over Selenium. It is faster, and the API feels more modern.

Anti-Bot Measures

Once a site notices that you are a crawler, the blocking usually follows a familiar order.
  1. User-Agent checks. The default requests user agent looks like a crawler. The first fix is to send a browser-like User-Agent string.
  1. IP frequency checks. Too many requests from the same IP in a short time will get blocked. The simplest fix is to slow down. A proxy pool can help, but it is expensive and often not worth it.
  1. Cookies and sessions. For pages that require login state, use requests.Session() to keep cookies across requests.
  1. CAPTCHAs. At this point the site is basically telling you to stop. OCR plus CAPTCHA-solving services can work, but the cost often outweighs the value of the data.
A proxy pool is easy to write and hard to maintain. Free proxies are mostly dead. Paid proxies cost real money. Before building around them, ask whether the thing you are scraping is worth that cost.

Parsing The HTML

After fetching HTML, you still need to extract data from it. The common choices are:
  • BeautifulSoup. Easiest to start with. It feels a bit like jQuery.
  • lxml plus XPath. Steeper syntax, but fast and stable for complicated pages.
  • PyQuery. A Python implementation of jQuery-like selectors. If you have written frontend code, it is comfortable.
  • Regex. Avoid it when you can. HTML is a tree, not a flat string stream. Parsing it with regex will eventually hurt.
My default is lxml plus XPath, especially with Scrapy. For simple jobs, BeautifulSoup is enough.

Practical Lessons

After writing crawlers for a while, the lessons that stayed with me are less about clever tricks and more about boring discipline.
  1. Go slower. Add a sleep of 1 to 3 seconds between requests. Most sites will never notice you. This is much better than blasting concurrency and getting blocked.
  1. Save the raw HTML. Store the complete HTML locally first, then debug parsing against the local copy. You can change selectors without re-fetching the page every time.
  1. Write retries. Networks fail constantly. Wrap each request with a small retry loop and exponential backoff.
  1. Do not trust selectors. .title today may become .article-title tomorrow. Crawlers are fragile because page structure changes. Monitoring and alerts matter as much as the crawler code.

References


Loading...
MCP Is Infrastructure, Not the Ultimate AI Solution

MCP Is Infrastructure, Not the Ultimate AI Solution

MCP standardizes how AI apps connect to tools, resources, prompts, and external systems. It is useful infrastructure for reducing integration glue, but it does not replace permissions, confirmations, logging, threat modeling, or tool-safety design.


What Does “Compiling Shaders” Mean in Games?

What Does “Compiling Shaders” Mean in Games?

A plain-language explanation of shader compilation: what shaders are, why GPU-specific machine code must be built locally, how graphics APIs changed the workflow, and why the first launch can take several minutes.


Announcement
This site is still updating…