scraping

“Scraping” is in the news these days. That’s the practice of third-parties scanning websites and collecting all the information they can. It’s not necessarily a bad thing: without scraping, we wouldn’t have search engines or the fabulous Internet Archive.

Inevitably, bad actors get involved. Early scrapers were used to scan the World Wide Web for email addresses to add to spammers’ spew lists. These days the controversy is more around builders of large language models scraping human-generated content so that their “AI” bots can repackage it as slag and sling it around the internet.

Deep in the distant past, I was in my kitchen when I heard a great roar come from the server room (I had 800 square feet or so of space in the house dedicated to data center ops). Moments later I got some text messages from status monitors. It was hard to even get a shell on the servers to see what was going on, but when I finally got through I discovered I was under a full-scale dDOS attack… from Microsoft. Hundreds of IP addresses were hammering my servers thousands of times per second. They were blasting through robots.txt restrictions and hitting everything they could as hard as they could. The roar was from a hundred or so servers’ fans spinning up to full speed to cool the overtaxed machines. I momentarily regretted having a 100 Mbps connection (fast for that time) but, remarkably, everything stayed up. My lines were so congested that I had to use a cellular connection to try to figure out what was going on until I could get enough Microsoft address space blocked that I could comfortably start analyzing logs.

What had I a done to warrant an attack from Microsoft? Nothing: don’t attribute to malice what can be explained by incompetence. In this case, incompetence combined with arrogance. Microsoft was launching its Bing search engine, and decided that it was too inconvenient to follow the standards specified for the robots.txt file (which limits which parts of sites and the rates at which scrapers can access sites) and just publish their own. This let them build the Bing search index at, basically, your expense and mine since it was other peoples’ networks and computers that supplied the bandwidth and cycles to feed Microsoft’s gluttonous need for data and their rude, selfish, and toxic way of getting it.

Had I launched an attack like that, even at a fraction of the scale, the victims would simply complain to my ISP and upstream providers and I’d get cut off, pronto. Microsoft, of course, was permitted to abuse everyone else’s systems with impunity. Usenet forums were packed with sysadmins complaining of the Microsoft attack which ripped off the resources of thousands of companies. Microsoft’s response: “it’s your fault for not reading our blogs and modifying your robots.txt files to conform with our new guidelines.”

—2p

addendum 20240811@09:30

Microsoft is doing it again with their “AI” projects, claiming that anything on the web is theirs to scrape and use however they see fit. This is like rolling a wheelbarrow into your local taqueria and filling it up with chips, salsa, plastic flatware, condiments, and napkins. What arrogant jerks.

← previous|next →

twoprops.net

scraping

addendum 20240811@09:30

Backlinks