<> GLADE ART <> Feed

10 million requests in my bot black hole... Here is some information:

By Jackie Glade June 01, 2026 1693 views
10 million requests in my bot black hole... Here is some information:
And so, ladies and gentlemen, we have hit the 10,000,000 serves threshold of the Data Export tar pit. This is a whole new level of bot trapping, and it's peak. Let's take a look at the logs and stuff. Tar pit we're talking about today: https://gladeart.com/data-export (And all parts of its link tree). If you're not too familiar with bots and tar pits, you can read this first: https://gladeart.com/blog/the-bot-situation-on-the-internet-is-actually-worse-than-you-could-imagine-heres-why What or who are they scraping for? At these scales of scraping, we can safely assume that this is for AI training; these guys have some pretty good funding. LOGS for download: First 6 million requests: https://mega.nz/file/69Rh3IpS#ThlagHz8e58jLvU-vWn9U9m9T_WegL4SE0H2mhZRcZY (Decompresses to a 1.1GB large text file). Next 4 million requests: https://mega.nz/file/rphUlJ6Z#2SHnCfGemVZb-qcqL2CFZ7LNeaQVo2hLZE8ynlQakS4 (Decompresses to a 700MB large text file). (NOTE: You may be wondering why I am sharing these logs for public download which contain IPs. the answer is that public IPs are not valuable information, and they are considered public info in the USA. Besides, these are the logs of literal bot swarms). These logs are not taken from the Nginx layer; they are taken from the server application itself, after the server-side delays. This means that 499s, 504s, and stuff aren't counted in the logs. Additionally, 429s (rate limited) requests aren't logged. Many bots time out from the delays, many are 429'ed, but these logs don't show those requests. The logs would be much larger if they did. As we can see in the logs, they seem to be made up mostly of 2 major bot swarms. We'll break them into 2 groups: 1. The "usual suspects": extremely common on nearly all websites which don't require JS. They use a massive pool of mobile/residential IPs mostly from Asian/Indonesian countries. These may appear like legitimate traffic to you on your website, but as we can see by them swarming in the tar pit, they are not. Compromised devices in a botnet used for scraping? From what, some sort of popular mobile app(s)? Something else? Hard to say. Anyways, these have some interesting behaviors: on smaller websites, they often pause scraping for a while when actual users access the site (someone uses up a bit of the server upload speeds causing a tiny bit of extra loading times for them). Why? You name the reason. Perhaps they are just trying to be respectful /s. They aren't too aggressive, but they can scrape your website 24/7 for literal months. IP rotation is on nearly every request btw. 2. The "47" datacenter. IPs look something like this: 47.79.XXX.XXX. You will find the 6m request file to have mostly "usual suspect" traffic, but the 4m one has a lot of the "47" ones on there. These are extremely aggressive. In fact, they had Data Export going at 4000 RPMs (maximum global rate limit for it), and then spamming the 429s, hitting about 10,000 requests per minute or more. This tar pit is made for these loads, but that would be reaching DDOS levels on many sites. So yeah, not very respective to websites. Scraping mostly comes in lengthy bursts from these guys. 3. And then there is "everyone else": just the average datacenters, VPNs, and stuff. Not significant compared to the other 2. The "Data Export" tar pit is my oldest and most famous pit, which these logs are from. It's been around for a while longer, but the actual, good version dropped on January 29th, 2026. As we know, nearly no bots execute JS (JavaScript), so this one stopped requiring it then. Since then I have made many other pits which use ultra high-variation word babblers, but Data Export is still the most popular one. (Mostly because of its age and contents). As seen with my many other tar pits, it takes about a month or two for them to really start ramping up the traffic. Welp, time to talk about tar pit 'SEO' now. So how can your tar pit get discovered by the bot swarms? Well unsurprisingly, just sharing the link of it on a social media/forum that doesn't require JS gets the bot swarms onto your tar pit. So yeah, for example Reddit is a good one (because Old Reddit doesn't require JS). I'd imagine they get absolutely hammered by these bots. When sharing your pit link, you can even say that it's a tar pit; the bots aren't smart, they just crawl everything they can find, and add seemingly valuable links to their database for further hammering. Disallow going into it on your robots.txt because quite often, the reason why they will go there in the first place is because they're disallowed from going into there. So yeah, I won't repeat the same stuff I said in the other blog articles here, but you can read them if you want. Some data about bots and PoW challenges: https://gladeart.com/blog/proof-of-work-challenges-are-actually-very-effective-against-bots-here-is-some-data-showing-it Oh and btw, I have a really huge tar pit site gaining popularity: a massive code repository filled with fake code. (Code fetched from The Poison Fountain). Stay tuned! <> Thanks for reading and have a nice day!

← Back to Blog