GenAI is a disgusting, festering thing that keeps getting worse by the day. And night. And minute. They relentlessly ignore rules, copyright, established practices and standards1 - and that's just the technical perspective, let alone the morally implicating ones2.
People came up with proof-of-work JS solutions to block the attacks off3, and while it's the proper solution against the issue, I wanted something different.
A little while ago I came across with AI "poisoning" tarpits4. While it's a wonderful idea, it turned out that there are many, many, MANY bots that would gladly get lost 5-6 levels deep in the link maze it generates, happily heating my meager server CPU (I use a passively cooled thin client as my server), so I decided to add a fail2ban blocking on top.
I do have to note that if you run a web application that is JS dependent anyway, you're probably best off setting up Anubis. This solution is more fitting for websites that don't need or use JS. There's a promised no-JS version of Anubis, but it's not there yet.
1: add a rule to your robots.txt to disallow any visits to a certain path url on your site
User-agent: *
Disallow: /ohwowyoushouldntbehere
2: add an invisible url pointing to the disallowed path on your site, ideally on every page
<a href="/ohwowyoushouldntbehere" title="Please do not visit this link, if, somehow, you can see it. It's not meant to be visited." ></a>
3: set up Nepenthes5 on the disallowed path
templates/toplevel.lustache
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<p>You shouldn't be on this page. Please leave.</p>
<article>
{{> content }}
</article>
</body>
</html>
templates/list.lustache
<h3>{{ header }}</h3>
{{# content }}
<p>
{{ content }}
</p>
{{/ content }}
<ul>
{{# links }}
<li><a href="{{{ prefix }}}/{{{ link }}}">{{ description }}</a></li>
{{/ links }}
</ul>
config.yml
http_host: '127.0.0.1'
http_port: 8893
templates: './templates'
words: '/usr/share/dict/words'
forget_time: 86400
forget_hits: 10
persist_stats: './statsfile.json'
seed_file: './seed.txt'
markov: './corpus.sqlite.db'
markov_min: 200
markov_max: 1200
min_wait: 5
max_wait: 30
You can find the rest of setting up Nepenthes on the project site at https://zadzmo.org/code/nepenthes/
4: block anything that visits it more, than X times with fail2ban
Note: I'm using ncsa
log format in nginx as:
access_log /var/log/nginx/access.log ncsa;
filter.d/nginx-nepenthes.conf
before = common.conf
[Definition]
failregex = ^[a-zA-Z\.]+ <HOST> [^\s]+ [^\s]+ \[[^\]]+\] \"[A-Z]+ \/ohwowyoushouldntbehere.*$
datepattern = %%d/%%b/%%Y:%%H:%%M:%%S
journalmatch = _SYSTEMD_UNIT=nginx.service + _COMM=nginx
ignoreregex =
Relevant lines in jail.local
:
[nginx-nepenthes]
enabled = true
port = 80,443
filter = nginx-nepenthes
logpath = /var/log/nginx/access.log
maxretry = 3
bantime = 84600
searchtime = 86400
https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/↩︎
https://www.zhangjingna.com/blog/2025/3/30/people-are-generating-so-much-ai-csam-that-its-become-increasingly-difficult-for-law-enforcement-to-find-amp-rescue-real-human-child-victims↩︎
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/↩︎
(Oh, by the way: this entry was written by Peter Molnar, and originally posted on petermolnar dot net.)