AI, scraping, nepenthes, fail2ban

A quick guide on combining an anti-AI tarpit with automatic blocking

GenAI is a disgusting, festering thing that keeps getting worse by the day. And night. And minute. They relentlessly ignore rules, copyright, established practices and standards1 - and that's just the technical perspective, let alone the morally implicating ones2.

People came up with proof-of-work JS solutions to block the attacks off3, and while it's the proper solution against the issue, I wanted something different.

A little while ago I came across with AI "poisoning" tarpits4. While it's a wonderful idea, it turned out that there are many, many, MANY bots that would gladly get lost 5-6 levels deep in the link maze it generates, happily heating my meager server CPU (I use a passively cooled thin client as my server), so I decided to add a fail2ban blocking on top.

I do have to note that if you run a web application that is JS dependent anyway, you're probably best off setting up Anubis. This solution is more fitting for websites that don't need or use JS. There's a promised no-JS version of Anubis, but it's not there yet.

1: add a rule to your robots.txt to disallow any visits to a certain path url on your site

User-agent: *
Disallow: /ohwowyoushouldntbehere

2: add an invisible url pointing to the disallowed path on your site, ideally on every page

<a href="/ohwowyoushouldntbehere" title="Please do not visit this link, if, somehow, you can see it. It's not meant to be visited." ></a>

3: set up Nepenthes5 on the disallowed path

templates/toplevel.lustache

<!DOCTYPE html>
<html lang="en" >
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
    <p>You shouldn't be on this page. Please leave.</p>
    <article>
        {{> content }}
    </article>
</body>
</html>

templates/list.lustache

<h3>{{ header }}</h3>
{{# content }}
<p>
        {{ content }}
</p>
{{/ content }}
<ul>
{{# links }}
        <li><a href="{{{ prefix }}}/{{{ link }}}">{{ description }}</a></li>
{{/ links }}
</ul>

config.yml

http_host: '127.0.0.1'
http_port: 8893
templates: './templates'
words: '/usr/share/dict/words'

forget_time: 86400
forget_hits: 10

persist_stats: './statsfile.json'
seed_file: './seed.txt'

markov: './corpus.sqlite.db'
markov_min: 200
markov_max: 1200

min_wait: 5
max_wait: 30

You can find the rest of setting up Nepenthes on the project site at https://zadzmo.org/code/nepenthes/

4: block anything that visits it more, than X times with fail2ban

Note: I'm using ncsa log format in nginx as:

access_log /var/log/nginx/access.log ncsa;

filter.d/nginx-nepenthes.conf

before = common.conf

[Definition]
failregex = ^[a-zA-Z\.]+ <HOST> [^\s]+ [^\s]+ \[[^\]]+\] \"[A-Z]+ \/ohwowyoushouldntbehere.*$
datepattern = %%d/%%b/%%Y:%%H:%%M:%%S
journalmatch = _SYSTEMD_UNIT=nginx.service + _COMM=nginx
ignoreregex =

Relevant lines in jail.local:

[nginx-nepenthes]
enabled = true
port = 80,443
filter = nginx-nepenthes
logpath = /var/log/nginx/access.log
maxretry = 3 
bantime = 84600
searchtime = 86400

  1. https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/↩︎

  2. https://www.zhangjingna.com/blog/2025/3/30/people-are-generating-so-much-ai-csam-that-its-become-increasingly-difficult-for-law-enforcement-to-find-amp-rescue-real-human-child-victims↩︎

  3. https://anubis.techaro.lol/↩︎

  4. https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/↩︎

  5. https://zadzmo.org/code/nepenthes/↩︎

(Oh, by the way: this entry was written by Peter Molnar, and originally posted on petermolnar dot net.)