How to block and confuse AI crawlers with nepenthes and fail2ban

GenAI is a disgusting, festering thing that keeps getting worse by the day. And night. And minute. They relentlessly ignore rules, copyright, established practices and standards¹ - and that's just the technical perspective, let alone the morally implicating ones².

People came up with proof-of-work JS solutions to block the attacks off³, and while it's the proper solution against the issue, I wanted something different.

A little while ago I came across with AI "poisoning" tarpits⁴. While it's a wonderful idea, it turned out that there are many, many, MANY bots that would gladly get lost 5-6 levels deep in the link maze it generates, happily heating my meager server CPU (I use a passively cooled thin client as my server), so I decided to add a fail2ban blocking on top.

I do have to note that if you run a web application that is JS dependent anyway, you're probably best off setting up Anubis. This solution is more fitting for websites that don't need or use JS. There's a promised no-JS version of Anubis, but it's not there yet.

1: add a rule to your robots.txt to disallow any visits to a certain path url on your site

User-agent: *
Disallow: /ohwowyoushouldntbehere

2: add an invisible url pointing to the disallowed path on your site, ideally on every page

<a href="/ohwowyoushouldntbehere" title="Please do not visit this link, if, somehow, you can see it. It's not meant to be visited." ></a>

3: set up Nepenthes⁵ on the disallowed path

templates/toplevel.lustache

<!DOCTYPE html>
<html lang="en" >
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
    <p>You shouldn't be on this page. Please leave.</p>
    <article>
        {{ content }}
    </article>
</body>
</html>

templates/list.lustache

<h3>{{ header }}</h3>
{{# content }}
<p>
        {{ content }}
</p>
{{/ content }}
<ul>
{{# links }}
        <li><a href="{{{ prefix }}}/{{{ link }}}">{{ description }}</a></li>
{{/ links }}
</ul>

config.yml

http_host: '127.0.0.1'
http_port: 8893
templates: './templates'
words: '/usr/share/dict/words'

forget_time: 86400
forget_hits: 10

persist_stats: './statsfile.json'
seed_file: './seed.txt'

markov: './corpus.sqlite.db'
markov_min: 200
markov_max: 1200

min_wait: 5
max_wait: 30

You can find the rest of setting up Nepenthes on the project site at https://zadzmo.org/code/nepenthes/

I ended up training it on Ulysses by James Joyce, obtained from https://www.gutenberg.org/cache/epub/4300/pg4300.txt

4: block anything that visits it more, than X times with fail2ban

Note: I'm using ncsa log format in nginx as:

access_log /var/log/nginx/access.log ncsa;

filter.d/nginx-nepenthes.conf

before = common.conf

[Definition]
failregex = ^[a-zA-Z\.]+ <HOST> [^\s]+ [^\s]+ \[[^\]]+\] \"[A-Z]+ \/ohwowyoushouldntbehere.*$
datepattern = %%d/%%b/%%Y:%%H:%%M:%%S
journalmatch = _SYSTEMD_UNIT=nginx.service + _COMM=nginx
ignoreregex =

Relevant lines in jail.local:

[nginx-nepenthes]
enabled = true
port = 80,443
filter = nginx-nepenthes
logpath = /var/log/nginx/access.log
maxretry = 3
bantime = 84600
searchtime = 86400

(Oh, by the way: this entry was written by Peter Molnar, and originally posted on petermolnar dot net.)