Refactoring my static generator

A few weeks ago I sat down in front of my site and realized: it's doing too many things, the code is over 3500 lines of Python, and I feel lost when I look at it. It was an organic growth, and happened somewhat like this:

Let's start simple: collect images, extract EXIF using exiftool¹, watermark them, if needed, resize them, if needed. Collect markdown files, convert them to HTML with pandoc² using microformat friendly templates. Ah, wait I need categories. I also need pages. And feeds. Multiple feeds, because I'm not going to choose sides, RSS, Atom, JSON, hfeed. Let's make all of them! I'll even invent YAMLFeed³ for the lulz. I need webmentions. Receive them, create comments before rendering anything, then render, then sync, then send outgoing webmentions. Oh. Don't send them every single time, just on change. I need to publish to flickr, but I need to be able to backfill from brid.gy. Let's handle gone content properly, also redirects nicely. Let's try JavaScript based search. Or let's not, it's needs people to download the full index every time, let's do PHP from Python templates instead. Google zombies are doing JSON-LD, academics are doing Linked Data, let's do all that. Hell, let's make an intermediate representation of all my content in JSON-LD that is made from the Markdown files before it hit's the HTML templates! In the meanwhile, why not auto-save my posts to archive.org? But what if I already did it? Let's find the earliest version automagically! OK, this is a bit slow now, let's start using async stuff. Let's syndicate to fediverse via fed.brid.gy. I don't like my pagination logic, let's do some categories flat: all on one page; others paginated by year. I want to add something funky for IWC this year, I'll add a worldmap for photos with location data. I see federated things are pinging .well-known locations, let's generate data for them.

I'm not certain if this is the whole list of features, but it's quite clear it has overgrown it's original purpose. In my defense, some of these functionalities were only meant to be learning experiences.

DRY - don't repeat yourself

I started with the most painful point. The previous iteration had a source directory for the content, with the unprocessed, original files, a nasg for the code, and a www for the generated output. What I should have done from the start it to have 1 and only 1 directory for everything.

The main reasons for the original layout were to keep my original - quite large - images safe on my own computer, copy only the resized and potentially watermarked ones online. The other was to keep the code in it's own repository, so it can be "Open Sourced". Why the quotes: because I've started to question what Open Source means to me and what it is right now in the world, but this is for another day.

The more I complicated this the more I realized all these disconnected pieces are making the originally simple process more and more convoluted. So I made certain decisions.

My generator code is not going to live on Github any more. Instead, it'll be in the root folder of my site content, which will also be the root folder for the website. I'll generate everything in place. I'll move the original images to be hidden files and protect them via webserver rules, like I did in the WordPress times. I'll place the Python virtualenv in this directory as well.

With the move to a single directory structure I also moved away from the weird path system I ended up with: direct uris for entries and /category/ prefixes for categories. Now everything always is /folder/subfolder/ etc, as it should have been from the start.

It needed some rewrite magic to have it done properly, but it should all be fine now.

Parsing should be stiff and intolerant

When I saved markdown files by hand, I wasn't paying too much attention to, for example, dates. The Python library I used - arrow - parsed nearly everything. This also applied to the comments, but the comments were saved by my own code: missing or null authors, bad date formats, etc.

With the refactoring I decided to ditch as many libraries as possible in favour of Python's built in ones, and datetime suddenly wasn't happy.

I fixed all of them; some with scripts, others by hand. Than swapped to a very strict parsing: if stuff is malformed, fail hard. Make me have to fix it.

No workarounds in the code, no clever hundreds of lines of fallbacks; the source should be cleaned if there is an issue.

Not everything needs templating

In order to have a nice search, I had templated PHP files. Truth is: it's not essential. Search is happy with a few lines of CSS and a "back to petermolnar.net" button.

My fallback 404.php can now rely on looking up files itself. Previously I had removeduri.del and some-old-uri.url files. The first were empty files, with the deleted URIs in their names; the second contained the URL to redirect to. Because of the content and www directory setup, I had to parse these, collect them, and then insert in the PHP. But now I had the files accessible from the PHP itself, meaning it can look it up itself.

This way both my 404.php and my search.php became self-sufficient: no more Python Jinja2 templates for PHP files.

Semantic HTML5 is a joke, JSON-LD is a monster, and I have no need for either

Some elements in HTML5 are good, and were much needed. Personally I'm very happy with figure and figcaption, details and summary, and time.

I findheader, footer , and nav a bit useless, but nothing tops the main, section, article (and probably some other) mess. There's no definitive way of using one or the other, so everyone is doing which make sense to them⁴ - which is the opposite of a standard. Try to figure out which definition goes for which (official definitions from the "living" HTML standard):

The X element represents a generic section of a document or application. The X , in this context, is a thematic grouping of content, typically with a heading.

The Y element represents a complete, or self-contained, composition in a document, page, application, or site and that is, in principle, independently distributable or reusable, e.g. in syndication.

The Z element represents the dominant contents of the document.

So I dropped most of it; especially because I have microformats⁵ v1 and v2 markup already, and that is an actual standard with obvious guidelines.

Next ripe for reaping was JSON-LD. I got into the semantic web possibilities because I was curios. I learnt a lot, including the fact that I have no need for it.

The enforced vocabulary for JSON-LD, schema.org, is terrible to use. Whenever you have a need for something that's not present already, you're done for, and it'll probably pollute the structured data results, because all the search engines, especially Google, are picky: they limit the options plus they require properties. Examples everything MUST have a photo! And and address! And a publisher! If you don't believe me, try to make a resume with schema.org then check the opinion of the Google Structured Data Testing Tool about it.

No, Google. Not everything has an image - see http://textfiles.com Like it or not, a website doesn't need and address. The list goes on forever.

I'm going to stop feeding it, stop feeding all of them, stop playing by their weird rules. HTML has link and meta elements, plus rel= property, so it can already represent the minimum, which is enough. Plus, again, there's microformats, and Google is still OK with them⁶.

Note: with structured data, in theory, one could pull in other vocabularies to overcome problems like nonexistent properties in one, but search engines are not real RDF parsers. Unless you're writing for academic publishing tools that will do so, don't bother.

Update: 2020-07-08: it very much seems like Google is sunsetting their microformats supports with their incredibly shitty new Rich Results Test, that doesn't even tell you what's wrong⁷, so I'm putting RDFa back.

Pick your format, and pick just one

Between 2003 and 2007 some tragic mud-throwing (mirror translated Hungarian phrase, just because it's pretty visual) was going on on the web, over something ridiculously small: my XML is better, than your XML! ⁸.

When I first encountered with the whole "feed" idea itself, there was only RSS, and for a very long time, I was happy with it. Then I read opinions of people I listen to on how Atom is better. https://fed.brid.gy is Atom only. Much later someone on the internet popped the JSONFeed thought.

When I first saw JSONFeed, I thought it's a joke. Turned out it's not, because there are simpletons who honestly believe the world will be better if things are JSON and not XML. It won't, it'll only result in things like JSON-LD

In the heat of the moment, I coined the thought of YAMLFeed⁹, strictly as a satire, but for a brief time I actually maintained a YAMLFeed file as well Do not follow my example.

And then I found myself serving them all. I had a Category class in Python, that had JSONFeed and XMLFeedsubclasses, which latter had AtomFeed and RSSFeed subclasses, it used FeedParser to deal with it, and so on... in short, I made a monster.

~~I went back an RSS 2.0 feed and a h-feed.~~ Update from 2021-05-22: I settled on Atom after learning a bit more about the possibilities in it. It can still be made with the lxml library directly. I still prefer the "RSS" acronym though.

Closure

If you have a website in 2020, it's probably a hobby for you as well; don't let anything change that.

It should never become a burden, any part of it. It did for me, and I seriously considered firing up something like Microsoft FrontPage 98 to start from the proverbial scratch, but managed to salvage it before resulting to drastic measures.

Don't follow trends. Once a solution grows deep enough roots - microformats, RSS, etc - it'll be around for a very long time.

Screw SEO. If you're like me, and you write for yourself, and, maybe, for the small web¹⁰, don't bother trying to please an ever-changing power play.

If you want to learn something new, be careful not to embed it too deep as it may be a fast fading idea.

(Oh, by the way: this entry was written by Peter Molnar, and originally posted on petermolnar dot net.)