A few weeks ago I sat down in front of my site and realized: it's doing too many things, the code is over 3500 lines of Python, and I feel lost when I look at it. It was an organic growth, and happened somewhat like this:
Let's start simple: collect images, extract EXIF using exiftool1, watermark them, if needed, resize
them, if needed. Collect markdown files, convert them to HTML with
pandoc2 using microformat friendly
templates. Ah, wait I need categories. I also need pages. And feeds.
Multiple feeds, because I'm not going to choose sides, RSS, Atom, JSON,
hfeed. Let's make all of them! I'll even invent YAMLFeed3 for
the lulz. I need webmentions. Receive them, create comments before
rendering anything, then render, then sync, then send outgoing
webmentions. Oh. Don't send them every single time, just on change. I
need to publish to flickr, but I need to be able to backfill from
brid.gy. Let's handle gone content properly, also redirects nicely.
Let's try JavaScript based search. Or let's not, it's needs people to
download the full index every time, let's do PHP from Python templates
instead. Google zombies are doing JSON-LD, academics are doing Linked
Data, let's do all that. Hell, let's make an intermediate representation
of all my content in JSON-LD that is made from the Markdown files before
it hit's the HTML templates! In the meanwhile, why not auto-save my
posts to archive.org? But what if I already did it? Let's find the
earliest version automagically! OK, this is a bit slow now, let's start
using async stuff. Let's syndicate to fediverse via fed.brid.gy. I don't
like my pagination logic, let's do some categories flat: all on one
page; others paginated by year. I want to add something funky for IWC
this year, I'll add a worldmap for photos with location data. I see
federated things are pinging .well-known
locations, let's
generate data for them.
I'm not certain if this is the whole list of features, but it's quite clear it has overgrown it's original purpose. In my defense, some of these functionalities were only meant to be learning experiences.
DRY - don't repeat yourself
I started with the most painful point. The previous iteration had a
directory for the content, with the unprocessed,
original files, a nasg
for the code, and a www
for the generated output. What I should have done from the start it to
have 1 and only 1 directory for everything.
The main reasons for the original layout were to keep my original - quite large - images safe on my own computer, copy only the resized and potentially watermarked ones online. The other was to keep the code in it's own repository, so it can be "Open Sourced". Why the quotes: because I've started to question what Open Source means to me and what it is right now in the world, but this is for another day.
The more I complicated this the more I realized all these disconnected pieces are making the originally simple process more and more convoluted. So I made certain decisions.
My generator code is not going to live on Github any more. Instead, it'll be in the root folder of my site content, which will also be the root folder for the website. I'll generate everything in place. I'll move the original images to be hidden files and protect them via webserver rules, like I did in the WordPress times. I'll place the Python virtualenv in this directory as well.
With the move to a single directory structure I also moved away from the weird path system I ended up with: direct uris for entries and /category/ prefixes for categories. Now everything always is /folder/subfolder/ etc, as it should have been from the start.
It needed some rewrite magic to have it done properly, but it should all be fine now.
Parsing should be stiff and intolerant
When I saved markdown files by hand, I wasn't paying too much
attention to, for example, dates. The Python library I used - arrow -
parsed nearly everything. This also applied to the comments, but the
comments were saved by my own code: missing or null
authors, bad date formats, etc.
With the refactoring I decided to ditch as many libraries as possible
in favour of Python's built in ones, and datetime
wasn't happy.
I fixed all of them; some with scripts, others by hand. Than swapped to a very strict parsing: if stuff is malformed, fail hard. Make me have to fix it.
No workarounds in the code, no clever hundreds of lines of fallbacks; the source should be cleaned if there is an issue.
Not everything needs templating
In order to have a nice search, I had templated PHP files. Truth is: it's not essential. Search is happy with a few lines of CSS and a "back to petermolnar.net" button.
My fallback 404.php can now rely on looking up files itself.
Previously I had removeduri.del
files. The first were empty files, with
the deleted URIs in their names; the second contained the URL to
redirect to. Because of the content
and www
directory setup, I had to parse these, collect them, and then insert in
the PHP. But now I had the files accessible from the PHP itself, meaning
it can look it up itself.
This way both my 404.php
and my
became self-sufficient: no more Python Jinja2
templates for PHP files. ## Semantic HTML5 is a joke, JSON-LD
is a monster, and I have no need for either
Some elements in HTML5 are good, and were much needed. Personally I'm
very happy with figure
and figcaption
and summary
, and
I findheader
, footer
, and nav
a bit useless, but nothing tops the main
, article
(and probably some other)
mess. There's no definitive way of using one or the other, so everyone
is doing which make sense to them4 - which is the opposite
of a standard. Try to figure out which definition goes for which
(official definitions from the "living" HTML standard):
The X element represents a generic section of a document or application. The X , in this context, is a thematic grouping of content, typically with a heading.
The Y element represents a complete, or self-contained, composition in a document, page, application, or site and that is, in principle, independently distributable or reusable, e.g. in syndication.
The Z element represents the dominant contents of the document.
So I dropped most of it; especially because I have microformats5 v1 and v2 markup already, and that is an actual standard with obvious guidelines.
Next ripe for reaping was JSON-LD. I got into the semantic web possibilities because I was curios. I learnt a lot, including the fact that I have no need for it.
The enforced vocabulary for JSON-LD, schema.org, is terrible to use. Whenever you have a need for something that's not present already, you're done for, and it'll probably pollute the structured data results, because all the search engines, especially Google, are picky: they limit the options plus they require properties. Examples everything MUST have a photo! And and address! And a publisher! If you don't believe me, try to make a resume with schema.org then check the opinion of the Google Structured Data Testing Tool about it.
No, Google. Not everything has an image - see http://textfiles.com Like it or not, a website doesn't need and address. The list goes on forever.
I'm going to stop feeding it, stop feeding all of them, stop
playing by their weird rules. HTML has link
elements, plus rel=
property, so it can
already represent the minimum, which is enough. Plus, again, there's
microformats, and Google is still OK with them6.
Note: with structured data, in theory, one could pull in other vocabularies to overcome problems like nonexistent properties in one, but search engines are not real RDF parsers. Unless you're writing for academic publishing tools that will do so, don't bother.
Update: 2020-07-08: it very much seems like Google is sunsetting their microformats supports with their incredibly shitty new Rich Results Test, that doesn't even tell you what's wrong7, so I'm putting RDFa back.
Pick your format, and pick just one
Between 2003 and 2007 some tragic mud-throwing (mirror translated Hungarian phrase, just because it's pretty visual) was going on on the web, over something ridiculously small: my XML is better, than your XML! 8.
When I first encountered with the whole "feed" idea itself, there was only RSS, and for a very long time, I was happy with it. Then I read opinions of people I listen to on how Atom is better. https://fed.brid.gy is Atom only. Much later someone on the internet popped the JSONFeed thought.
When I first saw JSONFeed, I thought it's a joke. Turned out it's not, because there are simpletons who honestly believe the world will be better if things are JSON and not XML. It won't, it'll only result in things like JSON-LD
In the heat of the moment, I coined the thought of YAMLFeed9, strictly as a satire, but for a brief time I actually maintained a YAMLFeed file as well Do not follow my example.
And then I found myself serving them all. I had a
class in Python, that had JSONFeed
and XMLFeed
subclasses, which latter had
and RSSFeed
subclasses, it used
to deal with it, and so on... in short, I made a
I went back an RSS 2.0 feed and a h-feed.
Update from 2021-05-22: I settled on Atom after learning a bit more
about the possibilities in it. It can still be made with the
library directly. I still prefer the "RSS" acronym
If you have a website in 2020, it's probably a hobby for you as well; don't let anything change that.
It should never become a burden, any part of it. It did for me, and I seriously considered firing up something like Microsoft FrontPage 98 to start from the proverbial scratch, but managed to salvage it before resulting to drastic measures.
Don't follow trends. Once a solution grows deep enough roots - microformats, RSS, etc - it'll be around for a very long time.
Screw SEO. If you're like me, and you write for yourself, and, maybe, for the small web10, don't bother trying to please an ever-changing power play.
If you want to learn something new, be careful not to embed it too deep as it may be a fast fading idea.
(Oh, by the way: this entry was written by Peter Molnar, and originally posted on petermolnar dot net.)