Content, bloat, privacy, archives

I spent a lot of time trying centralising my online activities, including adding bookmarks and imports from social networks. Lately my site looked bloated and unmaintainable. I started questioning what data is my data, what data should or could I own - it was time to rethink some ideas.


Have you ever reached the point when you started questioning why you're doing something? I have, but never before with my website.

What is my purpose? The unfortunate, sentient robot Rick created for the sole purpose of passing the butter.

What is my purpose? The unfortunate, sentient robot Rick created for the sole purpose of passing the butter.

The precursor to petermolnar.net started existing for a very simple reason: I wanted an online home and I wanted to put "interesting" things on it. It was in 1999, before chronological ordering took over the internet.1 Soon it got a blog-ish stream, then a portfolio for my photos, later tech howtos and long journal entries, but one thing was consistent for a very long time: the majority of the content was made by me.

After encountering the indieweb movement2 I started developing the idea of centralising one's self. I wrote about it not once3 but twice4, but going through with importing bookmarks and favourites had an unexpected outcome: they heavily outweighed my original content.

Do you know what happens when your own website doesn't have your own content? It starts feeling distant and unfamiliar. When you get here, you either leave the whole thing behind or reboot it somehow. I couldn't imagine not having a website, so I rebooted.

I kept long journal entries; notes, for replies to other websites and for short entries; photos; and tech articles - the rest needs to continue it's life either archived privately or forgotten for good.

Outsourcing bookmarks

The indieweb wiki entry on bookmark says5:

Why should you post bookmark posts? Good question. People seem to have reasons for doing so. (please feel free to replace this rhetorical question with actual reasoning)

Since that didn't help, I stepped back one step further: why do I bookmark?

Usually it's because I found them interesting and/or useful. What I ended up having was a date of bookmarking, a title, a URL, and some badly applied tags. In this form, bookmarks on my site were completely useless: I didn't have the content that made them interesting nor a way to search them properly.

To solve the first problem, the missing content, my initial idea was to leave everything in place and pull an extract of the content to have something to search in. It didn't go well. There's a plethora of js;dr6 sites these days, which don't, any more, offer a working, plain HTML output without executing JavaScript. For archival purposes, archive.org introduced an arcane file format, WARC7: it saves everything about the site, but there is no way to simply open it for view. Saving pages with crawlers including media files generated a silly amount of data on my system and soon became unsustainable.

Soon I realised I'm trying to solve a problem others worked on for years, if not decades, so I decided to look into existing bookmark managers. I tried two paid services, Pinboard8 and Pocket9 first. Pocket would be unbeatable, even though it's not self hosted, if the article extracts they make were available through their API. They are not. Unfortunately Pinboard wasn't giving me much over my existing crawler solutions.

The winner was Wallabag10: it's self-hosted, which is great, painful to install and set up, which is not, but it's completely self-sustaining, runs on SQLite and good enough for me.

There was only one problem: none of these offered archival copies of images, and some of the bookmarks I made were solely for the photos on the sites. I found a format, called MHTML11, also known as .eml, which is perfect for single-file archives of HTML pages: it inlines all images as base64 encoded data.

However, no browser offers a save-as-mhtml in headless mode, so to get your archives, you'll need to revisit your bookmarks. All of them. I enabled12 save as MHTML in Chrome (Firefox doesn't know this format), installed the Wayback Machine13 extension and saved GBs of websites. I also added them into Wallabag. It's an interesting, though very long journey, but you'll rediscover a lot of things for sure.

When this was done, I dropped thousands of bookmark entries from my site.

If I do want to share a site, I'll write a note about it, but bookmarks, without context, belong to my archives.

(Some) microblog imports should never have happened

I had iterations of imports, so after bookmarks it seemed reasonable to check what else may simply be noise on my site.

Back in the days people mostly wrote much lengthier entries: journal-like diary pages, thoughts, and it was, nearly always, anonymous. It all happened under pseudonyms.

Parallel to this there were the oldschool instant messengers, like ICQ and MSN Messenger. In many cases, though you all had handles, or numbers, or usernames, you knew exactly who you were talking to. Most of these programs had a feature called status message - looking back at it they may have been precursors to microblogging, but there was a huge difference: they were ephemeral.

With the rise of Twitter and Facebook status message also came (forced?) real identities, and tools letting us post from anywhere, within seconds. The content that earlier landed in status messages - XY is listening to...., Feels like..., etc - suddenly became readable at any time, sometimes to anyone.

I had content like this and I am, as well, guilty of posting short, meaningless, out-of-context entries. Imported burps of private life; useless shares of music pointing to long dead links; one-liner jokes, linking to bash.org; tiny replies and notes that should have been sent privately, either via email or some other mechanism.

Some things are meant to be ephemeral, no matter how loud the librarian is screaming deep inside me. Others belong in logs, and probably not on the public internet.

I deleted most of them and placed a HTTP 410 Gone message for their URLs.

Reposts are messy

For a few months I've been silently populating a category that I didn't promote openly: favorites. At that page, I basically had a lot of reposts: images and galleries, with complete content, but with big fat URLs over them, linking to the original content.

By using a silo you usually give permission to the silo to use your work and there. Due to the effects of votes and likes (see later) you do, in fact, boost the visibility of the artist. Note that usually these permissions are much broader, than you imagine: a lawyer reworded the policy of Instagram to let everyone understand, that by using the service, you allow them to do more or less anything the want to with your work14.

But what is you take content out of a silo? The majority of images and works are not licensed in any special way, meaning you need to assume full copyright protection. Copyright prohibits publishing works without the author's explicit consensus, so when you repost something that doesn't indicate it's OK with it - Creative Commons, Public Domain, etc -, what you do is illegal.

Also: for me, it feels like reposts, without notifying the creator, even though the licence allows it, are somewhat unfair - which is exactly what I was doing with these. Webmentions15 would like to address this by having an option to send notifications and delete requests, but silos are not there yet to send or to receive any of these.

There is a very simple solution: avoid reposting anything without being sure it's licence allows you. Save it in a private, offline copy, if you really want to. Cweiske had a nice idea about adding source URLs into JPG XMP metadata 16, so you know where it's from.

Silo reactions only make sense within the silo

When I started writing this entry, I differentiated 3, not-comment reaction types in silos:

A reaction is a social interaction, essentially a templated comment. "Well done", "I disagree", "buu", "acknowledged", ❤, 👍, ★, and so on. I asked my wife what she thinks about likes, why she uses them, and I got an unexpected answer: because, unlike with regular, text comments, others will not be able react to it - so no trolling or abuse is possible.

A vote has direct effect on ranking: think reddit up- and downvotes. Ideally it's anonymous: list of voters should not be displayed, not even for the owner of the entry.

A bookmark is solely for one's self: save this entry because I value it and I want to be able to find it again. They should have no social implications or boosting effect at all.

In many of the silos these are mixed - a Twitter fav used to range from an appreciation to a sarcastic meh17. With a range of reactions available this may get simpler to differentiate, but a like in Facebook still counts as both a vote and a reaction.

I thought a lot about reactions and I came to the conclusion that I should not have them on my site. The first problem is they will be linking into a walled garden, without context, maybe pointing at a private(ish) post, available to a limited audience. If the content is that good, bookmark it as well. If it's a reaction for the sake of being social, it's ephemeral.

Conclusions

Don't let your ideas take over the things you enjoy. Some ideas can be beneficial, others are passing experiments.

There's a lot of data worth collecting: scrobbles, location data, etc., but these are logs, and most of them, in my opinion, should be private. If I'm getting paranoid about how much services know about me, I shouldn't publish the same information publicly either.

And finally: keep things simple. I'm finding myself throwing out my filter coffee machine and replacing it with a pot that has a paper filter slot - it makes an even better coffee and I have to care about one less electrical thing. The same should apply for my web presence: the simpler is usually better.


  1. https://stackingthebricks.com/how-blogs-broke-the-web/

  2. https://indieweb.org/

  3. https://petermolnar.net/indieweb-decentralize-web-centralizing-ourselves/

  4. https://petermolnar.net/personal-website-as-archiving-vault/

  5. https://indieweb.org/bookmark

  6. http://tantek.com/2015/069/t1/js-dr-javascript-required-dead

  7. http://www.archiveteam.org/index.php?title=Wget_with_WARC_output

  8. http://pinboard.in/

  9. http://getpocket.com/

  10. https://wallabag.org/en

  11. https://en.wikipedia.org/wiki/MHTML

  12. https://superuser.com/a/445988

  13. https://chrome.google.com/webstore/detail/waybackmachine/gofnhkhaadkoabedkchceagnjjicaihi

  14. https://qz.com/878790/a-lawyer-rewrote-instagrams-terms-of-service-for-kids-now-you-can-understand-all-of-the-private-data-you-and-your-teen-are-giving-up-to-social-media/

  15. https://webmention.net/draft/#sending-webmentions-for-deleted-posts

  16. http://cweiske.de/tagebuch/exif-url.htm

  17. http://time.com/4336/a-simple-guide-to-twitter-favs/

Replies
  1. Lukas Rosenstock https://brid-gy.appspot.com/post/twitter/petermolnar/925392586268971010
Reactions