{
  "version": "https://jsonfeed.org/version/1",
  "title": "mercury parser on jocmp",
  "icon": "https://avatars.micro.blog/avatars/2026/14/52207.jpg",
  "home_page_url": "https://jocmp.com/",
  "feed_url": "https://jocmp.com/feed.json",
  "items": [
      {
        "id": "http://jocmp.micro.blog/2026/01/19/mercury-parser/",
        "title": "Mercury Parser 3.0.0",
        "content_html": "<p>In 2024 I forked <a href=\"https://github.com/postlight/parser\">Postlight Parser</a> to use in <a href=\"https://capyreader.com/\">Capy Reader</a> and returned the name to <a href=\"https://github.com/jocmp/mercury-parser\">Mercury Parser</a>. I really, really wanted to use the upstream version of the parser to avoid maintaining something off to the side. But nearly two years on, I feel I made the right choice for the scope of the app.</p>\n<p>That&rsquo;s why I&rsquo;m happy to announce <a href=\"https://github.com/jocmp/mercury-parser/releases/tag/v3.0.0\">Mercury Parser version 3.0</a>. My fork of this project follows romantic versioning, or <a href=\"https://github.com/romversioning/romver\">romver</a> meaning that version 3 represents a major overhaul to how the project is developed.</p>\n<p>For starters, the app no longer uses jQuery. The previous version of the app used jQuery for the web-based version of the parser. This was completely removed by <a href=\"https://github.com/jocmp/mercury-parser/pull/88\">upgrading Cheerio</a> to the latest version. Cheerio is responsible for much of the parsing in Mercury Parser and follows a similar interface to jQuery, just without the jQuery.</p>\n<p>Next, I replaced moment.js with <a href=\"https://github.com/iamkun/dayjs\">dayjs</a> for article date handling. This was another necessary shift since moment.js had been deprecated for several years. The shift was mostly one-to-one but there were a few trade-offs. Date patterns like <code>DD</code> are now replaced by <code>D</code> which was something lenient in moment.js. Date boundaries and timezone suffixes were also changed. This required manual code in the <a href=\"https://github.com/jocmp/mercury-parser/blob/4e4ebcab3f4251512e746b4fe7b80a10b0da5dd8/src/cleaners/date-published.js#L22\">date-published</a> cleaner.</p>\n<p>I also migrated from Karma for web tests to <a href=\"https://vitest.dev/\">Vitest</a> with Playwright. For now this means that the fixtures are only tested in node due to the constraint of the node-only <code>fs</code> import. This is something I may revisit in future for broader test coverage. Lastly, I <a href=\"https://github.com/jocmp/mercury-parser/pull/86\">migrated</a> the project from yarn v1 to npm. There&rsquo;s safety in defaults, and npm has come along way with performance since Mercury Parser launched in 2016.</p>\n<p>All in all, these changes have been easy to manage because my fork of Mercury Parser is a hobby project. There&rsquo;s no stakeholders, no mission critical applications. That, and Claude Code Opus 4.5 is just leagues better than me at catching package conflicts. It&rsquo;s been a fun journey to continue to revitalize this project in my spare time and use it in Capy Reader.</p>\n<hr>\n<p>If you use this fork and have feedback, let me know! If you use the full content extractor in Capy Reader and enjoy its benefits, consider sponsoring my efforts <a href=\"https://github.com/sponsors/jocmp\">on GitHub</a> or <a href=\"https://ko-fi.com/capyreader\">Ko-fi</a>.</p>\n",
        "date_published": "2026-01-19T16:48:11-05:00",
        "url": "https://jocmp.com/2026/01/19/mercury-parser/",
        "tags": ["programming","mercury parser"]
      },
      {
        "id": "http://jocmp.micro.blog/2025/07/12/full-content-extractors-comparing-defuddle/",
        "title": "Full Content Extractors: Comparing Defuddle and Postlight Parser",
        "content_html": "<p>One of the hardest problems with RSS feeds is displaying full content. It&rsquo;s essentially an unsolvable problem given the complexity of webpages and the lack of adherence to a semantic layout for any given blog or news site. It would be nice if each page had a header tag, and an article tag, but it&rsquo;s not that simple. Full content parsers attempt to solve this, but each has their own set of trade-offs.</p>\n<h2 id=\"legacy-contenders\">Legacy Contenders</h2>\n<p>Tools like Mozilla&rsquo;s Readability.js used to solve this in the past, but given <a href=\"https://www.theregister.com/2025/06/17/opinion_column_firefox/\">their recent woes</a> it&rsquo;s hard to trust it as a tool. <a href=\"https://github.com/postlight/parser\">Postlight Parser</a> née <a href=\"https://archive.postlight.com/insights/mercury-goes-open-source\">Mercury Parser</a> was a better option. Instead of trying to solve all webpages with a set of common heuristics, each domain could be overridden with a custom parser. In effect, The Verge could have a Verge-specific parser while Ars Technica could have a slightly different parser.</p>\n<p>It would be simple to stop there, but Postlight Parser was a product of its time. Around 2015, there was a <a href=\"https://www.theverge.com/23711172/google-amp-accelerated-mobile-pages-search-publishers-lawsuit\">push by Google</a> to speed up the web by simplifying webpages with AMP. Postlight was one of many companies that stepped in with their own set of tools like the parser to improve development with AMP. But as time passed, AMP fell out of fashion, and Postlight was <a href=\"https://archive.postlight.com/insights/postlight-joins-launch-by-ntt-data\">acquired by NTT Data</a>.</p>\n<p>Postlight Parser essentially ended with the acquisition. It&rsquo;s still possible to find Postlight Parser in the wild, however. Feedbin, a web-based feed reader, uses Postlight Parser to <a href=\"https://feedbin.com/blog/2019/03/11/the-future-of-full-content/\">power its full content mode</a>. Core development has ground to a halt with the last release <a href=\"https://github.com/postlight/parser/releases/tag/v2.2.3\">in 2022</a>.</p>\n<h2 id=\"a-new-entrant\">A New Entrant</h2>\n<p>Readability.js and Postlight Parser may very well represent the past of full content extraction. However a new project called Defuddle might take their place. <a href=\"https://github.com/kepano/defuddle\">Defuddle</a> was released in early 2025 by the developer behind the note-taking app Obsidian. It takes the Readability.js route of a one-size-fits-all input function with different internal heuristics.</p>\n<p>The following is a brief and non-exhaustive comparison between v2.2.3 of Postlight Parser and v0.6.4 of Defuddle using a small node.js application (<a href=\"https://github.com/jocmp/parser-comparison\">source code on GitHub</a>). Defuddle seems to work best when the site&rsquo;s markup is already well formatted which is the case with The Verge. In the <a href=\"https://www.theverge.com/24324299/asus-rog-zephyrus-g16-2024-gaming-laptop-review-amd-strix-point\">following review article</a>, Defuddle picks up more images, headers, and content like the overall review score than Postlight Parser.</p>\n<img src=\"https://cdn.uploads.micro.blog/238475/2025/parser-compare-verge.png\" width=\"600\" height=\"449\" alt=\"\">\n<p>Parsing fails if you throw <a href=\"https://sg.news.yahoo.com/mcdonald-pore-launches-chilli-crab-064000706.html\">an article from Yahoo News Singapore</a> at either Defuddle or Postlight Parser. Defuddle has a slight edge in that it at least extracts images and article content but still captures garbage text like &ldquo;ADVERTISEMENT.&rdquo;</p>\n<img src=\"https://cdn.uploads.micro.blog/238475/2025/parser-compare-yn-sg.png\" width=\"600\" height=\"449\" alt=\"\">\n<p>In short, better base markup still results in a better outcome. Defuddle is clearly the project to watch given Postlight Parser&rsquo;s lack of updates, and it&rsquo;s backed by a live project with Obsidian. Full content parsers come and go but the need to tame the chaos of the web is never ending.</p>\n",
        "date_published": "2025-07-12T18:03:00-05:00",
        "url": "https://jocmp.com/2025/07/12/full-content-extractors-comparing-defuddle/",
        "tags": ["rss","programming","mercury parser"]
      }
  ]
}
