Inspiration

Earlier this week I stumbled upon a post on Medium that discussed quotation mark usage in web content. The main point of the article was that the ASCII characters for single quotes (') and double quotes (") are often not typographically correct. When using quotation marks or apostrophes in text it’s almost always more appropriate to use “curly quotes” — notice the distinction between the open quote and end quote character. Similarly, an apostrophe should be a a curly ending single quote, rather than a straight quote. One notable exception mentioned in the Medium article is the use of quotes to represent inches and feet, e.g. 5'11". For that particular case the straight quotes (called primes) are the more correct choice.

Word processors like Microsoft Word automatically handle quote pairs based on the user’s input, however this functionality is unavailable or impractical when editing raw text files like HTML. To properly use curly quotes in an HTML document, the quotation mark characters need to be replaced with the equivalent HTML entities such as &ldquo and &rdquo. Medium solved the problem by adding functionality to their blog post editor that detects which style is most appropriate in a given context. The editor performs the character replacement on its own, much in the same way that a word processor would.

After doing some additional research I found a similar article covering a wider range of typographical issues on the web, including the incorrect usage of hyphens and dashes. This wasn’t an area that I’d given a lot of thought to in the past, and after reviewing the content on my website I realized my blog posts exhibited many of the errors described in the articles. I manually changed the characters in a few posts, and the visual difference was significant enough that it made sense to find a scalable solution to the problem.

The approach used by Medium wasn’t really an option for me since I don’t have a blog post editor; until recently, I wrote most of my posts in Vim. I ended up defining my own markup language to write blog posts with, which is then converted to HTML by a simple program. In addition to performing quote and dash replacement, the markup language also has utility functions for generating blocks of HTML such as section headers and code snippets. Modifying Markdown to do the quote replacement would have also worked, but it was more interesting to write my own parser from scratch.

Markup Language

The language has two primary components — text and functions. During the first parser pass anything that begins with an @ symbol is extracted as a function, while everything else is treated as plain text. Text blocks can use a number of custom inline syntax markers, which are processed alongside the quotation replacement.

The syntax is inspired by Markdown and the Slack chat syntax. Text can be made bold, italic or displayed as inline code using *...*, _..._ and `...` respectively. The parser supports nesting of the syntax markers as well, such as *bold inline code*.

Links can be embedded in a block of text using the [label]{link, target} syntax. The ‘target’ value is an optional parameter for specifying the link target, e.g. ‘_bank’ for opening the link in a new tab/window. As a convenience, the link syntax isn’t parsed when used inside of a code block.

The escape character \ can be added before a special character to skip marker parsing; the character will be treated normally and excluded from the replacer step.

Functions

Functions are currently defined in the C++ parser. At the moment their only use is for emitting predetermined blocks of HTML code with specific values filled in. For example, the header function takes two parameters; the first is the visual header text to display, and the second is the value for the HTML id field for linking purposes. The second parameter is optional. The syntax for using a function is @function('param0', 'param1', ...).

Code and Extensions

I’ve published the TML parser on GitHub so others can use or modify it. As discussed in the readme in the repo, I’ll continue to update the repo as I make changes. I’m not sure if I’ll have time to review and merge pull requests, but if I’ll try to look at them when I can.

The code for functions and syntax markers is quite modular and configurable. Rather than hard coding them in C++, I’d eventually like to make them specified at runtime with a language file. This would make it easier to extend TML for other needs outside of my blog. The same applies to functions — rather than having hard coded functions, it wouldn’t be too much work to have function definitions loaded from a file instead. Both of these items are on my TODO list.