Adding syntax highlight

studen

Oh, boy! Adding syntax highlighting to the blog has been a ride! I've spent almost two entire days tinkering with some libraries and implementing the logic to get it done, here is a summary of what I've been up to.

Highlighting with Rust

This whole project is a learning experience in Rust (and Yew in particular)1, so I wanted to use some library that implemented syntax highlighting using Rust. Also, I guessed it would be simpler to use a Rust library inside Yew than interoping with a javascript library.

The contenders

I embarked on my quest for the library for my job, and started Ecosiaing2

  • The first result from my search was highlight.js, it seems to be very popular and seems to have all that I would want for my project, except it's not made with Rust. So I continued.

  • Then I found syntect. It looked nice, with some projects like bat (a cat clone made with Rust) using it for syntax highlighting. However, the documentation was not too straight forward at that moment: How do I add new languages syntax?, How do I add the syle?, I don't want to use any prebuilt styles, I know the one I want to use 3, how do I import it?

  • After some time going through syntect's documentation, code and examples, I did another search and found inkjet. Its API looked simpler, it had the language syntaxes built-in and they were optional via Cargo features. So I decided to go with it.

The pain

inkjet uses of tree-sitter, which makes heavy use of C functions links. When I tried to build my project I got an error saying something like

.txt#inlcude <stdio.h>
          |
          no such file or directory

Well, it makes sense, I'm building for wasm32-unknown-unknown, but there has to be a way, right?

I looked into the tree-sitter bindings for Rust that inkjet was using, and I found that there are some non-default features that weren't being included in inkjet, in particular a "wasm" feature that seemed to be the solution to my problem.

I had to try, so I cloned the inkjet repo and modified the Cargo.toml to include the needed features from tree-sitter. The previous error disappeared, but, as usual, it was replaced with another when building another crate down the dependency tree.

After some time searching for information regarding the error I was facing, I read a message in tree-sitter's Discord server saying that some dependencies of tree-sitter can't build to wasm32-unknown-unknown and pointed to a fork called tree-sitter-c2rust that converts the C dependencies of tree-sitter to Rust using c2rust so it can build to other targets, in particular wasm32-unknown-unknown.

tree-sitter-c2rust's documentation didn't help a lot. I guessed that I needed to run a script that was added in the ahead commit, it seems to do the transpilation. But I wasn't able to install c2rust. At this point, it is all a joke. c2rust would fail with a linker error, something about -lclang{Stuff} not being a file or directory.

Back to syntect

I turned back to syntect; by then it was much clearer how to use it. Part of my difficulty understanding it at first was because of my lack of familiarity with the whole parsing/theming topic.

How highlighting works

As I understood, there are roughly two steps involved in the syntax highlighting process:

  1. First, we need to identify the tokens inside the code, and classify them. For example, in the code

    .cppint my_var = 1;
    

    we can say that int is a type, = is an operator, 1 is a constant and ; is a punctuation.

    The specific classification depends on the language and the syntax definition. syntect uses Sublime Text's syntax definition that are basically YAML files that declare some rules with scopes and regex to classify the tokens in the code.

  2. After the code is classified, then we apply the theme. We use a set of rules on what styles to apply to the tokens based on their classification, like a CSS style sheet.

Getting the CSS

The color scheme that I use for VSCode, and the one I wanted to use for the website is Gruvbox Material Dark. The color scheme comes in a JSON file that VSCode uses, not only for the code highlighting but also for the whole application so it looks cohesive.

I made the following Python script to parse it and extract it to a CSS suited for my application:

.pyimport json
import re

with open("gruvbox-material-dark.json") as f:
    theme = json.loads(f.read())

css = ""

st = theme["semanticTokenColors"]
for k, v in st.items():
    sp = k.split(":")
    if len(sp) > 1:
        css += f".language-{sp[1]} .{sp[0]} {{ color: {v}; }}\n"

for tc in theme["tokenColors"]:
    scopes = []
    for sc in tc["scope"].split(", "):
        scope = " ".join(["." + clase for clase in sc.split(" ")])
        scopes.append(re.sub("\\.(\\d)", "-\\1", scope))  # .heading.1 is invalid CSS

    css += ", ".join(scopes) + "{\n"

    for k, v in tc["settings"].items():
        if k == "foreground":
            css += f"    color: {v};\n"
        if v == "bold":
            css += "    font-weight: bold;\n"
        if v == "underline":
            css += "    text-decoration: underline;\n"
        if v == "italic":
            css += "    font-style: italic;\n"
        if v == "italic bold":
            css += "    font-weight: bold;\n"
            css += "    font-style: italic;\n"

    css += "}\n"

with open("gruvbox.css", "w") as f:
    f.write(css)

It was a quick'n'dirty implementation, but it got the work done and now I have a CSS of the theme.

Getting the classed HTML

Now I can use syntect's ClassedHTMLGenerator::new_with_class_style in order to get an HTML from the code that adds classes to the token spans with its classification.

This blog uses Markdown and then parses it to HTML. I do it with pulldown_cmark, and normally it parses the code blocks into <pre><code class="language-{lang}">...</code></pre>. To highlight the code blocks I need to extract them and transform them into classed spans myself.

Performance

Initially, I did the highlighting on the front-end, converting the code to the classed HTML on the fly, but it turned out to be too slow. Converting a small block of code took about 300 ms. Unacceptable! That was using the default syntax set that comes with syntect. I figured that creating a smaller set with just a bunch of languages that I would use would be faster, and indeed it was: adding just about 6 language syntaxes made the conversion go 3x faster, but it was slow still. Furthermore, adding the languages myself from text files meant that the loading of the syntax set itself was significantly slower, it took almost 3 seconds for just 6 languages, and I could not spend that much every time I wanted to parse one article.

I thought about making a global syntax set in the app that could be used every time a new article needs to be parsed, but it would have to be initialized every time the blog gets loaded anyway, so I discarded that approach.

Finally, I decided that it would be better to have the classification performed in the backend. First I would render the code blocks as done by default by pulldown_cmark, adding an id to each. Then I send the code to the backend and update the elements with the classed HTML upon response arrival. I was expecting the code to be visible as boring plain text and then colored with a noticeable delay, but it turned out to be very fast and the syntax highlighted code gets rendered almost instantly.

Conclusion

I finally got it done, as you can see from the code snippets above. Meanwhile, I learned some cool stuff about this topic and a little more about the WASM toolchain.

If you got this far into the article, I would like to thank you for taking the time to read it. I'm just starting with blogging, I hope I can get better at it with some practice so that the articles are more entertaining and clear.

1

Of course, if would still be a learning experience if I decided to use some other (probably more sensible) technologies, like React.

3

I'm talking about the theme that I use in VSCode, Gruvbox Material Dark. The style of the whole website is based on it because I'm not a designer, and coming up with nice color palettes is hard.