WordPress still needs a better shortcode parser.

WordPress 5.0 has now been released, with the new block syntax front and center. This data structure is completely new, and lays on top of all the existing structures that modify post-content, such as shortcodes, leaving many people wondering “why didn’t they just use shortcodes to store block data?”. There are several reasons for this, one of them being the Gutenberg team’s preference for HTML comments since they default to being un-rendered by browsers. However, the biggest reason for the switch that I can see is a bit more fundamental than that. Simply put, shortcode parsing in WordPress is terrible!

In the age of Blocks, why should we care?

The Block editor still parses shortcodes, and even contains a dedicated shortcode block
Many plugins and themes use shortcodes heavily, and those aren’t going away any time soon.
Shortcodes can operate inline, while block capabilities in that department are still somewhat experimental.
Not all post types will make sense with the block editor. Some use cases still need shortcodes.
Shortcodes are used in widget content, as well as post content.
Shortcodes are easy for novice developers to create. custom blocks? not so much.
Shortcode parsing has been a stain on WordPress for a long time. If they are going to exist, they should make sense.

Ok, so what’s wrong with shortcode parsing?

The primary issue comes from the fact that shortcode parsing is performed by a single Regular Expression, but shortcode grammar is unsuitable to being parsed by Regex. So, certain limitations are imposed by the parser, rather than by the grammar itself. The three limitations that are most notable are:

Shortcodes cannot be nested without recursively calling the shortcode parser within the shortcode’s rendering callback
Nested shortcodes cannot contain instances of the same shortcode they are nested within
The current parser is monolithic, with all logic obscured within a massive, inscrutable regex

The first of these issues is frustrating for users, as they don’t expect shortcodes to stop working, just because they are within another shortcode. However, if a developer does nest a call to the shortcode renderer in their shortcodes’ callback functions, they may exponentially increase the time needed to render a post.

The second issue has been a bane on page-builder plugins, that rely on shortcodes. For example, users will often want to nest a set of columns inside a column. This can only be accomplished by registering multiple identical shortcodes, such as [column-outer] and [column-inner], where [column] should be sufficient. These issues also apply to other use-cases, such as restricted access plugins that might want to be able to support nested access conditions.

The third issue is problematic for developers, who need to be able to reason about shortcodes, and potentially expand their capabilities. For example, until I started reverse-engineering that regular expression, I had no idea that there was a slight performance benefit to explicitly demarking self-closing shortcodes, like [shortcode /]. It seems to be totally unmentioned in documentation, and practically no one seems to do it, but there we are.

How do we fix it?

Well, luckily, we can build on the work done by Dennis Snell on the Gutenberg block parser. He used a regular expression to tokenize the post content, then iterated over the tokens to produce a document tree. That document tree is then consumed by a parser to generate the final HTML. Fundamentally, there is nothing about this process that isn’t compatible with the shortcode specification. The only complexity is the existence of non-demarked, self-closing shortcodes. Since we are doing a depth-first parse, that relies on a stack, this isn’t much of a problem at all. We just need to be able to backtrack up the stack and retroactively convert unknown blocks to self-closing when we hit EOD or end of parent.

Where to from here?

Well, you probably knew it was coming, but I’ve taken the time to write a proof of concept iterator + parser + renderer for shortcodes that is fully back-compatible with the existing standard. It isn’t fully tested yet, and probably has some bugs, but I’d love to see this form the basis of an eventual shortcode implementation in WordPress core.

Better Shortcode Parser on Github

Of course, this raises more questions, like “did WordPress really need to introduce a second language for blocks, when they already had shortcodes?”… I’ll leave that as an exercise for the reader.

max says:

January 14, 2020 at 6:13 am

Hi, thanks a lot for you article!
I actually have a problem i already spent quite a lot of time on, maybe, as a shortcode parser expert, you could help me solving the problem.
I use a shortcode like :

[gfchartsreports type="line" fill="origin" height="400px" grouped_tooltips="true" tooltip_dataserie_prefix="Kg Co2 " colors="#05AD1F,#E6550D" xaxislabel="Jaren" yaxislabel="Kosten baten jaar over jaar C02" gf_form_id="12" include="136,137,138,139,140,141,142,143,144,145,147,148,149,150,151,152,153,154,155,156" css_classes_as_series="maxicharts_met_serie_co2,maxicharts_zonder_serie_co2" css_datasets_labels="Co2 uitstoot met wtw douche,Co2 uitstoot zonder wtw douche" chart_js_options="title: {display: 1, text: 'Terugverdientijd en besparing in C02', fontSize:28,fontFamily:'Segoe UI', fontColor:'#161616',fontStyle:'bold',padding:20}" gf_entry_id="{entry_id}"]

I know it does not contain any html or line break inside the shortcode and double checked it (i know i had some parsing problem one day when a shortcode was containing a special character).

Outputting the $atts at the beginning of the callback gives me additionnal parameters:
[type] => line [fill] => origin [height] => 400px [grouped_tooltips] => true [tooltip_dataserie_prefix] => Kg Co2 [colors] => #05AD1F,#E6550D [xaxislabel] => Jaren [yaxislabel] => Kosten baten jaar over jaar C02 [gf_form_id] => 12 [include] => 136,137,138,139,140,141,142,143,144,145,147,148,149,150,151,152,153,154,155,156 [css_classes_as_series] => maxicharts_met_serie_co2,maxicharts_zonder_serie_co2 [css_datasets_labels] => Co2 uitstoot met wtw douche,Co2 uitstoot zonder wtw douche [0] => chart_js_options="title: [1] => {display: [2] => true, [3] => text: [chart_js_options] => title: {display: 1, text: 'Terugverdientijd en besparing in C02', fontSize:28,fontFamily:'Segoe UI', fontColor:'#161616',fontStyle:'bold',padding:20} [gf_entry_id] => 118

Any clue would be greatly appreciated ! 🙂

2 Comments

max says:

January 14, 2020 at 6:13 am

Hi, thanks a lot for you article!
I actually have a problem i already spent quite a lot of time on, maybe, as a shortcode parser expert, you could help me solving the problem.
I use a shortcode like :

[gfchartsreports type="line" fill="origin" height="400px" grouped_tooltips="true" tooltip_dataserie_prefix="Kg Co2 " colors="#05AD1F,#E6550D" xaxislabel="Jaren" yaxislabel="Kosten baten jaar over jaar C02" gf_form_id="12" include="136,137,138,139,140,141,142,143,144,145,147,148,149,150,151,152,153,154,155,156" css_classes_as_series="maxicharts_met_serie_co2,maxicharts_zonder_serie_co2" css_datasets_labels="Co2 uitstoot met wtw douche,Co2 uitstoot zonder wtw douche" chart_js_options="title: {display: 1, text: 'Terugverdientijd en besparing in C02', fontSize:28,fontFamily:'Segoe UI', fontColor:'#161616',fontStyle:'bold',padding:20}" gf_entry_id="{entry_id}"]

I know it does not contain any html or line break inside the shortcode and double checked it (i know i had some parsing problem one day when a shortcode was containing a special character).

Outputting the $atts at the beginning of the callback gives me additionnal parameters:
[type] => line [fill] => origin [height] => 400px [grouped_tooltips] => true [tooltip_dataserie_prefix] => Kg Co2 [colors] => #05AD1F,#E6550D [xaxislabel] => Jaren [yaxislabel] => Kosten baten jaar over jaar C02 [gf_form_id] => 12 [include] => 136,137,138,139,140,141,142,143,144,145,147,148,149,150,151,152,153,154,155,156 [css_classes_as_series] => maxicharts_met_serie_co2,maxicharts_zonder_serie_co2 [css_datasets_labels] => Co2 uitstoot met wtw douche,Co2 uitstoot zonder wtw douche [0] => chart_js_options="title: [1] => {display: [2] => true, [3] => text: [chart_js_options] => title: {display: 1, text: 'Terugverdientijd en besparing in C02', fontSize:28,fontFamily:'Segoe UI', fontColor:'#161616',fontStyle:'bold',padding:20} [gf_entry_id] => 118

Any clue would be greatly appreciated ! 🙂

- gschoppe says:
  
  January 17, 2020 at 12:58 am
  
  What is the problem that you are seeing? based on your dump of the attributes in the callback, the shortcode certainly seems to be parsed properly. My guess is that the issue is in the callback that interprets the shortcode, or in the actual attributes you sent to the shortcode, rather than an issue with shortcode syntax.

Greg Schoppe