Extracting things from JavaScript strings

You want to pull some information out of a JavaScript string and you have a pattern (regular expression, that is) to match against. Surprised that the String/RegExp combo API lacks a direct method, you comb the web for a solution. On StackOverflow you're invited to put a RegExp.exec() in a while loop. I'll show you a better if unorthodox way of doing it with String.replace.

(Ab)using String.replace

Let's start with the real-life-sounding, contrived scenario of wanting to extract all numbers from a JavaScript string. A simple pattern for matching all occurrences of numbers (specifically, decimal and floating point numbers) is:

/[+-]?\d+(\.\d+)?/g

(don't forget the g flag for Global)

The RegExp.exec way of doing it:

var str = "Some of the best numbers are 42, and, in particular 42.999.";
var number_regex = /[+-]?\d+(\.\d+)?/g;

var matches = [];

var match;
while ((match = number_regex.exec(str)) !== null) {
matches.push(match[0]);
}

console.log(matches);

// => ["42", "42.999"]

If instead we use String.replace we can write:

var str = "Some of the best numbers are 42, and, in particular 42.999.";
var number_regex = /[+-]?\d+(\.\d+)?/g;

var matches = [];
str.replace(number_regex, function(match) {
matches.push(match);
});

console.log(matches);

// => ["42", "42.999"]

In effect, we're hijacking String.replace to act as an iterator over the matches, rather than actually replacing anything in the String.

This has a couple of advantages:

  1. you do away with the anxiety-inducing while loop because what can go wrong I'll tell you what can go wrong; use a regular expression literal in the while statement, instead of a regex saved in variable, and you'll get an infinite loop (courtesy of the lastIndex property):
while ((match = /[+-]?\d+(\.\d+)?/g.exec(str)) !== null) { FOREVERCODE }
  1. you get to have named parameters for your matches instead indexes in an array.

Extracting parts of a pattern

Each group we create in a regular expression, using the (...) construct, will result in an additional parameter to our "replacer" function. Let's take another example.

Assume in an blog article you can add Wordpress-style shortcodes for embedding videos:

Top 10 videos this months:

1. [youtube:FyCsJAj69sc]
2. [vimeo:128373915]
...

A straightforward regular expression to match these shortcodes is:

/\[\w+:\w+\]/g

// match alphanumeric characters, followed by colon,
// followed by another set of alphanumeric characters
// all wrapped in square brackets

but we want to match the parts individually, so we put them in groups:

/\[(\w+):(\w+)\]/g

Our pattern-extraction function now receives two extra parameters, one for each group:

var str = "Top 10 videos this months: \
1. [youtube:FyCsJAj69sc] \
2. [vimeo:128373915]"
;
var shortcode_regex = /\[(\w+):(\w+)\]/g;

var matches = [];
str.replace(shortcode_regex, function(match, code, id) {
matches.push({
code: code,
id: id
});
});

console.log(matches);

For which we get:

[
{"code": "youtube", "id": "FyCsJAj69sc"},
{"code": "vimeo", "id": "128373915"}
]

Tip: Have a regex group that you don't want to show up in the matcher function? Use the non-capturing group syntax (?: ... ).


Now you know how to make pattern extraction more readable and less error-prone.