Counting Words in An HTML Web Page

You might think that in order to have JavaScript count how many words appear in the body of a web page that you'd need a rather complicated loop that steps bit by bit through all of the content of the web page performing all sorts of tests in order to determine what is part of the HTML, what is whitespace, and what is part of an actual word that will be displayed by the browser. In fact the JavaScript code that we need in order to perform a count of all the words is much simpler than that.

wordcount = function() {
var words = document.getElementsByTagName('body')[0].innerHTML.replace(/<.*?>/g, '');
return words.match(/\S+/g).length;

The above short function will take care of counting the words in a web page for us and even get the right answer most of the time. (Note that there are only two statements in the function - one starting var and one starting return - the first of these statements is rather long and has wrapped in the text on this page, but must be all on one line for the code to actually work.) We can of course add an extra statement or two to this code to handle those rare instances where this code doesn't give a correct count. Before we do that though, let's examine our function so as to determine just how this code is able to give us the number of words in the page.

The first and last of the four lines of code define a function called wordcount and the code within the function will count up the words in the web page and return that number. So we can simply substitute wordcount() into our JavaScript where ever we want to use the count.

Now let's consider just what it is that we are trying to do here. We want to get the content of the body of the web page and count the number of words within that which will appear on the screen. The first thing we need to do then is to get the content of the body of the current web page.

A call to document.getElementsByTagName('body')[0] locates the first (and only) body tag within the web page within the Document Object Model and makes it available to our script. Accessing the innerHTML attribute basically retrieves everything between the <body> and </body> tags within the HTML and provides it to us as a long string of text.

Now this text contains HTML, whitespace, and of course the words we are trying to count. It is with this copy of the page content that some people would start thinking about looping through the text looking for the parts that represent words in order to count them. We can solve the problem much easier if we next simply get rid of all the HTML tags so that our long string of text consists of just words and whitespace.

Now the HTML tags are easy to identify. They all start with a < and end with a >. So what we actually want to do is to find all the < in the web page and also the corresponding > that most closely follows each of them and remove those characters as well as everything in between them. The simplest way to do that is with a regular expression. Simply calling replace(/<.*?>/g, '') will remove all of the HTML tags for us. The . in the expression means any character, the * means zero or more occurrences, and the ? means select the shortest possible string. So <.*?> will match the < to the closest following > and instructing JavaScript to replace it with an empty string means that effectively the replace strips out all of the HTML leaving us with just plain text consisting of words and the whitespace between the words. This plain text is what we have assigned to the variable words.

The remaining line of our code is what is going to count the number of words in our text and return that to whatever called the function. We do this using a second regular expression. The expression \S+ will match to any occurrence of one or more adjacent characters that are not whitespace. Now one or more adjscent characters that are not whitespace is a word and so this regular expression matches all of the words in the text.

We used replace with our first regular expression to replace each occurrence of the content that matched the expression. With this second regular expression we are not trying to replace each occurrence, we are trying to count them so with this expression we use match which creates an array and loads each of our matched strings (the words) into separate entries in this array.

Now we are not actually interested in the content of the array, what we are after is a count of the number of words. Of course with each word from the page in a separate entry in the array, the word count we are after is identical to the number of entries in the array and JavaScript provides a really easy way to determine how many enttries there are in an array - length - so simply returning the length of the array that the match generated returns the count of the number of words.

Now this code gets the right answer most of the time. So we need to consider the situations where it will get it wrong and work out what changes to make to handle when those occur.

There is one additional type of element that can occur within HTML that we haven't considered - entity codes. While all web pages contain HTML tags, only some will contain entity codes. For those of you not familiar with entity codes they all start with an & and end with a ; and the value in between identifies a specific character that is to appear at that spot in the web page. These codes can either be a short description of the character or a # followed by a number.

The code we already have actually handles almost all entity codes since the characters that they represent will be part of the words that they are in. There is one entity code though that counts as whitespace and that is the non-breaking space - &nbsp; or &#160;. We can amend our function to handle when either of these two valuse are used simply by adding another replace next to the one we already have so as to replace non breaking spaces with ordinary spaces and so have them recognised as whitespace. The replace statement to do this is replace(/(&nbsp;|&#160;)/,' '). Since entity codes are separate from HTML tags it doesn't matter which order the two replaces are in.

Another situation where the count might end up incorrect is if we have HTML tags that start on one line of the source and end on a different line. Because our regular expressions as coded look at each line separately, we need to either turn on multiline mode or convert all the text into one line. A simple replace add in front of the others replace(/[\n\r]/g, ' ') will convert the entire body of the page into a single line before looking for the tags.

The other situation where our count will be wrong is where the page contains content that isn't text. Now this should never occur with a properly designed web page as the CSS and JavaScript should be in their own separate files and so wouldn't be retrieved as a part of the body of the page. Even if the CSS is in the web page it goes in the head of the page and not the body and so would not cause a problem for our word count. That leaves JavaScript actually embedded into the bottom of the web page as content we don't want to count (assuming we are dealing with modern web pages and not ones written in the twentieth century that have scripts calling document.write every few lines through the body).

Of course the simplest solution is to move the script out of the web page into its own file before you add the wordcount function but if that isn't possible for some reason then we can strip out everything following the <script> tag before we strip out the HTML tags.

Doing this involves - yes you guessed it - one additional replace statement to strip out not just the script tag but the entire content of the script. We just need to add replace(/<script.*?<\/script>/g, '') before we strip out the HTML tags so as to remove the entire script instead of just the tags around it (assuming that we have already added the replace statement that makes the content all one line as otherwise the closing script tag will almost certainly not be on the same line as the open tag and so will not match the expression.. Where your scripts themselves reference the closing script tag you need to ensure that every reference has a backslash in front of the slash to ensure that it isn't mistaken for the end of the script. It should be reasonable to expect the backslash to be there since not having it there not only results in this script miscalculating where the script finishes but can also have the same effect in some browsers. In our wordcount function we don't have a choice about the backslash anyway as leaving it out would break the regular expression (just as not specifying the < before the backslash would allow the expression to match itself).

So the simple function as first shown will work provided that you don't have tags that start on one line and end on a different line, don't use non breaking spaces, and keep all your scripts in separate files. If you have any or all of these then you need to add the extra prelace statements that clean them up so as to leave just the text for the other statement to process.

With all the extra replace statements added our function looks like this:

wordcount = function() {
var words = document.getElementsByTagName('body')[0].innerHTML.replace(/( | )/,' ').replace(/[\n\r]/g, ' ').replace(//g, '').replace(/<.*?>/g, '');
return words.match(/\S+/g).length;

This is still much simpler code than you would need if you were to set up a loop to process through the content character by character lokking for the words.


This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow