The above short function will take care of counting the words in a web page for us and even get the right answer most of the time. (Note that there are only two statements in the function - one starting var and one starting return - the first of these statements is rather long and has wrapped in the text on this page, but must be all on one line for the code to actually work.) We can of course add an extra statement or two to this code to handle those rare instances where this code doesn't give a correct count. Before we do that though, let's examine our function so as to determine just how this code is able to give us the number of words in the page.
Now let's consider just what it is that we are trying to do here. We want to get the content of the body of the web page and count the number of words within that which will appear on the screen. The first thing we need to do then is to get the content of the body of the current web page.
A call to document.getElementsByTagName('body') locates the first (and only) body tag within the web page within the Document Object Model and makes it available to our script. Accessing the innerHTML attribute basically retrieves everything between the <body> and </body> tags within the HTML and provides it to us as a long string of text.
Now this text contains HTML, whitespace, and of course the words we are trying to count. It is with this copy of the page content that some people would start thinking about looping through the text looking for the parts that represent words in order to count them. We can solve the problem much easier if we next simply get rid of all the HTML tags so that our long string of text consists of just words and whitespace.
The remaining line of our code is what is going to count the number of words in our text and return that to whatever called the function. We do this using a second regular expression. The expression \S+ will match to any occurrence of one or more adjacent characters that are not whitespace. Now one or more adjscent characters that are not whitespace is a word and so this regular expression matches all of the words in the text.
We used replace with our first regular expression to replace each occurrence of the content that matched the expression. With this second regular expression we are not trying to replace each occurrence, we are trying to count them so with this expression we use match which creates an array and loads each of our matched strings (the words) into separate entries in this array.
Now this code gets the right answer most of the time. So we need to consider the situations where it will get it wrong and work out what changes to make to handle when those occur.
There is one additional type of element that can occur within HTML that we haven't considered - entity codes. While all web pages contain HTML tags, only some will contain entity codes. For those of you not familiar with entity codes they all start with an & and end with a ; and the value in between identifies a specific character that is to appear at that spot in the web page. These codes can either be a short description of the character or a # followed by a number.
The code we already have actually handles almost all entity codes since the characters that they represent will be part of the words that they are in. There is one entity code though that counts as whitespace and that is the non-breaking space - or  . We can amend our function to handle when either of these two valuse are used simply by adding another replace next to the one we already have so as to replace non breaking spaces with ordinary spaces and so have them recognised as whitespace. The replace statement to do this is replace(/( | )/,' '). Since entity codes are separate from HTML tags it doesn't matter which order the two replaces are in.
Another situation where the count might end up incorrect is if we have HTML tags that start on one line of the source and end on a different line. Because our regular expressions as coded look at each line separately, we need to either turn on multiline mode or convert all the text into one line. A simple replace add in front of the others replace(/[\n\r]/g, ' ') will convert the entire body of the page into a single line before looking for the tags.
Of course the simplest solution is to move the script out of the web page into its own file before you add the wordcount function but if that isn't possible for some reason then we can strip out everything following the <script> tag before we strip out the HTML tags.
Doing this involves - yes you guessed it - one additional replace statement to strip out not just the script tag but the entire content of the script. We just need to add replace(/<script.*?<\/script>/g, '') before we strip out the HTML tags so as to remove the entire script instead of just the tags around it (assuming that we have already added the replace statement that makes the content all one line as otherwise the closing script tag will almost certainly not be on the same line as the open tag and so will not match the expression.. Where your scripts themselves reference the closing script tag you need to ensure that every reference has a backslash in front of the slash to ensure that it isn't mistaken for the end of the script. It should be reasonable to expect the backslash to be there since not having it there not only results in this script miscalculating where the script finishes but can also have the same effect in some browsers. In our wordcount function we don't have a choice about the backslash anyway as leaving it out would break the regular expression (just as not specifying the < before the backslash would allow the expression to match itself).
So the simple function as first shown will work provided that you don't have tags that start on one line and end on a different line, don't use non breaking spaces, and keep all your scripts in separate files. If you have any or all of these then you need to add the extra prelace statements that clean them up so as to leave just the text for the other statement to process.
With all the extra replace statements added our function looks like this:
This is still much simpler code than you would need if you were to set up a loop to process through the content character by character lokking for the words.
This article written by Stephen Chapman, Felgall Pty Ltd.