Java Tokenizer
August 12th, 2007 by RickWhy the heck don’t I just use the Java StringTokenizer class from java.util? Well, I’m in the middle of reading Writing Compilers And Interpreters, and I thought it would be instructive to port the tokenizer described in Chapter 2 (aka, “token1.c”) to Java.
I also apparently have a masochistic streak.
What is the tokenizer supposed to do? As it stands, it should read a designated text file line by line, separate “words” or “tokens” on each line, and decide whether the token is a number, a word, or the end of file indicator (a period in this case).
So, given an input file like this:
1500
This is a test
NAN 349 New
End of file. Test
Nothing here
Our program is supposed give output like this:
$ ./token1 test2.dat Page 1 test2.dat Sun Aug 12 17:05:51 2007 1 0: 1500 >> <NUMBER> 1500 2 0: This is a test >> <WORD> This >> <WORD> is >> <WORD> a >> <WORD> test 3 0: NAN 349 New >> <WORD> NAN >> <NUMBER> 349 >> <WORD> New 4 0: End of file. Test >> <WORD> End >> <WORD> of >> <WORD> file >> <PERIOD> . $
syntax highlighted by Code2HTML, v. 0.9.1
This is the output we get from the original C code in the compiler book. When I ported the first program from Chapter 1, the “list.c” program, I found myself doing an almost straight port. You can see the results in my previous entry, “Java Lister“. It’s not very OO, and I’m not very happy with it, but it does what it’s supposed to do.
For the tokenizer, I decided to get a handle on the book’s source (”token1.c”) and start writing Java with the target output in mind.
First, I figured I would need three classes:
- A “driver” class I called TokenizerOne — This contains main(), is responsible for I/O, and breaks up the input into its constituent tokens;
- A TokenCode class/enum-type — This Java 5 construct is actually pretty nifty. Not only did I not have to deal with the well-documented issues regarding faking out enums in previous Java versions, I was also able to customize output more easily than in C by simply overriding the toString method for TokenCode;
- A Token class — Since the value of a token here could be either a String or a numeric, the original C program tacked on a union construct. I really didn’t see the need to fake out a union in Java since we can easily tell if a Token is numeric or alpha by checking its TokenCode. If it’s numeric, you can always pull its numeric value on-the-fly by sending its value to Integer.parseInt().
So, first thing, let’s create a file called TokenCode.java. As the code below shows, there are six possible values for this enum. In addition, the Java enum type allows us define types and instances of those types to associate with each value. I’ve kept things simple by just associating a String called “outValue” with each enum value. This makes it easier to customize this enum’s behavior. Further customizations are possible by adding or overriding methods to the enum, as I have with the toString() method.
1 // Package statement 2 3 // Imports go here. 4 5 /** 6 * TokenCode enum 7 */ 8 public enum TokenCode { 9 NO_TOKEN (“<no token>”), 10 WORD (“<WORD>”), 11 NUMBER (“<NUMBER>”), 12 PERIOD (“<PERIOD>”), 13 END_OF_FILE (“<end of file>”), 14 ERROR (“<ERROR>”); 15 16 private final String outValue; 17 18 TokenCode (String outValue) { 19 this.outValue = outValue; 20 } 21 22 public String toString () { 23 return this.outValue; 24 } 25 }
syntax highlighted by Code2HTML, v. 0.9.1
Yeah, the enum declaration is a bit foreign to some Java developers. You really want to say something like
right? I know I did …The next thing to create is the Token class. A token has two data-members, its value, and something “meta” that is supposed to tell you what kind of data the value member is. Sounds like a good use of a TokenCode object, no?
1 // Package Statement 2 3 // Imports go here. 4 5 6 /** 7 * Token class 8 */ 9 public class Token { 10 private String value; 11 private TokenCode code; 12 13 public Token (String value, TokenCode code) { 14 this.value = value; 15 this.code = code; 16 } 17 18 public String getValue () { 19 return value; 20 } 21 22 public TokenCode getTokenCode () { 23 return code; 24 } 25 26 public String toString() { 27 return (code.toString() + “tt“ + value); 28 } 29 }
syntax highlighted by Code2HTML, v. 0.9.1
The general idea behind the TokenizerOne class is to read one line at a time from the input file. The java.io.LineNumberReader class makes this easy and keeps track of what line we’ve just read.Then, the line is taken apart one character at a time. Leading whitespace is skipped; non-alphanumerics cause errors. Characters are read into a buffer until whitespace or the end of the line is encountered. Then we decide what type of token we have and send the token Code and token buffer contents to the Token constructor (lines 177-178).Everything else here is really dealing with output formatting and should probably be separated out into another class or something.
1 // Package Statement 2 3 // Imports go here. 4 import java.io.FileReader; 5 import java.io.LineNumberReader; 6 import java.io.IOException; 7 import java.text.SimpleDateFormat; 8 import java.util.Date; 9 import java.util.GregorianCalendar; 10 11 /** 12 * TokenizerOne class 13 */ 14 15 public class TokenizerOne { 16 private static final char FORM_FEED_CHAR = ‘f’; 17 private static final int MAX_LINES_PER_PAGE = 50; 18 private static final char END_OF_FILE_FLAG = ‘.’; 19 20 private LineNumberReader sourceReader; 21 private boolean endOfFile; 22 private StringBuffer lineBuffer; 23 private String fileName; 24 private String date; 25 26 private int pageNumber; 27 private int lineCount; 28 private Token currentToken; 29 30 public static void main (String[] args) { 31 if (args.length < 1) { 32 System.err.println(“Need to supply the filename”); 33 System.exit(1); 34 } 35 36 TokenizerOne myTokenizer = new TokenizerOne(args[0]); 37 38 39 myTokenizer.nextLine(); 40 while (!myTokenizer.isEndOfFile()) { 41 myTokenizer.printLine(); 42 while (myTokenizer.lineHasMoreTokens()) { 43 myTokenizer.nextToken(); 44 myTokenizer.printToken(); 45 if (myTokenizer.getTokenCode() == TokenCode.PERIOD) { 46 break; 47 } 48 } 49 if (myTokenizer.getTokenCode() == TokenCode.PERIOD) { 50 break; 51 } 52 myTokenizer.nextLine(); 53 } 54 } 55 56 public TokenizerOne (String newFileName) { 57 pageNumber = 0; 58 lineCount = 0; 59 endOfFile = false; 60 61 // Set the date string 62 SimpleDateFormat dateFormat = 63 new SimpleDateFormat(“yyyy.MMMMM.dd hh:mm aaa”); 64 Date timer = new GregorianCalendar().getTime(); 65 date = dateFormat.format(timer); 66 67 // Set the file name 68 this.fileName = newFileName; 69 70 try { 71 // After this, the file should be ready for action72 sourceReader = 73 new LineNumberReader(new FileReader(this.fileName)); 74 } catch (IOException e) { 75 System.err.println (“Problem opening “ + fileName + “.”); 76 e.printStackTrace(); 77 System.exit(1); 78 } 79 80 lineBuffer = new StringBuffer(); 81 82 Token currentToken = null; 83 } 84 85 public void nextLine() { 86 if ( lineBuffer != null && 87 lineBuffer.length() > 0 ) { 88 lineBuffer.delete (0, lineBuffer.length()); 89 } 90 91 this.setEndOfFile(false); 92 try { 93 String nextLine = sourceReader.readLine(); 94 this.setEndOfFile( nextLine == null ); 95 if ( ! this.isEndOfFile() ) { 96 lineBuffer.append (nextLine); 97 98 // Remove leading whitespace. 99 while (lineBuffer.length() > 0 && 100 Character.isWhitespace(lineBuffer.charAt(0))) { 101 lineBuffer.deleteCharAt(0); 102 } 103 } 104 } catch (IOException e) { 105 System.err.println 106 (“Problem reading from “ + fileName + “.”); 107 e.printStackTrace(); 108 } 109 } 110 111 public boolean lineHasMoreTokens() { 112 boolean moreTokens = true; 113 114 // Remove any white space before the next token. 115 if (lineBuffer != null) { 116 while (lineBuffer.length() > 0 && 117 Character.isWhitespace(lineBuffer.charAt(0))) { 118 lineBuffer.deleteCharAt(0); 119 } 120 } 121 122 if ( lineBuffer != null && 123 lineBuffer.length() > 0 ) { 124 // Check if the first character of lineBuffer 125 // is the beginning of one of our token types. 126 char nextChar = lineBuffer.charAt(0); 127 moreTokens = Character.isDigit(nextChar) || 128 Character.isLetter(nextChar) || 129 (nextChar == END_OF_FILE_FLAG); 130 131 } else { 132 moreTokens = false; 133 } 134 135 return moreTokens; 136 } 137 138 public Token nextToken() { 139 StringBuffer tokenBuffer = new StringBuffer(“”); 140 TokenCode currentTokenCode = TokenCode.ERROR; 141 142 char nextChar = lineBuffer.charAt(0); 143 lineBuffer.deleteCharAt(0); 144 145 if ( Character.isDigit(nextChar) ) { 146 currentTokenCode = TokenCode.NUMBER; 147 tokenBuffer.append(nextChar); 148 while (lineBuffer.length() > 0 && 149 Character.isDigit(lineBuffer.charAt(0))) { 150 tokenBuffer.append(lineBuffer.charAt(0)); 151 lineBuffer.deleteCharAt(0); 152 } 153 } else if ( Character.isLetter(nextChar) ) { 154 currentTokenCode = TokenCode.WORD; 155 tokenBuffer.append(nextChar); 156 157 while ( lineBuffer.length() > 0 && 158 ((Character.isLetter(lineBuffer.charAt(0))) || 159 (Character.isDigit (lineBuffer.charAt(0)))) ) { 160 tokenBuffer.append(lineBuffer.charAt(0)); 161 lineBuffer.deleteCharAt(0); 162 } 163 } else if ( nextChar == END_OF_FILE_FLAG ) { 164 currentTokenCode = TokenCode.PERIOD; 165 tokenBuffer.append(nextChar); 166 } else { 167 currentTokenCode = TokenCode.ERROR; 168 tokenBuffer.append(nextChar); 169 while ( lineBuffer.length() > 0 && 170 ! ((Character.isLetter(lineBuffer.charAt(0))) || 171 (Character.isDigit (lineBuffer.charAt(0)))) ) { 172 tokenBuffer.append(lineBuffer.charAt(0)); 173 lineBuffer.deleteCharAt(0); 174 } 175 } 176 177 currentToken = 178 new Token (tokenBuffer.toString(), currentTokenCode); 179 return currentToken; 180 } 181 182 public TokenCode getTokenCode() { 183 return currentToken.getTokenCode(); 184 } 185 186 // Print line number and page info, if necessary. 187 public void printLine() { 188 if ( lineCount == 0 ) { 189 System.out.println (FORM_FEED_CHAR); 190 System.out.println (“Paget“ + ++pageNumber + “t“ + 191 fileName + “t“ + date + “nn“); 192 } 193 194 System.out.println (“t“ + sourceReader.getLineNumber() + 195 “:t“ + lineBuffer.toString()); 196 197 lineCount = ++lineCount % MAX_LINES_PER_PAGE; 198 199 } 200 201 public void printToken() { 202 if ( lineCount == 0 ) { 203 System.out.println (FORM_FEED_CHAR); 204 System.out.println (“Paget“ + ++pageNumber + “t“ + 205 fileName + “t“ + date + “nn“); 206 } 207 System.out.println (“tt>> “ + currentToken.toString()); 208 lineCount = ++lineCount % MAX_LINES_PER_PAGE; 209 } 210 211 public boolean isEndOfFile() { 212 return endOfFile; 213 } 214 215 public boolean getEndOfFile() { 216 return endOfFile; 217 } 218 219 public void setEndOfFile(boolean endOfFile) { 220 this.endOfFile = endOfFile; 221 } 222 223 }
syntax highlighted by Code2HTML, v. 0.9.1
And this is how it works. It looks pretty close to the output from the original C program …
$ java TokenizerOne test2.dat Page 1 test2.dat 2007.August.12 11:19 PM 1: 1500 >> <NUMBER> 1500 2: This is a test >> <WORD> This >> <WORD> is >> <WORD> a >> <WORD> test 3: NAN 349 New >> <WORD> NAN >> <NUMBER> 349 >> <WORD> New 4: End of file. Test >> <WORD> End >> <WORD> of >> <WORD> file >> <PERIOD> . $
syntax highlighted by Code2HTML, v. 0.9.1
The next section of the book deals with modularizing the C code so it’s more manageable. Header files, oh boy. I wonder if creating interfaces to deal with that sort of thing is going to be overkill.
Posted in Java |
No Comments »