Welcome to

Stabile-Dot-Org

If I have time to write about it, it’s here.

RecentPosts

RecentTracks



Categories

Archives

Blogroll

MetaInfo

Java Tokenizer

August 12th, 2007 by Rick

Why the heck don’t I just use the Java StringTokenizer class from java.util? Well, I’m in the middle of reading Writing Compilers And Interpreters, and I thought it would be instructive to port the tokenizer described in Chapter 2 (aka, “token1.c”) to Java.
I also apparently have a masochistic streak.
What is the tokenizer supposed to do? As it stands, it should read a designated text file line by line, separate “words” or “tokens” on each line, and decide whether the token is a number, a word, or the end of file indicator (a period in this case).
So, given an input file like this:

1500
This is a test
NAN 349 New
End of file. Test
Nothing here

Our program is supposed give output like this:

$ ./token1 test2.dat

Page 1    test2.dat    Sun Aug 12 17:05:51 2007

   1 0: 1500
     >> <NUMBER>         1500
   2 0: This is a test
     >> <WORD>           This
     >> <WORD>           is
     >> <WORD>           a
     >> <WORD>           test
   3 0: NAN 349 New
     >> <WORD>           NAN
     >> <NUMBER>         349
     >> <WORD>           New
   4 0: End of file. Test
     >> <WORD>           End
     >> <WORD>           of
     >> <WORD>           file
     >> <PERIOD>         .
$

syntax highlighted by Code2HTML, v. 0.9.1

This is the output we get from the original C code in the compiler book. When I ported the first program from Chapter 1, the “list.c” program, I found myself doing an almost straight port. You can see the results in my previous entry, “Java Lister“. It’s not very OO, and I’m not very happy with it, but it does what it’s supposed to do.
For the tokenizer, I decided to get a handle on the book’s source (”token1.c”) and start writing Java with the target output in mind.
First, I figured I would need three classes:

  1. A “driver” class I called TokenizerOne — This contains main(), is responsible for I/O, and breaks up the input into its constituent tokens;
  2. A TokenCode class/enum-type — This Java 5 construct is actually pretty nifty. Not only did I not have to deal with the well-documented issues regarding faking out enums in previous Java versions, I was also able to customize output more easily than in C by simply overriding the toString method for TokenCode;
  3. A Token class — Since the value of a token here could be either a String or a numeric, the original C program tacked on a union construct. I really didn’t see the need to fake out a union in Java since we can easily tell if a Token is numeric or alpha by checking its TokenCode. If it’s numeric, you can always pull its numeric value on-the-fly by sending its value to Integer.parseInt().

So, first thing, let’s create a file called TokenCode.java. As the code below shows, there are six possible values for this enum. In addition, the Java enum type allows us define types and instances of those types to associate with each value. I’ve kept things simple by just associating a String called “outValue” with each enum value. This makes it easier to customize this enum’s behavior. Further customizations are possible by adding or overriding methods to the enum, as I have with the toString() method.

 1 // Package statement
 2
 3 // Imports go here.
 4
 5 /**
 6  * TokenCode enum
 7  */
 8 public enum TokenCode {
 9     NO_TOKEN        (“<no token>”),
10     WORD            (“<WORD>”),
11     NUMBER          (“<NUMBER>”),
12     PERIOD          (“<PERIOD>”),
13     END_OF_FILE     (“<end of file>”),
14     ERROR           (“<ERROR>”);
15
16     private final String outValue;
17
18     TokenCode (String outValue) {
19         this.outValue = outValue;
20     }
21
22     public String toString () {
23         return this.outValue;
24     }
25 }

syntax highlighted by Code2HTML, v. 0.9.1


Yeah, the enum declaration is a bit foreign to some Java developers. You really want to say something like
public class TokenCode extends Enum

right? I know I did …The next thing to create is the Token class. A token has two data-members, its value, and something “meta” that is supposed to tell you what kind of data the value member is. Sounds like a good use of a TokenCode object, no?

 1 // Package Statement
 2
 3 // Imports go here.
 4
 5
 6 /**
 7  * Token class
 8  */
 9  public class Token {
10      private String    value;
11      private TokenCode code;
12
13      public Token (String value, TokenCode code) {
14          this.value = value;
15          this.code  = code;
16      }
17
18      public String getValue () {
19          return value;
20      }
21
22      public TokenCode getTokenCode () {
23          return code;
24      }
25
26      public String toString() {
27          return (code.toString() + tt + value);
28      }
29  }

syntax highlighted by Code2HTML, v. 0.9.1


The general idea behind the TokenizerOne class is to read one line at a time from the input file. The java.io.LineNumberReader class makes this easy and keeps track of what line we’ve just read.Then, the line is taken apart one character at a time. Leading whitespace is skipped; non-alphanumerics cause errors. Characters are read into a buffer until whitespace or the end of the line is encountered. Then we decide what type of token we have and send the token Code and token buffer contents to the Token constructor (lines 177-178).Everything else here is really dealing with output formatting and should probably be separated out into another class or something.

  1 // Package Statement
  2
  3 // Imports go here.
  4 import java.io.FileReader;
  5 import java.io.LineNumberReader;
  6 import java.io.IOException;
  7 import java.text.SimpleDateFormat;
  8 import java.util.Date;
  9 import java.util.GregorianCalendar;
 10
 11 /**
 12  * TokenizerOne class
 13  */
 14
 15 public class TokenizerOne {
 16     private static final char FORM_FEED_CHAR     = ‘f’;
 17     private static final int  MAX_LINES_PER_PAGE = 50;
 18     private static final char END_OF_FILE_FLAG   = ‘.’;
 19
 20     private LineNumberReader sourceReader;
 21     private boolean endOfFile;
 22     private StringBuffer lineBuffer;
 23     private String fileName;
 24     private String date;
 25
 26     private int pageNumber;
 27     private int lineCount;
 28     private Token currentToken;
 29
 30     public static void main (String[] args) {
 31         if (args.length < 1) {
 32             System.err.println(“Need to supply the filename”);
 33             System.exit(1);
 34         }
 35
 36         TokenizerOne myTokenizer = new TokenizerOne(args[0]);
 37
 38
 39         myTokenizer.nextLine();
 40         while (!myTokenizer.isEndOfFile()) {
 41             myTokenizer.printLine();
 42             while (myTokenizer.lineHasMoreTokens()) {
 43                 myTokenizer.nextToken();
 44                 myTokenizer.printToken();
 45                 if (myTokenizer.getTokenCode() == TokenCode.PERIOD) {
 46                     break;
 47                 }
 48             }
 49             if (myTokenizer.getTokenCode() == TokenCode.PERIOD) {
 50                 break;
 51             }
 52             myTokenizer.nextLine();
 53         }
 54     }
 55
 56     public TokenizerOne (String newFileName) {
 57         pageNumber = 0;
 58         lineCount  = 0;
 59         endOfFile  = false;
 60
 61         // Set the date string
 62         SimpleDateFormat dateFormat =
 63             new SimpleDateFormat(“yyyy.MMMMM.dd hh:mm aaa”);
 64         Date timer = new GregorianCalendar().getTime();
 65         date = dateFormat.format(timer);
 66
 67         // Set the file name
 68         this.fileName = newFileName;
 69
 70         try {
 71             // After this, the file should be ready for action :-) 
 72             sourceReader =
 73                 new LineNumberReader(new FileReader(this.fileName));
 74         } catch (IOException e) {
 75             System.err.println (“Problem opening “ + fileName + “.”);
 76             e.printStackTrace();
 77             System.exit(1);
 78         }
 79
 80         lineBuffer = new StringBuffer();
 81
 82         Token currentToken = null;
 83     }
 84
 85     public void nextLine() {
 86         if (    lineBuffer != null &&
 87                 lineBuffer.length() > 0 ) {
 88             lineBuffer.delete (0, lineBuffer.length());
 89         }
 90
 91         this.setEndOfFile(false);
 92         try {
 93             String nextLine = sourceReader.readLine();
 94             this.setEndOfFile( nextLine == null );
 95             if ( ! this.isEndOfFile() ) {
 96                 lineBuffer.append (nextLine);
 97
 98                 // Remove leading whitespace.
 99                 while (lineBuffer.length() > 0 &&
100                        Character.isWhitespace(lineBuffer.charAt(0))) {
101                     lineBuffer.deleteCharAt(0);
102                 }
103             }
104         } catch (IOException e) {
105             System.err.println
106                 (“Problem reading from “ + fileName + “.”);
107             e.printStackTrace();
108         }
109     }
110
111     public boolean lineHasMoreTokens() {
112         boolean moreTokens = true;
113
114         // Remove any white space before the next token.
115         if (lineBuffer != null) {
116             while (lineBuffer.length() > 0 &&
117                    Character.isWhitespace(lineBuffer.charAt(0))) {
118                 lineBuffer.deleteCharAt(0);
119             }
120         }
121
122         if (    lineBuffer != null &&
123                 lineBuffer.length() > 0 ) {
124             // Check if the first character of lineBuffer 
125             // is the beginning of one of our token types.
126             char nextChar = lineBuffer.charAt(0);
127             moreTokens = Character.isDigit(nextChar)  ||
128                          Character.isLetter(nextChar) ||
129                          (nextChar == END_OF_FILE_FLAG);
130
131         } else {
132              moreTokens = false;
133         }
134
135         return moreTokens;
136     }
137
138     public Token nextToken() {
139         StringBuffer tokenBuffer      = new StringBuffer(“”);
140         TokenCode    currentTokenCode = TokenCode.ERROR;
141
142         char nextChar = lineBuffer.charAt(0);
143         lineBuffer.deleteCharAt(0);
144
145         if ( Character.isDigit(nextChar) ) {
146             currentTokenCode = TokenCode.NUMBER;
147             tokenBuffer.append(nextChar);
148             while (lineBuffer.length() > 0 &&
149                    Character.isDigit(lineBuffer.charAt(0))) {
150                 tokenBuffer.append(lineBuffer.charAt(0));
151                 lineBuffer.deleteCharAt(0);
152             }
153         } else if ( Character.isLetter(nextChar) ) {
154             currentTokenCode = TokenCode.WORD;
155             tokenBuffer.append(nextChar);
156
157             while ( lineBuffer.length() > 0 &&
158                    ((Character.isLetter(lineBuffer.charAt(0))) ||
159                     (Character.isDigit (lineBuffer.charAt(0)))) ) {
160                 tokenBuffer.append(lineBuffer.charAt(0));
161                 lineBuffer.deleteCharAt(0);
162             }
163         } else if ( nextChar == END_OF_FILE_FLAG ) {
164             currentTokenCode = TokenCode.PERIOD;
165             tokenBuffer.append(nextChar);
166         } else {
167             currentTokenCode = TokenCode.ERROR;
168             tokenBuffer.append(nextChar);
169             while ( lineBuffer.length() > 0 &&
170                    ! ((Character.isLetter(lineBuffer.charAt(0))) ||
171                       (Character.isDigit (lineBuffer.charAt(0)))) ) {
172                 tokenBuffer.append(lineBuffer.charAt(0));
173                 lineBuffer.deleteCharAt(0);
174             }
175         }
176
177         currentToken =
178             new Token (tokenBuffer.toString(), currentTokenCode);
179         return currentToken;
180     }
181
182     public TokenCode getTokenCode() {
183         return currentToken.getTokenCode();
184     }
185
186     // Print line number and page info, if necessary.
187     public void printLine() {
188         if ( lineCount == 0 ) {
189             System.out.println (FORM_FEED_CHAR);
190             System.out.println (“Paget + ++pageNumber + t +
191                                  fileName + t + date + nn);
192         }
193
194         System.out.println (t + sourceReader.getLineNumber() +
195                             “:t + lineBuffer.toString());
196
197         lineCount = ++lineCount % MAX_LINES_PER_PAGE;
198
199     }
200
201     public void printToken() {
202         if ( lineCount == 0 ) {
203             System.out.println (FORM_FEED_CHAR);
204             System.out.println (“Paget + ++pageNumber + t +
205                                  fileName + t + date + nn);
206         }
207         System.out.println (tt>> “ + currentToken.toString());
208         lineCount = ++lineCount % MAX_LINES_PER_PAGE;
209     }
210
211     public boolean isEndOfFile() {
212         return endOfFile;
213     }
214
215     public boolean getEndOfFile() {
216         return endOfFile;
217     }
218
219     public void setEndOfFile(boolean endOfFile) {
220         this.endOfFile = endOfFile;
221     }
222
223 }

syntax highlighted by Code2HTML, v. 0.9.1


And this is how it works. It looks pretty close to the output from the original C program …

$ java TokenizerOne test2.dat

Page    1       test2.dat       2007.August.12 11:19 PM

        1:      1500
                >> <NUMBER>             1500
        2:      This is a test
                >> <WORD>               This
                >> <WORD>               is
                >> <WORD>               a
                >> <WORD>               test
        3:      NAN 349 New
                >> <WORD>               NAN
                >> <NUMBER>             349
                >> <WORD>               New
        4:      End of file. Test
                >> <WORD>               End
                >> <WORD>               of
                >> <WORD>               file
                >> <PERIOD>             .
$

syntax highlighted by Code2HTML, v. 0.9.1

The next section of the book deals with modularizing the C code so it’s more manageable. Header files, oh boy. I wonder if creating interfaces to deal with that sort of thing is going to be overkill.

Posted in Java | No Comments »

copyright © 2oo6 by Stabile-Dot-Org | Powered by Wordpress

Ported by ThemePorter - template by Design4 | Sponsored by web hosting bluebook