Antlr3.Runtime A kind of ReaderStream that pulls from an InputStream. Useful for reading from stdin and specifying file encodings etc... Vacuum all input from a Reader and then treat it like a StringStream. Manage the buffer manually to avoid unnecessary data copying. If you need encoding, use ANTLRInputStream. A pretty quick CharStream that pulls all data from an array directly. Every method call counts in the lexer. Java's strings aren't very good so I'm avoiding. The data being scanned How many characters are actually in the buffer 0..n-1 index into string of next char line number 1..n within the input The index of the character relative to the beginning of the line 0..n-1 tracks how deep mark() calls are nested A list of CharStreamState objects that tracks the stream state values line, charPositionInLine, and p that can change as you move through the input stream. Indexed from 1..markDepth. A null is kept @ index 0. Create upon first call to mark(). Track the last mark() call result value for use in rewind(). What is name or source of this char stream? Copy data in string to a local char array This is the preferred constructor as no data is copied Return the current input symbol index 0..n where n indicates the last symbol has been read. The index is the index of char to be returned from LA(1). Reset the stream so that it's in the same state it was when the object was created *except* the data array is not touched. consume() ahead until p==index; can't just set p=index as we must update line and charPositionInLine. A generic recognizer that can handle recognizers generated from lexer, parser, and tree grammars. This is all the parsing support code essentially; most of it is error recovery stuff and backtracking. State of a lexer, parser, or tree parser are collected into a state object so the state can be shared. This sharing is needed to have one grammar import others and share same error variables and other state variables. It's a kind of explicit multiple inheritance via delegation of methods and shared state. reset the parser's state; subclasses must rewinds the input stream Match current input symbol against ttype. Attempt single token insertion or deletion error recovery. If that fails, throw MismatchedTokenException. To turn off single token insertion or deletion error recovery, override recoverFromMismatchedToken() and have it throw an exception. See TreeParser.recoverFromMismatchedToken(). This way any error in a rule will cause an exception and immediate exit from rule. Rule would recover by resynchronizing to the set of symbols that can follow rule ref. Match the wildcard: in a symbol Report a recognition problem. This method sets errorRecovery to indicate the parser is recovering not parsing. Once in recovery mode, no errors are generated. To get out of recovery mode, the parser must successfully match a token (after a resync). So it will go: 1. error occurs 2. enter recovery mode, report error 3. consume until token found in resynch set 4. try to resume parsing 5. next match() will reset errorRecovery mode If you override, make sure to update syntaxErrors if you care about that. What error message should be generated for the various exception types? Not very object-oriented code, but I like having all error message generation within one method rather than spread among all of the exception classes. This also makes it much easier for the exception handling because the exception classes do not have to have pointers back to this object to access utility routines and so on. Also, changing the message for an exception type would be difficult because you would have to subclassing exception, but then somehow get ANTLR to make those kinds of exception objects instead of the default. This looks weird, but trust me--it makes the most sense in terms of flexibility. For grammar debugging, you will want to override this to add more information such as the stack frame with getRuleInvocationStack(e, this.getClass().getName()) and, for no viable alts, the decision description and state etc... Override this to change the message generated for one or more exception types. Get number of recognition errors (lexer, parser, tree parser). Each recognizer tracks its own number. So parser and lexer each have separate count. Does not count the spurious errors found between an error and next valid token match What is the error header, normally line/character position information? How should a token be displayed in an error message? The default is to display just the text, but during development you might want to have a lot of information spit out. Override in that case to use t.ToString() (which, for CommonToken, dumps everything about the token). This is better than forcing you to override a method in your token objects because you don't have to go modify your lexer so that it creates a new Java type. Override this method to change where error messages go Recover from an error found on the input stream. This is for NoViableAlt and mismatched symbol exceptions. If you enable single token insertion and deletion, this will usually not handle mismatched symbol exceptions but there could be a mismatched token that the match() routine could not recover from. A hook to listen in on the token consumption during error recovery. The DebugParser subclasses this to fire events to the listenter. Compute the context-sensitive FOLLOW set for current rule. This is set of token types that can follow a specific rule reference given a specific call chain. You get the set of viable tokens that can possibly come next (lookahead depth 1) given the current call chain. Contrast this with the definition of plain FOLLOW for rule r: FOLLOW(r)={x | S=>*alpha r beta in G and x in FIRST(beta)} where x in T* and alpha, beta in V*; T is set of terminals and V is the set of terminals and nonterminals. In other words, FOLLOW(r) is the set of all tokens that can possibly follow references to r in *any* sentential form (context). At runtime, however, we know precisely which context applies as we have the call chain. We may compute the exact (rather than covering superset) set of following tokens. For example, consider grammar: stat : ID '=' expr ';' // FOLLOW(stat)=={EOF} | "return" expr '.' ; expr : atom ('+' atom)* ; // FOLLOW(expr)=={';','.',')'} atom : INT // FOLLOW(atom)=={'+',')',';','.'} | '(' expr ')' ; The FOLLOW sets are all inclusive whereas context-sensitive FOLLOW sets are precisely what could follow a rule reference. For input input "i=(3);", here is the derivation: stat => ID '=' expr ';' => ID '=' atom ('+' atom)* ';' => ID '=' '(' expr ')' ('+' atom)* ';' => ID '=' '(' atom ')' ('+' atom)* ';' => ID '=' '(' INT ')' ('+' atom)* ';' => ID '=' '(' INT ')' ';' At the "3" token, you'd have a call chain of stat -> expr -> atom -> expr -> atom What can follow that specific nested ref to atom? Exactly ')' as you can see by looking at the derivation of this specific input. Contrast this with the FOLLOW(atom)={'+',')',';','.'}. You want the exact viable token set when recovering from a token mismatch. Upon token mismatch, if LA(1) is member of the viable next token set, then you know there is most likely a missing token in the input stream. "Insert" one by just not throwing an exception. Attempt to recover from a single missing or extra token. EXTRA TOKEN LA(1) is not what we are looking for. If LA(2) has the right token, however, then assume LA(1) is some extra spurious token. Delete it and LA(2) as if we were doing a normal match(), which advances the input. MISSING TOKEN If current token is consistent with what could come after ttype then it is ok to "insert" the missing token, else throw exception For example, Input "i=(3;" is clearly missing the ')'. When the parser returns from the nested call to expr, it will have call chain: stat -> expr -> atom and it will be trying to match the ')' at this point in the derivation: => ID '=' '(' INT ')' ('+' atom)* ';' ^ match() will see that ';' doesn't match ')' and report a mismatched token error. To recover, it sees that LA(1)==';' is in the set of tokens that can follow the ')' token reference in rule atom. It can assume that you forgot the ')'. Not currently used Match needs to return the current input symbol, which gets put into the label for the associated token ref; e.g., x=ID. Token and tree parsers need to return different objects. Rather than test for input stream type or change the IntStream interface, I use a simple method to ask the recognizer to tell me what the current input symbol is. This is ignored for lexers. Conjure up a missing token during error recovery. The recognizer attempts to recover from single missing symbols. But, actions might refer to that missing symbol. For example, x=ID {f($x);}. The action clearly assumes that there has been an identifier matched previously and that $x points at that token. If that token is missing, but the next token in the stream is what we want we assume that this token is missing and we keep going. Because we have to return some token to replace the missing token, we have to conjure one up. This method gives the user control over the tokens returned for missing tokens. Mostly, you will want to create something special for identifier tokens. For literals such as '{' and ',', the default action in the parser or tree parser works. It simply creates a CommonToken of the appropriate type. The text will be the token. If you change what tokens must be created by the lexer, override this method to create the appropriate tokens. Consume tokens until one matches the given token set Push a rule's follow set using our own hardcoded stack Return whether or not a backtracking attempt failed. Used to print out token names like ID during debugging and error reporting. The generated parsers implement a method that overrides this to point to their String[] tokenNames. For debugging and other purposes, might want the grammar name. Have ANTLR generate an implementation for this method. A convenience method for use most often with template rewrites. Convert a list of to a list of . Given a rule number and a start token index number, return MEMO_RULE_UNKNOWN if the rule has not parsed input starting from start index. If this rule has parsed input starting from the start index before, then return where the rule stopped parsing. It returns the index of the last token matched by the rule. For now we use a hashtable and just the slow Object-based one. Later, we can make a special one for ints and also one that tosses out data after we commit past input position i. Has this rule already parsed input at the current index in the input stream? Return the stop token index or MEMO_RULE_UNKNOWN. If we attempted but failed to parse properly before, return MEMO_RULE_FAILED. This method has a side-effect: if we have seen this input for this rule and successfully parsed before, then seek ahead to 1 past the stop token matched for this rule last time. Record whether or not this rule parsed the input at this position successfully. Use a standard java hashtable for now. return how many rule/input-index pairs there are in total. TODO: this includes synpreds. :( A stripped-down version of org.antlr.misc.BitSet that is just good enough to handle runtime requirements such as FOLLOW sets for automatic error recovery. We will often need to do a mod operator (i mod nbits). Its turns out that, for powers of two, this mod operation is same as (i & (nbits-1)). Since mod is slow, we use a precomputed mod mask to do the mod instead. The actual data bits Construct a bitset of size one word (64 bits) Construction from a static array of longs Construction from a list of integers Construct a bitset given the size The size of the bitset in bits return this | a in a new set or this element into this set (grow as necessary to accommodate) Grows the set to a larger number of bits. element that must fit in set Sets the size of a set. how many words the new set should be return how much space is being used by the bits array not how many actually have member bits on. Is this contained within a? Buffer all input tokens but do on-demand fetching of new tokens from lexer. Useful when the parser or lexer has to set context/mode info before proper lexing of future tokens. The ST template parser needs this, for example, because it has to constantly flip back and forth between inside/output templates. E.g., <names:{hi, <it>}> has to parse names as part of an expression but "hi, <it>" as a nested template. You can't use this stream if you pass whitespace or other off-channel tokens to the parser. The stream can't ignore off-channel tokens. (UnbufferedTokenStream is the same way.) This is not a subclass of UnbufferedTokenStream because I don't want to confuse small moving window of tokens it uses for the full buffer. Record every single token pulled from the source so we can reproduce chunks of it later. The buffer in LookaheadStream overlaps sometimes as its moving window moves through the input. This list captures everything so we can access complete input text. Track the last mark() call result value for use in rewind(). The index into the tokens list of the current token (next token to consume). tokens[p] should be LT(1). p=-1 indicates need to initialize with first token. The ctor doesn't get a token. First call to LT(1) or whatever gets the first token and sets p=0; How deep have we gone? Move the input pointer to the next incoming token. The stream must become active with LT(1) available. consume() simply moves the input pointer so that LT(1) points at the next input symbol. Consume at least one token. Walk past any token not on the channel the parser is listening to. Make sure index i in tokens has a token. add n elements to buffer Given a start and stop index, return a List of all tokens in the token type BitSet. Return null if no tokens were found. This method looks at both on and off channel tokens. When walking ahead with cyclic DFA or for syntactic predicates, we need to record the state of the input stream (char index, line, etc...) so that we can rewind the state after scanning ahead. This is the complete state of a stream. Index into the char stream of next lookahead char What line number is the scanner at before processing buffer[p]? What char position 0..n-1 in line is scanner before processing buffer[p]? A Token object like we'd use in ANTLR 2.x; has an actual string created and associated with this object. These objects are needed for imaginary tree nodes that have payload objects. We need to create a Token object that has a string; the tree node will point at this token. CommonToken has indexes into a char stream and hence cannot be used to introduce new strings. What token number is this from 0..n-1 tokens We need to be able to change the text once in a while. If this is non-null, then getText should return this. Note that start/stop are not affected by changing this. What token number is this from 0..n-1 tokens; < 0 implies invalid index The char position into the input buffer where this token starts The char position into the input buffer where this token stops The most common stream of tokens is one where every token is buffered up and tokens are prefiltered for a certain channel (the parser will only see these tokens and cannot change the filter channel number during the parse). TODO: how to access the full token stream? How to track all tokens matched per rule? Skip tokens on any channel but this one; this is how we skip whitespace... Reset this token stream by setting its token source. Always leave p on an on-channel token. Given a starting index, return the index of the first on-channel token. All debugging events that a recognizer can trigger. I did not create a separate AST debugging interface as it would create lots of extra classes and DebugParser has a dbg var defined, which makes it hard to change to ASTDebugEventListener. I looked hard at this issue and it is easier to understand as one monolithic event interface for all possible events. Hopefully, adding ST debugging stuff won't be bad. Leave for future. 4/26/2006. The parser has just entered a rule. No decision has been made about which alt is predicted. This is fired AFTER init actions have been executed. Attributes are defined and available etc... The grammarFileName allows composite grammars to jump around among multiple grammar files. Because rules can have lots of alternatives, it is very useful to know which alt you are entering. This is 1..n for n alts. This is the last thing executed before leaving a rule. It is executed even if an exception is thrown. This is triggered after error reporting and recovery have occurred (unless the exception is not caught in this rule). This implies an "exitAlt" event. The grammarFileName allows composite grammars to jump around among multiple grammar files. Track entry into any (...) subrule other EBNF construct Every decision, fixed k or arbitrary, has an enter/exit event so that a GUI can easily track what LT/consume events are associated with prediction. You will see a single enter/exit subrule but multiple enter/exit decision events, one for each loop iteration. An input token was consumed; matched by any kind of element. Trigger after the token was matched by things like match(), matchAny(). An off-channel input token was consumed. Trigger after the token was matched by things like match(), matchAny(). (unless of course the hidden token is first stuff in the input stream). Somebody (anybody) looked ahead. Note that this actually gets triggered by both LA and LT calls. The debugger will want to know which Token object was examined. Like consumeToken, this indicates what token was seen at that depth. A remote debugger cannot look ahead into a file it doesn't have so LT events must pass the token even if the info is redundant. The parser is going to look arbitrarily ahead; mark this location, the token stream's marker is sent in case you need it. After an arbitrairly long lookahead as with a cyclic DFA (or with any backtrack), this informs the debugger that stream should be rewound to the position associated with marker. Rewind to the input position of the last marker. Used currently only after a cyclic DFA and just before starting a sem/syn predicate to get the input position back to the start of the decision. Do not "pop" the marker off the state. mark(i) and rewind(i) should balance still. To watch a parser move through the grammar, the parser needs to inform the debugger what line/charPos it is passing in the grammar. For now, this does not know how to switch from one grammar to the other and back for island grammars etc... This should also allow breakpoints because the debugger can stop the parser whenever it hits this line/pos. A recognition exception occurred such as NoViableAltException. I made this a generic event so that I can alter the exception hierachy later without having to alter all the debug objects. Upon error, the stack of enter rule/subrule must be properly unwound. If no viable alt occurs it is within an enter/exit decision, which also must be rewound. Even the rewind for each mark must be unwount. In the Java target this is pretty easy using try/finally, if a bit ugly in the generated code. The rewind is generated in DFA.predict() actually so no code needs to be generated for that. For languages w/o this "finally" feature (C++?), the target implementor will have to build an event stack or something. Across a socket for remote debugging, only the RecognitionException data fields are transmitted. The token object or whatever that caused the problem was the last object referenced by LT. The immediately preceding LT event should hold the unexpected Token or char. Here is a sample event trace for grammar: b : C ({;}A|B) // {;} is there to prevent A|B becoming a set | D ; The sequence for this rule (with no viable alt in the subrule) for input 'c c' (there are 3 tokens) is: commence LT(1) enterRule b location 7 1 enter decision 3 LT(1) exit decision 3 enterAlt1 location 7 5 LT(1) consumeToken [c/<4>,1:0] location 7 7 enterSubRule 2 enter decision 2 LT(1) LT(1) recognitionException NoViableAltException 2 1 2 exit decision 2 exitSubRule 2 beginResync LT(1) consumeToken [c/<4>,1:1] LT(1) endResync LT(-1) exitRule b terminate Indicates the recognizer is about to consume tokens to resynchronize the parser. Any consume events from here until the recovered event are not part of the parse--they are dead tokens. Indicates that the recognizer has finished consuming tokens in order to resychronize. There may be multiple beginResync/endResync pairs before the recognizer comes out of errorRecovery mode (in which multiple errors are suppressed). This will be useful in a gui where you want to probably grey out tokens that are consumed but not matched to anything in grammar. Anything between a beginResync/endResync pair was tossed out by the parser. A semantic predicate was evaluate with this result and action text Announce that parsing has begun. Not technically useful except for sending events over a socket. A GUI for example will launch a thread to connect and communicate with a remote parser. The thread will want to notify the GUI when a connection is made. ANTLR parsers trigger this upon entry to the first rule (the ruleLevel is used to figure this out). Parsing is over; successfully or not. Mostly useful for telling remote debugging listeners that it's time to quit. When the rule invocation level goes to zero at the end of a rule, we are done parsing. Input for a tree parser is an AST, but we know nothing for sure about a node except its type and text (obtained from the adaptor). This is the analog of the consumeToken method. Again, the ID is the hashCode usually of the node so it only works if hashCode is not implemented. If the type is UP or DOWN, then the ID is not really meaningful as it's fixed--there is just one UP node and one DOWN navigation node. The tree parser lookedahead. If the type is UP or DOWN, then the ID is not really meaningful as it's fixed--there is just one UP node and one DOWN navigation node. A nil was created (even nil nodes have a unique ID... they are not "null" per se). As of 4/28/2006, this seems to be uniquely triggered when starting a new subtree such as when entering a subrule in automatic mode and when building a tree in rewrite mode. If you are receiving this event over a socket via RemoteDebugEventSocketListener then only t.ID is set. Upon syntax error, recognizers bracket the error with an error node if they are building ASTs. Announce a new node built from token elements such as type etc... If you are receiving this event over a socket via RemoteDebugEventSocketListener then only t.ID, type, text are set. Announce a new node built from an existing token. If you are receiving this event over a socket via RemoteDebugEventSocketListener then only node.ID and token.tokenIndex are set. Make a node the new root of an existing root. See Note: the newRootID parameter is possibly different than the TreeAdaptor.becomeRoot() newRoot parameter. In our case, it will always be the result of calling TreeAdaptor.becomeRoot() and not root_n or whatever. The listener should assume that this event occurs only when the current subrule (or rule) subtree is being reset to newRootID. If you are receiving this event over a socket via RemoteDebugEventSocketListener then only IDs are set. Make childID a child of rootID. If you are receiving this event over a socket via RemoteDebugEventSocketListener then only IDs are set. Set the token start/stop token index for a subtree root or node. If you are receiving this event over a socket via RemoteDebugEventSocketListener then only t.ID is set. A DFA implemented as a set of transition tables. Any state that has a semantic predicate edge is special; those states are generated with if-then-else structures in a specialStateTransition() which is generated by cyclicDFA template. There are at most 32767 states (16-bit signed short). Could get away with byte sometimes but would have to generate different types and the simulation code too. For a point of reference, the Java lexer's Tokens rule DFA has 326 states roughly. Which recognizer encloses this DFA? Needed to check backtracking From the input stream, predict what alternative will succeed using this DFA (representing the covering regular approximation to the underlying CFL). Return an alternative number 1..n. Throw an exception upon error. A hook for debugging interface Given a String that has a run-length-encoding of some unsigned shorts like "\1\2\3\9", convert to short[] {2,9,9,9}. We do this to avoid static short[] which generates so much init code that the class won't compile. :( Hideous duplication of code, but I need different typed arrays out :( The recognizer did not match anything for a (..)+ loop. A semantic predicate failed during validation. Validation of predicates occurs when normally parsing the alternative just like matching a token. Disambiguating predicate evaluation occurs when we hoist a predicate into a prediction decision. AST rules have trees Has a value potentially if output=AST; AST rules have trees Has a value potentially if output=AST; A source of characters for an ANTLR lexer For infinite streams, you don't need this; primarily I'm providing a useful interface for action code. Just make sure actions don't use this on streams that don't support it. Get the ith character of lookahead. This is the same usually as LA(i). This will be used for labels in the generated lexer code. I'd prefer to return a char here type-wise, but it's probably better to be 32-bit clean and be consistent with LA. ANTLR tracks the line information automatically Because this stream can rewind, we need to be able to reset the line The index of the character relative to the beginning of the line 0..n-1 A simple stream of integers used when all I care about is the char or token type sequence (such as interpretation). Get int at current input pointer + i ahead where i=1 is next int. Negative indexes are allowed. LA(-1) is previous token (token just matched). LA(-i) where i is before first token should yield -1, invalid char / EOF. Tell the stream to start buffering if it hasn't already. Return current input position, Index, or some other marker so that when passed to rewind() you get back to the same spot. rewind(mark()) should not affect the input cursor. The Lexer track line/col info as well as input index so its markers are not pure input indexes. Same for tree node streams. Return the current input symbol index 0..n where n indicates the last symbol has been read. The index is the symbol about to be read not the most recently read symbol. Reset the stream so that next call to index would return marker. The marker will usually be Index but it doesn't have to be. It's just a marker to indicate what state the stream was in. This is essentially calling release() and seek(). If there are markers created after this marker argument, this routine must unroll them like a stack. Assume the state the stream was in when this marker was created. Rewind to the input position of the last marker. Used currently only after a cyclic DFA and just before starting a sem/syn predicate to get the input position back to the start of the decision. Do not "pop" the marker off the state. mark(i) and rewind(i) should balance still. It is like invoking rewind(last marker) but it should not "pop" the marker off. It's like seek(last marker's input position). You may want to commit to a backtrack but don't want to force the stream to keep bookkeeping objects around for a marker that is no longer necessary. This will have the same behavior as rewind() except it releases resources without the backward seek. This must throw away resources for all markers back to the marker argument. So if you're nested 5 levels of mark(), and then release(2) you have to release resources for depths 2..5. Set the input cursor to the position indicated by index. This is normally used to seek ahead in the input stream. No buffering is required to do this unless you know your stream will use seek to move backwards such as when backtracking. This is different from rewind in its multi-directional requirement and in that its argument is strictly an input cursor (index). For char streams, seeking forward must update the stream state such as line number. For seeking backwards, you will be presumably backtracking using the mark/rewind mechanism that restores state and so this method does not need to update state when seeking backwards. Currently, this method is only used for efficient backtracking using memoization, but in the future it may be used for incremental parsing. The index is 0..n-1. A seek to position i means that LA(1) will return the ith symbol. So, seeking to 0 means LA(1) will return the first element in the stream. Only makes sense for streams that buffer everything up probably, but might be useful to display the entire stream or for testing. This value includes a single EOF. Where are you getting symbols from? Normally, implementations will pass the buck all the way to the lexer who can ask its input stream for the file name or whatever. Rules can have start/stop info. Gets the start element from the input stream Gets the stop element from the input stream Rules can have start/stop info. The element type of the input stream. Gets the start element from the input stream Gets the stop element from the input stream Get the text of the token The line number on which this token was matched; line=1..n The index of the first character relative to the beginning of the line 0..n-1 An index from 0..n-1 of the token object in the input stream. This must be valid in order to use the ANTLRWorks debugger. From what character stream was this token created? You don't have to implement but it's nice to know where a Token comes from if you have include files etc... on the input. A source of tokens must provide a sequence of tokens via nextToken() and also must reveal it's source of characters; CommonToken's text is computed from a CharStream; it only store indices into the char stream. Errors from the lexer are never passed to the parser. Either you want to keep going or you do not upon token recognition error. If you do not want to continue lexing then you do not want to continue parsing. Just throw an exception not under RecognitionException and Java will naturally toss you all the way out of the recognizers. If you want to continue lexing then you should not throw an exception to the parser--it has already requested a token. Keep lexing until you get a valid one. Just report errors and keep going, looking for a valid token. Return a Token object from your input stream (usually a CharStream). Do not fail/return upon lexing error; keep chewing on the characters until you get a good one; errors are not passed through to the parser. Where are you getting tokens from? normally the implication will simply ask lexers input stream. A stream of tokens accessing tokens from a TokenSource Get Token at current input pointer + i ahead where i=1 is next Token. i<0 indicates tokens in the past. So -1 is previous token and -2 is two tokens ago. LT(0) is undefined. For i>=n, return Token.EOFToken. Return null for LT(0) and any index that results in an absolute address that is negative. How far ahead has the stream been asked to look? The return value is a valid index from 0..n-1. Get a token at an absolute index i; 0..n-1. This is really only needed for profiling and debugging and token stream rewriting. If you don't want to buffer up tokens, then this method makes no sense for you. Naturally you can't use the rewrite stream feature. I believe DebugTokenStream can easily be altered to not use this method, removing the dependency. Where is this stream pulling tokens from? This is not the name, but the object that provides Token objects. Return the text of all tokens from start to stop, inclusive. If the stream does not buffer all the tokens then it can just return "" or null; Users should not access $ruleLabel.text in an action of course in that case. Because the user is not required to use a token with an index stored in it, we must provide a means for two token objects themselves to indicate the start/end location. Most often this will just delegate to the other toString(int,int). This is also parallel with the TreeNodeStream.toString(Object,Object). The most common stream of tokens is one where every token is buffered up and tokens are prefiltered for a certain channel (the parser will only see these tokens and cannot change the filter channel number during the parse). TODO: how to access the full token stream? How to track all tokens matched per rule? Record every single token pulled from the source so we can reproduce chunks of it later. Map from token type to channel to override some Tokens' channel numbers Set of token types; discard any tokens with this type Skip tokens on any channel but this one; this is how we skip whitespace... By default, track all incoming tokens Track the last mark() call result value for use in rewind(). The index into the tokens list of the current token (next token to consume). p==-1 indicates that the tokens list is empty How deep have we gone? Reset this token stream by setting its token source. Load all tokens from the token source and put in tokens. This is done upon first LT request because you might want to set some token type / channel overrides before filling buffer. Move the input pointer to the next incoming token. The stream must become active with LT(1) available. consume() simply moves the input pointer so that LT(1) points at the next input symbol. Consume at least one token. Walk past any token not on the channel the parser is listening to. Given a starting index, return the index of the first on-channel token. A simple filter mechanism whereby you can tell this token stream to force all tokens of type ttype to be on channel. For example, when interpreting, we cannot exec actions so we need to tell the stream to force all WS and NEWLINE to be a different, ignored channel. Given a start and stop index, return a List of all tokens in the token type BitSet. Return null if no tokens were found. This method looks at both on and off channel tokens. Get the ith token from the current position 1..n where k=1 is the first symbol of lookahead. Look backwards k tokens on-channel tokens Return absolute token i; ignore which channel the tokens are on; that is, count all tokens not just on-channel tokens. A lexer is recognizer that draws input symbols from a character stream. lexer grammars result in a subclass of this object. A Lexer object uses simplified match() and error recovery mechanisms in the interest of speed. Where is the lexer drawing characters from? Gets or sets the text matched so far for the current token or any text override. Setting this value replaces any previously set value, and overrides the original text. Return a token from this source; i.e., match a token on the char stream. Returns the EOF token (default), if you need to return a custom token instead override this method. Instruct the lexer to skip creating a token for current lexer rule and look for another token. nextToken() knows to keep looking when a lexer rule finishes with token set to SKIP_TOKEN. Recall that if token==null at end of any token rule, it creates one for you and emits it. This is the lexer entry point that sets instance var 'token' Currently does not support multiple emits per nextToken invocation for efficiency reasons. Subclass and override this method and nextToken (to push tokens into a list and pull from that list rather than a single variable as this implementation does). The standard method called to automatically emit a token at the outermost lexical rule. The token object should point into the char buffer start..stop. If there is a text override in 'text', use that to set the token's text. Override this method to emit custom Token objects. If you are building trees, then you should also override Parser or TreeParser.getMissingSymbol(). What is the index of the current character of lookahead? Lexers can normally match any char in it's vocabulary after matching a token, so do the easy thing and just kill a character and hope it all works out. You can instead use the rule invocation stack to do sophisticated error recovery if you are in a fragment rule. A queue that can dequeue and get(i) in O(1) and grow arbitrarily large. A linked list is fast at dequeue but slow at get(i). An array is the reverse. This is O(1) for both operations. List grows until you dequeue last element at end of buffer. Then it resets to start filling at 0 again. If adds/removes are balanced, the buffer will not grow too large. No iterator stuff as that's not how we'll use it. dynamically-sized buffer of elements index of next element to fill How deep have we gone? Return element {@code i} elements ahead of current element. {@code i==0} gets current element. This is not an absolute index into {@link #data} since {@code p} defines the start of the real list. Get and remove first element in queue Return string of current buffer contents; non-destructive A lookahead queue that knows how to mark/release locations in the buffer for backtracking purposes. Any markers force the {@link FastQueue} superclass to keep all elements until no more markers; then can reset to avoid growing a huge buffer. Absolute token index. It's the index of the symbol about to be read via {@code LT(1)}. Goes from 0 to numtokens. This is the {@code LT(-1)} element for the first element in {@link #data}. Track object returned by nextElement upon end of stream; Return it later when they ask for LT passed end of input. Track the last mark() call result value for use in rewind(). tracks how deep mark() calls are nested Implement nextElement to supply a stream of elements to this lookahead buffer. Return EOF upon end of the stream we're pulling from. Get and remove first element in queue; override {@link FastQueue#remove()}; it's the same, just checks for backtracking. Make sure we have at least one element to remove, even if EOF Make sure we have 'need' elements from current position p. Last valid p index is data.size()-1. p+need-1 is the data index 'need' elements ahead. If we need 1 element, (p+1-1)==p must be < data.size(). add n elements to buffer Size of entire stream is unknown; we only know buffer size from FastQueue Seek to a 0-indexed absolute token index. Normally used to seek backwards in the buffer. Does not force loading of nodes. To preserve backward compatibility, this method allows seeking past the end of the currently buffered data. In this case, the input pointer will be moved but the data will only actually be loaded upon the next call to {@link #consume} or {@link #LT} for {@code k>0}. A mismatched char or Token or tree node We were expecting a token but it's not found. The current token is actually what we wanted next. Used for tree node errors too. A parser for TokenStreams. "parser grammars" result in a subclass of this. Gets or sets the token stream; resets the parser upon a set. Rules that return more than a single value must return an object containing all the values. Besides the properties defined in RuleLabelScope.predefinedRulePropertiesScope there may be user-defined return values. This class simply defines the minimum properties that are always defined and methods to access the others that might be available depending on output option such as template and tree. Note text is not an actual property of the return value, it is computed from start and stop using the input stream's toString() method. I could add a ctor to this so that we can pass in and store the input stream, but I'm not sure we want to do that. It would seem to be undefined to get the .text property anyway if the rule matches tokens from multiple input streams. I do not use getters for fields of objects that are used simply to group values such as this aggregate. The getters/setters are there to satisfy the superclass interface. The root of the ANTLR exception hierarchy. To avoid English-only error messages and to generally make things as flexible as possible, these exceptions are not created with strings, but rather the information necessary to generate an error. Then the various reporting methods in Parser and Lexer can be overridden to generate a localized error message. For example, MismatchedToken exceptions are built with the expected token type. So, don't expect getMessage() to return anything. Note that as of Java 1.4, you can access the stack trace, which means that you can compute the complete trace of rules from the start symbol. This gives you considerable context information with which to generate useful error messages. ANTLR generates code that throws exceptions upon recognition error and also generates code to catch these exceptions in each rule. If you want to quit upon first error, you can turn off the automatic error handling mechanism using rulecatch action, but you still need to override methods mismatch and recoverFromMismatchSet. In general, the recognition exceptions can track where in a grammar a problem occurred and/or what was the expected input. While the parser knows its state (such as current input symbol and line info) that state can change before the exception is reported so current token index is computed and stored at exception time. From this info, you can perhaps print an entire line of input not just a single token, for example. Better to just say the recognizer had a problem and then let the parser figure out a fancy report. What input stream did the error occur in? What was the lookahead index when this exception was thrown? What is index of token/char were we looking at when the error occurred? The current Token when an error occurred. Since not all streams can retrieve the ith Token, we have to track the Token object. For parsers. Even when it's a tree parser, token might be set. If this is a tree parser exception, node is set to the node with the problem. The current char when an error occurred. For lexers. Track the line (1-based) at which the error occurred in case this is generated from a lexer. We need to track this since the unexpected char doesn't carry the line info. The 0-based index into the line where the error occurred. If you are parsing a tree node stream, you will encounter som imaginary nodes w/o line/col info. We now search backwards looking for most recent token with line/col info, but notify getErrorHeader() that info is approximate. Used for remote debugger deserialization Return the token type or char of the unexpected input element The set of fields needed by an abstract recognizer to recognize input and recover from errors etc... As a separate state object, it can be shared among multiple grammars; e.g., when one grammar imports another. These fields are publically visible but the actual state pointer per parser is protected. Track the set of token types that can follow any rule invocation. Stack grows upwards. When it hits the max, it grows 2x in size and keeps going. This is true when we see an error and before having successfully matched a token. Prevents generation of more than one error message per error. The index into the input stream where the last error occurred. This is used to prevent infinite loops where an error is found but no token is consumed during recovery...another error is found, ad naseum. This is a failsafe mechanism to guarantee that at least one token/tree node is consumed for two errors. In lieu of a return value, this indicates that a rule or token has failed to match. Reset to false upon valid token match. Did the recognizer encounter a syntax error? Track how many. If 0, no backtracking is going on. Safe to exec actions etc... If >0 then it's the level of backtracking. An array[size num rules] of dictionaries that tracks the stop token index for each rule. ruleMemo[ruleIndex] is the memoization table for ruleIndex. For key ruleStartIndex, you get back the stop token for associated rule or MEMO_RULE_FAILED. This is only used if rule memoization is on (which it is by default). The goal of all lexer rules/methods is to create a token object. This is an instance variable as multiple rules may collaborate to create a single token. nextToken will return this object after matching lexer rule(s). If you subclass to allow multiple token emissions, then set this to the last token to be matched or something nonnull so that the auto token emit mechanism will not emit another token. What character index in the stream did the current token start at? Needed, for example, to get the text for current token. Set at the start of nextToken. The line on which the first character of the token resides The character position of first character within the line The channel number for the current token The token type for the current token You can set the text for the current token to override what is in the input char buffer. Use setText() or can set this instance var. All tokens go to the parser (unless skip() is called in that rule) on a particular "channel". The parser tunes to a particular channel so that whitespace etc... can go to the parser on a "hidden" channel. Anything on different channel than DEFAULT_CHANNEL is not parsed by parser. Useful for dumping out the input stream after doing some augmentation or other manipulations. You can insert stuff, replace, and delete chunks. Note that the operations are done lazily--only if you convert the buffer to a String. This is very efficient because you are not moving data around all the time. As the buffer of tokens is converted to strings, the toString() method(s) check to see if there is an operation at the current index. If so, the operation is done and then normal String rendering continues on the buffer. This is like having multiple Turing machine instruction streams (programs) operating on a single input tape. :) Since the operations are done lazily at toString-time, operations do not screw up the token index values. That is, an insert operation at token index i does not change the index values for tokens i+1..n-1. Because operations never actually alter the buffer, you may always get the original token stream back without undoing anything. Since the instructions are queued up, you can easily simulate transactions and roll back any changes if there is an error just by removing instructions. For example, CharStream input = new ANTLRFileStream("input"); TLexer lex = new TLexer(input); TokenRewriteStream tokens = new TokenRewriteStream(lex); T parser = new T(tokens); parser.startRule(); Then in the rules, you can execute Token t,u; ... input.insertAfter(t, "text to put after t");} input.insertAfter(u, "text after u");} System.out.println(tokens.toString()); Actually, you have to cast the 'input' to a TokenRewriteStream. :( You can also have multiple "instruction streams" and get multiple rewrites from a single pass over the input. Just name the instruction streams and use that name again when printing the buffer. This could be useful for generating a C file and also its header file--all from the same buffer: tokens.insertAfter("pass1", t, "text to put after t");} tokens.insertAfter("pass2", u, "text after u");} System.out.println(tokens.toString("pass1")); System.out.println(tokens.toString("pass2")); If you don't use named rewrite streams, a "default" stream is used as the first example shows. What index into rewrites List are we? Token buffer index. Execute the rewrite operation by possibly adding to the buffer. Return the index of the next token to operate on. I'm going to try replacing range from x..y with (y-x)+1 ReplaceOp instructions. You may have multiple, named streams of rewrite operations. I'm calling these things "programs." Maps String (name) -> rewrite (List) Map String (program name) -> Integer index Rollback the instruction stream for a program so that the indicated instruction (via instructionIndex) is no longer in the stream. UNTESTED! Reset the program so that no instructions exist We need to combine operations and report invalid operations (like overlapping replaces that are not completed nested). Inserts to same index need to be combined etc... Here are the cases: I.i.u I.j.v leave alone, nonoverlapping I.i.u I.i.v combine: Iivu R.i-j.u R.x-y.v | i-j in x-y delete first R R.i-j.u R.i-j.v delete first R R.i-j.u R.x-y.v | x-y in i-j ERROR R.i-j.u R.x-y.v | boundaries overlap ERROR Delete special case of replace (text==null): D.i-j.u D.x-y.v | boundaries overlap combine to max(min)..max(right) I.i.u R.x-y.v | i in (x+1)-y delete I (since insert before we're not deleting i) I.i.u R.x-y.v | i not in (x+1)-y leave alone, nonoverlapping R.x-y.v I.i.u | i in x-y ERROR R.x-y.v I.x.u R.x-y.uv (combine, delete I) R.x-y.v I.i.u | i not in x-y leave alone, nonoverlapping I.i.u = insert u before op @ index i R.x-y.u = replace x-y indexed tokens with u First we need to examine replaces. For any replace op: 1. wipe out any insertions before op within that range. 2. Drop any replace op before that is contained completely within that range. 3. Throw exception upon boundary overlap with any previous replace. Then we can deal with inserts: 1. for any inserts to same index, combine even if not adjacent. 2. for any prior replace with same left boundary, combine this insert with replace and delete this replace. 3. throw exception if index in same range as previous replace Don't actually delete; make op null in list. Easier to walk list. Later we can throw as we add to index -> op map. Note that I.2 R.2-2 will wipe out I.2 even though, technically, the inserted stuff would be before the replace range. But, if you add tokens in front of a method body '{' and then delete the method body, I think the stuff before the '{' you added should disappear too. Return a map from token index to operation. Get all operations before an index of a particular kind In an action, a lexer rule can set token to this SKIP_TOKEN and ANTLR will avoid creating a token for this symbol and try to fetch another. imaginary tree navigation type; traverse "get child" link imaginary tree navigation type; finish with a child list A generic tree implementation with no payload. You must subclass to actually have any user data. ANTLR v3 uses a list of children approach instead of the child-sibling approach in v2. A flat tree (a list) is an empty node whose children represent the list. An empty, but non-null node is called "nil". Create a new node from an existing node does nothing for BaseTree as there are no fields other than the children list, which cannot be copied as the children are not considered part of this node. Get the children internal List; note that if you directly mess with the list, do so at your own risk. BaseTree doesn't track parent pointers. BaseTree doesn't track child indexes. Add t as child of this node. Warning: if t has no children, but child does and child isNil then this routine moves children to t via t.children = child.children; i.e., without copying the array. Add all elements of kids list as children of this node Insert child t at child position i (0..n-1) by shifting children i+1..n-1 to the right one position. Set parent / indexes properly but does NOT collapse nil-rooted t's that come in here like addChild. Delete children from start to stop and replace with t even if t is a list (nil-root tree). num of children can increase or decrease. For huge child lists, inserting children can force walking rest of children to set their childindex; could be slow. Override in a subclass to change the impl of children list Set the parent and child index values for all child of t Walk upwards looking for ancestor with this token type. Walk upwards and get first ancestor with this token type. Return a list of all ancestors of this node. The first node of list is the root and the last is the parent of this node. Print out a whole tree not just a node Override to say how a node (not a tree) should look as text A TreeAdaptor that works with any Tree implementation. System.identityHashCode() is not always unique; we have to track ourselves. That's ok, it's only for debugging, though it's expensive: we have to create a hashtable with all tree nodes in it. Create tree node that holds the start and stop tokens associated with an error. If you specify your own kind of tree nodes, you will likely have to override this method. CommonTree returns Token.INVALID_TOKEN_TYPE if no token payload but you might have to set token type for diff node type. You don't have to subclass CommonErrorNode; you will likely need to subclass your own tree node class to avoid class cast exception. This is generic in the sense that it will work with any kind of tree (not just ITree interface). It invokes the adaptor routines not the tree node routines to do the construction. Add a child to the tree t. If child is a flat tree (a list), make all in list children of t. Warning: if t has no children, but child does and child isNil then you can decide it is ok to move children to t via t.children = child.children; i.e., without copying the array. Just make sure that this is consistent with have the user will build ASTs. If oldRoot is a nil root, just copy or move the children to newRoot. If not a nil root, make oldRoot a child of newRoot. old=^(nil a b c), new=r yields ^(r a b c) old=^(a b c), new=r yields ^(r ^(a b c)) If newRoot is a nil-rooted single child tree, use the single child as the new root node. old=^(nil a b c), new=^(nil r) yields ^(r a b c) old=^(a b c), new=^(nil r) yields ^(r ^(a b c)) If oldRoot was null, it's ok, just return newRoot (even if isNil). old=null, new=r yields r old=null, new=^(nil r) yields ^(nil r) Return newRoot. Throw an exception if newRoot is not a simple node or nil root with a single child node--it must be a root node. If newRoot is ^(nil x) return x as newRoot. Be advised that it's ok for newRoot to point at oldRoot's children; i.e., you don't have to copy the list. We are constructing these nodes so we should have this control for efficiency. Transform ^(nil x) to x and nil to null Tell me how to create a token for use with imaginary token nodes. For example, there is probably no input symbol associated with imaginary token DECL, but you need to create it as a payload or whatever for the DECL node as in ^(DECL type ID). If you care what the token payload objects' type is, you should override this method and any other createToken variant. Tell me how to create a token for use with imaginary token nodes. For example, there is probably no input symbol associated with imaginary token DECL, but you need to create it as a payload or whatever for the DECL node as in ^(DECL type ID). This is a variant of createToken where the new token is derived from an actual real input token. Typically this is for converting '{' tokens to BLOCK etc... You'll see r : lc='{' ID+ '}' -> ^(BLOCK[$lc] ID+) ; If you care what the token payload objects' type is, you should override this method and any other createToken variant. Duplicate a node. This is part of the factory; override if you want another kind of node to be built. I could use reflection to prevent having to override this but reflection is slow. Track start/stop token for subtree root created for a rule. Only works with Tree nodes. For rules that match nothing, seems like this will yield start=i and stop=i-1 in a nil node. Might be useful info so I'll not force to be i..i. A buffered stream of tree nodes. Nodes can be from a tree of ANY kind. This node stream sucks all nodes out of the tree specified in the constructor during construction and makes pointers into the tree using an array of Object pointers. The stream necessarily includes pointers to DOWN and UP and EOF nodes. This stream knows how to mark/release for backtracking. This stream is most suitable for tree interpreters that need to jump around a lot or for tree parsers requiring speed (at cost of memory). There is some duplicated functionality here with UnBufferedTreeNodeStream but just in bookkeeping, not tree walking etc... TARGET DEVELOPERS: This is the old CommonTreeNodeStream that buffered up entire node stream. No need to implement really as new CommonTreeNodeStream is much better and covers what we need. @see CommonTreeNodeStream The complete mapping from stream index to tree node. This buffer includes pointers to DOWN, UP, and EOF nodes. It is built upon ctor invocation. The elements are type Object as we don't what the trees look like. Load upon first need of the buffer so we can set token types of interest for reverseIndexing. Slows us down a wee bit to do all of the if p==-1 testing everywhere though. Pull nodes from which tree? IF this tree (root) was created from a token stream, track it. What tree adaptor was used to build these trees Reuse same DOWN, UP navigation nodes unless this is true The index into the nodes list of the current node (next node to consume). If -1, nodes array not filled yet. Track the last mark() call result value for use in rewind(). Stack of indexes used for push/pop calls Walk tree with depth-first-search and fill nodes buffer. Don't do DOWN, UP nodes if its a list (t is isNil). What is the stream index for node? 0..n-1 Return -1 if node not found. As we flatten the tree, we use UP, DOWN nodes to represent the tree structure. When debugging we need unique nodes so instantiate new ones when uniqueNavigationNodes is true. Look backwards k nodes Make stream jump to a new location, saving old location. Switch back with pop(). Seek back to previous index saved during last push() call. Return top of stack (return index). Used for testing, just return the token type stream Debugging A node representing erroneous token range in token stream A tree node that is wrapper for a Token object. After 3.0 release while building tree rewrite stuff, it became clear that computing parent and child index is very difficult and cumbersome. Better to spend the space in every tree node. If you don't want these extra fields, it's easy to cut them out in your own BaseTree subclass. A single token is the payload What token indexes bracket all tokens associated with this node and below? Who is the parent node of this node; if null, implies node is root What index is this node in the child list? Range: 0..n-1 For every node in this subtree, make sure it's start/stop token's are set. Walk depth first, visit bottom up. Only updates nodes with at least one token index < 0. A TreeAdaptor that works with any Tree implementation. It provides really just factory methods; all the work is done by BaseTreeAdaptor. If you would like to have different tokens created than ClassicToken objects, you need to override this and then set the parser tree adaptor to use your subclass. To get your parser to build nodes of a different type, override create(Token), errorNode(), and to be safe, YourTreeClass.dupNode(). dupNode is called to duplicate nodes during rewrite operations. Tell me how to create a token for use with imaginary token nodes. For example, there is probably no input symbol associated with imaginary token DECL, but you need to create it as a payload or whatever for the DECL node as in ^(DECL type ID). If you care what the token payload objects' type is, you should override this method and any other createToken variant. Tell me how to create a token for use with imaginary token nodes. For example, there is probably no input symbol associated with imaginary token DECL, but you need to create it as a payload or whatever for the DECL node as in ^(DECL type ID). This is a variant of createToken where the new token is derived from an actual real input token. Typically this is for converting '{' tokens to BLOCK etc... You'll see r : lc='{' ID+ '}' -> ^(BLOCK[$lc] ID+) ; If you care what the token payload objects' type is, you should override this method and any other createToken variant. What is the Token associated with this node? If you are not using CommonTree, then you must override this in your own adaptor. Pull nodes from which tree? If this tree (root) was created from a token stream, track it. What tree adaptor was used to build these trees The tree iterator we are using Stack of indexes used for push/pop calls Tree (nil A B C) trees like flat A B C streams Tracks tree depth. Level=0 means we're at root node level. Tracks the last node before the start of {@link #data} which contains position information to provide information for error reporting. This is tracked in addition to {@link #prevElement} which may or may not contain position information. @see #hasPositionInformation @see RecognitionException#extractInformationFromTreeNodeStream Make stream jump to a new location, saving old location. Switch back with pop(). Seek back to previous index saved during last push() call. Return top of stack (return index). Returns an element containing position information. If {@code allowApproximateLocation} is {@code false}, then this method will return the {@code LT(1)} element if it contains position information, and otherwise return {@code null}. If {@code allowApproximateLocation} is {@code true}, then this method will return the last known element containing position information. @see #hasPositionInformation For debugging; destructive: moves tree iterator to end. A utility class to generate DOT diagrams (graphviz) from arbitrary trees. You can pass in your own templates and can pass in any kind of tree or use Tree interface method. I wanted this separator so that you don't have to include ST just to use the org.antlr.runtime.tree.* package. This is a set of non-static methods so you can subclass to override. For example, here is an invocation: CharStream input = new ANTLRInputStream(System.in); TLexer lex = new TLexer(input); CommonTokenStream tokens = new CommonTokenStream(lex); TParser parser = new TParser(tokens); TParser.e_return r = parser.e(); Tree t = (Tree)r.tree; System.out.println(t.toStringTree()); DOTTreeGenerator gen = new DOTTreeGenerator(); StringTemplate st = gen.toDOT(t); System.out.println(st); Track node to number mapping so we can get proper node name back Track node number so we can get unique node names Generate DOT (graphviz) for a whole tree not just a node. For example, 3+4*5 should generate: digraph { node [shape=plaintext, fixedsize=true, fontsize=11, fontname="Courier", width=.4, height=.2]; edge [arrowsize=.7] "+"->3 "+"->"*" "*"->4 "*"->5 } Takes a Tree interface object. @author Sam Harwell Returns an element containing concrete information about the current position in the stream. @param allowApproximateLocation if {@code false}, this method returns {@code null} if an element containing exact information about the current position is not available Determines if the specified {@code element} contains concrete position information. @param element the element to check @return {@code true} if {@code element} contains concrete position information, otherwise {@code false} What does a tree look like? ANTLR has a number of support classes such as CommonTreeNodeStream that work on these kinds of trees. You don't have to make your trees implement this interface, but if you do, you'll be able to use more support code. NOTE: When constructing trees, ANTLR can build any kind of tree; it can even use Token objects as trees if you add a child list to your tokens. This is a tree node without any payload; just navigation and factory stuff. Is there is a node above with token type ttype? Walk upwards and get first ancestor with this token type. Return a list of all ancestors of this node. The first node of list is the root and the last is the parent of this node. This node is what child index? 0..n-1 Set the parent and child index values for all children Add t as a child to this node. If t is null, do nothing. If t is nil, add all children of t to this' children. Set ith child (0..n-1) to t; t must be non-null and non-nil node Delete children from start to stop and replace with t even if t is a list (nil-root tree). num of children can increase or decrease. For huge child lists, inserting children can force walking rest of children to set their childindex; could be slow. Indicates the node is a nil node but may still have children, meaning the tree is a flat list. What is the smallest token index (indexing from 0) for this node and its children? What is the largest token index (indexing from 0) for this node and its children? Return a token type; needed for tree parsing In case we don't have a token payload, what is the line for errors? How to create and navigate trees. Rather than have a separate factory and adaptor, I've merged them. Makes sense to encapsulate. This takes the place of the tree construction code generated in the generated code in 2.x and the ASTFactory. I do not need to know the type of a tree at all so they are all generic Objects. This may increase the amount of typecasting needed. :( Create a tree node from Token object; for CommonTree type trees, then the token just becomes the payload. This is the most common create call. Override if you want another kind of node to be built. Create a new node derived from a token, with a new token type. This is invoked from an imaginary node ref on right side of a rewrite rule as IMAG[$tokenLabel]. This should invoke createToken(Token). Same as create(tokenType,fromToken) except set the text too. This is invoked from an imaginary node ref on right side of a rewrite rule as IMAG[$tokenLabel, "IMAG"]. This should invoke createToken(Token). Same as create(fromToken) except set the text too. This is invoked when the text terminal option is set, as in IMAG<text='IMAG'>. This should invoke createToken(Token). Create a new node derived from a token, with a new token type. This is invoked from an imaginary node ref on right side of a rewrite rule as IMAG["IMAG"]. This should invoke createToken(int,String). Duplicate a single tree node. Override if you want another kind of node to be built. Duplicate tree recursively, using dupNode() for each node Return a nil node (an empty but non-null node) that can hold a list of element as the children. If you want a flat tree (a list) use "t=adaptor.nil(); t.addChild(x); t.addChild(y);" Return a tree node representing an error. This node records the tokens consumed during error recovery. The start token indicates the input symbol at which the error was detected. The stop token indicates the last symbol consumed during recovery. You must specify the input stream so that the erroneous text can be packaged up in the error node. The exception could be useful to some applications; default implementation stores ptr to it in the CommonErrorNode. This only makes sense during token parsing, not tree parsing. Tree parsing should happen only when parsing and tree construction succeed. Is tree considered a nil node used to make lists of child nodes? Add a child to the tree t. If child is a flat tree (a list), make all in list children of t. Warning: if t has no children, but child does and child isNil then you can decide it is ok to move children to t via t.children = child.children; i.e., without copying the array. Just make sure that this is consistent with have the user will build ASTs. Do nothing if t or child is null. If oldRoot is a nil root, just copy or move the children to newRoot. If not a nil root, make oldRoot a child of newRoot. old=^(nil a b c), new=r yields ^(r a b c) old=^(a b c), new=r yields ^(r ^(a b c)) If newRoot is a nil-rooted single child tree, use the single child as the new root node. old=^(nil a b c), new=^(nil r) yields ^(r a b c) old=^(a b c), new=^(nil r) yields ^(r ^(a b c)) If oldRoot was null, it's ok, just return newRoot (even if isNil). old=null, new=r yields r old=null, new=^(nil r) yields ^(nil r) Return newRoot. Throw an exception if newRoot is not a simple node or nil root with a single child node--it must be a root node. If newRoot is ^(nil x) return x as newRoot. Be advised that it's ok for newRoot to point at oldRoot's children; i.e., you don't have to copy the list. We are constructing these nodes so we should have this control for efficiency. Given the root of the subtree created for this rule, post process it to do any simplifications or whatever you want. A required behavior is to convert ^(nil singleSubtree) to singleSubtree as the setting of start/stop indexes relies on a single non-nil root for non-flat trees. Flat trees such as for lists like "idlist : ID+ ;" are left alone unless there is only one ID. For a list, the start/stop indexes are set in the nil node. This method is executed after all rule tree construction and right before setTokenBoundaries(). For identifying trees. How to identify nodes so we can say "add node to a prior node"? Even becomeRoot is an issue. Use System.identityHashCode(node) usually. Create a node for newRoot make it the root of oldRoot. If oldRoot is a nil root, just copy or move the children to newRoot. If not a nil root, make oldRoot a child of newRoot. Return node created for newRoot. Be advised: when debugging ASTs, the DebugTreeAdaptor manually calls create(Token child) and then plain becomeRoot(node, node) because it needs to trap calls to create, but it can't since it delegates to not inherits from the TreeAdaptor. For tree parsing, I need to know the token type of a node Node constructors can set the type of a node Node constructors can set the text of a node Return the token object from which this node was created. Currently used only for printing an error message. The error display routine in BaseRecognizer needs to display where the input the error occurred. If your tree of limitation does not store information that can lead you to the token, you can create a token filled with the appropriate information and pass that back. See BaseRecognizer.getErrorMessage(). Where are the bounds in the input token stream for this node and all children? Each rule that creates AST nodes will call this method right before returning. Flat trees (i.e., lists) will still usually have a nil root node just to hold the children list. That node would contain the start/stop indexes then. Get the token start index for this subtree; return -1 if no such index Get the token stop index for this subtree; return -1 if no such index Get a child 0..n-1 node Set ith child (0..n-1) to t; t must be non-null and non-nil node Remove ith child and shift children down from right. How many children? If 0, then this is a leaf node Who is the parent node of this node; if null, implies node is root. If your node type doesn't handle this, it's ok but the tree rewrites in tree parsers need this functionality. What index is this node in the child list? Range: 0..n-1 If your node type doesn't handle this, it's ok but the tree rewrites in tree parsers need this functionality. Replace from start to stop child index of parent with t, which might be a list. Number of children may be different after this call. If parent is null, don't do anything; must be at root of overall tree. Can't replace whatever points to the parent externally. Do nothing. A stream of tree nodes, accessing nodes from a tree of some kind Get a tree node at an absolute index i; 0..n-1. If you don't want to buffer up nodes, then this method makes no sense for you. Get tree node at current input pointer + ahead where ==1 is next node. <0 indicates nodes in the past. So {@code LT(-1)} is previous node, but implementations are not required to provide results for < -1. {@code LT(0)} is undefined. For <=n, return . Return for {@code LT(0)} and any index that results in an absolute address that is negative. This is analogous to , but this returns a tree node instead of a . Makes code generation identical for both parser and tree grammars. Where is this stream pulling nodes from? This is not the name, but the object that provides node objects. If the tree associated with this stream was created from a {@link TokenStream}, you can specify it here. Used to do rule {@code $text} attribute in tree parser. Optional unless you use tree parser rule {@code $text} attribute or {@code output=template} and {@code rewrite=true} options. What adaptor can tell me how to interpret/navigate nodes and trees. E.g., get text of a node. As we flatten the tree, we use {@link Token#UP}, {@link Token#DOWN} nodes to represent the tree structure. When debugging we need unique nodes so we have to instantiate new ones. When doing normal tree parsing, it's slow and a waste of memory to create unique navigation nodes. Default should be {@code false}. Return the text of all nodes from {@code start} to {@code stop}, inclusive. If the stream does not buffer all the nodes then it can still walk recursively from start until stop. You can always return {@code null} or {@code ""} too, but users should not access {@code $ruleLabel.text} in an action of course in that case. Replace children of {@code parent} from index {@code startChildIndex} to {@code stopChildIndex} with {@code t}, which might be a list. Number of children may be different after this call. The stream is notified because it is walking the tree and might need to know you are monkeying with the underlying tree. Also, it might be able to modify the node stream to avoid restreaming for future phases. If {@code parent} is {@code null}, don't do anything; must be at root of overall tree. Can't replace whatever points to the parent externally. Do nothing. How to execute code for node t when a visitor visits node t. Execute pre() before visiting children and execute post() after visiting children. Execute an action before visiting children of t. Return t or a rewritten t. It is up to the visitor to decide what to do with the return value. Children of returned value will be visited if using TreeVisitor.visit(). Execute an action after visiting children of t. Return t or a rewritten t. It is up to the visitor to decide what to do with the return value. A record of the rules used to match a token sequence. The tokens end up as the leaves of this tree and rule nodes are the interior nodes. This really adds no functionality, it is just an alias for CommonTree that is more meaningful (specific) and holds a String to display for a node. Emit a token and all hidden nodes before. EOF node holds all hidden tokens after last real token. Print out the leaves of this tree, which means printing original input back out. Base class for all exceptions thrown during AST rewrite construction. This signifies a case where the cardinality of two or more elements in a subrule are different: (ID INT)+ where |ID|!=|INT| No elements within a (...)+ in a rewrite rule Ref to ID or expr but no tokens in ID stream or subtrees in expr stream A generic list of elements tracked in an alternative to be used in a -> rewrite rule. We need to subclass to fill in the next() method, which returns either an AST node wrapped around a token payload or an existing subtree. Once you start next()ing, do not try to add more elements. It will break the cursor tracking I believe. TODO: add mechanism to detect/puke on modification after reading from stream Cursor 0..n-1. If singleElement!=null, cursor is 0 until you next(), which bumps it to 1 meaning no more elements. Track single elements w/o creating a list. Upon 2nd add, alloc list The list of tokens or subtrees we are tracking Once a node / subtree has been used in a stream, it must be dup'd from then on. Streams are reset after subrules so that the streams can be reused in future subrules. So, reset must set a dirty bit. If dirty, then next() always returns a dup. The element or stream description; usually has name of the token or rule reference that this list tracks. Can include rulename too, but the exception would track that info. Create a stream with one element Create a stream, but feed off an existing list Reset the condition of this stream so that it appears we have not consumed any of its elements. Elements themselves are untouched. Once we reset the stream, any future use will need duplicates. Set the dirty bit. Return the next element in the stream. If out of elements, throw an exception unless size()==1. If size is 1, then return elements[0]. Return a duplicate node/subtree if stream is out of elements and size==1. If we've already used the element, dup (dirty bit set). Do the work of getting the next element, making sure that it's a tree node or subtree. Deal with the optimization of single- element list versus list of size > 1. Throw an exception if the stream is empty or we're out of elements and size>1. protected so you can override in a subclass if necessary. When constructing trees, sometimes we need to dup a token or AST subtree. Dup'ing a token means just creating another AST node around it. For trees, you must call the adaptor.dupTree() unless the element is for a tree root; then it must be a node dup. Ensure stream emits trees; tokens must be converted to AST nodes. AST nodes can be passed through unmolested. Queues up nodes matched on left side of -> in a tree parser. This is the analog of RewriteRuleTokenStream for normal parsers. Create a stream with one element Create a stream, but feed off an existing list Create a stream with one element Create a stream, but feed off an existing list Treat next element as a single node even if it's a subtree. This is used instead of next() when the result has to be a tree root node. Also prevents us from duplicating recently-added children; e.g., ^(type ID)+ adds ID to type and then 2nd iteration must dup the type node, but ID has been added. Referencing a rule result twice is ok; dup entire tree as we can't be adding trees as root; e.g., expr expr. Hideous code duplication here with super.next(). Can't think of a proper way to refactor. This needs to always call dup node and super.next() doesn't know which to call: dup node or dup tree. Create a stream with one element Create a stream, but feed off an existing list Get next token from stream and make a node for it Don't convert to a tree unless they explicitly call nextTree. This way we can do hetero tree nodes in rewrite. Return a node stream from a doubly-linked tree whose nodes know what child index they are. No remove() is supported. Emit navigation nodes (DOWN, UP, and EOF) to let show tree structure. If we emit UP/DOWN nodes, we need to spit out multiple nodes per next() call. A parser for a stream of tree nodes. "tree grammars" result in a subclass of this. All the error reporting and recovery is shared with Parser via the BaseRecognizer superclass. Set the input stream Match '.' in tree parser has special meaning. Skip node or entire tree if node has children. If children, scan until corresponding UP node. We have DOWN/UP nodes in the stream that have no line info; override. plus we want to alter the exception type. Don't try to recover from tree parser errors inline... Prefix error message with the grammar name because message is always intended for the programmer because the parser built the input tree not the user. Tree parsers parse nodes they usually have a token object as payload. Set the exception token and do the default behavior. The tree pattern to lex like "(A B C)" Index into input string Current char How long is the pattern in char? Set when token type is ID or ARG (name mimics Java's StreamTokenizer) Override this if you need transformation tracing to go somewhere other than stdout or if you're not using ITree-derived trees. This is identical to the ParserRuleReturnScope except that the start property is a tree nodes not Token object when you are parsing trees. Gets the first node or root node of tree matched for this rule. Do a depth first walk of a tree, applying pre() and post() actions as we go. Visit every node in tree t and trigger an action for each node before/after having visited all of its children. Bottom up walk. Execute both actions even if t has no children. Ignore return results from transforming children since they will have altered the child list of this node (their parent). Return result of applying post action to this node. Build and navigate trees with this object. Must know about the names of tokens so you have to pass in a map or array of token names (from which this class can build the map). I.e., Token DECL means nothing unless the class can translate it to a token type. In order to create nodes and navigate, this class needs a TreeAdaptor. This class can build a token type -> node index for repeated use or for iterating over the various nodes with a particular type. This class works in conjunction with the TreeAdaptor rather than moving all this functionality into the adaptor. An adaptor helps build and navigate trees using methods. This class helps you do it with string patterns like "(A B C)". You can create a tree from that pattern or match subtrees against it. When using %label:TOKENNAME in a tree for parse(), we must track the label. This adaptor creates TreePattern objects for use during scan() Compute a Map<String, Integer> that is an inverted index of tokenNames (which maps int token types to names). Using the map of token names to token types, return the type. Walk the entire tree and make a node name to nodes mapping. For now, use recursion but later nonrecursive version may be more efficient. Returns Map<Integer, List> where the List is of your AST node type. The Integer is the token type of the node. TODO: save this index so that find and visit are faster Do the work for index Return a List of tree nodes with token type ttype Return a List of subtrees matching pattern. Visit every ttype node in t, invoking the visitor. This is a quicker version of the general visit(t, pattern) method. The labels arg of the visitor action method is never set (it's null) since using a token type rather than a pattern doesn't let us set a label. Do the recursive work for visit For all subtrees that match the pattern, execute the visit action. The implementation uses the root node of the pattern in combination with visit(t, ttype, visitor) so nil-rooted patterns are not allowed. Patterns with wildcard roots are also not allowed. Given a pattern like (ASSIGN %lhs:ID %rhs:.) with optional labels on the various nodes and '.' (dot) as the node/subtree wildcard, return true if the pattern matches and fill the labels Map with the labels pointing at the appropriate nodes. Return false if the pattern is malformed or the tree does not match. If a node specifies a text arg in pattern, then that must match for that node in t. TODO: what's a better way to indicate bad pattern? Exceptions are a hassle Do the work for parse. Check to see if the t2 pattern fits the structure and token types in t1. Check text if the pattern has text arguments on nodes. Fill labels map with pointers to nodes in tree matched against nodes in pattern with labels. Create a tree or node from the indicated tree pattern that closely follows ANTLR tree grammar tree element syntax: (root child1 ... child2). You can also just pass in a node: ID Any node can have a text argument: ID[foo] (notice there are no quotes around foo--it's clear it's a string). nil is a special name meaning "give me a nil node". Useful for making lists: (nil A B C) is a list of A B C. Compare t1 and t2; return true if token types/text, structure match exactly. The trees are examined in their entirety so that (A B) does not match (A B C) nor (A (B C)). TODO: allow them to pass in a comparator TODO: have a version that is nonstatic so it can use instance adaptor I cannot rely on the tree node's equals() implementation as I make no constraints at all on the node types nor interface etc... Compare type, structure, and text of two trees, assuming adaptor in this instance of a TreeWizard. A token stream that pulls tokens from the code source on-demand and without tracking a complete buffer of the tokens. This stream buffers the minimum number of tokens possible. It's the same as OnDemandTokenStream except that OnDemandTokenStream buffers all tokens. You can't use this stream if you pass whitespace or other off-channel tokens to the parser. The stream can't ignore off-channel tokens. You can only look backwards 1 token: LT(-1). Use this when you need to read from a socket or other infinite stream. @see BufferedTokenStream @see CommonTokenStream Skip tokens on any channel but this one; this is how we skip whitespace... An extra token while parsing a TokenStream