TokenType
    ├── Doctype: name, publicIdentifier, systemIdentifier
    ├── StartTag: tagName, pendingAttributeName, pendingAttributeValue, attributes
    ├── EndTag: tagName, pendingAttributeName, pendingAttributeValue
    ├── Comment: data
    ├── Character: data
    ├── EOF: EOF
譬如
Initial {
        boolean process(Token t, HtmlTreeBuilder tb) {
            if (isWhitespace(t)) {
                return true; // ignore whitespace
            } else if (t.isComment()) {
                tb.insert(t.asComment());
            } else if (t.isDoctype()) {
                // todo: parse error check on expected doctypes
                // todo: quirk state check on doctype ids
                Token.Doctype d = t.asDoctype();
                DocumentType doctype = new DocumentType(d.getName(), d.getPublicIdentifier(), d.getSystemIdentifier(), tb.getBaseUri());
                tb.getDocument().appendChild(doctype);
                if (d.isForceQuirks())
                    tb.getDocument().quirksMode(Document.QuirksMode.quirks);
                tb.transition(BeforeHtml);
            } else {
                // todo: check not iframe srcdoc
                tb.transition(BeforeHtml);
                return tb.process(t); // re-process token
            }
            return true;
        }
    },
当进入initial状态的时候,只可能出现四种情况whitespace, comment, doctype, 其他。在不同的情况下,定义了各种后续的策略
状态和下一状态的转换使用tb.transition(BeforeHtml)来进行。
    <!-- State: Initial -->
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <!-- State: BeforeHtml -->
    <html lang='zh-CN' xml:lang='zh-CN' xmlns='http://www.w3.org/1999/xhtml'>
    <!-- State: BeforeHead -->
    <head>
      <!-- State: InHead -->
      <script type="text/javascript">
      //<!-- State: Text -->
        function xx(){
        }
      </script>
      <noscript>
        <!-- State: InHeadNoscript -->
        Your browser does not support JavaScript!
      </noscript>
    </head>
    <!-- State: AfterHead -->
    <body>
    <!-- State: InBody -->
    <textarea>
        <!-- State: Text -->
        xxx
    </textarea>
    <table>
        <!-- State: InTable -->
        <!-- State: InTableText -->
        xxx
        <tbody>
        <!-- State: InTableBody -->
        </tbody>
        <tr>
            <!-- State: InRow -->
            <td>
                <!-- State: InCell -->
            </td>
        </tr>    
    </table>
    </html>
根据如下代码:
        String html = "<html><head><title>First!</title></head><body><p>First post! <img src=\"foo.png\" /></p></body></html>";
        Document doc = Jsoup.parse(html);
生成的HTML解析树(Document)
doc->childNodes: 
        <html><head>...</head><body>...</body></html>
doc->childNodes->childNodes
        <head>...</head>
        <body>...</body>
然后继续按照childNodes分下去就是一颗完整的树
Token token = tokeniser.read();就是在分词,分出<html>等词出来
protected void runParser() {
   while (true) {
       Token token = tokeniser.read();
       process(token);
       if (token.type == Token.TokenType.EOF)
           break;
   }
}
寻找tagName为a的element
new NodeTraversor(new Accumulator(root, elements, eval)).traverse(root);
使用NodeTraversor一个element一个element的遍历生成的document,找到一个element之后在Accumulator的head方法中判断是不是要找的tagName为a的element,如果是,就加到elements变量中
TokenQueue tq = new TokenQueue("(one (two) three) four");
String guts = tq.chompBalanced('(', ')');
assertEquals("one (two) three", guts);
使用chompBalanced(char open, char close)获取open和close之间的内容
do {
       if (isEmpty()) break;
       Character c = consume();
       if (last == 0 || last != ESC) {
           if (c.equals(open))
               depth++;
           else if (c.equals(close))
               depth--;
       }
       if (depth > 0 && last != 0)
           accum.append(c); // don't include the outer match pair in the return
       last = c;
   } while (depth > 0);
找到符合open的元素,depth+1;找到符号close的元素,depth-1。期间的内容append就是需要提取的元素。
depth不为0但是依然没有找到符合close的元素,就把找到的全输出
TokenQueue tq = new TokenQueue("(one (two three) four");
String guts = tq.chompBalanced('(', ')');
// guts = one (two three four