JSoup

2016/2/22 posted in  源码阅读
  • 如果要建立一个对象只包含一个ArrayList<Object>,可以extends这个ArrayList<Object>

  • TokenType包含

    TokenType
    ├── Doctype: name, publicIdentifier, systemIdentifier
    ├── StartTag: tagName, pendingAttributeName, pendingAttributeValue, attributes
    ├── EndTag: tagName, pendingAttributeName, pendingAttributeValue
    ├── Comment: data
    ├── Character: data
    ├── EOF: EOF
  • HtmlTreeBuilderState中定义了分析html页面时各种状态以及在各种状态下对应的处理方法(状态机

    this.state.process(..)
    

譬如

Initial {
        boolean process(Token t, HtmlTreeBuilder tb) {
            if (isWhitespace(t)) {
                return true; // ignore whitespace
            } else if (t.isComment()) {
                tb.insert(t.asComment());
            } else if (t.isDoctype()) {
                // todo: parse error check on expected doctypes
                // todo: quirk state check on doctype ids
                Token.Doctype d = t.asDoctype();
                DocumentType doctype = new DocumentType(d.getName(), d.getPublicIdentifier(), d.getSystemIdentifier(), tb.getBaseUri());
                tb.getDocument().appendChild(doctype);
                if (d.isForceQuirks())
                    tb.getDocument().quirksMode(Document.QuirksMode.quirks);
                tb.transition(BeforeHtml);
            } else {
                // todo: check not iframe srcdoc
                tb.transition(BeforeHtml);
                return tb.process(t); // re-process token
            }
            return true;
        }
    },

当进入initial状态的时候,只可能出现四种情况whitespace, comment, doctype, 其他。在不同的情况下,定义了各种后续的策略

状态和下一状态的转换使用tb.transition(BeforeHtml)来进行。

  • HTML解析状态机
    <!-- State: Initial -->
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <!-- State: BeforeHtml -->
    <html lang='zh-CN' xml:lang='zh-CN' xmlns='http://www.w3.org/1999/xhtml'>
    <!-- State: BeforeHead -->
    <head>
      <!-- State: InHead -->
      <script type="text/javascript">
      //<!-- State: Text -->
        function xx(){
        }
      </script>
      <noscript>
        <!-- State: InHeadNoscript -->
        Your browser does not support JavaScript!
      </noscript>
    </head>
    <!-- State: AfterHead -->
    <body>
    <!-- State: InBody -->
    <textarea>
        <!-- State: Text -->
        xxx
    </textarea>
    <table>
        <!-- State: InTable -->
        <!-- State: InTableText -->
        xxx
        <tbody>
        <!-- State: InTableBody -->
        </tbody>
        <tr>
            <!-- State: InRow -->
            <td>
                <!-- State: InCell -->
            </td>
        </tr>    
    </table>
    </html>

  • HTML解析树

根据如下代码:

        String html = "<html><head><title>First!</title></head><body><p>First post! <img src=\"foo.png\" /></p></body></html>";
        Document doc = Jsoup.parse(html);

生成的HTML解析树(Document

doc->childNodes: 
        <html><head>...</head><body>...</body></html>

doc->childNodes->childNodes
        <head>...</head>
        <body>...</body>

然后继续按照childNodes分下去就是一颗完整的树

  • HTML解析

Token token = tokeniser.read();就是在分词,分出<html>等词出来

protected void runParser() {
   while (true) {
       Token token = tokeniser.read();
       process(token);

       if (token.type == Token.TokenType.EOF)
           break;
   }
}
  • getElementByTag("a")

寻找tagNameaelement

new NodeTraversor(new Accumulator(root, elements, eval)).traverse(root);

使用NodeTraversor一个element一个element的遍历生成的document,找到一个element之后在Accumulatorhead方法中判断是不是要找的tagNameaelement,如果是,就加到elements变量中

  • TokenQueue
TokenQueue tq = new TokenQueue("(one (two) three) four");
String guts = tq.chompBalanced('(', ')');
assertEquals("one (two) three", guts);

使用chompBalanced(char open, char close)获取openclose之间的内容

do {
       if (isEmpty()) break;
       Character c = consume();
       if (last == 0 || last != ESC) {
           if (c.equals(open))
               depth++;
           else if (c.equals(close))
               depth--;
       }

       if (depth > 0 && last != 0)
           accum.append(c); // don't include the outer match pair in the return
       last = c;
   } while (depth > 0);

找到符合open的元素,depth+1;找到符号close的元素,depth-1。期间的内容append就是需要提取的元素。

depth不为0但是依然没有找到符合close的元素,就把找到的全输出

TokenQueue tq = new TokenQueue("(one (two three) four");
String guts = tq.chompBalanced('(', ')');
// guts = one (two three four