TokenType
├── Doctype: name, publicIdentifier, systemIdentifier
├── StartTag: tagName, pendingAttributeName, pendingAttributeValue, attributes
├── EndTag: tagName, pendingAttributeName, pendingAttributeValue
├── Comment: data
├── Character: data
├── EOF: EOF
譬如
Initial {
boolean process(Token t, HtmlTreeBuilder tb) {
if (isWhitespace(t)) {
return true; // ignore whitespace
} else if (t.isComment()) {
tb.insert(t.asComment());
} else if (t.isDoctype()) {
// todo: parse error check on expected doctypes
// todo: quirk state check on doctype ids
Token.Doctype d = t.asDoctype();
DocumentType doctype = new DocumentType(d.getName(), d.getPublicIdentifier(), d.getSystemIdentifier(), tb.getBaseUri());
tb.getDocument().appendChild(doctype);
if (d.isForceQuirks())
tb.getDocument().quirksMode(Document.QuirksMode.quirks);
tb.transition(BeforeHtml);
} else {
// todo: check not iframe srcdoc
tb.transition(BeforeHtml);
return tb.process(t); // re-process token
}
return true;
}
},
当进入initial
状态的时候,只可能出现四种情况whitespace
, comment
, doctype
, 其他。在不同的情况下,定义了各种后续的策略
状态和下一状态的转换使用tb.transition(BeforeHtml)
来进行。
<!-- State: Initial -->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- State: BeforeHtml -->
<html lang='zh-CN' xml:lang='zh-CN' xmlns='http://www.w3.org/1999/xhtml'>
<!-- State: BeforeHead -->
<head>
<!-- State: InHead -->
<script type="text/javascript">
//<!-- State: Text -->
function xx(){
}
</script>
<noscript>
<!-- State: InHeadNoscript -->
Your browser does not support JavaScript!
</noscript>
</head>
<!-- State: AfterHead -->
<body>
<!-- State: InBody -->
<textarea>
<!-- State: Text -->
xxx
</textarea>
<table>
<!-- State: InTable -->
<!-- State: InTableText -->
xxx
<tbody>
<!-- State: InTableBody -->
</tbody>
<tr>
<!-- State: InRow -->
<td>
<!-- State: InCell -->
</td>
</tr>
</table>
</html>
根据如下代码:
String html = "<html><head><title>First!</title></head><body><p>First post! <img src=\"foo.png\" /></p></body></html>";
Document doc = Jsoup.parse(html);
生成的HTML
解析树(Document
)
doc->childNodes:
<html><head>...</head><body>...</body></html>
doc->childNodes->childNodes
<head>...</head>
<body>...</body>
然后继续按照childNodes
分下去就是一颗完整的树
Token token = tokeniser.read();
就是在分词,分出<html>
等词出来
protected void runParser() {
while (true) {
Token token = tokeniser.read();
process(token);
if (token.type == Token.TokenType.EOF)
break;
}
}
寻找tagName
为a
的element
new NodeTraversor(new Accumulator(root, elements, eval)).traverse(root);
使用NodeTraversor
一个element
一个element
的遍历生成的document
,找到一个element
之后在Accumulator
的head
方法中判断是不是要找的tagName
为a
的element
,如果是,就加到elements
变量中
TokenQueue tq = new TokenQueue("(one (two) three) four");
String guts = tq.chompBalanced('(', ')');
assertEquals("one (two) three", guts);
使用chompBalanced(char open, char close)
获取open
和close
之间的内容
do {
if (isEmpty()) break;
Character c = consume();
if (last == 0 || last != ESC) {
if (c.equals(open))
depth++;
else if (c.equals(close))
depth--;
}
if (depth > 0 && last != 0)
accum.append(c); // don't include the outer match pair in the return
last = c;
} while (depth > 0);
找到符合open
的元素,depth+1
;找到符号close
的元素,depth-1
。期间的内容append就是需要提取的元素。
depth
不为0但是依然没有找到符合close
的元素,就把找到的全输出
TokenQueue tq = new TokenQueue("(one (two three) four");
String guts = tq.chompBalanced('(', ')');
// guts = one (two three four