目前在研究如何自动从网络上抓取(获得)自己想要的信息,同时对信息按照自己的需要进行分类,归档处理。
一切还想以前,看看有没有开源的东东可以借鉴,站在别人的基础上工作速度会快些,虽然并不都会取得好的效果。
看到挺多人都在关注Jericho HTML Parser,应该是个不错的家伙,拿过来测试了一下,基本功能都可以满足,索性就研究一下:
Jericho HTML Parser 简介:
Jericho HTML Parser是一个简单而功能强大的Java HTML解析器库,可以分析和处理HTML文档的一部分,包括一些通用的服务器端标签,同时也可以重新生成无法识别的或无效的HTML。它也提供了一个有用的HTML表单分析器。
作者的主页在:http://jericho.htmlparser.net/docs/index.html
相关项目信息在:http://jerichohtml.sourceforge.net/
附录原作者的介绍:—-Jericho HTML Parser 对作者表示感谢
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
It is an open source library released under both the Eclipse Public License (EPL) and GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.
The javadocs provide comprehensive documentation of the entire API, as well as being a very useful reference on aspects of HTML and XML in general.
Visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ for downloads and support.
You can also rate the project highly at http://freshmeat.net/projects/jerichohtml/
Release notes for each version can be found in a file called release.txt in the project root directory.
下面会对Jericho HTML Parser 的例子进行一些解释和实际操作,同时增加自己的一些扩展和应用。
The following sample programs are available:下面试一下简单的实例程序,源代码可以自己点击下载:
| DisplayAllElements.java |
Demonstrates the behaviour of the library when retrieving all elements from a document containing a mix of normal HTML, different types of server tags, and badly formatted HTML.演示获得所有的页面元素,包括一些服务器端的标签。 |
| FindSpecificTags.java |
Demonstrates how to search for tags with a specified name, in a specified namespace, or special tags such as document type declarations, XML declarations, XML processing instructions, common server tags, PHP tags, Mason tags, and HTML comments. |
| ExtractText.java |
Demonstrates the use of the TextExtractor class that extracts all of the text from a document, as well as the title, description, keywords and links. |
| RenderToText.java |
Demonstrates the use of the Renderer class that performs a simple text rendering of HTML markup, similar to the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.获得HTML页面的文本信息-便于全文检索的应用,像Apache Lucene等。 |
| HTMLSanitiser.java |
Demonstrates how to sanitise HTML containing unwanted or invalid tags into clean HTML. The unit test class for this functionality is available here.对HTML中不想要的或者无效的标签进行清理。对html标签进行标准化处理–参照HTML标准进行处理,遗漏的地方可以自己扩展。——-可以扩展为html检测器和代码自动更正引擎之类的东东。 |
| StreamedSourceCopy.java |
Demonstrates the use of the StreamedSource class by iterating through the parsed segments of a source document and creating an exact copy of it.
复制源代码。 |
| FormControlDisplayCharacteristics.java |
Demonstrates setting the display characteristics of individual form controls. This allows a control to be disabled, removed, or replaced with a plain text representation of its value (display value). The new document is written to a file called NewForm.html |
| FormFieldCSVOutput.java |
Demonstrates the use of the FormFields.getColumnValues(Map) method to store form data in a .CSV file, automatically creating separate columns for fields that can contain multiple values (such as checkboxes). The output is written to a file called FormData.csv |
| FormFieldList.java |
Demonstrates the use of the Segment.findFormFields() method to list all form fields and their associated controls in a document. |
| FormFieldSetValues.java |
Demonstrates setting the valunnecessary ues of form controls, which is best done via the FormFields object. The new document is written to a file called NewForm.html |
| FormatSource.java |
Demonstrates the use of the SourceFormatter class that formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Also known as a “source beautifier”. (Click here for an online demonstration) |
| CompactSource.java |
Demonstrates the use of the SourceCompactor class that compacts HTML source by removing all white space. |
| Encoding.java |
Demonstrates the use of the EncodingDetector class and how to determine the encoding of a source document. |
| SplitLongLines.java |
Demonstrates how to reformat a document so that lines exceeding a certain number of characters are split into multiple lines.对长行进行折行处理。 |
| ConvertStyleSheets.java |
Demonstrates how to detect all external style sheets and place them inline into the document. |