White Papers
Extracting Structure From HTML Documents for Language Visualization and Analysis
Category: Data Management, Software and Web Development
Tags: pdf
Overview Document analysis is shifting from document image analysis to the analysis of electronic documents, especially those available on the Web in HTML and PDF formats. This paper is analyzing a 250M word collection of HTML formatted papers from the American Society for Microbiology with the ultimate goal of doing query answering and information extraction. Each document is converted to a sequence of token-id items by an invertible process called Extreme Tokenization. A lexicon is constructed with attributes including: token string, tag, capitalized, etc. An XML descriptive structure is built using JAXB 1.0. Sentence boundaries are discovered. Language framework patterns are visualized in a custom Framework Viewer to identify important patterns of expression for further analysis.
- Publisher
- Northeastern University
- File Format
- Date Published
- Dec 4, 2008
- Format
- White Papers
- Topics
- HTML, Data Visualization, Data Mining - Analysis
Northeastern University White Papers
On the Performance of IEEE 802.11 Under Jamming
This paper studies the performance of the IEEE 802.11 MAC protocol under a range of jammers that covers both channel-obl
Publisher: Northeastern University
Neural Network Based Fault Diagnostics of Industrial Robots Using Wavelt Multi-Resolution Analysis
A multi-resolution wavelet analysis coupled with a neural network based approach is applied in the problem of fault diag
Publisher: Northeastern University | Tags: data, network, robots
Mobility Models for Ad Hoc Network Simulation
This paper proposes a novel general technique, based on renewal theory, for analyzing mobility models in ad hoc networks
Publisher: Northeastern University | Tags: mobility
Preprocessing DNS Log Data for Effective Data Mining
The Domain Name Service (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from ban
Publisher: Northeastern University | Tags: data, dns servers, server
NEUStore (Version 1.4): A Simple Java Package for the Construction of Disk-Based, Paginated, and Buffered Indices
This paper describes NEUStore, a Java package that aims to support the development of disk-based, paginated, and buffere
Publisher: Northeastern University | Tags: java
Featured white papers
-
The Value of Location Intelligence in the Communications Industry
Public Services are under pressure, the challenge is to do more with less. How do you improve citizen satisfaction, increase cost efficiencies and improve service delivery? The power of location intelligence is helping many local authorities...
-
Best Practices for Translating Customer Satisfaction into Revenue
Today's support organisations are focused on two top-level metrics: financial results and customer satisfaction. For most, it's easy to track financial performance, but customer satisfaction is akin to speaking a foreign language...
-
HP print solutions and 3M
The objective for 3M was to optimize office printing infrastructure at 3M locations worldwide, reduce total cost and environmental footprint. Some of the business benefits acheived by switching to HP print solutions...
-
Check out these top business apps for your iPhone
-
Inside a Microsoft datacentre
-
Green IT without losing your edge
-
Peter Cochrane's latest video blog
-
What you need to know about Windows 7