White Papers

Extracting Structure From HTML Documents for Language Visualization and Analysis

Overview Document analysis is shifting from document image analysis to the analysis of electronic documents, especially those available on the Web in HTML and PDF formats. This paper is analyzing a 250M word collection of HTML formatted papers from the American Society for Microbiology with the ultimate goal of doing query answering and information extraction. Each document is converted to a sequence of token-id items by an invertible process called Extreme Tokenization. A lexicon is constructed with attributes including: token string, tag, capitalized, etc. An XML descriptive structure is built using JAXB 1.0. Sentence boundaries are discovered. Language framework patterns are visualized in a custom Framework Viewer to identify important patterns of expression for further analysis.

Download White Paper

By downloading you agree to our Terms and Conditions. These include information regarding use of your personal data.

Publisher
Northeastern University
File Format
PDF
Date Published
Dec 4, 2008
Format
White Papers
Topics
HTML, Data Visualization, Data Mining - Analysis

Northeastern University White Papers

On the Performance of IEEE 802.11 Under Jamming

On the Performance of IEEE 802.11 Under Jamming

This paper studies the performance of the IEEE 802.11 MAC protocol under a range of jammers that covers both channel-obl

Publisher: Northeastern University

Neural Network Based Fault Diagnostics of Industrial Robots Using Wavelt Multi-Resolution Analysis

Neural Network Based Fault Diagnostics of Industrial Robots Using Wavelt Multi-Resolution Analysis

A multi-resolution wavelet analysis coupled with a neural network based approach is applied in the problem of fault diag

Publisher: Northeastern University  |  Tags: data, network, robots

Mobility Models for Ad Hoc Network Simulation

Mobility Models for Ad Hoc Network Simulation

This paper proposes a novel general technique, based on renewal theory, for analyzing mobility models in ad hoc networks

Publisher: Northeastern University  |  Tags: mobility

Preprocessing DNS Log Data for Effective Data Mining

Preprocessing DNS Log Data for Effective Data Mining

The Domain Name Service (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from ban

Publisher: Northeastern University  |  Tags: data, dns servers, server

NEUStore (Version 1.4): A Simple Java Package for the Construction of Disk-Based, Paginated, and Buffered Indices

NEUStore (Version 1.4): A Simple Java Package for the Construction of Disk-Based, Paginated, and Buffered Indices

This paper describes NEUStore, a Java package that aims to support the development of disk-based, paginated, and buffere

Publisher: Northeastern University  |  Tags: java