White Papers

Fast Indexes and Algorithms for Set Similarity Selection Queries

Overview Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Set similarity queries are commonly used in data cleaning for matching similar data. This paper concentrates on set similarity selection queries: Given a query set, retrieve all sets in a collection with similarity greater than some threshold. Various set similarity measures have been proposed in the past for data cleaning purposes. This paper concentrates on weighted similarity functions like TF/IDF, and introduces variants that are well suited for set similarity selections in a relational database context. These variants have special semantic properties that can be exploited to design very efficient index structures and algorithms for answering queries efficiently.

Download White Paper

By downloading you agree to our Terms and Conditions. These include information regarding use of your personal data.

Publisher
AT&T Intellectual Property
File Format
PDF
Date Published
May 29, 2009
Format
White Papers
Topics
Software Engineering

Similiar White Papers

High Level Best Practices in Software Configuration Management

High Level Best Practices in Software Configuration Management

When deploying new software configuration management (SCM) tools, implementers sometimes focus on perfecting fine-graine

Publisher: Perforce Software  |  Tags: management, software

Software Configuration Management: The Foundation of Global Distributed Development Today

Software Configuration Management: The Foundation of Global Distributed Development Today

By distributing development, you can create a collaborative work environment staffed by the best developers you can hire

Publisher: Perforce Software  |  Tags: developers, it department, network

BMC Best Practice Process Flow for Release Management

BMC Best Practice Process Flow for Release Management

The Release Management process consists of four procedures. The first procedure is called "Request for Change Handling".

Publisher: BMC Software

Application Lifecycle Management With ClearQuest 7.1.0.0

Application Lifecycle Management With ClearQuest 7.1.0.0

This overview of the concepts and design goals behind an out-of-the-box Application Lifecycle Management (ALM) solution

Publisher: IBM  |  Tags: software

White Paper: Tips for Writing Good Use Cases

White Paper: Tips for Writing Good Use Cases

Writing good use cases is more of an art than a science. In this IBM Rational white paper "Tips for writing good use cas

Publisher: IBM  |  Tags: software

AT&T Intellectual Property White Papers

On-Demand Webcast: Lowering the TCO for Business Applications

On-Demand Webcast: Lowering the TCO for Business Applications

Critical business applications require continuous care and investment. Now, more than ever, enterprises must ensure thei

Publisher: AT&T Intellectual Property  |  Tags: applications, business applications, infrastructure, network, real-time, tco

Design, Implementation and Operation of a Large Enterprise Content Distribution Network

Design, Implementation and Operation of a Large Enterprise Content Distribution Network

Content Distribution Networks (CDNs) are becoming an important resource in enterprise networks. They are being used in a

Publisher: AT&T Intellectual Property  |  Tags: applications

On-Demand Webcast: When Is VPLS the Right Choice?

On-Demand Webcast: When Is VPLS the Right Choice?

Networking is constantly evolving, bringing new and demanding applications for today's enterprises to manage. Making WAN

Publisher: AT&T Intellectual Property  |  Tags: applications, network, wan

FastRWeb: Fast Interactive Web Framework for Data Mining Using R

FastRWeb: Fast Interactive Web Framework for Data Mining Using R

R is widely used and accepted as a very versatile tool for statistical computing and data analysis. It provides a pletho

Publisher: AT&T Intellectual Property  |  Tags: computing, data, infrastructure

On-Demand Webcast: Comparing WAN Technology Choices

On-Demand Webcast: Comparing WAN Technology Choices

As new applications are added to network, how to choose the right networking solution? Frame Relay, ATM, Ethernet, MPLS,

Publisher: AT&T Intellectual Property  |  Tags: applications, atm, ethernet, ip, mpls, network, wan