Download Classifying and Searching Hidden-Web Text Databases by Panagiotis G. Ipeirotis PDF

By Panagiotis G. Ipeirotis

The World-Wide internet keeps to develop speedily, which makes exploiting all to be had info a problem. se's comparable to Google index an exceptional quantity of data, yet nonetheless don't supply entry to beneficial content material in textual content databases "hidden" at the back of seek interfaces. for instance, present se's mostly forget about the contents of the Library of Congress, the U.S. Patent and Trademark database, newspaper data, and plenty of different precious assets of knowledge simply because their contents should not "crawlable." even though, clients could be capable of finding the data that they wish with as little attempt as attainable, whether this knowledge is crawlable or now not. As an important step in the direction of this objective, we've designed algorithms that aid shopping and searching-the dominant methods of discovering info at the web-over "hidden-web" textual content databases.

Show description

Read or Download Classifying and Searching Hidden-Web Text Databases PDF

Similar algorithms and data structures books

Algorithmic Foundation of Multi-Scale Spatial Representation (2006)(en)(280s)

With the frequent use of GIS, multi-scale illustration has turn into a tremendous factor within the realm of spatial information dealing with. concentrating on geometric adjustments, this source provides entire insurance of the low-level algorithms to be had for the multi-scale representations of alternative varieties of spatial good points, together with element clusters, person strains, a category of strains, person parts, and a category of components.

INFORMATION RANDOMNESS & INCOMPLETENESS Papers on Algorithmic Information Theory

"One will locate [Information, Randomness and Incompleteness] all types of articles that are popularizations or epistemological reflections and displays which allow one to swiftly receive an exact inspiration of the topic and of a few of its functions (in specific within the organic domain). Very entire, it's endorsed to an individual who's drawn to algorithmic details thought.

A Method of Programming

Ebook through Dijkstra, Edsger W. , Feijen, W. H. J. , Sterringa, shaggy dog story

Additional info for Classifying and Searching Hidden-Web Text Databases

Example text

The “heterogeneous” databases, on the other hand, have documents from different categories that reside in the same level in the hierarchy (not necessarily siblings), with different mixture percentages. We believe that these databases model real-world text databases, with a variety of sizes and foci. These databases were indexed and queried by a SMART-based program [SM97] supporting both boolean and vector-space retrieval models. Web Database Set: We also evaluate our techniques on real web-accessible databases over which we do not have any control.

M, where pk is a conjunction of words and Clk ∈ C. A document d matches a rule pk → Clk if all the words in that rule’s antecedent, pk , appear in d. If a document matches multiple rules with different classification decisions, the final classification decision depends on the specific implementation of the rule-based classifier. We can simulate the behavior of a rule-based classifier over all documents of a database by mapping each rule pk → Clk of the classifier into a boolean query qk that is the conjunction of all words in pk .

2 Classifying Databases through Probing The elimination of words dictated by Zipf’s law is a form of feature selection. However, frequency information alone is not, after some point, a good indicator to drive the feature selection process further. Thus, we use an information theoretic feature selection algorithm that eliminates the terms that have the least impact on the class distribution of documents [KS97, KS96]. , words that are not strongly associated with one specific category) or features that are redundant given the presence of another feature.

Download PDF sample

Rated 4.56 of 5 – based on 33 votes

Author: admin