TermExtractor is a FREE software package for Terminology Extraction. The software helps a web community
to extract and validate relevant domain terms in their interest domain, by submitting an archive of
domain-related documents in any format.
Furthermore, TermExtractor is a very useful starting point for Domain Ontology construction,
Semantic Similarity, Knowledge Management, etc.,
since it allows the identification of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts.
TermExtractor extracts terminology consensually referred in a specific application domain.
The software takes as input a corpus of domain documents, parses the documents, and extracts a
list of "syntactically plausible" terms (e.g. compounds, adjective-nouns, etc.).
Documents parsing assigns a greater importance to terms with text layouts (title, abstract, bold, italic,
underlined, etc.).
Two entropy-based measures, called Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms which are consensually referred throughout
the corpus documents. Domain Relevance to select only the terms which are relevant to the
domain of interest, Domain Relevance is computed with reference to a set of contrastive
terminologies from different domains. Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of association of all the words in a
terminological string.
See the help page ( ) for additional informations about TermExtractor.
Details can be found also on:
F. Sclano ad P. Velardi "TermExtractor: a Web
Application to Learn the Common Terminology of
Interest Groups and Research Communities " 9th
Conf. on Terminology and Artificial Intelligence
TIA 2007, Sophia Antinopolis, October 2007
|
Enter one document of maximum 5 MB and START the terminology extraction process.
Accepted formats are: txt, pdf, ps, dvi, tex, doc, rtf, ppt,
xls, xml, html/htm, chm, wpd.
The document must not be encrypted and written in english language.
|