Istella News platform
Istella News is a complex platform for crawling, clustering, indexing and searching news sources. The module collects news articles in several different formats (rss/atom/html) and from a large number of heterogeneous sources.
The news clustering activity has been implemented through an agglomerative algorithm that allows to connect each single document to the most similar one. The similarity between documents is essentially based on the following features:
- simple terms (with or without stemming)
- pairs of uppercase terms
- named entities
- shingles with variable size
In addition to the retrieval and clustering functions, the Istella News System includes the following features:
- detection of the most important text-blocks and images;
- similar images deduplication;
- geolocation of content;
- detecting of trending topics;
- incremental indexing and real-time search;
- support for multiple languages (including Italian, English and Arabic);
- news suggestion module: analyzes recent news and suggests the most relevant articles while the user is typing the query;
- administration backend tools to refine the cluster aggregation, configure the newsfeeds, remove unwanted images, edit the top named entities, etc.;
The current service provides access to news collected from over 1800 categorized feeds originating from more than 700 sources. It also provides access to an historical archive of more than 40 million articles collected in the last ten years.
The system can be easily expanded according to customers’ need in order to include additional information sources in multiple languages.