Background

The ATLAS platform is made up of several independent software components – the KMS web content management system, the i-Publisher visualization layer to KMS, a Text Mining engine, a Search engine, and a Machine Translation Engine. These components are described in detail below:

  1. KMS Content Management System

    The web content management system that will be used in this project as the basis of the ATLAS platform is the latest stable version of the ISN KMS (ISN Knowledge Management System). Tetracom has been working constantly on the ISN KMS since 2003 when the system was developed as a standalone solution for the International Relations and Security Network, ETH Zurich.

    ISN KMS was designed to address specific needs such as the capability to easily unify, categorize, retrieve, store, and package heterogeneous content items, and render them on different topic-specific multilingual web sites. Other requirements were also specified, such as the capacity of the system to support numerous classification criteria, reusable presentation layers, and tight security.

    Being in production for nearly seven years, the system is well-documented and has been thoroughly tested and evaluated both by users and by technical staff. Currently, the International Relations and Security Network supports more than 60 multilingual web sites, 430 output channels, 5000 content classification criteria, and have more than 80 000 page hits per day.

    The software components that make up the ISN KMS are distributed among six servers at ISN. We have run several stress tests using Apache JMeter in order to determine the capabilities of the system with this hardware configuration. Results demonstrated that the system can support 150 web sites, 1000 output channels, 10 000 classification criteria, more than 1.5 million content items, around 1.2 million page hits and 90 000 unique visitors per day with the following hardware resources:

    • Database server – 2 CPU x 2.66GHz, 4GB of system memory
    • Storage server – 15 hard drives x 146GB
    • Language processing server – 2 CPU x 3.2GHz, 8GB of system memory
    • Content management server – 2 CPU x 3.2GHz, 4GB of system memory
    • Two content publishing servers – 2 CPU x 2.66GHz, 3GB of system memory

    Adding more physical servers will allow more users and content items to be easily handled because the system has been designed with horizontal scalability in mind. Indeed, the long-term usage of the system, the large number of multilingual web sites that it powers, and the large amounts of content that it handles prove its technical viability and potential.

  2. i-Publisher (KMS visualization layer)

    In the beginning of 2009, the i-Publisher visualization layer was designed to replace the existing IMI (Integrated Management Interface) that is still used to create and deploy web sites powered by KMS. i‑Publisher is currently in a pre-prototype phase and after appropriate adjustments and integration within the KMS, it will become a integral part of the ATLAS platform, providing an intuitive web-based point-and-click graphical user interface for web site building.

    A major advantage of i-Publisher over IMI is the enhanced usability and productivity of the new tool – multilingual web sites can be built with it from scratch much faster than with IMI or other systems, and existing web sites may be modified at any time to address changes in requirements. Finally, as i-Publisher is a web application, it can be offered as a free online service with reduced installation and management costs.

  3. Text Mining engine

    At the end of 2008, Tetracom created a Text Mining engine whose purpose was to add several language-based mechanisms to the KMS in order to facilitate web content management and publishing, and to enhance end-user experience. These mechanisms include extraction of significant words, important phrases, names and references, locating similar documents, text summarization and categorization. As each specific mechanism is powered by external language processing tools, Tetracom started working on a framework that would allow tools for different languages to be easily plugged into the Text Mining engine.

    The work on this integration framework has started and will continue within the project. In addition, an integration discipline document will be compiled and agreed among the partners; this will define the methodology and the integration requirements, seeking the consensus among the partners who should endorse the framework communication interfaces.

  4. Text Mining engine language packs

    An English language pack for the Text Mining engine was created in 2008 and integrated into a demo online digital library. Several features in this language pack are based on the GATE architecture – extraction of important words, noun phrases, references and named entities. The language pack also includes a module for extractive summarization which uses features from algorithms such as LexRank, GistSumm, and genetic algorithms. So far, nearly 100 000 texts in English have been processed using this language pack.

    Language packs for the other project languages will be created during the project for each language by adjusting, fine tuning and combining existing tools developed as part of earlier EU projects or independently authored. Whenever a language-specific tool for a particular task is not available or the quality of the tool is not satisfactory for the needs of the project, a language-independent approach will be taken. In particular, the text categorization module used in this project will be to some extent language-independent, based on supervised learning methods, and only adjusted to each project language to improve the quality of categorization for that language.

  5. Machine Translation engine

    A Machine Translation engine that uses the Moses tool developed by EuroMatrix has been created by Tetracom. During the project, existing language models for the target languages will be improved and the Machine Translation engine will be integrated into the ATLAS platform so that computer-aided translation is made available to content authors and machine translation is available to end-users. For lexical gaps, not translated or not correctly translated, we will give the content authors and editors the possibility to consult available lexical resources, including resources available through web services offered by projects like CLARIN. The portability of the Moses system to different language pairs was demonstrated within the EuroMatrix project but also for some other language pairs. For instance, the University of Hamburg is currently using Moses to develop a German-Romanian Machine Translation system. Furthermore, studies have shown that even for less resourced languages (like Croatian), non-tuned Statistical Machine Translation systems like Moses demonstrate accuracy similar to that of Google Translation. Therefore, Moses is an ideal tool for the needs of the ATLAS project.

  6. Machine Translation engine language packs

    A Machine Translation engine language pack is a Moses-based translation model for a given language pair.

  7. Search engine

    The ATLAS Search engine includes cross-lingual full-text and similarity search. Currently, a full-text search engine is a part of the KMS web content management system. This engine need to be adjusted to support also cross-lingual and similarity search. This functionality will most likely be based on the outcomes of the EU-funded initiative Language Technology for e-Learning (LT4eL) where a prototype of a semantics-based cross-lingual search engine was created, thoroughly tested and validated for nine European languages, some of which are targeted by ATLAS.

  8. Search engine language packs

    A search engine language pack includes language resources that add to the ATLAS Search engine support for cross-lingual full-text and similarity search for a specific language.