LinA – Linguistic Text Analysis

line

LinA is a software for automatic text processing. It performs linguistic analysis of unstructured text including social media. LinA is designed to handle large volumes of text. Its pipeline architecture makes it highly flexible and configurable. It can process a large variety of input formats and is available for English and German.

Available modules

  • Text filters: extract content from text, HTML, PDF, MS Office, Open Office, XML, TMX, XLIFF and many other file formats.
  • Language identification: automatically determines the language of a document.
  • Sentence segmentation and tokenization: Identifies sentence and word boundaries.
  • Truecasing: normalizes the casing of a text, e.g. This Sentence Is in ENGLISH.this sentence is in English.
  • Part-of-speech tagging: assigns to every word its part of speech.
  • Morphological analysis and lemmatization: analyses unknown words according to morphological rules (including compound splitting for German), and generates the baseform of a word in the current context.
  • Several configurable output methods: XML writer, Lucene index writer, etc.

Main features

  • Pipeline architecture: highly flexible and configurable. You only have to license modules you really need.
  • 100% Java: LinA is completely coded in Java, which makes it fast and run on Unix as well as Windows and MacOS.
  • Multi-threaded: LinA takes advantage of multi-processor machines and automatically runs in parallel on all available cores.

Ready for Big Data!

LinA takes advantage of multi-core machines and automatically runs in parallel on all available cores. Furthermore, it can be scaled out to run on clusters of arbitrary many machines.

Average speed when feeding HTML, PDF, and DOCX documents through a LinA pipeline consisting of a filter for text extraction, sentence detector, tokenizer, POS tagger, lemmatizer, and an XML annotation writer:

  • Intel Core-i3 (two pipelines with two threads each): 10,000 documents per hour
  • Intel Core-i7 (four pipelines with two threads each): 20,000 documents per hour

Available languages

Lina is available for English and German. Many more languages are in preparation.

Module en de es fr it pt nl sv da pl cz ru ro zh
Language identification
Sentence segmentation
Tokenization
Truecasing
Part-of-speech Tagging
Morphological analysis
Lemmatization

☐ = in preparation, ☑ = available.

Licensing the software

You can run LinA on your own machines or integrate it in your product by obtaining a software license. Pricing depends on ordered modules and languages. For more information, please inquire
peter.kolb@linguatools.org.


If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.