BSP – Crawling the web for translations

In order to feed our online context dictionary with more bilingual example sentences we have built BSP – the Example Sentence Generation Pipeline. BSP is a webcrawler that identifies web pages that are translations of each other, extracts the page’s contents, splits them into sentences using LinA, and aligns the sentences. Then, some very advanced classifiers sort the wheat from the chaff:

GarbageFilter: sorts out boilerplate text from navigation elements and menues, but also sentences that consist only of keyword lists and other SEO spam.

Parallelness Scorer: uses a dictionary-based classifier to make sure that the aligned sentences really are translations of each other.

Machine Translation Filter: detects pages that were translated by machine translation engines like Google Translate.

Domain Classification: assigns each page up to three domains from our domain hierarchy.

Duplicate detection: identifies duplicate sentence pairs. Only unique sentence pairs are kept.

We have already collected 10 million parallel sentences for German-English. At the moment, we’re adapting the BSP to more language pairs.