Big data processing on a map reduce (corpora, text processing, building optimized model)

Ссылка на работу
Several times I had an opportunity to run my algorithms on a relatively big data (actual sizes were approximately 2000 Gb / 100 Gb / 11000 Gb of raw data in text format in different cases). Went through the following activities: Getting access to large unstructured (or partially unstructured) data Writing a set of utilities to get this data stored in a proper way (merge them from different locations, making them homogeneous) Writing mappers for data pre-processing (html/xml cleaning, tokenization, sometimes vectorization, building bags of words, filtering, applying different nlp techniques) performing research / analysis (what features do we have, how good this corpus is for current purposes, what would we be able to extract from this data and so on) extract required data creating effectively represented in memory models for Machine Learning Algorithms using map-reduce using different ways to run map-reduce: I tried own simple implementations for parallelization, hadoop and javascript based map-reduce solutions.