feat: Imdb sentiment dataset reader (!962) · Merge requests · DeepPavlov / DeepPavlov

Merged Andrei Glinskii requested to merge github/fork/Huawei-MRC-OSI/pr-imdb-clean into dev Aug 08, 2019

Created by: sgrechanik-h

This PR implements a dataset reader for the IMDb sentiment classification dataset. It also includes a json configuration for BERT (en, cased) which is mostly the same as the configuration for rusentiment except for the max seq length and batch size (which I set to values such that I don't get out-of-memory on my hardware).

This PR also includes a fix for the sets_accuracy metric which should now correctly work for string labels (i.e. wrap them into sets instead converting them to sets). Also I added reporting of cached files in download_decompress.