A Microsoft team named Bling (Beyond Language Understanding) announced a Finite State machine and regular expression manipulation library called Fire, yesterday.
Fire has been developed to use in case of different linguistic operations inside Bing including Tokenization, Multi-word expression matching, Unknown word-guessing, and Stemming/Lemmatization among others.
Under Fire comes a tokenizer, which has been designed for fast-speed and quality tokenization of Natural Language text. Fire tokenization uses the tokenization logic of NLTK (Natural Language Toolkit), with an exception that hyphenated words can be split and only a few errors can be fixed. Also, when compared with other popular NLP libraries, Bling Fire becomes 10X faster speed in tokenization task.
The latest release of Bling Fire model is enabled to support most languages including East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai). The tokenizer’s high-level API is friendly to use from languages such as Python, Perl, C#, Java, etc. Also, the tokenizer has been designed in a way that it requires 0 zero configurations, or initialization, or additional files. The reason Tokenizer is very fast is because it makes use of deterministic finite state machines underneath.
In order to use the Bling Fire Library and Finite State Machine manipulation tools, the project can be built on Windows/Linux using CMake, which allows you to create your own tokenization/segmentation, stemming, etc. To use the Bling Fire Library in Python, users can install the release with the help of using: pip install blingfire
For more information, check out Bling Fire on GitHub.