1 min read

A Microsoft team named Bling (Beyond Language Understanding) announced a Finite State machine and regular expression manipulation library called Fire, yesterday.

Fire has been developed to use in case of different linguistic operations inside Bing including Tokenization, Multi-word expression matching, Unknown word-guessing, and Stemming/Lemmatization among others.

Under Fire comes a tokenizer, which has been designed for fast-speed and quality tokenization of Natural Language text. Fire tokenization uses the tokenization logic of NLTK (Natural Language Toolkit), with an exception that hyphenated words can be split and only a few errors can be fixed. Also, when compared with other popular NLP libraries, Bling Fire becomes 10X faster speed in tokenization task.

The latest release of Bling Fire model is enabled to support most languages including East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai). The tokenizer’s high-level API is friendly to use from languages such as Python, Perl, C#, Java, etc. Also, the tokenizer has been designed in a way that it requires 0 zero configurations, or initialization, or additional files. The reason Tokenizer is very fast is because it makes use of deterministic finite state machines underneath.


In order to use the Bling Fire Library and Finite State Machine manipulation tools, the project can be built on Windows/ Linux using CMake, which allows you to create your own tokenization/segmentation, stemming, etc. To use the Bling Fire Library in Python, users can install the release with the help of using: pip install blingfire

For more information, check out Bling Fire on GitHub.

Read Next

Microsoft reveals certain Outlook.com user accounts were hacked for months

Microsoft makes the first preview builds of Chromium-based Edge available for testing

Microsoft announces the general availability of Live Share and brings it to Visual Studio 2019


Subscribe to the weekly Packt Hub newsletter. We'll send you the results of our AI Now Survey, featuring data and insights from across the tech landscape.