Viterbi algorithm for HMM, Testing the Kafka cluster & ML with Python Bundle.

7 min read

What’s happening in Data?

Here are the updates from Google Data Cloud, AWS SageMaker and Oracle Database Insider.   

Google Data Cloud 

AWS SageMaker 

Oracle Database Insider 

  • MySQL Support in Database Tools in OCI: The Database Tools service in OCI provides instant web browser SQL access to Oracle Cloud Databases via REST.  Now, they expand this service with the ability to create connections to the MySQL Database Service in OCI. 
  • Create Graph Databases with Graph Studio: The Graph Studio in ADB makes it easy to create graph models from data in your database or data warehouse, perform graph analysis, develop graph applications, and visualize and share results. 

Understanding Machine Learning Algorithms

Viterbi algorithm for Hidden Markov Models
– By Giuseppe Bonaccorso 

The Viterbi algorithm is one of the most common decoding algorithms for HMM. Its goal is to find the most likely hidden state sequence corresponding to a series of observations. The structure is very similar to the forward algorithm, but instead of computing the probability of a sequence of observations joined with the state at the last time instant, this algorithm looks for: 

The variable vti represents that maximum probability of the given observation sequence joint with xt = i, considering all possible hidden state paths (from time instant 1 to t-1). We can compute vti recursively by evaluating all the vt-1j multiplied by the corresponding transition probabilities pji and emission probability P(ot|xi), and always picking the maximum overall possible values of j: 

The algorithm is based on a backtracking approach, using a backpointer bpti whose recursive expression is the same as vti, but with the argmax function instead of max: 

Therefore, bpti represents the partial sequence of hidden states x1, x2, …, xt-1 that maximizes vti. That’s why we need to backtrack the partial result and replace the sequence built at time t that doesn’t maximize vt+1i anymore. 

The algorithm is based on the following steps (like in the other cases, the initial and ending states are not emitting): 

  1. Initialization of a vector V with shape(N + 2, Sequence Length). 
  2. Initialization of a vector BP with shape (N + 2, Sequence Length). 
  3. Initialization of A (transition probability matrix) with shape (N, N). Each element is P(xi|xj). 
  4. Initialization of B with shape (Sequence Length, N). Each element is P(oi|xj). 
  5. For i=1 to N: 
    1. Set V[i, 1] = A[i, 0] · B[1, i] 
    2. BP[i, 1] = Null (or any other value that cannot be interpreted as a state) 
  6. For t=1 to Sequence Length: 
    1. For i=1 to N: 
      1. Set V[i, t] = maxj V[j, t-1] · A[j, i] · B[t, i] 
      2. Set BP[i, t] = argmaxj V[j, t-1] · A[j, i] · B[t, i] 
  7. Set V[xEndind, Sequence Length] = maxj V[j, Sequence Length] · A[j, xEndind]. 
  8. Set BP[xEndind, Sequence Length] = argmaxj V[j, Sequence Length] · A[j, xEndind]. 
  9. Reverse BP. 

The output of the Viterbi algorithm is a tuple with the most likely sequence BP, and the corresponding probabilities V. This explainer on algorithms was curated from the Book – Mastering Machine Learning Algorithms. To explore more, click the button below! 

Read Here!


Quick Tutorial

Testing the Kafka cluster
– By Paul Crickard  

Kafka comes with scripts to allow you to perform some basic functions from the command line. To test the cluster, you can create a topic, create a producer, send some messages, and then create a consumer to read them. If the consumer can read them, your cluster is running. To create a topic, run the following command from your kafka_1 directory: 

bin/ –create –zookeeper localhost:2181,localhost:2182,localhost:2183 –replication-factor 2 –partitions 1 –topic dataengineering 

The preceding command runs the kafka-topics script with the create flag. It then specifies the ZooKeeper cluster IP addresses and the topic. If the topic was created, the terminal will have printed the following line:

created topic dataengineering 

You can verify this by listing all the topics in the Kafka cluster using the same script, but with the list flag: 

bin/ –list –zookeeper localhost:2181,localhost:2182,localhost:2183 

The result should be a single line: dataengineering. Now that you have a topic, you can send and receive messages on it. The next section will show you how. 

Testing the cluster with messages 


For a quick test of the cluster, you can use the scripts provided to do this as well. To create a producer, use the following command: 

bin/ –broker-list localhost:9092,localhost:9093,localhost:9094 –topic dataengineering 

The preceding command uses the kafka-console-producer script with the broker-list flag that passes the kafka cluster servers. Lastly, it takes a topic, and since we only have one, it is dataengineering. When it is ready, you will have a > prompt to type messages into. To read the messages, you will need to use the kafka-console-consumer script. The command is as shown: 

bin/ –zookeeper localhost:2181,localhost:2182,localhost:2183 –topic dataengineering –from-beginning 

The consumer passes the zookeeper flag with the list of servers. It also specifies the topic and the from-beginning flag. If you had already read messages, you could specify an offset flag with the index of the last message so that you start from your last position. Putting the producer and consumer terminals next to each other, you should have something like the following screenshot: 

Figure – Producer and consumer 

When the consumer turned on, it read all the messages on the topic. Once it has read them all, it will await new messages. If you type a message in the producer, it will show up in the consumer window after a short lag. This quick tutorial was curated from the Book – Data Engineering with Python. To explore more, click the button below!   

  Read the Book