Data Binarisation and AI

Pros, cons, and overcoming the limitations of using data binarisation in machine learning

As the saying goes, time is money. That adage is as true of artificial intelligence as anything else. The longer training or inference takes, the more expensive an AI model typically is in terms of energy and water consumption. That leads machine learning engineers to pursue a variety of techniques in order to achieve better benchmarking. Key amongst them, data binarisation - simplifying training and inference input data by representing it with binary (0 and 1) digits.

But such simplification is not typically without trade-offs.

What is data binarisation?

Data binarisation is a critical data transformation technique, converting continuous numerical data into a binary format, typically represented as 0 or 1. This binary approach simplifies complex datasets, allowing AI and machine learning models to process and interpret the information more efficiently. Owing to its numerical base, data binarisation is most commonly used within AI models to represent already numerical values, such as those found within time series analysis and anomaly detection. Literal Labs' AI models extend this, however, allowing for non-numerical data, including sounds and images, to be represented in binary before being passed to a Tsetlin machine for further processing.

The use of data binarization in AI and ML
Data binarisation examples

Machine health

In the realm of machine health monitoring, binarisation can be highly effective when applied to audio anomaly detection, something well reflected in successful ToyADMOS benchmarking.

For example, audio sensors continuously monitor the sound emitted by a machine. By applying binarisation, an AI model can set a threshold for normal operational noise levels. Any sound pattern that deviates from this expected range—such as grinding, screeching, or sudden changes in frequency—can be classified as an 'anomaly' (1), while regular sound levels are marked as 'normal' (0). This binary distinction enables AI models to swiftly flag abnormal machine behaviour, allowing operators to intervene before catastrophic failure occurs.

Preventative maintenance

Binarisation can, similarly, be applied to AI for predictive maintenance through vibration monitoring.

Machines with moving parts generate specific vibration patterns during normal operation. However, as components wear out or faults develop, these vibration signatures change. By using binarisation, an AI model can learn the appropriate threshold for normal vibration based on historical data. For example, in the case of a motor, any vibration amplitude above a certain level—indicating excessive wear or imbalance—can be marked as ‘abnormal’ (1), while those within the acceptable range are classified as ‘normal’ (0). This binary labelling allows AI systems to automatically detect potential issues and schedule maintenance before more severe mechanical failure occurs, reducing costly downtime and repairs.

Pros and cons of data binarisation in AI

Of course, data binarisation isn't necessarily without limitations — else, all AI models would use it, thus reducing their environmental impact while increasing their speed.

Pros of Data Binarisation

  1. Simplifies Data: Binarisation reduces complex, continuous data into simple 0 or 1 values, making it easier for AI models to process.
  2. Faster Computation: With reduced data complexity, models can perform faster calculations and require less computational power.
  3. Improves Decision-Making: Binary distinctions can enable models to focus on clear, actionable insights, leading to quicker decisions.
  4. Enhances Model Interpretability: Binary data is more straightforward, making AI model outputs easier to interpret and explain.
  5. Enables Efficient Storage: Since binary data takes up less space, it results in smaller models, more efficient storage, and quicker data retrieval.

Cons of Data Binarisation

  1. Loss of Accuracy: By reducing data to binary values, subtle nuances and variations within the original data can be lost and model accuracy reduced.
  2. Threshold Sensitivity: Models heavily depend on the chosen threshold, and incorrect threshold selection can lead to inaccurate results.
  3. Limited to Certain Applications: Binarisation is less effective for tasks that require nuanced or continuous data, such as regression problems.
  4. Reduced Model Flexibility: Binary data can oversimplify complex relationships, limiting a model’s ability to capture intricate patterns.
  5. Potential for Misclassification: Edge cases near the threshold may be misclassified, which can lead to false positives or negatives in predictions.

How Literal Labs uses binarisation

In the machine learning and artificial intelligence space, accuracy is mission critical. But we did note that binarisation can lead to accuracy reduction. So how does Literal Labs use the technique?

Our novel approach to AI models has shown that data binarisation will have no significant impact on a model’s accuracy when underpinned by a series of other technologies. In fact, our benchmarking has shown our models to have an average accuracy variation of ±2% — put another way, while utilising binarisation, our pipeline is often able to build models that just aren't faster and more energy efficient than their neural network counterparts, they're more accurate as well.