Rainbow Vectorization: Rainbow Method applied Binary Vectorization

When rainbow meets with text

Dmytro Karabash
3 min readMar 5, 2023

Co-authored with Anna Arakelyan

Credit: Generated by Dmytro Karabash assisted by fotor and pinetools

We start with some introduction and examples of Binary Vectorization that we will use and then answer Joshua Banks Mailman, Ph.D. question here. We do want to thank Joshua as great question create new methods!

Intro Into Binary Vectorization:

Binary vectorization is a process of converting categorical or textual data into a binary format, which can be easily processed by machine learning algorithms. This technique involves assigning a binary value (usually 0 or 1) to each category or term in the dataset, creating a vector of binary values for each data point. These vectors can then be used as input for various machine learning models such as logistic regression, decision trees, and neural networks.

Binary vectorization is commonly used in natural language processing (NLP) tasks such as sentiment analysis and document classification. It can be applied to textual data such as words or phrases, where each word or phrase is treated as a category and assigned a binary value indicating its presence or absence in a document or text corpus.

Overall, binary vectorization allows for efficient processing and analysis of categorical or textual data in machine learning, making it a valuable technique in many applications.

Examples Binary Vectorization:

Example 1. Consider two reviews.

Review 1: “I loved this movie. It was funny and heartwarming.”
Review 2: “This movie was terrible. The acting was terrible and the plot was boring.”

After preprocessing, the two reviews might be represented by following words (vectors):

Review 1: “loved movie funny and heartwarming”
Review 2: “movie terrible acting plot boring”
We can then create a binary vector for each review using the words in the corpus as features:

Review #  loved  movie  funny  heartwarming  terrible  acting  plot  boring
1 1 1 1 1 0 0 0 0
2 0 1 0 0 1 1 1 1

Rainbow Vectorization:

To jump right into an answer of how to apply Rainbow Method to Binary Vectorization, let us consider the following example.

Example 2. Consider following news
News 1: “Cutting-edge amazing product from Astral is coming soon and this will be huge.”
News 2: “Astral novel research is proceeding well but will take some time.”
News 3: “Astral is adding minor update will take an hour”
News 4: “New major update next week will have significant impact

After preprocessing, news will be represented by following vectors of words:
News 1: Cutting-edge amazing product soon huge
News 2: Research well take_some_time
News 3: minor update hour
News 4: major update week significant

News #     Novelty        RTime        QTime     Sentiment  Dev Stage  Magnitude 
1 Cutting-edge soon amazing product Huge
2 novel take_some_time NA well research NA
3 NA hours NA update minor
4 New NA week significant update major

which would then be transcribed into

News #     Novelty        RTime        QTime     Sentiment  Dev Stage  Magnitude 
1 5 0 NA 4 4 4
2 2 1 NA 3 1 NA
3 NA NA 5 NA 5 1
4 NA NA 8 2 5 3

With the following rainbows being used for the following categories (these can be much expanded but we will use short list for brevity):
Novelty: old->0, original->1, new->2, novel->2, advanced->3, or cutting-edge->4
RTime (relative time): soon->0, take_some_time->1,take_long->2
QTime (quantitative time): second->0, seconds->1,minute->2, minutes->3, hour->4., hours->5, day->6,days->7,week->8,weeks->9,month->10…
Sentiment: poor->0, normal->1, significant->2,well->3,amazing->4
Dev Stage: proposal->0, research ->1, development->2, testing->3, product->4,update->5
Magnitude: tiny->0,minor->1,medium->2,major->3,huge->4

These are not ideal on purpose, so that you have ideas how to improve these lists.
In particular you can come up with another rainbow for these categories. In fact you can use several rainbows to code the same category; the recommendation is to use significantly less than log_2(N) rainbows if there are N subcategories in rainbow.

--

--

Dmytro Karabash

Innovator, Data Scientist, Universal Analyst, Mathematician, Blockchain Enthusiast, Artist