Text Processing & Embeddings
Machines don't understand words — they understand numbers. The entire field of NLP revolves around one central challenge: how do we convert human language into numerical representations that preserve meaning?
Over the past two decades, the answer to that question has evolved dramatically:
| Era | Technique | Key Idea |
|---|---|---|
| ~2000 | Bag of Words (BoW) | Count word occurrences |
| ~2005 | TF-IDF | Weight words by importance |
| 2013 | Word2Vec / GloVe | Learn dense vector representations |
| 2018+ | Contextual Embeddings (BERT, GPT) | Same word gets different vectors based on context |
Why Numbers, Not Text?
Bag of Words (BoW)
The simplest approach: create a vocabulary of all unique words, then represent each document as a vector of word counts.
Example:
Problems with BoW:
TF-IDF: Term Frequency–Inverse Document Frequency
TF-IDF improves on raw counts by asking: how important is this word to this specific document, relative to the whole corpus?
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$
Words like "the" appear in every document, so their IDF is near zero. Domain-specific terms like "embedding" have high IDF in a general corpus.
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4 "The cat sat on the mat",
5 "The dog chased the cat",
6 "A bird flew over the mat",
7]
8
9vectorizer = TfidfVectorizer()
10tfidf_matrix = vectorizer.fit_transform(documents)
11
12print("Vocabulary:", vectorizer.get_feature_names_out())
13print("TF-IDF shape:", tfidf_matrix.shape)
14print("Document 0 vector:\n", tfidf_matrix[0].toarray())
15# Notice "the" has low weight (appears everywhere)
16# while "sat" has high weight (unique to document 0)Word2Vec: Dense Learned Embeddings
The breakthrough came in 2013 when Mikolov et al. showed that you could learn vector representations by training a shallow neural network on a simple task: predict a word from its context (CBOW) or predict the context from a word (Skip-gram).
The magic is that the learned vectors capture semantic relationships:
king - man + woman ≈ queen
paris - france + italy ≈ rome
Each word gets a dense vector (typically 100–300 dimensions) where similar words are close together in the vector space.
Static vs. Contextual Embeddings
Tokenization in Practice
Before any of these techniques work, we need to tokenize the raw text — split it into individual units. There are several strategies:
| Strategy | Example: "unhappiness" | Pros | Cons |
|---|---|---|---|
| Word-level | ["unhappiness"] | Simple, intuitive | Large vocab, can't handle unknown words |
| Character-level | ["u","n","h","a","p","p","i","n","e","s","s"] | Tiny vocab, no OOV | Very long sequences, harder to learn |
| Subword (BPE) | ["un", "happiness"] | Balanced vocab, handles OOV | Requires training a tokenizer |
Building Text Models in Keras
Keras provides two key tools for text processing:
1. TextVectorization — a preprocessing layer that tokenizes and indexes text
2. Embedding — a trainable layer that maps integer token IDs to dense vectors
Let's build a complete text classification pipeline.
1import tensorflow as tf
2from tensorflow.keras import layers, models
3
4# --- TextVectorization: converts raw strings → integer sequences ---
5max_tokens = 10000
6max_length = 200
7
8vectorize_layer = layers.TextVectorization(
9 max_tokens=max_tokens,
10 output_mode="int",
11 output_sequence_length=max_length,
12)
13
14# Adapt the layer to your training data (builds the vocabulary)
15train_texts = ["This movie was great!", "Terrible film, waste of time.", ...]
16vectorize_layer.adapt(train_texts)
17
18# See the vocabulary
19vocab = vectorize_layer.get_vocabulary()
20print(f"Vocabulary size: {len(vocab)}")
21print(f"First 10 tokens: {vocab[:10]}")
22
23# Vectorize a sentence
24sample = tf.constant(["This movie was great!"])
25print(vectorize_layer(sample)) # e.g., [ 12 45 8 203 0 0 ...]1# --- Full text classification model ---
2embedding_dim = 128
3
4model = models.Sequential([
5 # Input: raw strings
6 layers.Input(shape=(1,), dtype=tf.string),
7 vectorize_layer,
8
9 # Embedding: (batch, seq_len) → (batch, seq_len, embedding_dim)
10 layers.Embedding(
11 input_dim=max_tokens,
12 output_dim=embedding_dim,
13 mask_zero=True, # Ignore padding tokens
14 ),
15
16 # GlobalAveragePooling1D: (batch, seq_len, 128) → (batch, 128)
17 # Averages across the sequence dimension — simple but effective
18 layers.GlobalAveragePooling1D(),
19
20 # Classification head
21 layers.Dense(64, activation="relu"),
22 layers.Dropout(0.3),
23 layers.Dense(1, activation="sigmoid"), # Binary classification
24])
25
26model.compile(
27 optimizer="adam",
28 loss="binary_crossentropy",
29 metrics=["accuracy"],
30)
31
32model.summary()GlobalAveragePooling1D vs. Flatten
IMDB Sentiment Classification — End to End
Let's put it all together with the classic IMDB movie review dataset:
1import tensorflow as tf
2from tensorflow.keras import layers, models
3import tensorflow_datasets as tfds
4
5# Load IMDB dataset (25k train, 25k test)
6(train_data, test_data), info = tfds.load(
7 "imdb_reviews",
8 split=["train", "test"],
9 as_supervised=True,
10 with_info=True,
11)
12
13# Prepare batches
14train_ds = train_data.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
15test_ds = test_data.batch(64).prefetch(tf.data.AUTOTUNE)
16
17# Build the TextVectorization layer
18max_tokens = 20000
19max_length = 500
20
21vectorize_layer = layers.TextVectorization(
22 max_tokens=max_tokens,
23 output_mode="int",
24 output_sequence_length=max_length,
25)
26
27# Adapt on training text only
28train_text = train_data.map(lambda text, label: text)
29vectorize_layer.adapt(train_text)
30
31# Build model
32model = models.Sequential([
33 vectorize_layer,
34 layers.Embedding(max_tokens, 128, mask_zero=True),
35 layers.GlobalAveragePooling1D(),
36 layers.Dense(64, activation="relu"),
37 layers.Dropout(0.4),
38 layers.Dense(1, activation="sigmoid"),
39])
40
41model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
42
43# Train
44history = model.fit(train_ds, validation_data=test_ds, epochs=10)
45# Expect ~87-89% accuracy with this simple architecture