Skip to main content

Named Entity Recognition & Text Parsing

Extract structured information from unstructured text — identify people, organizations, locations, and relationships using spaCy and custom NER models.

~45 min
Listen to this lesson

Named Entity Recognition & Text Parsing

Raw text is unstructured — it's just a sequence of characters. But buried inside that text is structured information: names of people, organizations, dates, monetary amounts, and more. Named Entity Recognition (NER) is the task of automatically identifying and classifying these entities.

Why NER Matters

NER is a critical building block for:

  • Information extraction — pull key facts from documents
  • Search engines — understand queries like "Apple stock price" (company, not fruit)
  • Chatbots — extract user intent ("Book a flight to Paris on March 15")
  • Compliance — automatically detect PII (personally identifiable information)
  • Knowledge graphs — build structured databases from text
  • Standard Entity Types

    Entity TypeCodeExamples
    PersonPERSON"Albert Einstein", "Marie Curie"
    OrganizationORG"Google", "United Nations"
    LocationGPE / LOC"Paris", "Mount Everest"
    DateDATE"March 15, 2024", "last Tuesday"
    MoneyMONEY"$1.5 billion", "500 euros"
    ProductPRODUCT"iPhone 15", "Tesla Model 3"

    The NLP Pipeline

    NER doesn't happen in isolation. A full NLP pipeline typically runs these steps in order: (1) Tokenization — split text into tokens, (2) POS Tagging — assign grammatical roles, (3) Dependency Parsing — identify grammatical relationships, (4) Named Entity Recognition — classify entity spans. Each step builds on the previous one.

    spaCy: Industrial-Strength NLP

    spaCy is the most popular library for production NLP pipelines. It's fast, accurate, and provides a unified API for tokenization, POS tagging, NER, and dependency parsing.

    python
    1import spacy
    2
    3# Load the pre-trained English model
    4# Install first: python -m spacy download en_core_web_sm
    5nlp = spacy.load("en_core_web_sm")
    6
    7# Process a text — runs the full pipeline
    8text = "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976."
    9doc = nlp(text)
    10
    11# --- Tokenization ---
    12print("Tokens:")
    13for token in doc:
    14    print(f"  {token.text:15s}  POS: {token.pos_:6s}  Dep: {token.dep_:10s}")
    15
    16# --- Named Entity Recognition ---
    17print("\nEntities:")
    18for ent in doc.ents:
    19    print(f"  {ent.text:25s}  Label: {ent.label_:10s}  Span: [{ent.start_char}:{ent.end_char}]")
    20
    21# Output:
    22#   Apple Inc.                 Label: ORG         Span: [0:10]
    23#   Steve Jobs                 Label: PERSON      Span: [26:36]
    24#   Cupertino                  Label: GPE         Span: [40:49]
    25#   California                 Label: GPE         Span: [51:61]
    26#   April 1, 1976              Label: DATE        Span: [65:79]

    Part-of-Speech (POS) Tagging

    POS tagging assigns a grammatical category to each token. This is essential for understanding sentence structure:

    python
    1import spacy
    2
    3nlp = spacy.load("en_core_web_sm")
    4doc = nlp("The quick brown fox jumps over the lazy dog")
    5
    6# Detailed POS tags
    7print(f"{'Token':<10} {'POS':<8} {'Fine POS':<8} {'Explanation'}")
    8print("-" * 50)
    9for token in doc:
    10    print(f"{token.text:<10} {token.pos_:<8} {token.tag_:<8} {spacy.explain(token.tag_)}")
    11
    12# Output:
    13# Token      POS      Fine POS Explanation
    14# --------------------------------------------------
    15# The        DET      DT       determiner
    16# quick      ADJ      JJ       adjective
    17# brown      ADJ      JJ       adjective
    18# fox        NOUN     NN       noun, singular or mass
    19# jumps      VERB     VBZ      verb, 3rd person singular present
    20# over       ADP      IN       conjunction, subordinating or preposition
    21# the        DET      DT       determiner
    22# lazy       ADJ      JJ       adjective
    23# dog        NOUN     NN       noun, singular or mass

    Dependency Parsing

    Dependency parsing identifies grammatical relationships between words — which word is the subject, which is the object, which modifies which:

    python
    1import spacy
    2
    3nlp = spacy.load("en_core_web_sm")
    4doc = nlp("The CEO of Google announced a new product yesterday")
    5
    6# Dependency tree
    7for token in doc:
    8    print(f"{token.text:12s} --{token.dep_:10s}--> {token.head.text}")
    9
    10# Output:
    11# The          --det       --> CEO
    12# CEO          --nsubj     --> announced      (subject of "announced")
    13# of           --prep      --> CEO
    14# Google       --pobj      --> of             (object of "of")
    15# announced    --ROOT      --> announced      (root of the sentence)
    16# a            --det       --> product
    17# new          --amod      --> product        (adjective modifier)
    18# product      --dobj      --> announced      (direct object of "announced")
    19# yesterday    --npadvmod  --> announced      (temporal modifier)
    20
    21# Extract subject-verb-object triples
    22for token in doc:
    23    if token.dep_ == "ROOT":
    24        subjects = [child for child in token.children if "subj" in child.dep_]
    25        objects = [child for child in token.children if "obj" in child.dep_]
    26        if subjects and objects:
    27            print(f"\nTriple: {subjects[0].text}{token.text}{objects[0].text}")
    28            # Triple: CEO → announced → product

    Regex Patterns for Structured Extraction

    Sometimes you need to extract well-defined patterns that don't require a neural model — emails, phone numbers, URLs, dates in specific formats. Regular expressions are the right tool here:

    python
    1import re
    2
    3text = """
    4Contact John Smith at john.smith@example.com or call (555) 123-4567.
    5The invoice total is $1,234.56 due on 2024-03-15.
    6Visit https://www.example.com for more info.
    7"""
    8
    9patterns = {
    10    "email": r'[\w.+-]+@[\w-]+\.[\w.-]+',
    11    "phone": r'\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}',
    12    "money": r'\$[\d,]+\.?\d*',
    13    "date": r'\d{4}-\d{2}-\d{2}',
    14    "url": r'https?://[\w./\-]+',
    15}
    16
    17for name, pattern in patterns.items():
    18    matches = re.findall(pattern, text)
    19    print(f"{name:8s}: {matches}")
    20
    21# email   : ['john.smith@example.com']
    22# phone   : ['(555) 123-4567']
    23# money   : ['$1,234.56']
    24# date    : ['2024-03-15']
    25# url     : ['https://www.example.com']

    Regex vs. NER: When to Use Which

    Use **regex** when the pattern is well-defined and consistent (emails, phone numbers, dates in ISO format, credit card numbers). Use **NER** when the entity varies in form and requires context to identify (company names, product names, people's names). For the best results, combine both: use regex for structured patterns and NER for contextual entities.

    Custom NER Training with spaCy

    Pre-trained models know about standard entity types, but your domain might have custom entities — drug names, legal clauses, product SKUs. You can train spaCy's NER on your own data:

    python
    1import spacy
    2from spacy.tokens import DocBin
    3from spacy.training import Example
    4import random
    5
    6# --- Prepare training data ---
    7# Format: (text, {"entities": [(start, end, label), ...]})
    8TRAIN_DATA = [
    9    ("Aspirin is used for headaches", {"entities": [(0, 7, "DRUG")]}),
    10    ("Ibuprofen reduces inflammation", {"entities": [(0, 9, "DRUG")]}),
    11    ("The patient was prescribed Metformin for diabetes",
    12     {"entities": [(26, 35, "DRUG")]}),
    13    ("Amoxicillin treats bacterial infections",
    14     {"entities": [(0, 11, "DRUG")]}),
    15    ("She takes Lisinopril for blood pressure",
    16     {"entities": [(10, 20, "DRUG")]}),
    17]
    18
    19# --- Create a blank model or start from existing ---
    20nlp = spacy.blank("en")  # Start fresh, or use spacy.load("en_core_web_sm")
    21
    22# Add NER pipe if not present
    23if "ner" not in nlp.pipe_names:
    24    ner = nlp.add_pipe("ner", last=True)
    25else:
    26    ner = nlp.get_pipe("ner")
    27
    28# Add custom entity label
    29ner.add_label("DRUG")
    30
    31# --- Training loop ---
    32optimizer = nlp.begin_training()
    33
    34for epoch in range(30):
    35    random.shuffle(TRAIN_DATA)
    36    losses = {}
    37
    38    for text, annotations in TRAIN_DATA:
    39        doc = nlp.make_doc(text)
    40        example = Example.from_dict(doc, annotations)
    41        nlp.update([example], sgd=optimizer, losses=losses)
    42
    43    if epoch % 10 == 0:
    44        print(f"Epoch {epoch:3d} | Loss: {losses['ner']:.4f}")
    45
    46# --- Test the trained model ---
    47test_doc = nlp("The doctor recommended Aspirin and Ibuprofen for the pain")
    48for ent in test_doc.ents:
    49    print(f"  {ent.text:15s}  Label: {ent.label_}")

    Building a Complete Text Processing Pipeline

    In practice, you'll combine multiple techniques into a pipeline:

    python
    1import spacy
    2import re
    3from collections import defaultdict
    4
    5def process_document(text, nlp_model):
    6    """
    7    Complete text processing pipeline:
    8    1. spaCy NER for contextual entities
    9    2. Regex for structured patterns
    10    3. Dependency parsing for relationships
    11    """
    12    doc = nlp_model(text)
    13
    14    results = {
    15        "entities": defaultdict(list),
    16        "patterns": {},
    17        "key_phrases": [],
    18    }
    19
    20    # 1. Named Entities
    21    for ent in doc.ents:
    22        results["entities"][ent.label_].append({
    23            "text": ent.text,
    24            "start": ent.start_char,
    25            "end": ent.end_char,
    26        })
    27
    28    # 2. Regex patterns
    29    results["patterns"]["emails"] = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
    30    results["patterns"]["dates"] = re.findall(r'\d{4}-\d{2}-\d{2}', text)
    31    results["patterns"]["money"] = re.findall(r'\$[\d,]+\.?\d*', text)
    32
    33    # 3. Key noun phrases
    34    for chunk in doc.noun_chunks:
    35        results["key_phrases"].append(chunk.text)
    36
    37    return results
    38
    39# --- Use it ---
    40nlp = spacy.load("en_core_web_sm")
    41text = """
    42On 2024-03-15, Google CEO Sundar Pichai announced a $2 billion investment
    43in cloud infrastructure in London. Contact press@google.com for details.
    44"""
    45
    46results = process_document(text, nlp)
    47
    48print("=== Entities ===")
    49for label, ents in results["entities"].items():
    50    print(f"  {label}: {[e['text'] for e in ents]}")
    51
    52print("\n=== Patterns ===")
    53for pattern, matches in results["patterns"].items():
    54    print(f"  {pattern}: {matches}")
    55
    56print("\n=== Key Phrases ===")
    57for phrase in results["key_phrases"]:
    58    print(f"  - {phrase}")

    IOB Tagging Under the Hood

    NER models internally use **IOB tagging** (Inside-Outside-Beginning) to label each token. For the entity "Steve Jobs": "Steve" gets B-PERSON (beginning) and "Jobs" gets I-PERSON (inside). Non-entity tokens get O (outside). This scheme handles multi-word entities and distinguishes consecutive entities of the same type ("Steve Jobs" and "Tim Cook" appearing next to each other).