Who is this guide for?

This guide is designed for beginner-level users and takes about 2 minutes to read.

How-To Beginner 2 min read 313 words

How to Extract and Transform Structured Data from Text

Parse emails, addresses, phone numbers, dates, and URLs from unstructured text using regex and pattern matching.

Featured Tool

Contador de Palavras

Conte palavras, caracteres e frases em tempo real.

Try it Free

Extracting Structured Data from Text

Unstructured text contains valuable structured data — email addresses, phone numbers, dates, URLs, prices, and identifiers. Extracting this data accurately requires understanding the patterns and edge cases of each data type.

Email Extraction

The RFC 5322 email specification allows a surprising range of characters, but practical extraction uses simpler patterns. Match sequences of alphanumeric characters, dots, hyphens, and underscores, followed by @, followed by a domain with at least one dot. Be aware of edge cases: plus-addressed emails ([email protected]), subdomains ([email protected]), and new TLDs (.museum, .technology).

Phone Number Parsing

Phone numbers appear in dozens of formats: (555) 123-4567, +1-555-123-4567, 555.123.4567, 5551234567. Rather than writing one regex to match all formats, strip all non-digit characters first, then validate the resulting digit sequence against known country formats. The E.164 format (+[country][subscriber]) is the universal normalized form.

Date Recognition

Dates are ambiguous: 01/02/03 could be January 2, 2003 (US), February 1, 2003 (UK), or 2001 February 3 (ISO). When extracting dates from text, look for unambiguous patterns first (written month names, ISO 8601 format), then fall back to locale-specific patterns with clear documentation about which interpretation you're using.

URL Detection

URLs in text may or may not have the protocol prefix. Match both http(s)://domain.tld/path and bare domain.tld/path patterns. Be careful with punctuation at the end: "Visit example.com." — the period is sentence punctuation, not part of the URL. Parentheses in URLs (common in Wikipedia links) require careful handling to avoid matching surrounding sentence parentheses.

Validation After Extraction

Extraction finds patterns; validation confirms they're real. Extracted emails should have valid MX records on their domain. Phone numbers should have the correct number of digits for their country. URLs should be accessible (HTTP HEAD request). Dates should be valid calendar dates (no February 30th). Build a pipeline that extracts, validates, and flags uncertain matches for human review.

Ferramentas relacionadas

C Contador de Palavras C Conversor de Maiúsculas/Minúsculas O Ordenar Linhas G Gerador de Lorem Ipsum G Gerador de Slug E Encontrar e Substituir R Remover Linhas Duplicadas C Codificador/Decodificador Base64 C Codificador/Decodificador de URL F Formatador JSON C Codificador/Decodificador de Entidades HTML I Inverter Texto A Adicionar/Remover Números de Linha C Comparador de Texto E Extrator de Texto

Formatos relacionados

.csv .html .json .md .txt .xml

Guias relacionados

Text Encoding Explained: UTF-8, ASCII, and Beyond

Text encoding determines how characters are stored as bytes. Understanding UTF-8, ASCII, and other encodings prevents garbled text, mojibake, and data corruption in your applications and documents.

Regular Expressions: A Practical Guide for Text Processing

Regular expressions are powerful patterns for searching, matching, and transforming text. This guide covers the most useful regex patterns with real-world examples for common text processing tasks.

Markdown vs Rich Text vs Plain Text: When to Use Each

Choosing between Markdown, rich text, and plain text affects portability, readability, and editing workflow. This comparison helps you select the right text format for documentation, notes, and content creation.

How to Convert Case and Clean Up Messy Text

Messy text with inconsistent capitalization, extra whitespace, and mixed formatting is a common problem. This guide covers tools and techniques for cleaning, transforming, and standardizing text efficiently.

Troubleshooting Character Encoding Problems

Garbled text, question marks, and missing characters are symptoms of encoding mismatches. This guide helps you diagnose and fix the most common character encoding problems in web pages, files, and databases.