Diaries...

Named Entity Linking (NEL)

Named Entity Linking (NEL)

Named Entity Linking, also known as Named Entity Disambiguation, is an advanced Natural Language Processing (NLP) technique that extends beyond simply identifying entities in text. Its primary goal is to connect a named entity to a unique, real-world identity in a knowledge base. This is crucial for resolving ambiguity when a name, like “Paris,” could refer to multiple different things.

Read More
The State of the Art in Information Extraction: From Pipelines to Unified Paradigms

The State of the Art in Information Extraction: From Pipelines to Unified Paradigms

Information Extraction (IE) is a cornerstone of modern Natural Language Processing (NLP), focused on automatically extracting structured information from unstructured or semi-structured text. Its goal is to transform free-form text into a machine-readable format, such as a database or knowledge graph, enabling applications from sentiment analysis and question answering to semantic search and bioinformatics. The field has seen a dramatic evolution, moving from rule-based systems to sophisticated neural architectures, with Large Language Models (LLMs) now redefining the cutting edge.

Read More
Git Conventional Commits

Git Conventional Commits

Git Conventional Commits provide a standardized way of writing commit messages to make them more readable and machine-parsable. The format is <type>[optional scope]: <description>. Here are some common examples and variations:

Read More
Python-Powered PDF Text Extraction: A Practical Guide

Python-Powered PDF Text Extraction: A Practical Guide

Extracting text from PDFs is a common first step in many data pipelines, but it’s rarely a clean process. PDFs are designed for visual presentation, not data extraction, which means the raw text you get is often riddled with formatting issues like unwanted line breaks, hyphenated words, and inconsistent spacing.

Read More
Turning Text into Numbers: The Art of Text Representation

Turning Text into Numbers: The Art of Text Representation

In the world of machine learning, the quality of your features directly determines the quality of your results—a principle known as “garbage in, garbage out.” For Natural Language Processing (NLP), this means that converting raw text into a numerical format, or text representation, is one of the most critical steps in the entire pipeline.

Read More
An In-Depth Guide to Essential Text Exploration Techniques

An In-Depth Guide to Essential Text Exploration Techniques

Text exploration is the indispensable first step in Natural Language Processing (NLP) and data science, where raw, unstructured text is transformed into meaningful, actionable insights. By applying these techniques, we can uncover hidden patterns, themes, and linguistic properties that are crucial for building more advanced models and making data-driven decisions. This guide details 15 of the most important text exploration techniques, complete with their applications, units of analysis, metrics, visualization strategies, and key scientific references. đź’ˇ

Read More
Getting starter with Apache Airflow

Getting starter with Apache Airflow

Setting up Apache Airflow with Docker is the recommended and easiest way to get started, especially for local development. Docker isolates Airflow and its dependencies in containers, preventing conflicts with your host machine and ensuring a reproducible environment.

Read More
Data acquisition

Data acquisition

Data is paramount to any Machine Learning (ML) system, frequently becoming the primary bottleneck in industrial projects. This section outlines various strategies for acquiring relevant data for Natural Language Processing (NLP) initiatives.

Read More
Evaluation in the NLP Pipeline: Measuring Model Success

Evaluation in the NLP Pipeline: Measuring Model Success

Evaluation is a crucial step in the Natural Language Processing (NLP) pipeline, assessing a model’s “goodness,” primarily its performance on unseen data. Success hinges on using the right metrics and following a proper evaluation process. Metrics vary by NLP task and pipeline phase (model building, deployment, production), with machine learning (ML) metrics common in early phases and business metrics added in production to gauge business impact.

Read More
A Look at the Modern Natural Language Processing Pipeline: From Data to Intelligent Production

A Look at the Modern Natural Language Processing Pipeline: From Data to Intelligent Production

As an NLP learner specializing in modern techniques, I’ve outlined the essential stages of a contemporary NLP pipeline. This structured workflow transforms raw textual data into actionable insights or intelligent applications. This article summarizes these eight fundamental steps, integrating best practices and key academic references.

Read More