Sometimes we build “News Aggregator” type themes, where a post doesn’t have its own content, but only links to an external article. Or we want the first image in the content to automatically become the “Featured Image” if the editor forgets to set it.
In both cases, we need to “scan” the post content (the_content) and extract the first <a> or <img> tag from it.
Method: Domdocument class
Many developers use Regular Expressions (Regex) for this, but parsing HTML with Regex is bad practice. It’s better to use PHP’s built-in DOMDocument class.
Here’s a ready-made function you can paste into functions.php:
function get_first_link_url( $content ) {
// If content is empty, return false
if ( empty( $content ) ) return false;
$doc = new DOMDocument();
// Suppress HTML5 errors (DOMDocument is old and sometimes complains about <section> etc.)
libxml_use_internal_errors(true);
// Load HTML (with UTF-8 hack)
$doc->loadHTML( mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8') );
$links = $doc->getElementsByTagName('a');
if ( $links->length > 0 ) {
// Return href of the first link
return $links->item(0)->getAttribute('href');
}
return false;
}
Usage IN the loop
$link = get_first_link_url( get_the_content() );
if ( $link ) {
echo '<a href="' . esc_url($link) . '" class="read-more-external">Read original</a>';
}
This solution is solid, secure, and handles errors in HTML structure better than any Regex.


