Starbeamrainbowlabs

Stardust
Blog

Converting Hashtags into Titles with PHP

Recently I have been working on a website for someone I know. Mythdael (the awesome artist who created a design for this website!) did the design for this one too, and I felt that I had to bring it to life. While I was writing it, I found that I needed to convert any given hashtag into a presentable title. Since hashtags are all one word (and usually lower case), it is more or less impossible to work out where one word starts and another ends. The solution: A wordlist (My choice was a modified enable1.txt).

Here is the solution I came up with:

/*
 * From https://terenceyim.wordpress.com/2011/02/01/all-purpose-binary-search-in-php/
 * Parameters: 
 *   $a - The sort array.
 *   $first - First index of the array to be searched (inclusive).
 *   $last - Last index of the array to be searched (exclusive).
 *   $key - The key to be searched for.
 *   $compare - A user defined function for comparison. Same definition as the one in usort
 *
 * Return:
 *   index of the search key if found, otherwise return (-insert_index - 1). 
 *   insert_index is the index of smallest element that is greater than $key or sizeof($a) if $key
 *   is larger than all elements in the array.
 */
function binary_search(array $a, $first, $last, $key, $compare) {
    $lo = $first; 
    $hi = $last - 1;

    while ($lo <= $hi) {
        $mid = (int)(($hi - $lo) / 2) + $lo;
        $cmp = call_user_func($compare, $a[$mid], $key);

        if ($cmp < 0) {
            $lo = $mid + 1;
        } elseif ($cmp > 0) {
            $hi = $mid - 1;
        } else {
            return $mid;
        }
    }
    return -($lo + 1);
}

class hashtag_parser
{
    public $wordlist_length = 0;
    public $wordlist = [];

    function __construct($wordlist_path) {
        global $settings;
        $this->wordlist = file($wordlist_path, FILE_IGNORE_NEW_LINES);
        $this->wordlist_length = count($this->wordlist);
    }

    public function in_wordlist($word)
    {
        $word = strtolower($word);

        $result = binary_search($this->wordlist, 0, $this->wordlist_length, $word, "strcmp");

        if($result > -1)
            return true;
        else
            return false;
        /*
        if(in_array($word, $this->wordlist))
            return true;
        else
            return false;
        */
    }

    public function extract_words($hashtag)
    {
        global $settings;

        // Remove the hash from the beginning if it is present
        if(substr($hashtag, 0, 1) == "#") $hashtag = substr($hashtag, 1);

        // Create an array to hold the words we find
        $words = [];

        $length = strlen($hashtag); // Cache the length of the hashtag
        $pos = 0;
        while($pos < $length)
        {
//          echo("pos: $pos\n");
            // aim: find the length of the longest substring that is a valid
            // word according to the wordlist
            $longest_word_length = 0;
            for($scan_pos = $pos + 1; $scan_pos < $length + 1; $scan_pos++)
            {
//              echo("scan_pos: $scan_pos substring: " . substr($hashtag, $pos, $scan_pos - $pos) . "\n");
                if($this->in_wordlist(substr($hashtag, $pos, $scan_pos - $pos)))
                {
                    $longest_word_length = $scan_pos - $pos;
//                  echo("found word\n");
                }
            }

            // Set the length of the longest word to the remainder of the
            // string if we don't find any valid words
            if($longest_word_length == 0) $longest_word_length = $length - $pos;

            $words[] = substr($hashtag, $pos, $longest_word_length);

            $pos += $longest_word_length;
        }

        return ucwords(implode(" ", $words));
    }
}

The code is a bit messy (perhaps I should tidy it up a bit lot), but it does the job I intended it to do. I used a class because I was concerned that reading in the wordlist every time would cause the code to take too long to complete - it is slow enough as it is. Thankfully a binary search algorithm written in PHP by terenceyim helped speed things up enormously. Anyway, it can be used like this:

$parser = new hashtag_parser("/path/to/wordlist.txt");
echo($parser->extract_words("sometext")); // Prints "Some Text"

The algorithm I used is quite simple:

  1. Loop over each character.
  2. Scan ahead of the current character and figure out the length of the longest word in the input via the wordlist.
  3. If no valid word can be found, assume that the rest of the input is all one word.
  4. Extract the longest word we can find and add it to an array.
  5. Add the length of the word we found to the character pointer.
  6. If we haven't reached the end of the given input, go to step 2.
  7. If we have reach the end of the output, return the words we found.

I am posting this here in the hopes that someone else will find this code useful :)

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Archive

Art by Mythdael