Drift

The personal blog of Phoenix-based developer Josh Janusch

Using TCPDF to Handle PDF Form Overflows

One of the most difficult challenges I have faced was a project that required filling out PDF forms dynamically and automatically handling text that falls beyond the bounds of an input. The project was written in PHP and after numerous attempts at solving this issue, we managed to do it using TCPDF.

The Problem

Let's start off with the problem itself. The project itself was a large report creator that started as a way to generate dynamic reports based on a few dozen fields that the users had full control over. To do this, we utilized Twig templates using a recursive include method. We then fed this to wkhtmltopdf to convert the HTML to PDFs, sometimes 30-40 pages long.

The next phase of the project switched gears. We forked the editor, added dozens of new record types and input methods, and instead of generating PDFs, we had to fill out several PDFs filled with inputs. With this phase, we had to account for users entering far more text than fits into the PDF input and moving the overflow into supplemental pages.

Example of overflowing PDF content

This was far more trouble than we had bargained for.

Options

My first thought was somehow connecting the inputs of the PDF so that text automatically flows between them. This is possible through embedding Javascript in the PDF. Unfortunately, we could not modify the source documents and this wouldn't really work programmatically since it relies on keyup, or similar, events.

The next possibility was finding a tool that would take an input and say if it was too much. That would have helped, but it wouldn't help in moving the overflow to supplemental pages.

Another option was to measure each character width and manually measure it all. This might have worked to an extent. It would have been a ton of work (the client is creating reports in dozens of languages and the character set would be huge), but it showed promise. After investigation, however, we realized that kerning, line heights, line breaks, field padding, et al. would play into it. We determined it was highly unlikely we could handle all situations or do this method efficiently, so we moved on.

The final option was to calculate how large a set of text is and figure out if it fits and where it breaks. This seemed like the most fool-proof way so we went with it.

Finding A Way

This solution proved difficult. Few PHP libraries or command line tools have the ability to measure the height of text. We were already using wkhtmltopdf and PDFtk, so we first tried those. They were a no-go.

Next, we looked at PDFlib, a native PHP extension. This allowed me to half-accomplish the goal in a round-about way. I was able to create a temporary PDF that would let me set dimensions of a text field (called a textflow), put text into it, and move any remaining text into a second field. It worked, and it was fast. A 10 page overflow could be done in milliseconds. My prototype for this is no longer available, but it was based on the code found in PDFlib's documentation and was based around this:

do {  
    /* Fill the first column */
    $result = $p->fit_textflow($tf, $llx1, $lly1, $urx1, $ury1, $optlist);

    /* Fill the second column if we have more text*/
    if ($result != "_stop") {
        $result = $p->fit_textflow($tf, 
                    $llx2, $lly2, $urx2, $ury2, $optlist);
    }
} while ($result == "_boxfull" || $result == "_nextpage");

Sadly, this was a one-way street. You could input text, but there was no way to output the text in a specific textflow. So there was no way to figure out which text needed to be added to supplemental pages or even where to break the first field, just that it wasn't going to fit.

With PDFlib out, I reluctantly turned toward PHP libraries. I had previously had some experience with DOMpdf, mPDF, fPDF, and a few others. They were extremely slow and it worried me that this was what was left to me without resorting to an API written in Java or something else.

We went back through those and none of them had the functionality we were looking for, or didn't have it implemented well enough to be usable. We finally found TCPDF, which included a function which did exactly what we needed, getStringHeight().

Implementation

getStringHeight() takes a string, width, font formatting, and padding and outputs the height of the string in those constraints. We could then compare that to the height of the field and know if there was overflow. We finally had it.

The first prototype went word-by-word through the input until it went over the height. It worked and it worked well... if you had 10 minutes to handle each field. It was extremely slow. The basic logic behind this:

$paddings = ['T' => 0, 'R' => 2.835, 'B' => 0, 'L' => 2.835];
function getOverflow(\TCPDF $pdf, $string, $fieldWidth, $fieldHeight) {  
    // replace line breaks with PHP_EOL. Shouldn't be necessary but it didn't work correctly otherwise
    $words = preg_replace('/(?<! )\\n(?! )/', sprintf(' %s ', PHP_EOL), $string);
    $words = array_filter(explode(' ', $string));

    $i = 0;
    $testString = null;
    $wordCount = count($words);
    while (!$testString && $i < $wordCount) {
        $testString = array_implode(' ', array_slice($words, 0, $i));
        if ($pdf->getStringHeight($fieldWidth, $testString, false, false, $paddings) > $fieldHeight) {
            break;
        }
    }

    return ['text' => array_implode(' ', array_slice($words, 0, $i - 1)), 'overflow' => array_implode(' ', array_slice($words, $i))];
}

It's relatively simple. Split the string on spaces, loop through that string, compare the height of the string to the field height. It worked, albeit very slow. But as a proof-of-concept, it did its job and proved that this problem was solvable.

Optimization

Now that it worked, it needed to be optimized. After some tests, it became clear that getting it to an acceptable level was unrealistic. We resigned ourselves to offloading the generation to jobs run in the background that would then be sent to the user. It wasn't ideal, but it was acceptable given the functionality it provided.

So the first plan was to start with a simple one-way binary search. Start at the full string, halve it until its shorter than the field, and then run the previous function starting at that word.

$paddings = ['T' => 0, 'R' => 2.835, 'B' => 0, 'L' => 2.835];
function getOverflow(\TCPDF $pdf, $string, $fieldWidth, $fieldHeight) {  
    // replace line breaks with PHP_EOL. Shouldn't be necessary but it didn't work correctly otherwise
    $words = preg_replace('/(?<! )\\n(?! )/', sprintf(' %s ', PHP_EOL), $string);
    $words = array_filter(explode(' ', $string));

    $testString = $string;
    $lastTextHeight = 0;
    $comparisonCheck = true;
    $i = $wordCount = count($words);

    while ($pdf->getStringHeight($fieldWidth, $testString, false, false, $paddings) > $fieldHeight && $i > 0) {
        $i = $wordCount / 2;
        $testString = array_implode(' ', array_slice($words, 0, $i));
    }

    while (!$testString && $i < $wordCount) {
        $testString = array_implode(' ', array_slice($words, 0, $i));
        if ($pdf->getStringHeight($fieldWidth, $testString, false, false, $paddings) > $fieldHeight) {
            break;
        }
    }

    return ['text' => array_implode(' ', array_slice($words, 0, $i - 1)), 'overflow' => array_implode(' ', array_slice($words, $i))];
}

This improved things dramatically. What previously took minutes was now taking ~45s. That was only for a single field, however, and was still unacceptable — we were hoping to do an entire document (with dozens of fields) in an average of 10s.

Next, we had to look into ways to improve that. We took the binary search a step further and made it go forward until the string is taller than the field and then go in reverse word-by-word. This was another improvement, decreasing that average by another 10s.

We realized here that we had to automate the binary search. Go back and forth until we're within a few words, and then go into word-by-word mode. That should greatly reduce iterations.

Unfortunately, I can't show full code samples beyond this point because it's what was actually used in the application, but I can give a general overview.

protected function operateWhile(array $words, $startingIndex, $fieldWidth, $fieldHeight, $operator, $operatorVal, $comparison) {}  

The above method is what the getOverflow() method grew into. It runs a search in a specified direction using a specified interval. For example, the previous function could be run with:

$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '/', 2, '>=');
$idx = $this->operateWhile($words, $idx, $fieldWidth, $fieldHeight, '+', 1, '<=');

Further testing came up with things like

$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '/', 2, '>=');
$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '*', .5, '<=');
$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '/', 2, '>=');
$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '*', .5, '<=');
$idx = $this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '/', 2, '>=');
$idx = $this->operateWhile($words, $idx, $fieldWidth, $fieldHeight, '+', 1, '<=');

For the most part, the more iterations of the binary search, the quicker things ran. The problem was that the longer the string, the larger the gap the one-by-one had to bridge. We needed a way to dynamically run the search based on string length and field size. To that end, we developed this:

const OVERFLOW_INCREMENT_MULTIPLIER = .25;  
const OVERFLOW_INCREMENT_MIN = 7;

$breakIndex = floor($this->operateWhile($words, count($words), $fieldWidth, $fieldHeight, '/', 2, '>=')) * self::OVERFLOW_INCREMENT_MULTIPLIER;

while ($incrementer >= self::OVERFLOW_INCREMENT_MIN) {  
    $breakIndex = $this->operateWhile($words, $breakIndex, $fieldWidth, $fieldHeight, '+', $incrementer, '<');
    $breakIndex = $this->operateWhile($words, $breakIndex, $fieldWidth, $fieldHeight, '-', floor($incrementer / 3), '>');
    $incrementer = floor($incrementer * self::OVERFLOW_INCREMENT_MULTIPLIER);
}

$breakIndex = $this->operateWhile($words, $breakIndex, $fieldWidth, $fieldHeight, '+', 1, '<=');

That's roughly the final approach, minus some additional logic to handle it running too long or breaking when the string is too short. The two constants dictate how the search is run and for how long.

That brought the final duration down to our acceptable mark of ~10s on average. It's still not ideal, but it works and the UX is acceptable. If a string flows over the dimensions, it is truncated and the remainder is moved onto another page.

Note: No actual code samples were used due to contractual requirements, although original code samples were referenced to create the ones in this post.