Getting started with PDF development

Introduction

PDF is a standardized format for document presentation that was developed by Adobe long ago - it is an acronym that stands for "Portable Document Format." It might be worth noting that the definition of "portable" in this context is supposed to mean "cross-platform"; PDFs can be very large in the literal storage sense.

PDF documents are composed of a hierarchy of standardized "objects," which you can read all about in the official reference specification. When developing something involving them, there is no need to build it from the ground up. Instead, we can consider a few different options:

  • Use a pre-existing tool

  • Build tools of our own, using existing PDF libraries

  • Some combination of the above

Let's say we're interested in building an application that accepts report data, and produces a nicely formatted document at the other end. We'd probably have a designer help us figure out what can be modularized, and give us the general look and feel to build towards.

Tools

There are a lot of PDF modification and creation tools that are available to us in the modern day. On Linux systems, a popular choice is pdftk. This particular instance is a terminal client that allows you to perform specific operations on documents - such as splicing, reordering, filling form fields out with information, rotating, and so on. Given a design, one might be able to stitch together modular pieces of a report to create the whole, but only if the pages themselves are interchangeable.

If we want to, we can also use a "Desktop Publishing Tool" that has some sort of scripting capability, and feed it information to create a similar report. One such tool is Scribus, but its paid equivalent is Adobe InDesign. If you ask me, most projects are far from needing such expensive software, though.

There are also purpose-designed applications for performing these actions - they're examples of exactly what we might want to build. Two examples are Pentaho and BIRT. These may be worthy of being explored, especially if we find an open-source application capable of saving us a lot of frustrations. However, note that you are then at the mercy of these products.

Libraries

Pretty much every major programming language has some sort of PDF library we can utilize. The advantage of using a library - as opposed to an existing tool - is flexibility. The disadvantage is time, effort and maintenance.

One example of a useful Python PDF library is ReportLab. The advantage of abstracting away the Adobe PDF specifications becomes evident fairly quickly; you gain not only speed, but many options, without actually needing to know low-level details. A similar Ruby library is Prawn, and even Java has one called PDFBox.

All of these libraries have a similar set of features. You can edit, add and remove raster images, vector objects, text (which is also represented as a set of vector objects), forms, and pretty much anything else you might associate with PDFs.

What should we do?

The advantage of using a pre-built application is time, but you lose a significant amount of flexibility in exchange. We need flexibility for our internally designed products, so it makes sense to build our own solution. How do we go about this? There are a few sensible options that come to mind, but here is one strong contender:

Client's data source -> Server -> Client


This method would consist of a server taking data from some source, and returning a document. In this context, one or more users would provide either field data, or an arbitrary non-binary file format; these fields would correspond to sections of the resulting PDF, which are then organized according to whatever library we've chosen.

Take, for example, the python library ReportLab, and one of its child libraries, Platypus. These allow a developer to define "flowing" sections in a document, similar to the way that the DOM functions in HTML; each item has its own coordinate system inside of its boundaries, as well as being defined and positioned relative to the document itself.

Knowing this, a good way of keeping the project tidy would be to find and isolate all of the small components that make up the document, and creating a module for each of them. The modules would be rendered by Platypus, given the input data, which presents both challenge and flexibility to us - we must take care of a significant fraction of the math regarding distances between elements inside these modules, and so on. However, Platypus takes care of concepts such as distances between one module and another, or determining when to break a page if an element is too large, among other things.

Assuming we've authored our modules and are happy with them, we'd then set up a server to:

  • Receive data submitted from some source

  • Make sure the data is safe to manipulate

  • Analyze the form data

  • Attempt to render PDF modules, passing data to them based on our analysis of a client's input

  • Flow the modules together into a final document

  • Send the result back to the client interface

There are additional areas which should be questioned, and a decision made with regards to how we proceed.

Dynamic Graphics

As it stands, our conceptual redesign of the reports received contains a number of "dynamic graphics" - that is, graphical representations of data, such as bar graphs, histographs, and some objects that don't fit into a specific category.

Many PDF programming libraries, including ReportLab, contain components which will help us build these out. However, it should be assumed that some manual development of graphical modules will be necessary, especially regarding "unique" views.

We should strive to make sure these types of components are truly modular. In particular, it is probably important to:

  • Define graphic boundaries

  • Specify graphic scales (such as the scale of a simple x-y graph)

  • Make graphics vector based, and not raster image renders. This is important for scaling.

Variable size fields

When working with data whose fields are of an indeterminable length, we need to make sure none of the data is truncated when it is rendered in the final result.
To accomplish this, we must calculate a few things:

  • What is the maximum possible size of the component, given that it has the maximum possible size of input data?

  • What is the minimum possible size of the component, given that it has the minimum possible size of input data?

  • In what ways must we account for these minimum and maximum sizes, considering the flow of other components?

Mostly, we're concerned with the first metric - we don't want the component to go past a certain point in the document. If it exceeds that boundary, we should move it into a new page as the first component of that page. Then, we must also make sure that other components do not "break" as a result of our modification to the prior page's flow.

If our design includes elements that conditionally split depending on their distance from the end of the document, the remaining available space, and the value of their contents, the Flowables (in the case of ReportLab / Platypus) must be coded in a way that accounts for these behaviors.

Observations

In our experimentation, we've found that using libraries as a foundation for PDF creation lets you ignore the tricky parts of the specification, but you still need to be pretty careful about what you're doing. Even though your hand is held to some degree, there are many pitfalls to watch out for - here is the provided "Hello World" equivalent layout in Platypus:

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer  
from reportlab.lib.styles import getSampleStyleSheet  
from reportlab.rl_config import defaultPageSize  
from reportlab.lib.units import inch  
PAGE_HEIGHT=defaultPageSize[1]  
PAGE_WIDTH=defaultPageSize[0]  
styles = getSampleStyleSheet()  
Title = "Hello world"  
pageinfo = "platypus example"

def myFirstPage(canvas, doc):  
  canvas.saveState()  
  canvas.setFont('Times-Bold',16)  
  canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT-108, Title)  
  canvas.setFont('Times-Roman',9)  
  canvas.drawString(inch, 0.75 * inch,"First Page / %s" % pageinfo)  
  canvas.restoreState()  

def myLaterPages(canvas, doc):  
  canvas.saveState()
  canvas.setFont('Times-Roman',9)
  canvas.drawString(inch, 0.75 * inch, "Page %d %s" % (doc.page, pageinfo))  
  canvas.restoreState()

def go():  
  doc = SimpleDocTemplate("phello.pdf")
  Story = [Spacer(1,2*inch)]
  style = styles["Normal"]  
  for i in range(100):  
    bogustext = ("This is Paragraph number %s. " % i) *20
    p = Paragraph(bogustext, style)  
    Story.append(p)
    Story.append(Spacer(1,0.2*inch))  -
    doc.build(Story, onFirstPage=myFirstPage, onLaterPages=myLaterPages)

This example comes from ReportLab. It's not too bad, yet; you start to run in to large program design decisions very quickly, though. Depending on your background experience with data structures and the like, you will flesh it out differently.

Immediately, assuming your end-goal is not to print one line of text in a specific section of a document, you have to figure out how to:

  • Keep everything orthogonal

  • Modularize common components

  • Prevent overlapping elements

  • Define "safe" page boundaries

  • Calculate module distances

.. et cetera.

You might need to answer these on a project-to-project basis, but I think we can come up with some generic rules. In my case, since we are experimenting with Platypus in this context, we want to keep each module in its own flowable.

A flowable is a kind of abstraction that you can use to define modules with certain rules - the manual states,

"Flowables are things which can be drawn and which have wrap, draw and perhaps split methods." They sit inside of Frames, which sit inside of the PageTemplate.

A Small Experiment

Let's try creating a simple module, and see if we can learn anything on the way.

This is a hypothetical section from one of our potential reports. I'd consider this a Frame that has six Flowables inside of it, as it is all based on the same informational topic, and also is aesthetically consistent.

You might want to think of these module build-outs as individual PDFs - in this library, they have their own coordinate systems, as mentioned earlier.

The Frame

Frames are defined as such:

f = Frame(x1,  
   y1,
   width,
   height,
   leftPadding=6,
   bottomPadding=6,
   rightPadding=6,
   topPadding=6,
   id=None,
   showBoundary=0)

In this case, x1 and y1 refer to the offset of the frame relative to its canvas, in a normal coordinate system; width and height are fairly self-explanatory, and padding parameters act as you might expect in, say, CSS. You can pass units like inch to them if you've imported the definition:

from reportlab.lib.units import inch  
f = Frame(inch, inch, 6*inch, 9*inch ... )

If we assume we have a size of 3 inches wide, 8 inches vertically, we might define  
 f = Frame(inch, inch, 3*inch, 8*inch [...]).

We then will need to append Flowables to the Frame. This is accomplished with a simple method - f.addFromList([list of flowables], [canvas]). Before calling the method, it needs to have a list of Flowables to consume, and a defined Canvas instantiation.

The Flowables

For historical reasons, a list of Flowables is usually defined as a variable called story ( story = [] - literally, just a Python list object).
There are a couple of ways to populate our list of Flowable objects. The first is to directly define them as we add them, like this:

story.append(Paragraph("This is a Heading",styleH))

styleH, in this case, is taken from an example stylesheet included with the default distribution of ReportLab's open-source version:

from reportlab.lib.styles import getSampleStyleSheet  
styles = getSampleStyleSheet()  
styleH = styles['Heading1']  

The other way is to define the Flowable beforehand, which is probably preferable in our case - this is what that looks like in the official manual document:

P=Paragraph('This is a very silly example',style)  
canv = Canvas('doc.pdf')  
aW = 460 # available width and height  
aH = 800  w,h = P.wrap(aW, aH) # find required space  
if w<=aW and h<=aH:  
  P.drawOn(canv,0,aH)
  aH = aH - h # reduce the available height  
  canv.save()  
else:  
  raise ValueError, "Not enough room"

This is where we'd work on making sure the Flowables take up an appropriate amount of space in the Frame. There are a number of methods available to us that are utilized to help Platypus figure this out, but two are most commonly needed:

Flowable.wrap(availWidth, availHeight)

Returns the space that a Flowable object actually occupies. You'll want to define the available height and width of the canvas itself.

Flowable.drawOn(canvas, x, y)

Invokes the rendering action for this Flowable, positioned absolutely in the specified Canvas object. To make another comparison to web technologies, this is sort of like absolutely positioning a child of a relatively positioned HTML element.

These kinds of methods are only of our concern if we are creating a new type of Flowable other than the ones that exist already. We are essentially defining our own generic object that we can use in a portable manner.

Say we've created some Flowable called StatisticBox, which would basically be a box containing one header and one value - we'd need six to fill our Frame from before. If we do things correctly, it shouldn't look more complicated than this:

story.append(StatisticBox('PROJECT ID', '38000P153163'))

Our code should, ideally, take care of situations involving too little space, a need for page-breaking, splitting behaviors, and other contingencies.

Feeding the data

Besides implementation details for our layout, we also need to design a way for our application to accept some form of data. Assuming we have some standardized format, such as a .csv, some XML markup, or anything similar, what we need to do is break that data into components recognizable at the other end.

This is something that is primarily on the developer to implement; however, it is a common problem that has been solved many times over several decades. You might need to slightly modify the input data, so as to identify one kind of data with one frame, for instance.

The basic process is to identify what will be used to specify frames, the Flowables inside them, where the frames are positioned, whether to wrap or break them to the proceeding page, what data belongs to what object, and so on.

Since this is largely contextual, I hesitate to provide any solid "recommendations" - however, we do have one suggestion. Look into tokenization to extract useful bits of information out of your data. This is basically the practice of creating a text parser, of sorts.

Tips on Working Practices

Knowing what we do now, we should just adhere to a few principles going forward:

Be Pragmatic

We should evaluate our choice of tool and/or library at each junction. Let's not use one for no particular reason; justify them with objective data. Use what makes our lives easier, as many before us have struggled with PDFs.

Consult design

It is much easier to talk to the designer(s) about segments that might be unreasonably difficult to develop. After you've come to a conclusion, modularize each part of the design into objects that are easy to tack together.

Be aware, but not obsessed, with the standard

Adobe's PDF Specification document is very thorough. It is, after all, a specification. When dealing with an issue that seems strange, it can be useful to come back to this document in order to figure out what's gone wrong.

Stand on the shoulders of giants

We don't have to come up with everything from scratch. We are not the first ones to build something for PDF documents, and we won't be the last. Viewing the source code of successful projects, if they exist, can be very useful.

Download our Incubator Resources

 

WANT MORE?

We’re known for sharing everything!

HANDBOOK

Save more time, get more done!

FREE HANDBOOK

Innovate from the inside

Written by
Cody Welsh 19 Apr 2017

Musician; software engineer; science doer.

YOU MIGHT ALSO LIKE

comments powered by Disqus