PDF is a standardized format for document presentation that was developed by Adobe long ago - it is an acronym that stands for "Portable Document Format." It might be worth noting that the definition of "portable" in this context is supposed to mean "cross-platform"; PDFs can be very large in the literal storage sense.
PDF documents are composed of a hierarchy of standardized "objects," which you can read all about in the official reference specification. When developing something involving them, there is no need to build it from the ground up. Instead, we can consider a few different options:
Use a pre-existing tool
Build tools of our own, using existing PDF libraries
Some combination of the above
Let's say we're interested in building an application that accepts report data, and produces a nicely formatted document at the other end. We'd probably have a designer help us figure out what can be modularized, and give us the general look and feel to build towards.
There are a lot of PDF modification and creation tools that are available to us in the modern day. On Linux systems, a popular choice is pdftk. This particular instance is a terminal client that allows you to perform specific operations on documents - such as splicing, reordering, filling form fields out with information, rotating, and so on. Given a design, one might be able to stitch together modular pieces of a report to create the whole, but only if the pages themselves are interchangeable.
If we want to, we can also use a "Desktop Publishing Tool" that has some sort of scripting capability, and feed it information to create a similar report. One such tool is Scribus, but its paid equivalent is Adobe InDesign. If you ask me, most projects are far from needing such expensive software, though.
There are also purpose-designed applications for performing these actions - they're examples of exactly what we might want to build. Two examples are Pentaho and BIRT. These may be worthy of being explored, especially if we find an open-source application capable of saving us a lot of frustrations. However, note that you are then at the mercy of these products.
Pretty much every major programming language has some sort of PDF library we can utilize. The advantage of using a library - as opposed to an existing tool - is flexibility. The disadvantage is time, effort and maintenance.
One example of a useful Python PDF library is ReportLab. The advantage of abstracting away the Adobe PDF specifications becomes evident fairly quickly; you gain not only speed, but many options, without actually needing to know low-level details. A similar Ruby library is Prawn, and even Java has one called PDFBox.
All of these libraries have a similar set of features. You can edit, add and remove raster images, vector objects, text (which is also represented as a set of vector objects), forms, and pretty much anything else you might associate with PDFs.
The advantage of using a pre-built application is time, but you lose a significant amount of flexibility in exchange. We need flexibility for our internally designed products, so it makes sense to build our own solution. How do we go about this? There are a few sensible options that come to mind, but here is one strong contender:
This method would consist of a server taking data from some source, and returning a document. In this context, one or more users would provide either field data, or an arbitrary non-binary file format; these fields would correspond to sections of the resulting PDF, which are then organized according to whatever library we've chosen.
Take, for example, the python library
ReportLab, and one of its child libraries,
Platypus. These allow a developer to define "flowing" sections in a document, similar to the way that the DOM functions in HTML; each item has its own coordinate system inside of its boundaries, as well as being defined and positioned relative to the document itself.
Knowing this, a good way of keeping the project tidy would be to find and isolate all of the small components that make up the document, and creating a module for each of them. The modules would be rendered by
Platypus, given the input data, which presents both challenge and flexibility to us - we must take care of a significant fraction of the math regarding distances between elements inside these modules, and so on. However,
Platypus takes care of concepts such as distances between one module and another, or determining when to break a page if an element is too large, among other things.
Assuming we've authored our modules and are happy with them, we'd then set up a server to:
Receive data submitted from some source
Make sure the data is safe to manipulate
Analyze the form data
Attempt to render PDF modules, passing data to them based on our analysis of a client's input
Flow the modules together into a final document
Send the result back to the client interface
There are additional areas which should be questioned, and a decision made with regards to how we proceed.
As it stands, our conceptual redesign of the reports received contains a number of "dynamic graphics" - that is, graphical representations of data, such as bar graphs, histographs, and some objects that don't fit into a specific category.
Many PDF programming libraries, including
ReportLab, contain components which will help us build these out. However, it should be assumed that some manual development of graphical modules will be necessary, especially regarding "unique" views.
We should strive to make sure these types of components are truly modular. In particular, it is probably important to:
Define graphic boundaries
Specify graphic scales (such as the scale of a simple x-y graph)
Make graphics vector based, and not raster image renders. This is important for scaling.
When working with data whose fields are of an indeterminable length, we need to make sure none of the data is truncated when it is rendered in the final result.
To accomplish this, we must calculate a few things:
What is the maximum possible size of the component, given that it has the maximum possible size of input data?
What is the minimum possible size of the component, given that it has the minimum possible size of input data?
In what ways must we account for these minimum and maximum sizes, considering the flow of other components?
Mostly, we're concerned with the first metric - we don't want the component to go past a certain point in the document. If it exceeds that boundary, we should move it into a new page as the first component of that page. Then, we must also make sure that other components do not "break" as a result of our modification to the prior page's flow.
If our design includes elements that conditionally split depending on their distance from the end of the document, the remaining available space, and the value of their contents, the
Flowables (in the case of
Platypus) must be coded in a way that accounts for these behaviors.
In our experimentation, we've found that using libraries as a foundation for PDF creation lets you ignore the tricky parts of the specification, but you still need to be pretty careful about what you're doing. Even though your hand is held to some degree, there are many pitfalls to watch out for - here is the provided "Hello World" equivalent layout in
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer from reportlab.lib.styles import getSampleStyleSheet from reportlab.rl_config import defaultPageSize from reportlab.lib.units import inch PAGE_HEIGHT=defaultPageSize PAGE_WIDTH=defaultPageSize styles = getSampleStyleSheet() Title = "Hello world" pageinfo = "platypus example" def myFirstPage(canvas, doc): canvas.saveState() canvas.setFont('Times-Bold',16) canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT-108, Title) canvas.setFont('Times-Roman',9) canvas.drawString(inch, 0.75 * inch,"First Page / %s" % pageinfo) canvas.restoreState() def myLaterPages(canvas, doc): canvas.saveState() canvas.setFont('Times-Roman',9) canvas.drawString(inch, 0.75 * inch, "Page %d %s" % (doc.page, pageinfo)) canvas.restoreState() def go(): doc = SimpleDocTemplate("phello.pdf") Story = [Spacer(1,2*inch)] style = styles["Normal"] for i in range(100): bogustext = ("This is Paragraph number %s. " % i) *20 p = Paragraph(bogustext, style) Story.append(p) Story.append(Spacer(1,0.2*inch)) - doc.build(Story, onFirstPage=myFirstPage, onLaterPages=myLaterPages)
This example comes from
ReportLab. It's not too bad, yet; you start to run in to large program design decisions very quickly, though. Depending on your background experience with data structures and the like, you will flesh it out differently.
Immediately, assuming your end-goal is not to print one line of text in a specific section of a document, you have to figure out how to:
Keep everything orthogonal
Modularize common components
Prevent overlapping elements
Define "safe" page boundaries
Calculate module distances
.. et cetera.
You might need to answer these on a project-to-project basis, but I think we can come up with some generic rules. In my case, since we are experimenting with
Platypus in this context, we want to keep each module in its own
flowable is a kind of abstraction that you can use to define modules with certain rules - the manual states,
"Flowables are things which can be drawn and which have wrap, draw and perhaps split methods." They sit inside of Frames, which sit inside of the PageTemplate.
Let's try creating a simple module, and see if we can learn anything on the way.
This is a hypothetical section from one of our potential reports. I'd consider this a Frame that has six
Flowables inside of it, as it is all based on the same informational topic, and also is aesthetically consistent.
You might want to think of these module build-outs as individual PDFs - in this library, they have their own coordinate systems, as mentioned earlier.
Frames are defined as such:
f = Frame(x1, y1, width, height, leftPadding=6, bottomPadding=6, rightPadding=6, topPadding=6, id=None, showBoundary=0)
In this case,
y1 refer to the offset of the frame relative to its canvas, in a normal coordinate system; width and height are fairly self-explanatory, and padding parameters act as you might expect in, say,
CSS. You can pass units like inch to them if you've imported the definition:
from reportlab.lib.units import inch f = Frame(inch, inch, 6*inch, 9*inch ... ) If we assume we have a size of 3 inches wide, 8 inches vertically, we might define f = Frame(inch, inch, 3*inch, 8*inch [...]).
We then will need to append
Flowables to the Frame. This is accomplished with a simple method -
f.addFromList([list of flowables], [canvas]). Before calling the method, it needs to have a list of
Flowables to consume, and a defined
For historical reasons, a list of
Flowables is usually defined as a variable called story (
story =  - literally, just a Python list object).
There are a couple of ways to populate our list of
Flowable objects. The first is to directly define them as we add them, like this:
story.append(Paragraph("This is a Heading",styleH))
styleH, in this case, is taken from an example stylesheet included with the default distribution of
ReportLab's open-source version:
from reportlab.lib.styles import getSampleStyleSheet styles = getSampleStyleSheet() styleH = styles['Heading1']
The other way is to define the
Flowable beforehand, which is probably preferable in our case - this is what that looks like in the official manual document:
P=Paragraph('This is a very silly example',style) canv = Canvas('doc.pdf') aW = 460 # available width and height aH = 800 w,h = P.wrap(aW, aH) # find required space if w<=aW and h<=aH: P.drawOn(canv,0,aH) aH = aH - h # reduce the available height canv.save() else: raise ValueError, "Not enough room"
This is where we'd work on making sure the
Flowables take up an appropriate amount of space in the Frame. There are a number of methods available to us that are utilized to help
Platypus figure this out, but two are most commonly needed:
Returns the space that a Flowable object actually occupies. You'll want to define the available height and width of the canvas itself.
Flowable.drawOn(canvas, x, y)
Invokes the rendering action for this
Flowable, positioned absolutely in the specified
Canvas object. To make another comparison to web technologies, this is sort of like absolutely positioning a child of a relatively positioned
These kinds of methods are only of our concern if we are creating a new type of
Flowable other than the ones that exist already. We are essentially defining our own generic object that we can use in a portable manner.
Say we've created some
StatisticBox, which would basically be a box containing one header and one value - we'd need six to fill our
Frame from before. If we do things correctly, it shouldn't look more complicated than this:
story.append(StatisticBox('PROJECT ID', '38000P153163'))
Our code should, ideally, take care of situations involving too little space, a need for page-breaking, splitting behaviors, and other contingencies.
Besides implementation details for our layout, we also need to design a way for our application to accept some form of data. Assuming we have some standardized format, such as a
XML markup, or anything similar, what we need to do is break that data into components recognizable at the other end.
This is something that is primarily on the developer to implement; however, it is a common problem that has been solved many times over several decades. You might need to slightly modify the input data, so as to identify one kind of data with one frame, for instance.
The basic process is to identify what will be used to specify frames, the
Flowables inside them, where the frames are positioned, whether to wrap or break them to the proceeding page, what data belongs to what object, and so on.
Since this is largely contextual, I hesitate to provide any solid "recommendations" - however, we do have one suggestion. Look into tokenization to extract useful bits of information out of your data. This is basically the practice of creating a text parser, of sorts.
Knowing what we do now, we should just adhere to a few principles going forward:
We should evaluate our choice of tool and/or library at each junction. Let's not use one for no particular reason; justify them with objective data. Use what makes our lives easier, as many before us have struggled with PDFs.
It is much easier to talk to the designer(s) about segments that might be unreasonably difficult to develop. After you've come to a conclusion, modularize each part of the design into objects that are easy to tack together.
Adobe's PDF Specification document is very thorough. It is, after all, a specification. When dealing with an issue that seems strange, it can be useful to come back to this document in order to figure out what's gone wrong.
We don't have to come up with everything from scratch. We are not the first ones to build something for PDF documents, and we won't be the last. Viewing the source code of successful projects, if they exist, can be very useful.
Download our Incubator Resources
We’re known for sharing everything!
Save more time, get more done!
Innovate from the inside
YOU MIGHT ALSO LIKE