Testing PDF Files With TestComplete

Files

Author: SmartBear Software
Applied to: TestComplete 10

PDF is a widespread platform-independent format of documents to share. Many applications use this format to export or import data. If your application outputs some data in the PDF format, you may need to check if the exported data matches some baseline values. This article explains how you can do this with TestComplete.

In This Article

Overview

To work with PDF documents, we will use an Apache PDFBox library. This is a Java library that provides objects, methods and properties for retrieving and changing PDF file data. You can find more information on the library on its official web site --

-- and the library’s documentation --

In this article, we will demonstrate how you can use the library’s objects, methods and properties to extract text and images from PDF documents and compare or verify them in your tests.

To call the library’s methods and properties, we will use the Java Bridge – TestComplete’s subsystem that lets you call arbitrary methods of .jar modules directly from test scripts. You can find more on calling methods of Java classes from test scripts in TestComplete documentation.

All sample code that you find in this article is also available in a sample TestComplete project. You can find a link to it at the top of the page.

About the PDF File Structure

To understand the PDFBox objects’ functionality faster, it is useful to know the document structure. Logically, a PDF file consists of several so-called layers: Acrobat Support layer for core functionality; PDSEdit layer for storing structure information, and others.

In this article, we will be interested in the PD and COS layers. Objects of the PD layer provide access to various parts of the document (pages, images, the document itself). The contents of the PD objects (text of a page, image data and so on) are stored in the corresponding objects of the COS layer.

Below is a rough structure of a simple PDF file:

PD layer COS layer
PDDocument COSDocument
• PDDocumentCatalog  
      · Metadata COSStream
      · Pages COSArray
            - PDPage COSDictionary

For the complete description of PDF internals, see the Acrobat PDF Library API Reference.

Preparation Steps

To use the PDFBox library with TestComplete, you need to prepare your test computer and configure TestComplete settings.

  1. Download the PDFBox Library

    Download the PDFBox library from the following web site to some folder on your computer:

    You will need to specify the path to the library in TestComplete settings (see below).

  2. Install Java

    PDFBox is a Java library. To be able to call its routines, TestComplete uses the Java virtual machine that is part of Java Runtime Environment. If you do not have Java yet, download it from this web site and install on your computer:

  3. Configure TestComplete

    After you have installed Java Runtime Environment, you need to adjust TestComplete settings:

    • From TestComplete’s main menu, select Tools | Options, and in the subsequent Options dialog, choose the Engines | Java Bridge option group.
    • On the right of the dialog, specify the name of the Java virtual machine module that will be used to host the library classes, for instance,
          C:\Program Files (x86)\Java\jre7\bin\client\jvm.dll
    • Click OK to save the changes.
  4. Configure Test Project

    Make the classes of the PDFBox library available for your test project:

    • From TestComplete’s main menu, select Tools | Current Project Properties to open properties of your project.
    • On the Properties page of the project editor, select Java Bridge.
    • Click Add JAR Files and specify the path to the PDFBox library.

    • Add the following classes to the Java Classes list:

      Class Used for
      org.apache.pdfbox.pdmodel.PDDocument Working with PDF documents.
      org.apache.pdfbox.util.PDFTextStripper Extracting text from PDF documents.
      javax.imageio.ImageIO Extracting image data and saving images.
      java.io.File Saving image data.

      To add a class, click Add, type the full name of the class and press Enter to confirm the input.

      Specifics of Calling Java Methods From TestComplete

      If you have experience with using the Java library with TestComplete, you can skip this part and proceed to the next section. Otherwise, you must be aware of the following specifics of working with Java libraries in TestComplete:

      • In Java, methods can be overloaded. A method is called overloaded, if it has the same name as another method has, but uses different parameter sets. For instance, the load method below is overloaded:

        load(URL url, Boolean force)
        load(String filename)
        load(InuptStream input)

        When running Java code, the Java run time engine analyzes the parameter types and automatically calls the appropriate method. However, in TestComplete, this is not possible as all script variables have the Variant type. To distinguish different implementations of a method, TestComplete appends a postfix to the method name: load, load_2, load_3 and so on.

        To get information about method names and parameters, examine the methods in the Code Completion window at design time. Alternatively, at run time, you can pause the script on a breakpoint and explore objects’ methods and properties with TestComplete’s Evaluate dialog.
      • In Java, the constructor of a class has the name of this class. TestComplete changes the constructor names to newInstance(). If a class has overloaded constructors, TestComplete names them like newInstance, newInstace_2, newInstance_3 and so on. Explore the constructor parameters in the Code Completion window to choose the constructor that you need.

      Loading PDF Documents

      To work with a PDF file, you need to load this file in memory. To do this, you use the load() method of the PDDocument object. This is a static method. You can call it without creating the PDDocument object first. Here is an example --

      function loadDocument(fileName)
      {
        var docObj;
      
        // Load the PDF file to the PDDocument object
        docObj = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(fileName);
      
        // Return the resulting PDDocument object
        return docObj;
      }
      

      The method parameter specifies the PDF file to be opened.

      The method returns a PDDocument object that corresponds to the specified PDF file.
      Note that the load method is overloaded. That is, it can use different sets of parameters and these parameters can be of different data types. load_3 is TestComplete’s name for the method variant that uses one parameter of the string type that specifies the path to the desired PDF file.

      You can call the above-mentioned function by using the following code snippet:

      // Get the PDF document
      docObj = loadDocument("C:\\Work\\Document.pdf");
      

      Getting Document Pages

      After you loaded a PDF document in memory, you can work with its data.
      To get access to the document’s pages, you should first obtain the catalog object that provides scripting interface to the document’s data. Then, you can use the catalog’s getAllPages() method to obtain a collection of pages:

      // Get the catalog object
      catalog = docObj.getDocumentCatalog();
      
      // Get the page collection
      pageArray = catalog.getAllPages();
      

      The getAllPages method returns a COSArray (collection) of the PDPage objects that correspond to pages.
      To obtain an individual page, call the get method of the collection object and specify the index of the desired page as a parameter (the indexes are zero-based):

      pageArray = docObj.getDocumentCatalog().getAllPages(); // Get the collection of pages
      pageObj = pageArray.items(0); // Obtain the first page of the document
      pageObj = pageArray.items(2); // Obtain the third page of the document
      

      For instance, the following script routine returns the desired page by its zero-based index (number):

      function getPage(docObj, pageIndex)
      {
        var pageArray, pageObj;
      
        // Obtain a collection of the pages 
        pageArray = docObj.getDocumentCatalog().getAllPages();
      
        // Obtain the specified page
        pageObj =  pageArray.get(pageIndex);
      
        // Return the result
        return pageObj;
      } 
      

      You can loop through the pages by using the following code:

      // Obtain the page’s iterator
      pageIterator = pageArray.iterator();
      
      while (pageIterator.hasNext()) // Check if the iterator is at the end of the collection
      {
        pageObj = pageIterator.next(); // Obtain the next page
      
        // Do some action
        // ...
      }
      

      Extracting Data From PDF Files

      Extracting Text

      To retrieve text of a PDF file, you use the PDFTextStripper object. You can create it in the following way:

      textStripperObj = JavaClasses.org_apache_pdfbox_util.PDFTextStripper.newInstance();
      

      To obtain text from your document, use the getText_2 method of the PDFTextStripper object. This method takes the document object as a parameter and returns the extracted text.

      // Obtain text from the entire document
      text = textStripperObj.getText_2(docObj);
      

      To extract text of specific pages, you need to set the page range. You do this by calling the setStartPage and setEndPage methods. Note that these methods take a non-zero-based page index. For instance, the following code exports text from the third, fourth and fifth pages:

      // Set the start page (note that the page index is not zero-based)
      textStripperObj.setStartPage(3);
      
      // Set the end page
      textStripperObj.setEndPage(5);
      
      // Get the text of the specified pages
      text = textStripperObj.getText_2(docObj);
      

      To retrieve text on one page only, set the same index for the start and end page:

      // Set the page index as a start page
      // Note that the page index here is not zero-based
      textStripperObj.setStartPage(3);
      
      // Set the same page index as an end page
      textStripperObj.setEndPage(3);
      
      // Get the text of the page
      text = textStripperObj.getText_2(docObj);
      

      Note: The getText_2 method returns a string that includes all the text of the specified pages, including footnotes, image captions and so on.

      Extracting Paragraphs

      The PDFTextStripper object does not have a method or property that would return a paragraph’s text. To get the text of a paragraph, you can command the PDFTextStripper object to put a certain string at the end of each paragraph. By using this string as a separator, you can then divide the text into paragraphs and get the desired paragraph. The sample function demonstrates how you can do this. It uses the following parameters:

      • docObj - The PDDocument that represents the desired document.
      • pageIndex - A zero-based index of the desired page.
      • paraIndex - A zero-based index of the paragraph from the page.
      The function returns the text of the specified paragraph on the specified page:

      function getParagraphText(docObj, pageIndex, paraIndex)
      {
        var textStripperObj, constMarker, pageText, paragraphText;
      
        // Create an instance of the PDFTextStripper object
        textStripperObj = JavaClasses.org_apache_pdfbox_util.PDFTextStripper.newInstance();
      
        // A string to be used as a marker of the paragraph end
        constMarker = "SOME_UNIQUE_STRING_FOR_PARAGRAPH_END";
      
        // Set the page range to retrieve text of one page only.
        // Note that pageIndex is zero-based, and the methods below
        // use non-zero-based indexes. So, we increase pageIndex
        // when passing it to the methods.
      
        textStripperObj.setStartPage(pageIndex + 1);
        textStripperObj.setEndPage(pageIndex + 1);
      
        // Specify the paragraph marker
        textStripperObj.setParagraphEnd(constMarker);
      
        // Specify the list separator
        aqString.ListSeparator = constMarker;
      
        // Obtain the text of the page
        pageText = textStripperObj.getText_2(docObj);
      
        // Check that the paragraph index is positive
        if (paraIndex < 0)
        {
          Log.Error("The paragraph index is negative.");
        } else
        {
          // Check that paraIndex does not exceed
          // the actual number of paragraphs on the page
          if (paraIndex > aqString.GetListLength(pageText))
          {
            // Post an error to the test log
            Log.Error("The paragraph index exceeds the number of paragraphs on the page.");
      
            // Return an empty string
            return "";
          } else
          {
            // Specify the marker as a list separator
            aqString.ListSeparator = constMarker;
      
            // Get the text of the specified paragraph
            paragraphText = aqString.GetListItem(pageText, paraIndex);
      
            // Return the paragraph's text
            return paragraphText;
          }
        }
      }
      

      To call this function in your test, you can use the following code:

      function Test()
      {
        // Load the desired PDF file
        docObj = loadDocument("C:\\Work\\Document.pdf");
      
        // Get the text of the fifth paragraph on the first page
        // and post it to the log
        text = getParagraphText(docObj, 0, 4);
        Log.Message(text);
      }
      

      Important: Image captions, header and footnote lines are paragraphs. You should take them into account when specifying the paragraph index. Alternatively, you can exclude header and footnote lines from the extracting. The sample project that is attached to this article contains examples that demonstrate how to do that.

      Extracting Images

      To extract an image from a PDF file, you need to obtain the image data and then save it to some file on your disk.

      • 1. Images are resources that are stored within the page, where they are used. To get the list of image data, use the following code:

        // Get the image data collection
        imgMap = page.getResources().getXObjects();
        

        This code above returns a HashMap object that consists of key-value entries. Each key is an internal name of the image (for instance Im0, Image3), and value is a descendant object of the PDObjectImage class (PDJpeg or PDPixelMap).
        page is a PDPage object that corresponds to the desired page. You can obtain it, for example, in the way that was described above.
      • 2. The next step is to convert the HashMap values to an array:

        imgArray = imgMap.values().toArray();
      • 3. To get an individual image, specify the image index in the items property of the array:

        imageObj = imgArray.items(2); // Returns the third image on a page

        Note that the image index may differ from the actual ordinal number of the image on the page.

      Now, we can write a script routine that gets a page object by using the getPagefunction and returns an image object by the image’s index on a page:

      function getImage(docObj, pageIndex, imgIndex)
      {
        var pageObj, imgMap, imgArray, imageObj;
      
        // Get the desired page
        pageObj = getPage(docObj, pageIndex);
      
        // Obtain HashMap of the images from the specified page
        imgMap = pageObj.getResources().getXObjects();
      
        // Get an array of the images
        imgArray = imgMap.values().toArray();
      
        // Get an individual image by its index
        imageObj = imgArray.items(imgIndex);
      
        // Return the image object
        return imageObj;
      }
      

      The resulting array can contain both PDJpeg and PDPixelMap objects that correspond to the JPEG and PNG images correspondingly.

      To get the total number of images on a page, use the length property of the image array. You can use this property to iterate through all the images on a page:

      // Loop through the images on a page
      for (i=0; i < imgArray.length(); i++)
      {
        // Get an image
        imageObj = imgArray.items(i);
      
        // Do some action
      }
      

      To save the image, use the write2file_2 method of the image object. For example, the command --

      image.write2file_2("C:\\Work\\test_image");
      

      -- saves the image to the file in the C:/Work directory. This method automatically adds the extension to the file name depending on the object type (.jpg for PDJpeg and .png for PDPixelMap objects).
      To save an image in another format, use the getRGBImage method of the image object. It returns the bufferedImage object that you can save to the file with an arbitrary extension. For instance, the following code example saves an image to a .png file (regardless of the initial format of the image: JPEG or PNG):

      // Obtain the binary data of the image
      imgBuffer = imageObj.getRGBImage();
      
      // Create a new file to save
      imgFile = JavaClasses.java_io.File.newInstance("C:\\Temp\\image.png");
      
      // Save the image to the created file
      JavaClasses.javax_imageio.ImageIO.write(imgBuffer, "png", imgFile);
      

      Converting Document Pages to Images

      Sometimes (especially when you compare PDF files), you may want not only to get the document text, but also to check if it is positioned properly. To perform this task, you can convert a page to an image and then use this image in your comparison tasks.

      To convert a page to an image, use the convertToImage method of the PDPage object. It returns the bufferedImage object that corresponds to the page’s screenshot. You can save it to a file then. For instance, the following function gets the PDPage object by using the getPage function, saves it as an image and returns the Pucture object that corresponds to the image:

      function convertPageToPicture(docObj, pageIndex, fileName)
      {
        var pageObj, imgBuffer, imgFile, imgFormat, pictureObj;
      
        // Get the desired page
        pageObj = getPage(docObj, pageIndex);
      
        // Convert the page to image data
        imgBuffer = pageObj.convertToImage();
      
        // Create a new file to save
        imgFile = JavaClasses.java_io.File.newInstance(fileName);
      
        // Get the image format from the name
        imgFormat = aqString.SubString(fileName, aqString.GetLength(fileName)-3, 3);
      
        // Save the image to the created file
        JavaClasses.javax_imageio.ImageIO.write(imgBuffer, imgFormat, imgFile);
      
        // Create a Picture object
        pictureObj = Utils.Picture;
      
        // Load the image as a picture
        pictureObj.LoadFromFile(fileName);
      
        // Return the picture object
        return pictureObj; 
      }
      

      Extracting Metadata

      To get the document’s metadata (author name, subject, creation data and so on), use getDocumentInformation of the PDDocument object. This method returns the PDDocumentInformation object, whose methods and properties allow you to retrieve the metadata. For example, the following code example posts the title, author name, subject and other document information to the test log:

      // Get information about the document
      info = docObj.getDocumentInformation();
      
      // Log the total number of pages to the log
      Log.Message("Pages: " + docObj.getNumberOfPages());
      
      // Log the title of the document to the log
      Log.Message("Title: " + info.getTitle());
      
      // Log the author of the document to the log
      Log.Message("Author: " + info.getAuthor()); 
      
      // Log the subject of the object to the log
      Log.Message("Subject: " + info.getSubject());
      
      // Log the creator of the document to the log
      Log.Message("Creator: " + info.getCreator());
      
      // Log the date and time when the document was created in the local settings
      Log.Message("Creation Date: " + info.getCreationDate().getTime().toLocaleString());
      
      // Log the date and time when the document was last updated in the local settings
      Log.Message("Modification Date: " + info.getModificationDate().getTime().toLocaleString());
      

      Useful Test Cases

      Finding Text in a Document

      When you need to find the string within a PDF file, you must get text from the PDF file and use the aqString methods to find the desired string in the text. For instance, the following example finds a string in a PDF file:

      function findText(docObj, string)
      {
        var textStripperObj, text;
      
        // Create the PDFTextStripper object
        textStripperObj = JavaClasses.org_apache_pdfbox_util.PDFTextStripper.newInstance();
      
        // Get the document text
        text = textStripperObj.getText_2(docObj);
      
        // Search for the specified string in the text
        if (aqString.Find(text, string)> -1)
        {
          Log.Message("The document contains the specified string.");
        } else
        {
          Log.Warning("The document does not contain the specified string.");
        }
      }
       

      Comparing Paragraphs

      To verify a paragraph’s content, you extract this paragraph’s text from the document by using the getParagraphText function that we created earlier, and then use the aqString.Compare method to compare it with some baseline text value. The following code snippet shows you how to do this:

      function compareParagraph(docObj_1, pageIndex, paraIndex, baselineValue)
      {
        var paragraph;
      
        // Obtain the desired paragraph text
        paragraph = getParagraphText(docObj_1, pageIndex, paraIndex);
      
        // Compare the paragraphs with the sample
        if (aqString.Compare(paragraph, baselineValue, true) == 0)
        {
          // Post the message that paragraphs are equal
          Log.Message("The paragraph text coincides with the baseline value.");
        } else
        {
        // Post the message that paragraphs are different
        Log.Message("The paragraph text differs from the baseline value.");
        }
      }
      

      Comparing Images

      When you need to compare an image from the PDF, retrieve it from the PDF file and compare it using the Region.Compare method or another suitable method.

      For this purpose, save the desired picture to the temporary file and then compare the saved image with another one. For instance, the following function compares a picture from the PDF file with the sample. Note that in this example, we use the getPage and getImage functions that were written earlier:

      function compareImages(docObj, pageIndex, imgIndex, imgFile, imgSample)
      {
        var imageObj, imgBuffer, tempFile;
      
        // Load an image from the page
        imageObj = getImage(docObj, pageIndex, imgIndex);
      
        // Get binary data of the image
        imgBuffer = imageObj.getRGBImage();
      
        // Create a temp file
        tempFile = JavaClasses.java_io.File.newInstance(imgFile);
      
        // Save the image to the temp file
        JavaClasses.javax_imageio.ImageIO.write(imgBuffer, "png", tempFile);
      
        // Compare the image with the sample
        Regions.Compare(imgSample, imgFile);
      
        // Delete the temp image file
        aqFile.Delete(imgFile);
      }
      

      Comparing Whole Documents

      When you compare two documents, you can compare their text, images or metadata.

      For example, to check text of the document, you can obtain lists of paragraphs and compare each of them. The following code snippet shows you how to do it. We do not use the getParagraphText function here on purpose, because it divides text at every call. This function divides the text only once for each page, and then we get the list of paragraphs and iterate through it:

      function compareDocsAsText(pdfFile_1, pdfFile_2)
      {
        var markerConst, docObj_1, docObj_2, textStripperObj, text_1, text_2, par_1, par_2;
      
        // Specify the unique string as a custom marker
        markerConst = "SOME_UNIQUE_STRING_FOR_PARAGRAPH_END"
      
        // Load the PDDocument objects
        docObj_1 = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(pdfFile_1);
        docObj_2 = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(pdfFile_2);
      
        // Obtain the stripper object
        textStripperObj = JavaClasses.org_apache_pdfbox_util.PDFTextStripper.newInstance();
      
        // Specify the custom string as a marker of the paragraph's end
        textStripperObj.setParagraphEnd(markerConst);
      
        // Specify the custom string as a list separator
        aqString.ListSeparator = markerConst;
      
        // Check if the number of pages is different
        if (docObj_1.getNumberOfPages != docObj_2.getNumberOfPages)
        {
          Log.Error("The number of pages is different.")
        } else
        {
          // Loop through the pages
          for (pageIndex=1; pageIndex < docObj_1.getNumberOfPages+1; pageIndex++)
          {
            // Specify the page for extraction
            textStripperObj.setStartPage(pageIndex);
            textStripperObj.setEndPage(pageIndex);
      
            // Get the documents’ text
            text_1 = textStripperObj.getText_2(docObj_1);
            text_2 = textStripperObj.getText_2(docObj_2);
      
            // Check if the documents have the same number of paragraphs
            if (aqString.GetListLength(text_1) == aqString.GetListLength(text_2))
            {
              // Loop through the text paragraphs
              for (i=0; i < aqString.GetListLength(text_1); i++)
              {
                // Obtain the next paragraph
                par_1 = aqString.GetListItem(text_1, i);
                par_2 = aqString.GetListItem(text_2, i);
      
                // Compare paragraphs of two documents
                if (aqString.Compare(par_1, par_2, true) != 0)
                {
                  // Post a warning to the test log that current paragraphs are different
                  Log.Warning("The " + aqConvert.IntToStr(i + 1) + " paragraphs on the " + pageIndex + " page are different.");
                } else
                {
                  // Post a message to the test log that current paragraphs are equal
                  Log.Message("The " + aqConvert.IntToStr(i + 1) + " paragraphs on the " + pageIndex + " page are equal.");
                }
              }
            } else
            {
              // Post a warning to the test log that a number of paragraphs are different
              Log.Warning("A number of paragraphs on the "+ pageIndex +" page are different.");
            }
          }
        }
      }
      

      Comparison of a document’s text, images or metadata has a limit – they do not compare the relative location of the elements on the page. To verify the complete identity, you can convert the document’s pages to images and compare them. Note that even documents were generated by the same application. For example, they can be different because of the different creation dates in the header. To exclude such regions from comparison, use the image mask. The following code snippet shows how you can compare two PDF files (in this example, we use the convert PageToPicture unctions that were written earlier):

      function compareDocsAsImg(pdfFile_1, pdfFile_2, maskImg)
      {
        var imgFile_1, imgFile_2, docObj_1, docObj_2, totalPages_1, totalPages_2;
      
        // Specify the fully-qualified name of the temporary image files
        imgFile_1 = "C:\\Temp\\page_doc_1.png";
        imgFile_2 = "C:\\Temp\\page_doc_2.png";
      
        // Load specified documents
        docObj_1 = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(pdfFile_1);
        docObj_2 = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(pdfFile_2);
      
        // Get the total number of pages in both documents
        totalPages_1 = docObj_1.getNumberOfPages();
        totalPages_2 = docObj_2.getNumberOfPages();
      
        // Check whether the documents contain the same number of the pages
        if (totalPages_1 != totalPages_2)
        {
          Log.Message("The documents contain different number of pages.");
        } else
        {
          for (i = 0; i < totalPages_1; i++)
          {
            // Call a routine that converts the specified page to an image
            pic_1 = convertPageToPicture(docObj_1, i, imgFile_1);
            pic_2 = convertPageToPicture(docObj_2, i, imgFile_2);
      
            // Compare two images
            if (!pic_1.Compare(pic_2, false, 5, false, 5, maskImg))
            {
              // If the images are different...
              // Post image differences to the log
              Log.Picture(pic_1.Difference(pic_2, false, 5, false, 5, maskImg));
      
              // Post a warning message
              Log.Warning("Pages " + aqConvert.IntToStr(i+1) + " are different. Documents are different.");
      
              // Break the loop
              break;
            } else
            {
              // Post a message that the pages are equal
              Log.Message("Pages " + aqConvert.IntToStr(i+1) + " are equal.")
            }
            // Delete the temporary image files
            aqFile.Delete(imgFile_1);
            aqFile.Delete(imgFile_2);
          }
        }
      }
      

      Specifics of Testing Encrypted PDF Files

      In this article, we imply that your PDF document is not encrypted. If your PDF file is encrypted, the specified test scripts will not function until you decrypt the file. You can do this in Adobe Acrobat, or by using the decrypt method of the PDFBox library. In any case, you will have to specify the password that was used to encrypt the document.

      Conclusion

      This article describes basic approaches for retrieving and comparing data of PDF documents with TestComplete. We hope it helps you get started working with PDF documents in your tests. We would suggest that you explore the PDFBox library documentation to find objects, methods and properties that will help you perform more sophisticated test operations.

      If you have not used TestComplete yet, download and try it for free.