Working With PDF Files

Applies to TestComplete 15.47, last modified on January 20, 2023

With TestComplete, you can read data of PDF files and compare these files.

To extract text contents of PDF files, TestComplete uses optical character recognition (OCR).

Requirements

  • Your TestComplete version must be 14.20 or later.

  • Your computer must have access to the ocr.api.dev.smartbear.com web service.

    If you have firewalls or proxies running in your network, they should allow your computer to access the web service. This web service is used to recognize the text content of PDF files.

  • Your firewall must allow traffic through port 443.

  • You need an active license for the TestComplete Intelligent Quality add-on.

  • The Intelligent Quality add-on must be enabled in TestComplete.

    You can enable the add-on during TestComplete installation. If you did not enable the add-on during the installation, you can do this at any moment later via the File > Install Extensions dialog. Select File > Install Extensions from the TestComplete main menu and enable the Intelligent Quality > Intelligent Quality Core plugin in the resulting dialog.

  • PDF to Text support must be enabled in TestComplete.

    By default, it is enabled automatically if you enable the Intelligent Quality add-on during TestComplete installation.

    If you experience issues with PDF support in your tests, select File > Install Extensions from the TestComplete main menu and make sure the PDF to Text plugin is enabled (you can find it in the Intelligent Quality group). If the plugin is disabled, enable it. In the confirmation message, click the link to read a third-party license agreement. If you agree to the license terms, click Enable OCR.

How TestComplete recognizes PDF contents

TestComplete sends PDF files to the ocr.api.dev.smartbear.com web service by SmartBear. This web service forwards the files to Google Vision API and transfers the recognition results back to TestComplete.

In your test, you get the text of an entire PDF file as a string and work with it as your needs dictate. For example, you can compare it against some expected values. See below for more information.

Extract data

In keyword tests

To get text contents of a PDF file, use the PDF to Text operation. Then, to get the recognized text, you can, for example, store it to a variable by using the Set Variable Value operation:

Getting text of a PDF file in a keyword test in TestComplete

Click the image to enlarge it.

To get a fragment of the recognized text:

In script tests

To get text contents of a PDF file, use the PDF.ConvertToText(…) method. It receives the PDF file name as a parameter, processes the file and returns a string that contains the recognized text:

JavaScript, JScript

var path = "C:\\work\\sample.pdf";
var contents = PDF.ConvertToText(path);

Python

path = "C:\\work\\sample.pdf"
contents = PDF.ConvertToText(path)

VBScript

path = "C:\work\sample.pdf"
contents = PDF.ConvertToText(path)

DelphiScript

var path, contents;

path := 'C:\work\sample.pdf';
contents := PDF.ConvertToText(path);

C++Script, C#Script

var path = "C:\\work\\sample.pdf";
var contents = PDF["ConvertToText"](path);

To get a fragment of the recognized text. , use methods and properties of the aqString object: Contains(…), StrMatches(…), SubString(…) and others. With aqString.StrMatches(…), you can search and validate the string with regular expressions.

Examples

Extract a Value From a PDF File

The sample code below shows how to get all date values from a PDF file:

JavaScript, JScript

function GetDateValuesFromPDF()
{
  // Get the path to the tested PDF file
  var path = "C:\\work\\sample.pdf";
  if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf"))
  {
    // Get the entire file contents
    contents = PDF.ConvertToText(path);
    if (contents != "")
    {
      // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim

      // Post all the date values that match the specified pattern
      // to the test log
      var r = contents.match(regEx);
      if (r != null && r.length > 0)
      {
        for (var i = 0; i < r.length; i++ )
          Log.Message(r[0]);
      }
    }
  }

}

Python

import re

def GetDateValuesFromPDF():
  # Get the path to the tested PDF file
  path = "C:\\work\\sample.pdf"
  if ((path != "") and (aqFile.Exists(path)) and (aqFileSystem.GetFileExtension(path) == "pdf")):
    # Get the entire file contents
    contents = PDF.ConvertToText(path)
    if (contents != ""):
      # This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx = re.compile("\d{1,2}\/\d{1,2}\/\d{2,4}")
      # Post all date values that match the specified pattern
      # To the test log
      r = regEx.findall(contents)
      if r != None:
        for i in range (0, len(r)):
          Log.Message(r[i])

VBScript

Sub GetDateValuesFromPDF()
  ' Get the path to the tested PDF file
  path = "C:\work\sample.pdf"
  If path <> "" And aqFile.Exists(path) And aqFileSystem.GetFileExtension(path) = "pdf" Then
    ' Get the entire file contents
    contents = PDF.ConvertToText(path)
    If contents <> "" Then
      ' This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      Set regEx = New RegExp
      regEx.pattern = "\d{1,2}\/\d{1,2}\/\d{2,4}"
      regEx.Global = True

      ' Post all the date values that match the specified pattern
      ' to the test log
      Set r = regEx.Execute(contents)
      For Each m In r
        Log.Message(m.Value)
      Next

    End If
  End If

End Sub

DelphiScript

function GetDateValuesFromPDF();
var path, contents, regEx;
begin
  // Get the path to the tested PDF file
  path := 'C:\work\sample.pdf';
  if ((path <> '') and (aqFile.Exists(path)) and (aqFileSystem.GetFileExtension(path) = 'pdf')) then
  begin

    // Get the entire file contents
    contents : = PDF.ConvertToText(path);
    if contents <> '' then
    begin
      // Extract a value
      regEx := HISUtils.RegExpr();

      // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx.Expression := '\d{1,2}/\d{1,2}/\d{2,4}';

      // Post all the date values that match the specified pattern
      // to the test log
      if regEx.Exec(contents) then
        repeat
          Log.Message(regEx.Match[0]);
        until not regEx.ExecNext;

    end;

  end;

end;

C++Script, C#Script

function GetDateValuesFromPDF()
{
  // Get the path to the tested PDF file
  var path = "C:\\work\\sample.pdf";
  if ((path != "") && (aqFile["Exists"](path)) && (aqFileSystem["GetFileExtension"](path) == "pdf"))
  {
    // Get the entire file contents
    contents = PDF["ConvertToText"](path);
    if (contents != "")
    {
      // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim

      // Post all the date values that match the specified pattern
      // to the test log
      var r = contents["match"](regEx);
      if (r != null && r["length"] > 0)
      {
        for (var i = 0; i < r["length"]; i++ )
          Log["Message"](r[0]);
      }
    }
  }

}
Extract Section Contents From a PDF File

This sample shows how to get the contents of a section from a PDF file (the text between two subheaders):

JavaScript, JScript

function Main()
{
  // Specifies the tested PDF file
  var path = "C:\\work\\sample.pdf";
  // Set the name of the section
  // whose contents you want to get
  section1Name = "Section 1 Name";

  // Set the name of the section
  // that follows the target section
  // (to get the contents of the last section, leave this value empty)
  section2Name = "Section 2 Name";

  // Get the section contents
  contents = GetSectionContents(path, section1Name, section2Name);

  // Post the contents to the test log
  if (contents != null && contents != "")
    Log.Message("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);
}

// Use a regular expression
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2)
{
  if (aqFile.Exists(aPath) && (aqFileSystem.GetFileExtension(aPath) == 'pdf'))
  {
    // Get the entire text content of the PDF file
    str = PDF.ConvertToText(aPath);
    if (str != "")
    {
      // Create a regular expression that will get the text between the section headers
      var w = aSection1 + "[\\r\\n]*([^]*)" + aSection2;
      var regEx = new RegExp(w, "gim");
      var r = regEx.exec(str);
      if (r != null && r.length > 0)
      {
        return r[1];
      }
      else
      {
        Log.Warning("Failed to get the section contents");
        return "";
      }
    }
  }

}

Python

import re

def Main():
  # Specifies the tested PDF file
  path = "C:\\work\\sample.pdf"
  # Set the name of the section
  # Whose contents you want to get
  section1Name = "Section 1 Name"

  # Set the name of the section
  # That follows the target section
  # (To get the contents of the last section, leave this value empty)
  section2Name = "Section 2 Name"

  # Get the section contents
  contents = GetSectionContents(path, section1Name, section2Name);

  # Posts the contents to the test log
  if (contents != None and contents != ""):
    Log.Message("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);


# Use a regular expression
# To get the section contents
def GetSectionContents(aPath, aSection1, aSection2):
  if (aqFile.Exists(aPath) and (aqFileSystem.GetFileExtension(aPath) == 'pdf')):
    # Get the entire text contents of the PDF file
    str = PDF.ConvertToText(aPath)
    if (str != ""):
      # Create a regular expression that will get text between section headers
      w = aSection1 + "[\\r\\n]+([\\w\\W\\s\\S\\r\\n]*)" + aSection2
      regEx = re.compile(w)
      r = regEx.search(str)
      if (r and len(r.groups()) > 0):
        return r.groups()[0]
      else:
        Log.Warning("Failed to get the section contents")
        return ""

VBScript

Sub Main()
  ' Specifies the tested PDF file
  path = "C:\work\sample.pdf"
  ' Set the name of the section
  ' whose contents you want to get
  section1Name = "Section 1 Name"

  ' Set the name of the section
  ' that follows the target section
  ' (to get the contents of the last section, leave this value empty)
  section2Name = "Section 2 Name"

  ' Get the section contents
  contents = GetSectionContents(path, section1Name, section2Name)

  ' Posts the contents to the test log
  If contents <> "" Then
    Call Log.Message("View the contents of the section '" & section1Name & "' in the Details panel", contents)
  End If
End Sub

' Use a regular expression
' to get the section contents
Function GetSectionContents(aPath, aSection1, aSection2)
  If aqFile.Exists(aPath) And aqFileSystem.GetFileExtension(aPath) = "pdf" Then
    ' Get the entire text content of the PDF file
    str = PDF.ConvertToText(aPath)
    If str <> "" Then
      ' Create a regular expression that will get the text between the section headers
      w = aSection1 & "[\r\n]*([\w\W\s\S\r\n]*)" & aSection2

      Set regEx = New RegExp
      regEx.Pattern = w
      regEx.Global = True
      If Not r Is Nothing Then
        For Each m In r
          GetSectionContents = m.SubMatches(0)
        Next
      Else
        Log.Warning("Failed to get the section contents")
        GetSectionContents = ""
      End If

    End If
  End If

End Function

DelphiScript

// Use a regular expression
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2);
var str, regEx, w;
begin
  if (aqFile.Exists(aPath) and (aqFileSystem.GetFileExtension(aPath) = 'pdf')) then
  begin
    // Get the entire text content of the PDF file
    str := PDF.ConvertToText(aPath);
    if str <> '' then
    begin
      // Create a regular expression that will get the text between the section headers
      aSection1 := aqString.Replace(aSection1, ' ', '[\s\t\r\n]+');
      aSection2 := aqString.Replace(aSection2, ' ', '[\s\t\r\n]+');
      w := aSection1 + '(.*)' + aSection2;

      regEx := HISUtils.RegExpr();
      regEx.Expression := w;
      if regEx.Exec(str) then
        // Return the text that is between the specified section headers
        result := regEx.Match[1]
      else
        begin
          Log.Warning('Failed to get the section contents');
          result := '';
        end;
    end;
  end;

end;

procedure Main();
var path, section1Name, section2Name, contents;
begin
  // Specifies the tested PDF file
  path := 'C:\work\sample.pdf';
  // Set the name of the section
  // whose contents you want to get
  section1Name := 'Section 1 Name';

  // Set the name of the section
  // that follows the target section
  // (to get the contents of the last section, leave this value empty)
  section2Name := 'Section 2 Name';

  // Get the section contents
  contents := GetSectionContents(path, section1Name, section2Name);

  // Post the contents to the test log
  if (contents <> '') then
    Log.Message('View the contents of the section "' + section1Name + '" in the Details panel', contents);

end;

C++Script, C#Script

function Main()
{
  // Specifies the tested PDF file
  var path = "C:\\work\\sample.pdf";
  // Set the name of the section
  // whose contents you want to get
  section1Name = "Section 1 Name";

  // Set the name of the section
  // that follows the target section
  // (To get the contents of the last section, leave this value empty)
  section2Name = "Section 2 Name";

  // Get the section contents
  contents = GetSectionContents(path, section1Name, section2Name);

  // Post the contents to the test log
  if (contents != null && contents != "")
    Log["Message"]("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);
}

// Use a regular expression
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2)
{
  if (aqFile["Exists"](aPath) && (aqFileSystem["GetFileExtension"](aPath) == 'pdf'))
  {
    // Get the entire text content of the PDF file
    str = PDF["ConvertToText"](aPath);
    if (str != "")
    {
      // Create a regular expression that will get the text between the section headers
      var w = aSection1 + "[\\r\\n]*([^]*)" + aSection2;
      var regEx = new RegExp(w, "gim");
      var r = regEx["exec"](str);
      if (r != null && r.length > 0)
      {
        return r[1];
      }
      else
      {
        Log["Warning"]("Failed to get the section contents");
        return "";
      }
    }
  }

}

Validate PDF files

PDF Checkpoints

To verify contents of an entire PDF file, use PDF checkpoints. They compare contents of the PDF file with the expected contents you store in the Stores > Files collection of your project. You can create checkpoints both during test recording and at design time. In the checkpoint properties, you can set the allowed difference between files. For more information, see PDF Checkpoints.

Custom Verification

To validate a fragment of a PDF file, or to create custom verifications, you need to create a keyword test or write a script.

In keyword tests
  1. Use the PDF to Text operation to get file contents. Follow this operation with the Set Variable Value operation to save the extracted text (the last operation result) to a variable:

    Getting text of a PDF file in a keyword test in TestComplete

    Click the image to enlarge it.

  2. Use the If…Then operation to check the variable value, or write a script routine that will perform the needed verification, and then call it from your keyword test by using the Call Script Routine, Run Code Snippet, or Run Test operation.

In script tests
  1. Call PDF.ConvertToText(…) to get the text of your PDF file:

    JavaScript, JScript

    var path = "C:\\work\\sample.pdf";
    var contents = PDF.ConvertToText(path);

    Python

    path = "C:\\work\\sample.pdf"
    contents = PDF.ConvertToText(path)

    VBScript

    path = "C:\work\sample.pdf"
    contents = PDF.ConvertToText(path)

    DelphiScript

    var path, contents;

    path := 'C:\work\sample.pdf';
    contents := PDF.ConvertToText(path);

    C++Script, C#Script

    var path = "C:\\work\\sample.pdf";
    var contents = PDF["ConvertToText"](path);

  2. Use script objects and methods that TestComplete provides for working with strings, for example, aqString, to perform the operation you need. See the example below.

Example

The example below shows how to validate the contents of a PDF file ignoring some part of it.

JavaScript

function Main()
{
  let path1 = "C:\\work\\baseline.pdf";
  let path2 = "C:\\work\\report.pdf";

  if (ComparePDF(path1, path2))
    Log.Message("The text contents of specified PDF files are the same");

}

function ComparePDF(path1, path2)
{
  if (((path1 != "") && (aqFile.Exists(path1)) && (aqFileSystem.GetFileExtension(path1) == "pdf"))
    && ((path2 != "") && (aqFile.Exists(path2)) && (aqFileSystem.GetFileExtension(path2) == "pdf")))
    {
      // Get the text contents of PDF files
      let str1 = PDF.ConvertToText(path1);
      let str2 = PDF.ConvertToText(path2);

      // Use the regular expression
      // to replace the date/time stamp
      regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;

      str1 = str1.replace(regEx, "<ignore>");
      str2 = str2.replace(regEx, "<ignore>");

      // Compare the resulting contents
      return equal(str1, str2);

    }
  else
    return false;
}

JScript

function Main()
{
  var path1 = "C:\\work\\baseline.pdf";
  var path2 = "C:\\work\\report.pdf";

  if (ComparePDF(path1, path2))
    Log.Message("The text contents of specified PDF files are the same");

}

function ComparePDF(path1, path2)
{
  if (((path1 != "") && (aqFile.Exists(path1)) && (aqFileSystem.GetFileExtension(path1) == "pdf"))
  && ((path2 != "") && (aqFile.Exists(path2)) && (aqFileSystem.GetFileExtension(path2) == "pdf")))
    {
      // Get the text contents of PDF files
      var str1 = PDF.ConvertToText(path1);
      var str2 = PDF.ConvertToText(path2);

      // Use the regular expression
      // to replace the date/time stamp
      regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;

      str1 = str1.replace(regEx, "<ignore>");
      str2 = str2.replace(regEx, "<ignore>");

      // Compare the resulting contents
      return (str1 == str2);

  }
  else
    return false;
}

Python

def Main():
  path1 = "C:\\work\\baseline.pdf"
  path2 = "C:\\work\\report.pdf"

  if (ComparePDF(path1, path2)):
    Log.Message("The text contents of specified PDF files are the same")

def ComparePDF(path1, path2):
  if (path1 != "" and aqFile.Exists(path1) and aqFileSystem.GetFileExtension(path1) == "pdf" and \
  path2 != "" and aqFile.Exists(path2) and aqFileSystem.GetFileExtension(path2) == "pdf"):
    # Get the text contents of PDF files
    str1 = PDF.ConvertToText(path1)
    str2 = PDF.ConvertToText(path2)

    # Use the regular expression
    # to replace the date/time stamp
    regEx = "/\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim"

    str1 = str1.replace(regEx, "<ignore>")
    str2 = str2.replace(regEx, "<ignore>")
   
    # Compare the resulting contents
    return (str1 == str2)

  else:
    return False

VBScript

Sub Main()

  path1 = "C:\work\baseline.pdf"
  path2 = "C:\work\report.pdf"

  If ComparePDF(path1, path2) Then
    Log.Message("The text contents of specified PDF files are the same")
  End If

End Sub

Function ComparePDF(path1, path2)
  If path1 <> "" And aqFile.Exists(path1) And aqFileSystem.GetFileExtension(path1) = "pdf" _
    And path2 <> "" And aqFile.Exists(path2) And aqFileSystem.GetFileExtension(path2) = "pdf" Then
      ' Get the text contents of PDF files
      str1 = PDF.ConvertToText(path1)
      str2 = PDF.ConvertToText(path2)

      ' Use the regular expression
      ' to replace the date/time stamp
      Set regEx = New RegExp
      regEx.Pattern = "\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}"
      regEx.IgnoreCase = True
      regEx.Global = True

      str1 = regEx.replace(str1, "<ignore>")
      str2 = regEx.replace(str2, "<ignore>")

      ' Compare the resulting contents
      ComparePDF = (str1 = str2)

  Else
    ComparePDF = False
  End If
End Function

DelphiScript

function ComparePDF(path1, path2);
var str1, str2;
var regEx;
begin

  if ((path1 <> '') and (aqFile.Exists(path1)) and (aqFileSystem.GetFileExtension(path1) = 'pdf'))
    and ((path2 <> '') and (aqFile.Exists(path2)) and (aqFileSystem.GetFileExtension(path2) = 'pdf')) then
    begin
      // Get the text contents of PDF files
      str1 := PDF.ConvertToText(path1);
      str2 := PDF.ConvertToText(path2);

      // Use the regular expression
      // to replace the date/time stamp
      regEx : = HISUtils.RegExpr;
      regEx.Expression := '\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}';

      str1 := regEx.Replace(str1, '<ignore>');
      str2 := regEx.Replace(str2, '<ignore>');
      // Compare the resulting contents
      result : = (str1 = str2);

    end
  else
    result := false;
end;

procedure Main();
var path1, path2;
begin
  path1 := 'C:\work\baseline.pdf';
  path2 := 'C:\work\report.pdf';

  if ComparePDF(path1, path2) then
    Log.Message('The text contents of specified PDF files are the same');

end;

C++Script, C#Script

function Main()
{
var path1 = "C:\\work\\baseline.pdf";
var path2 = "C:\\work\\report.pdf";

  if (ComparePDF(path1, path2))
    Log["Message"]("The text contents of specified PDF files are the same");

}

function ComparePDF(path1, path2)
{
  if (((path1 != "") && (aqFile["Exists"](path1)) && (aqFileSystem["GetFileExtension"](path1) == "pdf"))
    && ((path2 != "") && (aqFile["Exists"](path2)) && (aqFileSystem["GetFileExtension"](path2) == "pdf")))
    {
      // Get the text contents of PDF files
      var str1 = PDF["ConvertToText"](path1);
      var str2 = PDF["ConvertToText"](path2);

      // Use the regular expression
      // to replace the date/time stamp
      regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;

      str1 = str1["replace"](regEx, "<ignore>");
      str2 = str2["replace"](regEx, "<ignore>");

      // Compare the resulting contents
      return (str1 == str2);

    }
  else
    return false;

}

See Also

Working With External Data Sources
PDF Checkpoints

Highlight search results