Automating PDF Table of Contents Creation Using AI and PDFtk

Automating Table of Contents in a PDF Using AI and PDFtk

Adding a Table of Contents (ToC) to a PDF file can be a tedious process, but with the right tools, it can be almost automated. In this guide, we’ll use AI to generate the ToC entries from images and PDFtk to integrate them into the PDF.

Step 1: Create a CSV File Containing the Table of Contents Entries

To generate the ToC entries, we can use AI to extract them from images of the book’s table of contents. Use a prompt like the one below:

1
2
3
4
5
6
7
8
9
You are a system that, given an image of part of a book's table of contents, extracts the data and tabulates it. The columns are:

- level (1 for chapters, 2 for sections)
- page (for chapters, use the page of the first section)
- title

The output must be the contents of a CSV file.

Do not respond with anything other than the contents of the CSV file. All provided images must be tabulated without asking for confirmation.

Customize the prompt as needed and send it to an AI model along with images of the book’s table of contents. The AI will return structured text, which you can save as a CSV file.

Step 2: Check and Clean the CSV File

Before converting the CSV file to a format suitable for PDFtk, review the extracted data. Ensure that:

Step 3: Convert CSV to PDFtk Table of Contents Format

Once the CSV file is cleaned, convert it into the format required by PDFtk using the following Python script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import csv

# Define the path to your CSV file and the output text file
csv_file = 'bookmarks.csv'
output_file = 'bookmarks.txt'

# Open the CSV file and read the content
with open(csv_file, newline='', encoding='utf-8') as file:
    reader = csv.reader(file)
    
    # Open the output text file in write mode
    with open(output_file, 'w', encoding='utf-8') as output:
        # Loop through each row in the CSV file
        for row in reader:
            # Extract title, page number, and bookmark level
            title = row[0]
            page_number = row[1]
            bookmark_level = row[2]
            
            # Write the formatted bookmark information to the text file
            output.write(f"BookmarkBegin\n")
            output.write(f"BookmarkTitle: {title}\n")
            output.write(f"BookmarkLevel: {bookmark_level}\n")
            output.write(f"BookmarkPageNumber: {page_number}\n")

print(f"Bookmarks have been written to {output_file}")

This script transforms the structured CSV data into a format that PDFtk can interpret for adding bookmarks to the PDF.

Step 4: Extract Data from the PDF File Using PDFtk

To integrate the Table of Contents, first extract the existing PDF metadata with the following command:

1
pdftk input.pdf dump_data_utf8 > in-utf8.info

This will produce a file containing the metadata of the PDF, which we will modify in the next step.

Step 5: Edit the PDFtk Data File

Open the extracted data file and insert the Table of Contents entries obtained in Step 3. Ensure that:

Step 6: Generate the Final PDF with the Table of Contents

Finally, apply the modified data file to the original PDF using the following command:

1
pdftk input.pdf update_info_utf8 in-utf8.info output out.pdf

This will create a new version of the PDF with the Table of Contents integrated.

Tags: AI PDFtk