Pdf Datasource

This module contains functionality related to the Pdf datasource.

Configuration

`PDFDatasourceConfiguration`

Bases: DatasourceConfiguration

Configuration for PDF data source.

This class defines the configuration parameters required for extracting data from PDF files. It inherits from the base DatasourceConfiguration class.

Source code in src/extraction/datasources/pdf/configuration.py

class PDFDatasourceConfiguration(DatasourceConfiguration):
    """Configuration for PDF data source.

    This class defines the configuration parameters required for extracting data from PDF files.
    It inherits from the base DatasourceConfiguration class.
    """

    name: Literal[DatasourceName.PDF] = Field(
        ..., description="The name of the data source."
    )
    base_path: str = Field(
        ..., description="Base path to the directory containing PDF files"
    )

Document

`PDFDocument`

Bases: BaseDocument

Document representation for PDF file content.

Extends BaseDocument to handle PDF-specific document processing including metadata filtering for embeddings and LLM contexts.

Source code in src/extraction/datasources/pdf/document.py

class PDFDocument(BaseDocument):
    """Document representation for PDF file content.

    Extends BaseDocument to handle PDF-specific document processing including
    metadata filtering for embeddings and LLM contexts.
    """

    pass

Manager

`PDFDatasourceManagerFactory`

Bases: Factory

Factory for creating datasource managers.

Provides type-safe creation of datasource managers based on configuration.

Attributes:	`_configuration_class` (`Type`) – Type of configuration object

Source code in src/extraction/datasources/pdf/manager.py

class PDFDatasourceManagerFactory(Factory):
    """Factory for creating datasource managers.

    Provides type-safe creation of datasource managers based on configuration.

    Attributes:
        _configuration_class: Type of configuration object
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, configuration: PDFDatasourceConfiguration
    ) -> BasicDatasourceManager:
        """Create an instance of the PDF datasource manager.

        This method constructs a BasicDatasourceManager by creating the appropriate
        reader and parser based on the provided configuration.

        Args:
            configuration: Configuration specifying how to set up the PDF datasource
                          manager, reader, and parser.

        Returns:
            A configured BasicDatasourceManager instance for handling PDF documents.
        """
        reader = PDFDatasourceReaderFactory.create(configuration)
        parser = PDFDatasourceParserFactory.create(configuration)
        return BasicDatasourceManager(
            configuration=configuration,
            reader=reader,
            parser=parser,
        )

Parser

`PDFDatasourceParser`

Bases: BaseParser[PDFDocument]

Parser for PDF documents that converts them to structured PDFDocument objects.

Uses MarkItDown to convert PDF files to markdown format for easier processing.

Source code in src/extraction/datasources/pdf/parser.py

class PDFDatasourceParser(BaseParser[PDFDocument]):
    """
    Parser for PDF documents that converts them to structured PDFDocument objects.

    Uses MarkItDown to convert PDF files to markdown format for easier processing.
    """

    def __init__(self, parser: MarkItDown = MarkItDown()):
        """
        Initialize the PDF parser.

        Attributes:
            parser: MarkItDown parser instance for PDF to markdown conversion
        """
        self.parser = parser

    def parse(self, file_path: str) -> PDFDocument:
        """
        Parses the given PDF file into a structured document.

        Args:
            file_path: Path to the PDF file

        Returns:
            PDFDocument object containing the parsed content and metadata
        """
        markdown = self.parser.convert(
            file_path, file_extension=".pdf"
        ).text_content
        metadata = self._extract_metadata(file_path)
        return PDFDocument(text=markdown, metadata=metadata)

    def _extract_metadata(self, file_path: str) -> dict:
        """
        Extract and process PDF metadata from the file.

        Args:
            file_path: Path to the PDF file

        Returns:
            Processed metadata dictionary with standardized fields
        """
        metadata = default_file_metadata_func(file_path)
        metadata.update(
            {
                "datasource": "pdf",
                "format": "pdf",
                "url": None,
                "title": os.path.basename(file_path),
                "last_edited_date": metadata["last_modified_date"],
                "created_date": metadata["creation_date"],
            }
        )
        del metadata["last_modified_date"]
        del metadata["creation_date"]
        return metadata

`init(parser=MarkItDown())`

Initialize the PDF parser.

Attributes:	`parser` – MarkItDown parser instance for PDF to markdown conversion

Source code in src/extraction/datasources/pdf/parser.py

def __init__(self, parser: MarkItDown = MarkItDown()):
    """
    Initialize the PDF parser.

    Attributes:
        parser: MarkItDown parser instance for PDF to markdown conversion
    """
    self.parser = parser

`parse(file_path)`

Parses the given PDF file into a structured document.

Parameters:	`file_path` (`str`) – Path to the PDF file

Returns:	`PDFDocument` – PDFDocument object containing the parsed content and metadata

Source code in src/extraction/datasources/pdf/parser.py

def parse(self, file_path: str) -> PDFDocument:
    """
    Parses the given PDF file into a structured document.

    Args:
        file_path: Path to the PDF file

    Returns:
        PDFDocument object containing the parsed content and metadata
    """
    markdown = self.parser.convert(
        file_path, file_extension=".pdf"
    ).text_content
    metadata = self._extract_metadata(file_path)
    return PDFDocument(text=markdown, metadata=metadata)

`PDFDatasourceParserFactory`

Bases: Factory

Factory for creating PDF parser instances.

Creates and configures PDFDatasourceParser objects according to the provided configuration.

Source code in src/extraction/datasources/pdf/parser.py

class PDFDatasourceParserFactory(Factory):
    """
    Factory for creating PDF parser instances.

    Creates and configures PDFDatasourceParser objects according to
    the provided configuration.
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, _: PDFDatasourceConfiguration
    ) -> PDFDatasourceParser:
        """
        Creates a new instance of the PDF parser.

        Args:
            _: Configuration object for the parser (not used in this implementation)

        Returns:
            PDFDatasourceParser: Configured parser instance
        """
        return PDFDatasourceParser()

Reader

`PDFDatasourceReader`

Bases: BaseReader

Source code in src/extraction/datasources/pdf/reader.py

class PDFDatasourceReader(BaseReader):

    def __init__(
        self,
        configuration: PDFDatasourceConfiguration,
        logger: logging.Logger = LoggerConfiguration.get_logger(__name__),
    ):
        """Initialize PDF reader.

        Args:
            configuration: Settings for PDF processing
            logger: Logger instance for logging messages
        """
        super().__init__()
        self.export_limit = configuration.export_limit
        self.base_path = configuration.base_path
        self.logger = logger

    async def read_all_async(self) -> AsyncIterator[str]:
        """Asynchronously yield PDF file paths from the configured directory.

        Retrieves a list of PDF files from the base path, applies any configured
        export limit, and yields each file path individually.

        Returns:
            AsyncIterator[str]: An asynchronous iterator of PDF file paths
        """
        self.logger.info(
            f"Fetching PDF files from '{self.base_path}' with limit {self.export_limit}"
        )

        pdf_files = [
            f for f in os.listdir(self.base_path) if f.endswith(".pdf")
        ]
        files_to_load = (
            pdf_files
            if self.export_limit is None
            else pdf_files[: self.export_limit]
        )

        for file_name in tqdm(
            files_to_load, desc="[PDF] Loading files", unit="files"
        ):
            file_path = os.path.join(self.base_path, file_name)
            if os.path.isfile(file_path):
                yield file_path

`init(configuration, logger=LoggerConfiguration.get_logger(name))`

Initialize PDF reader.

Parameters:	`configuration` (`PDFDatasourceConfiguration`) – Settings for PDF processing `logger` (`Logger`, default: `get_logger(__name__)` ) – Logger instance for logging messages

Source code in src/extraction/datasources/pdf/reader.py

def __init__(
    self,
    configuration: PDFDatasourceConfiguration,
    logger: logging.Logger = LoggerConfiguration.get_logger(__name__),
):
    """Initialize PDF reader.

    Args:
        configuration: Settings for PDF processing
        logger: Logger instance for logging messages
    """
    super().__init__()
    self.export_limit = configuration.export_limit
    self.base_path = configuration.base_path
    self.logger = logger

`read_all_async()` `async`

Asynchronously yield PDF file paths from the configured directory.

Retrieves a list of PDF files from the base path, applies any configured export limit, and yields each file path individually.

Returns:	`AsyncIterator[str]` – AsyncIterator[str]: An asynchronous iterator of PDF file paths

Source code in src/extraction/datasources/pdf/reader.py

async def read_all_async(self) -> AsyncIterator[str]:
    """Asynchronously yield PDF file paths from the configured directory.

    Retrieves a list of PDF files from the base path, applies any configured
    export limit, and yields each file path individually.

    Returns:
        AsyncIterator[str]: An asynchronous iterator of PDF file paths
    """
    self.logger.info(
        f"Fetching PDF files from '{self.base_path}' with limit {self.export_limit}"
    )

    pdf_files = [
        f for f in os.listdir(self.base_path) if f.endswith(".pdf")
    ]
    files_to_load = (
        pdf_files
        if self.export_limit is None
        else pdf_files[: self.export_limit]
    )

    for file_name in tqdm(
        files_to_load, desc="[PDF] Loading files", unit="files"
    ):
        file_path = os.path.join(self.base_path, file_name)
        if os.path.isfile(file_path):
            yield file_path

`PDFDatasourceReaderFactory`

Bases: Factory

Factory for creating PDF reader instances.

Implements the factory pattern to produce configured PDFDatasourceReader objects based on the provided configuration settings.

Source code in src/extraction/datasources/pdf/reader.py

class PDFDatasourceReaderFactory(Factory):
    """Factory for creating PDF reader instances.

    Implements the factory pattern to produce configured PDFDatasourceReader
    objects based on the provided configuration settings.
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, configuration: PDFDatasourceConfiguration
    ) -> PDFDatasourceReader:
        """Create a new PDFDatasourceReader with the specified configuration.

        Args:
            configuration: Settings that control PDF processing behavior including
                           base path and export limits

        Returns:
            PDFDatasourceReader: A fully configured reader instance ready for use
        """
        return PDFDatasourceReader(configuration=configuration)

Pdf Datasource

Configuration

PDFDatasourceConfiguration

Document

PDFDocument

Manager

PDFDatasourceManagerFactory

Parser

PDFDatasourceParser

__init__(parser=MarkItDown())

parse(file_path)

PDFDatasourceParserFactory

Reader

PDFDatasourceReader

__init__(configuration, logger=LoggerConfiguration.get_logger(__name__))

read_all_async() async

PDFDatasourceReaderFactory

`PDFDatasourceConfiguration`

`PDFDocument`

`PDFDatasourceManagerFactory`

`PDFDatasourceParser`

`init(parser=MarkItDown())`

`parse(file_path)`

`PDFDatasourceParserFactory`

`PDFDatasourceReader`

`init(configuration, logger=LoggerConfiguration.get_logger(name))`

`read_all_async()` `async`

`PDFDatasourceReaderFactory`