Spark Extension¶

openaivec.spark ¶

Asynchronous Spark UDFs for the OpenAI and Azure OpenAI APIs.

This module provides builder classes (ResponsesUDFBuilder, EmbeddingsUDFBuilder) for creating asynchronous Spark UDFs that communicate with either the public OpenAI API or Azure OpenAI using the openaivec.spark subpackage. It supports UDFs for generating responses and creating embeddings asynchronously. The UDFs operate on Spark DataFrames and leverage asyncio for potentially improved performance in I/O-bound operations.

Setup¶

First, obtain a Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Next, instantiate UDF builders with your OpenAI API key (or Azure credentials) and model/deployment names, then register the desired UDFs:

import os
from openaivec.spark import ResponsesUDFBuilder, EmbeddingsUDFBuilder
from pydantic import BaseModel

# Option 1: Using OpenAI
resp_builder = ResponsesUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4o-mini", # Model for responses
)
emb_builder = EmbeddingsUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small", # Model for embeddings
)

# Option 2: Using Azure OpenAI
# resp_builder = ResponsesUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="your-resp-deployment-name", # Deployment for responses
# )
# emb_builder = EmbeddingsUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="your-emb-deployment-name", # Deployment for embeddings
# )

# Define a Pydantic model for structured responses (optional)
class Translation(BaseModel):
    en: str
    fr: str
    # ... other languages

# Register the asynchronous responses UDF
spark.udf.register(
    "translate_async",
    resp_builder.build(
        instructions="Translate the text to multiple languages.",
        response_format=Translation,
    ),
)

# Or use a predefined task with build_from_task method
from openaivec.task import nlp
spark.udf.register(
    "sentiment_async",
    resp_builder.build_from_task(nlp.SENTIMENT_ANALYSIS),
)

# Register the asynchronous embeddings UDF
spark.udf.register(
    "embed_async",
    emb_builder.build(),
)

You can now invoke the UDFs from Spark SQL:

SELECT
    text,
    translate_async(text) AS translation,
    sentiment_async(text) AS sentiment,
    embed_async(text) AS embedding
FROM your_table;

Note: This module provides asynchronous support through the pandas extensions.

ResponsesUDFBuilder `dataclass` ¶

Builder for asynchronous Spark pandas UDFs for generating responses.

Configures and builds UDFs that leverage pandas_ext.aio.responses to generate text or structured responses from OpenAI models asynchronously. An instance stores authentication parameters and the model name.

This builder supports two main methods: - build(): Creates UDFs with custom instructions and response formats - build_from_task(): Creates UDFs from predefined tasks (e.g., sentiment analysis)

Attributes:

Name	Type	Description
`api_key`	`str`	OpenAI or Azure API key.
`endpoint`	`Optional[str]`	Azure endpoint base URL. None for public OpenAI.
`api_version`	`Optional[str]`	Azure API version. Ignored for public OpenAI.
`model_name`	`str`	Deployment name (Azure) or model name (OpenAI) for responses.

Source code in src/openaivec/spark.py

@dataclass(frozen=True)
class ResponsesUDFBuilder:
    """Builder for asynchronous Spark pandas UDFs for generating responses.

    Configures and builds UDFs that leverage `pandas_ext.aio.responses`
    to generate text or structured responses from OpenAI models asynchronously.
    An instance stores authentication parameters and the model name.

    This builder supports two main methods:
    - `build()`: Creates UDFs with custom instructions and response formats
    - `build_from_task()`: Creates UDFs from predefined tasks (e.g., sentiment analysis)

    Attributes:
        api_key (str): OpenAI or Azure API key.
        endpoint (Optional[str]): Azure endpoint base URL. None for public OpenAI.
        api_version (Optional[str]): Azure API version. Ignored for public OpenAI.
        model_name (str): Deployment name (Azure) or model name (OpenAI) for responses.
    """

    # Params for OpenAI SDK
    api_key: str
    endpoint: str | None
    api_version: str | None

    # Params for Responses API
    model_name: str

    @classmethod
    def of_openai(cls, api_key: str, model_name: str) -> "ResponsesUDFBuilder":
        """Creates a builder configured for the public OpenAI API.

        Args:
            api_key (str): The OpenAI API key.
            model_name (str): The OpenAI model name for responses (e.g., "gpt-4o-mini").

        Returns:
            ResponsesUDFBuilder: A builder instance configured for OpenAI responses.
        """
        return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

    @classmethod
    def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "ResponsesUDFBuilder":
        """Creates a builder configured for Azure OpenAI.

        Args:
            api_key (str): The Azure OpenAI API key.
            endpoint (str): The Azure OpenAI endpoint URL.
            api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
            model_name (str): The Azure OpenAI deployment name for responses.

        Returns:
            ResponsesUDFBuilder: A builder instance configured for Azure OpenAI responses.
        """
        return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

    def build(
        self,
        instructions: str,
        response_format: Type[T] = str,
        batch_size: int = 128,  # Default batch size for async might differ
        temperature: float = 0.0,
        top_p: float = 1.0,
        max_concurrency: int = 8,
    ) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF for generating responses.

        Args:
            instructions (str): The system prompt or instructions for the model.
            response_format (Type[T]): The desired output format. Either `str` for plain text
                or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.
            temperature (float): Sampling temperature (0.0 to 2.0). Defaults to 0.0.
            top_p (float): Nucleus sampling parameter. Defaults to 1.0.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to generate responses asynchronously.
                Output schema is `StringType` or a struct derived from `response_format`.

        Raises:
            ValueError: If `response_format` is not `str` or a Pydantic `BaseModel`.
        """
        if issubclass(response_format, BaseModel):
            spark_schema = _pydantic_to_spark_schema(response_format)
            json_schema_string = serialize_base_model(response_format)

            @pandas_udf(returnType=spark_schema)
            def structure_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
                _initialize(self.api_key, self.endpoint, self.api_version)
                pandas_ext.responses_model(self.model_name)

                for part in col:
                    predictions: pd.Series = asyncio.run(
                        part.aio.responses(
                            instructions=instructions,
                            response_format=deserialize_base_model(json_schema_string),
                            batch_size=batch_size,
                            temperature=temperature,
                            top_p=top_p,
                            max_concurrency=max_concurrency,
                        )
                    )
                    yield pd.DataFrame(predictions.map(_safe_dump).tolist())

            return structure_udf

        elif issubclass(response_format, str):

            @pandas_udf(returnType=StringType())
            def string_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
                _initialize(self.api_key, self.endpoint, self.api_version)
                pandas_ext.responses_model(self.model_name)

                for part in col:
                    predictions: pd.Series = asyncio.run(
                        part.aio.responses(
                            instructions=instructions,
                            response_format=str,
                            batch_size=batch_size,
                            temperature=temperature,
                            top_p=top_p,
                            max_concurrency=max_concurrency,
                        )
                    )
                    yield predictions.map(_safe_cast_str)

            return string_udf

        else:
            raise ValueError(f"Unsupported response_format: {response_format}")

    def build_from_task(
        self,
        task: PreparedTask,
        batch_size: int = 128,
        max_concurrency: int = 8,
    ) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF from a predefined task.

        This method allows users to create UDFs from predefined tasks such as sentiment analysis,
        translation, or other common NLP operations defined in the openaivec.task module.

        Args:
            task (PreparedTask): A predefined task configuration containing instructions,
                response format, temperature, and top_p settings.
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.
            max_concurrency (int): Maximum number of concurrent requests. Defaults to 8.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to execute the specified task
                asynchronously, returning a struct derived from the task's response format.

        Example:
            ```python
            from openaivec.task import nlp

            builder = ResponsesUDFBuilder.of_openai(
                api_key="your-api-key",
                model_name="gpt-4o-mini"
            )

            sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

            spark.udf.register("analyze_sentiment", sentiment_udf)
            ```
        """
        # Serialize task parameters for Spark serialization compatibility
        task_instructions = task.instructions
        task_response_format_json = serialize_base_model(task.response_format)
        task_temperature = task.temperature
        task_top_p = task.top_p

        # Deserialize the response format from JSON
        response_format = deserialize_base_model(task_response_format_json)
        spark_schema = _pydantic_to_spark_schema(response_format)

        @pandas_udf(returnType=spark_schema)
        def task_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=task_instructions,
                        response_format=response_format,
                        batch_size=batch_size,
                        temperature=task_temperature,
                        top_p=task_top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield pd.DataFrame(predictions.map(_safe_dump).tolist())

        return task_udf

of_openai `classmethod` ¶

of_openai(
    api_key: str, model_name: str
) -> ResponsesUDFBuilder

Creates a builder configured for the public OpenAI API.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	The OpenAI API key.	required
`model_name`	`str`	The OpenAI model name for responses (e.g., "gpt-4o-mini").	required

Returns:

Name	Type	Description
`ResponsesUDFBuilder`	`ResponsesUDFBuilder`	A builder instance configured for OpenAI responses.

Source code in src/openaivec/spark.py

@classmethod
def of_openai(cls, api_key: str, model_name: str) -> "ResponsesUDFBuilder":
    """Creates a builder configured for the public OpenAI API.

    Args:
        api_key (str): The OpenAI API key.
        model_name (str): The OpenAI model name for responses (e.g., "gpt-4o-mini").

    Returns:
        ResponsesUDFBuilder: A builder instance configured for OpenAI responses.
    """
    return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

of_azure_openai `classmethod` ¶

of_azure_openai(
    api_key: str,
    endpoint: str,
    api_version: str,
    model_name: str,
) -> ResponsesUDFBuilder

Creates a builder configured for Azure OpenAI.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	The Azure OpenAI API key.	required
`endpoint`	`str`	The Azure OpenAI endpoint URL.	required
`api_version`	`str`	The Azure OpenAI API version (e.g., "2024-02-01").	required
`model_name`	`str`	The Azure OpenAI deployment name for responses.	required

Returns:

Name	Type	Description
`ResponsesUDFBuilder`	`ResponsesUDFBuilder`	A builder instance configured for Azure OpenAI responses.

Source code in src/openaivec/spark.py

@classmethod
def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "ResponsesUDFBuilder":
    """Creates a builder configured for Azure OpenAI.

    Args:
        api_key (str): The Azure OpenAI API key.
        endpoint (str): The Azure OpenAI endpoint URL.
        api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
        model_name (str): The Azure OpenAI deployment name for responses.

    Returns:
        ResponsesUDFBuilder: A builder instance configured for Azure OpenAI responses.
    """
    return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

build ¶

build(
    instructions: str,
    response_format: Type[T] = str,
    batch_size: int = 128,
    temperature: float = 0.0,
    top_p: float = 1.0,
    max_concurrency: int = 8,
) -> UserDefinedFunction

Builds the asynchronous pandas UDF for generating responses.

Parameters:

Name	Type	Description	Default
`instructions`	`str`	The system prompt or instructions for the model.	required
`response_format`	`Type[T]`	The desired output format. Either `str` for plain text or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.	`str`
`batch_size`	`int`	Number of rows per async batch request passed to the underlying `pandas_ext` function. Defaults to 128.	`128`
`temperature`	`float`	Sampling temperature (0.0 to 2.0). Defaults to 0.0.	`0.0`
`top_p`	`float`	Nucleus sampling parameter. Defaults to 1.0.	`1.0`

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to generate responses asynchronously. Output schema is `StringType` or a struct derived from `response_format`.

Raises:

Type	Description
`ValueError`	If `response_format` is not `str` or a Pydantic `BaseModel`.

Source code in src/openaivec/spark.py

def build(
    self,
    instructions: str,
    response_format: Type[T] = str,
    batch_size: int = 128,  # Default batch size for async might differ
    temperature: float = 0.0,
    top_p: float = 1.0,
    max_concurrency: int = 8,
) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF for generating responses.

    Args:
        instructions (str): The system prompt or instructions for the model.
        response_format (Type[T]): The desired output format. Either `str` for plain text
            or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.
        temperature (float): Sampling temperature (0.0 to 2.0). Defaults to 0.0.
        top_p (float): Nucleus sampling parameter. Defaults to 1.0.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate responses asynchronously.
            Output schema is `StringType` or a struct derived from `response_format`.

    Raises:
        ValueError: If `response_format` is not `str` or a Pydantic `BaseModel`.
    """
    if issubclass(response_format, BaseModel):
        spark_schema = _pydantic_to_spark_schema(response_format)
        json_schema_string = serialize_base_model(response_format)

        @pandas_udf(returnType=spark_schema)
        def structure_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=instructions,
                        response_format=deserialize_base_model(json_schema_string),
                        batch_size=batch_size,
                        temperature=temperature,
                        top_p=top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield pd.DataFrame(predictions.map(_safe_dump).tolist())

        return structure_udf

    elif issubclass(response_format, str):

        @pandas_udf(returnType=StringType())
        def string_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=instructions,
                        response_format=str,
                        batch_size=batch_size,
                        temperature=temperature,
                        top_p=top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield predictions.map(_safe_cast_str)

        return string_udf

    else:
        raise ValueError(f"Unsupported response_format: {response_format}")

build_from_task ¶

build_from_task(
    task: PreparedTask,
    batch_size: int = 128,
    max_concurrency: int = 8,
) -> UserDefinedFunction

Builds the asynchronous pandas UDF from a predefined task.

This method allows users to create UDFs from predefined tasks such as sentiment analysis, translation, or other common NLP operations defined in the openaivec.task module.

Parameters:

Name	Type	Description	Default
`task`	`PreparedTask`	A predefined task configuration containing instructions, response format, temperature, and top_p settings.	required
`batch_size`	`int`	Number of rows per async batch request passed to the underlying `pandas_ext` function. Defaults to 128.	`128`
`max_concurrency`	`int`	Maximum number of concurrent requests. Defaults to 8.	`8`

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to execute the specified task asynchronously, returning a struct derived from the task's response format.

Example

from openaivec.task import nlp

builder = ResponsesUDFBuilder.of_openai(
    api_key="your-api-key",
    model_name="gpt-4o-mini"
)

sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

spark.udf.register("analyze_sentiment", sentiment_udf)

Source code in src/openaivec/spark.py

def build_from_task(
    self,
    task: PreparedTask,
    batch_size: int = 128,
    max_concurrency: int = 8,
) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF from a predefined task.

    This method allows users to create UDFs from predefined tasks such as sentiment analysis,
    translation, or other common NLP operations defined in the openaivec.task module.

    Args:
        task (PreparedTask): A predefined task configuration containing instructions,
            response format, temperature, and top_p settings.
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.
        max_concurrency (int): Maximum number of concurrent requests. Defaults to 8.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to execute the specified task
            asynchronously, returning a struct derived from the task's response format.

    Example:
        ```python
        from openaivec.task import nlp

        builder = ResponsesUDFBuilder.of_openai(
            api_key="your-api-key",
            model_name="gpt-4o-mini"
        )

        sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

        spark.udf.register("analyze_sentiment", sentiment_udf)
        ```
    """
    # Serialize task parameters for Spark serialization compatibility
    task_instructions = task.instructions
    task_response_format_json = serialize_base_model(task.response_format)
    task_temperature = task.temperature
    task_top_p = task.top_p

    # Deserialize the response format from JSON
    response_format = deserialize_base_model(task_response_format_json)
    spark_schema = _pydantic_to_spark_schema(response_format)

    @pandas_udf(returnType=spark_schema)
    def task_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
        _initialize(self.api_key, self.endpoint, self.api_version)
        pandas_ext.responses_model(self.model_name)

        for part in col:
            predictions: pd.Series = asyncio.run(
                part.aio.responses(
                    instructions=task_instructions,
                    response_format=response_format,
                    batch_size=batch_size,
                    temperature=task_temperature,
                    top_p=task_top_p,
                    max_concurrency=max_concurrency,
                )
            )
            yield pd.DataFrame(predictions.map(_safe_dump).tolist())

    return task_udf

EmbeddingsUDFBuilder `dataclass` ¶

Builder for asynchronous Spark pandas UDFs for creating embeddings.

Configures and builds UDFs that leverage pandas_ext.aio.embeddings to generate vector embeddings from OpenAI models asynchronously. An instance stores authentication parameters and the model name.

Attributes:

Name	Type	Description
`api_key`	`str`	OpenAI or Azure API key.
`endpoint`	`Optional[str]`	Azure endpoint base URL. None for public OpenAI.
`api_version`	`Optional[str]`	Azure API version. Ignored for public OpenAI.
`model_name`	`str`	Deployment name (Azure) or model name (OpenAI) for embeddings.

Source code in src/openaivec/spark.py

@dataclass(frozen=True)
class EmbeddingsUDFBuilder:
    """Builder for asynchronous Spark pandas UDFs for creating embeddings.

    Configures and builds UDFs that leverage `pandas_ext.aio.embeddings`
    to generate vector embeddings from OpenAI models asynchronously.
    An instance stores authentication parameters and the model name.

    Attributes:
        api_key (str): OpenAI or Azure API key.
        endpoint (Optional[str]): Azure endpoint base URL. None for public OpenAI.
        api_version (Optional[str]): Azure API version. Ignored for public OpenAI.
        model_name (str): Deployment name (Azure) or model name (OpenAI) for embeddings.
    """

    # Params for OpenAI SDK
    api_key: str
    endpoint: str | None
    api_version: str | None

    # Params for Embeddings API
    model_name: str

    @classmethod
    def of_openai(cls, api_key: str, model_name: str) -> "EmbeddingsUDFBuilder":
        """Creates a builder configured for the public OpenAI API.

        Args:
            api_key (str): The OpenAI API key.
            model_name (str): The OpenAI model name for embeddings (e.g., "text-embedding-3-small").

        Returns:
            EmbeddingsUDFBuilder: A builder instance configured for OpenAI embeddings.
        """
        return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

    @classmethod
    def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "EmbeddingsUDFBuilder":
        """Creates a builder configured for Azure OpenAI.

        Args:
            api_key (str): The Azure OpenAI API key.
            endpoint (str): The Azure OpenAI endpoint URL.
            api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
            model_name (str): The Azure OpenAI deployment name for embeddings.

        Returns:
            EmbeddingsUDFBuilder: A builder instance configured for Azure OpenAI embeddings.
        """
        return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

    def build(self, batch_size: int = 128, max_concurrency: int = 8) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF for generating embeddings.

        Args:
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to generate embeddings asynchronously,
                returning an `ArrayType(FloatType())` column.
        """

        @pandas_udf(returnType=ArrayType(FloatType()))
        def embeddings_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.embeddings_model(self.model_name)

            for part in col:
                embeddings: pd.Series = asyncio.run(
                    part.aio.embeddings(batch_size=batch_size, max_concurrency=max_concurrency)
                )
                yield embeddings.map(lambda x: x.tolist())

        return embeddings_udf

of_openai `classmethod` ¶

of_openai(
    api_key: str, model_name: str
) -> EmbeddingsUDFBuilder

Creates a builder configured for the public OpenAI API.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	The OpenAI API key.	required
`model_name`	`str`	The OpenAI model name for embeddings (e.g., "text-embedding-3-small").	required

Returns:

Name	Type	Description
`EmbeddingsUDFBuilder`	`EmbeddingsUDFBuilder`	A builder instance configured for OpenAI embeddings.

Source code in src/openaivec/spark.py

@classmethod
def of_openai(cls, api_key: str, model_name: str) -> "EmbeddingsUDFBuilder":
    """Creates a builder configured for the public OpenAI API.

    Args:
        api_key (str): The OpenAI API key.
        model_name (str): The OpenAI model name for embeddings (e.g., "text-embedding-3-small").

    Returns:
        EmbeddingsUDFBuilder: A builder instance configured for OpenAI embeddings.
    """
    return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

of_azure_openai `classmethod` ¶

of_azure_openai(
    api_key: str,
    endpoint: str,
    api_version: str,
    model_name: str,
) -> EmbeddingsUDFBuilder

Creates a builder configured for Azure OpenAI.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	The Azure OpenAI API key.	required
`endpoint`	`str`	The Azure OpenAI endpoint URL.	required
`api_version`	`str`	The Azure OpenAI API version (e.g., "2024-02-01").	required
`model_name`	`str`	The Azure OpenAI deployment name for embeddings.	required

Returns:

Name	Type	Description
`EmbeddingsUDFBuilder`	`EmbeddingsUDFBuilder`	A builder instance configured for Azure OpenAI embeddings.

Source code in src/openaivec/spark.py

@classmethod
def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "EmbeddingsUDFBuilder":
    """Creates a builder configured for Azure OpenAI.

    Args:
        api_key (str): The Azure OpenAI API key.
        endpoint (str): The Azure OpenAI endpoint URL.
        api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
        model_name (str): The Azure OpenAI deployment name for embeddings.

    Returns:
        EmbeddingsUDFBuilder: A builder instance configured for Azure OpenAI embeddings.
    """
    return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

build ¶

build(
    batch_size: int = 128, max_concurrency: int = 8
) -> UserDefinedFunction

Builds the asynchronous pandas UDF for generating embeddings.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows per async batch request passed to the underlying `pandas_ext` function. Defaults to 128.	`128`

Returns:

Name	Type	Description
`UserDefinedFunction`	`UserDefinedFunction`	A Spark pandas UDF configured to generate embeddings asynchronously, returning an `ArrayType(FloatType())` column.

Source code in src/openaivec/spark.py

def build(self, batch_size: int = 128, max_concurrency: int = 8) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF for generating embeddings.

    Args:
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate embeddings asynchronously,
            returning an `ArrayType(FloatType())` column.
    """

    @pandas_udf(returnType=ArrayType(FloatType()))
    def embeddings_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        _initialize(self.api_key, self.endpoint, self.api_version)
        pandas_ext.embeddings_model(self.model_name)

        for part in col:
            embeddings: pd.Series = asyncio.run(
                part.aio.embeddings(batch_size=batch_size, max_concurrency=max_concurrency)
            )
            yield embeddings.map(lambda x: x.tolist())

    return embeddings_udf

split_to_chunks_udf ¶

split_to_chunks_udf(
    model_name: str, max_tokens: int, sep: List[str]
) -> UserDefinedFunction

Create a pandas‑UDF that splits text into token‑bounded chunks.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Model identifier passed to tiktoken.	required
`max_tokens`	`int`	Maximum tokens allowed per chunk.	required
`sep`	`List[str]`	Ordered list of separator strings used by `TextChunker`.	required

Returns:

Type	Description
`UserDefinedFunction`	A pandas UDF producing an `ArrayType(StringType())` column whose values are lists of chunks respecting the `max_tokens` limit.

Source code in src/openaivec/spark.py

def split_to_chunks_udf(model_name: str, max_tokens: int, sep: List[str]) -> UserDefinedFunction:
    """Create a pandas‑UDF that splits text into token‑bounded chunks.

    Args:
        model_name: Model identifier passed to *tiktoken*.
        max_tokens: Maximum tokens allowed per chunk.
        sep: Ordered list of separator strings used by ``TextChunker``.

    Returns:
        A pandas UDF producing an ``ArrayType(StringType())`` column whose
            values are lists of chunks respecting the ``max_tokens`` limit.
    """

    @pandas_udf(ArrayType(StringType()))
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        global _TIKTOKEN_ENC
        if _TIKTOKEN_ENC is None:
            _TIKTOKEN_ENC = tiktoken.encoding_for_model(model_name)

        chunker = TextChunker(_TIKTOKEN_ENC)

        for part in col:
            yield part.map(lambda x: chunker.split(x, max_tokens=max_tokens, sep=sep) if isinstance(x, str) else [])

    return fn

count_tokens_udf ¶

count_tokens_udf(
    model_name: str = "gpt-4o",
) -> UserDefinedFunction

Create a pandas‑UDF that counts tokens for every string cell.

The UDF uses tiktoken to approximate tokenisation and caches the resulting Encoding object per executor.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Model identifier understood by `tiktoken`.	`'gpt-4o'`

Returns:

Type	Description
`UserDefinedFunction`	A pandas UDF producing an `IntegerType` column with token counts.

Source code in src/openaivec/spark.py

def count_tokens_udf(model_name: str = "gpt-4o") -> UserDefinedFunction:
    """Create a pandas‑UDF that counts tokens for every string cell.

    The UDF uses *tiktoken* to approximate tokenisation and caches the
    resulting ``Encoding`` object per executor.

    Args:
        model_name: Model identifier understood by ``tiktoken``.

    Returns:
        A pandas UDF producing an ``IntegerType`` column with token counts.
    """

    @pandas_udf(IntegerType())
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        global _TIKTOKEN_ENC
        if _TIKTOKEN_ENC is None:
            _TIKTOKEN_ENC = tiktoken.encoding_for_model(model_name)

        for part in col:
            yield part.map(lambda x: len(_TIKTOKEN_ENC.encode(x)) if isinstance(x, str) else 0)

    return fn

Spark Extension¶

openaivec.spark ¶

Setup¶

ResponsesUDFBuilder dataclass ¶

of_openai classmethod ¶

of_azure_openai classmethod ¶

build ¶

build_from_task ¶

EmbeddingsUDFBuilder dataclass ¶

of_openai classmethod ¶

of_azure_openai classmethod ¶

build ¶

split_to_chunks_udf ¶

count_tokens_udf ¶

ResponsesUDFBuilder `dataclass` ¶

of_openai `classmethod` ¶

of_azure_openai `classmethod` ¶

EmbeddingsUDFBuilder `dataclass` ¶

of_openai `classmethod` ¶

of_azure_openai `classmethod` ¶