Skip to content

Spark Extension

openaivec.spark

Asynchronous Spark UDFs for the OpenAI and Azure OpenAI APIs.

This module provides builder classes (ResponsesUDFBuilder, EmbeddingsUDFBuilder) for creating asynchronous Spark UDFs that communicate with either the public OpenAI API or Azure OpenAI using the openaivec.spark subpackage. It supports UDFs for generating responses and creating embeddings asynchronously. The UDFs operate on Spark DataFrames and leverage asyncio for potentially improved performance in I/O-bound operations.

Setup

First, obtain a Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Next, instantiate UDF builders with your OpenAI API key (or Azure credentials) and model/deployment names, then register the desired UDFs:

import os
from openaivec.spark import ResponsesUDFBuilder, EmbeddingsUDFBuilder
from pydantic import BaseModel

# Option 1: Using OpenAI
resp_builder = ResponsesUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4o-mini", # Model for responses
)
emb_builder = EmbeddingsUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small", # Model for embeddings
)

# Option 2: Using Azure OpenAI
# resp_builder = ResponsesUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="your-resp-deployment-name", # Deployment for responses
# )
# emb_builder = EmbeddingsUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="your-emb-deployment-name", # Deployment for embeddings
# )

# Define a Pydantic model for structured responses (optional)
class Translation(BaseModel):
    en: str
    fr: str
    # ... other languages

# Register the asynchronous responses UDF
spark.udf.register(
    "translate_async",
    resp_builder.build(
        instructions="Translate the text to multiple languages.",
        response_format=Translation,
    ),
)

# Or use a predefined task with build_from_task method
from openaivec.task import nlp
spark.udf.register(
    "sentiment_async",
    resp_builder.build_from_task(nlp.SENTIMENT_ANALYSIS),
)

# Register the asynchronous embeddings UDF
spark.udf.register(
    "embed_async",
    emb_builder.build(),
)

You can now invoke the UDFs from Spark SQL:

SELECT
    text,
    translate_async(text) AS translation,
    sentiment_async(text) AS sentiment,
    embed_async(text) AS embedding
FROM your_table;

Note: This module provides asynchronous support through the pandas extensions.

ResponsesUDFBuilder dataclass

Builder for asynchronous Spark pandas UDFs for generating responses.

Configures and builds UDFs that leverage pandas_ext.aio.responses to generate text or structured responses from OpenAI models asynchronously. An instance stores authentication parameters and the model name.

This builder supports two main methods: - build(): Creates UDFs with custom instructions and response formats - build_from_task(): Creates UDFs from predefined tasks (e.g., sentiment analysis)

Attributes:

Name Type Description
api_key str

OpenAI or Azure API key.

endpoint Optional[str]

Azure endpoint base URL. None for public OpenAI.

api_version Optional[str]

Azure API version. Ignored for public OpenAI.

model_name str

Deployment name (Azure) or model name (OpenAI) for responses.

Source code in src/openaivec/spark.py
@dataclass(frozen=True)
class ResponsesUDFBuilder:
    """Builder for asynchronous Spark pandas UDFs for generating responses.

    Configures and builds UDFs that leverage `pandas_ext.aio.responses`
    to generate text or structured responses from OpenAI models asynchronously.
    An instance stores authentication parameters and the model name.

    This builder supports two main methods:
    - `build()`: Creates UDFs with custom instructions and response formats
    - `build_from_task()`: Creates UDFs from predefined tasks (e.g., sentiment analysis)

    Attributes:
        api_key (str): OpenAI or Azure API key.
        endpoint (Optional[str]): Azure endpoint base URL. None for public OpenAI.
        api_version (Optional[str]): Azure API version. Ignored for public OpenAI.
        model_name (str): Deployment name (Azure) or model name (OpenAI) for responses.
    """

    # Params for OpenAI SDK
    api_key: str
    endpoint: str | None
    api_version: str | None

    # Params for Responses API
    model_name: str

    @classmethod
    def of_openai(cls, api_key: str, model_name: str) -> "ResponsesUDFBuilder":
        """Creates a builder configured for the public OpenAI API.

        Args:
            api_key (str): The OpenAI API key.
            model_name (str): The OpenAI model name for responses (e.g., "gpt-4o-mini").

        Returns:
            ResponsesUDFBuilder: A builder instance configured for OpenAI responses.
        """
        return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

    @classmethod
    def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "ResponsesUDFBuilder":
        """Creates a builder configured for Azure OpenAI.

        Args:
            api_key (str): The Azure OpenAI API key.
            endpoint (str): The Azure OpenAI endpoint URL.
            api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
            model_name (str): The Azure OpenAI deployment name for responses.

        Returns:
            ResponsesUDFBuilder: A builder instance configured for Azure OpenAI responses.
        """
        return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

    def build(
        self,
        instructions: str,
        response_format: Type[T] = str,
        batch_size: int = 128,  # Default batch size for async might differ
        temperature: float = 0.0,
        top_p: float = 1.0,
        max_concurrency: int = 8,
    ) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF for generating responses.

        Args:
            instructions (str): The system prompt or instructions for the model.
            response_format (Type[T]): The desired output format. Either `str` for plain text
                or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.
            temperature (float): Sampling temperature (0.0 to 2.0). Defaults to 0.0.
            top_p (float): Nucleus sampling parameter. Defaults to 1.0.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to generate responses asynchronously.
                Output schema is `StringType` or a struct derived from `response_format`.

        Raises:
            ValueError: If `response_format` is not `str` or a Pydantic `BaseModel`.
        """
        if issubclass(response_format, BaseModel):
            spark_schema = _pydantic_to_spark_schema(response_format)
            json_schema_string = serialize_base_model(response_format)

            @pandas_udf(returnType=spark_schema)
            def structure_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
                _initialize(self.api_key, self.endpoint, self.api_version)
                pandas_ext.responses_model(self.model_name)

                for part in col:
                    predictions: pd.Series = asyncio.run(
                        part.aio.responses(
                            instructions=instructions,
                            response_format=deserialize_base_model(json_schema_string),
                            batch_size=batch_size,
                            temperature=temperature,
                            top_p=top_p,
                            max_concurrency=max_concurrency,
                        )
                    )
                    yield pd.DataFrame(predictions.map(_safe_dump).tolist())

            return structure_udf

        elif issubclass(response_format, str):

            @pandas_udf(returnType=StringType())
            def string_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
                _initialize(self.api_key, self.endpoint, self.api_version)
                pandas_ext.responses_model(self.model_name)

                for part in col:
                    predictions: pd.Series = asyncio.run(
                        part.aio.responses(
                            instructions=instructions,
                            response_format=str,
                            batch_size=batch_size,
                            temperature=temperature,
                            top_p=top_p,
                            max_concurrency=max_concurrency,
                        )
                    )
                    yield predictions.map(_safe_cast_str)

            return string_udf

        else:
            raise ValueError(f"Unsupported response_format: {response_format}")

    def build_from_task(
        self,
        task: PreparedTask,
        batch_size: int = 128,
        max_concurrency: int = 8,
    ) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF from a predefined task.

        This method allows users to create UDFs from predefined tasks such as sentiment analysis,
        translation, or other common NLP operations defined in the openaivec.task module.

        Args:
            task (PreparedTask): A predefined task configuration containing instructions,
                response format, temperature, and top_p settings.
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.
            max_concurrency (int): Maximum number of concurrent requests. Defaults to 8.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to execute the specified task
                asynchronously, returning a struct derived from the task's response format.

        Example:
            ```python
            from openaivec.task import nlp

            builder = ResponsesUDFBuilder.of_openai(
                api_key="your-api-key",
                model_name="gpt-4o-mini"
            )

            sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

            spark.udf.register("analyze_sentiment", sentiment_udf)
            ```
        """
        # Serialize task parameters for Spark serialization compatibility
        task_instructions = task.instructions
        task_response_format_json = serialize_base_model(task.response_format)
        task_temperature = task.temperature
        task_top_p = task.top_p

        # Deserialize the response format from JSON
        response_format = deserialize_base_model(task_response_format_json)
        spark_schema = _pydantic_to_spark_schema(response_format)

        @pandas_udf(returnType=spark_schema)
        def task_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=task_instructions,
                        response_format=response_format,
                        batch_size=batch_size,
                        temperature=task_temperature,
                        top_p=task_top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield pd.DataFrame(predictions.map(_safe_dump).tolist())

        return task_udf

of_openai classmethod

of_openai(
    api_key: str, model_name: str
) -> ResponsesUDFBuilder

Creates a builder configured for the public OpenAI API.

Parameters:

Name Type Description Default
api_key str

The OpenAI API key.

required
model_name str

The OpenAI model name for responses (e.g., "gpt-4o-mini").

required

Returns:

Name Type Description
ResponsesUDFBuilder ResponsesUDFBuilder

A builder instance configured for OpenAI responses.

Source code in src/openaivec/spark.py
@classmethod
def of_openai(cls, api_key: str, model_name: str) -> "ResponsesUDFBuilder":
    """Creates a builder configured for the public OpenAI API.

    Args:
        api_key (str): The OpenAI API key.
        model_name (str): The OpenAI model name for responses (e.g., "gpt-4o-mini").

    Returns:
        ResponsesUDFBuilder: A builder instance configured for OpenAI responses.
    """
    return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

of_azure_openai classmethod

of_azure_openai(
    api_key: str,
    endpoint: str,
    api_version: str,
    model_name: str,
) -> ResponsesUDFBuilder

Creates a builder configured for Azure OpenAI.

Parameters:

Name Type Description Default
api_key str

The Azure OpenAI API key.

required
endpoint str

The Azure OpenAI endpoint URL.

required
api_version str

The Azure OpenAI API version (e.g., "2024-02-01").

required
model_name str

The Azure OpenAI deployment name for responses.

required

Returns:

Name Type Description
ResponsesUDFBuilder ResponsesUDFBuilder

A builder instance configured for Azure OpenAI responses.

Source code in src/openaivec/spark.py
@classmethod
def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "ResponsesUDFBuilder":
    """Creates a builder configured for Azure OpenAI.

    Args:
        api_key (str): The Azure OpenAI API key.
        endpoint (str): The Azure OpenAI endpoint URL.
        api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
        model_name (str): The Azure OpenAI deployment name for responses.

    Returns:
        ResponsesUDFBuilder: A builder instance configured for Azure OpenAI responses.
    """
    return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

build

build(
    instructions: str,
    response_format: Type[T] = str,
    batch_size: int = 128,
    temperature: float = 0.0,
    top_p: float = 1.0,
    max_concurrency: int = 8,
) -> UserDefinedFunction

Builds the asynchronous pandas UDF for generating responses.

Parameters:

Name Type Description Default
instructions str

The system prompt or instructions for the model.

required
response_format Type[T]

The desired output format. Either str for plain text or a Pydantic BaseModel for structured JSON output. Defaults to str.

str
batch_size int

Number of rows per async batch request passed to the underlying pandas_ext function. Defaults to 128.

128
temperature float

Sampling temperature (0.0 to 2.0). Defaults to 0.0.

0.0
top_p float

Nucleus sampling parameter. Defaults to 1.0.

1.0

Returns:

Name Type Description
UserDefinedFunction UserDefinedFunction

A Spark pandas UDF configured to generate responses asynchronously. Output schema is StringType or a struct derived from response_format.

Raises:

Type Description
ValueError

If response_format is not str or a Pydantic BaseModel.

Source code in src/openaivec/spark.py
def build(
    self,
    instructions: str,
    response_format: Type[T] = str,
    batch_size: int = 128,  # Default batch size for async might differ
    temperature: float = 0.0,
    top_p: float = 1.0,
    max_concurrency: int = 8,
) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF for generating responses.

    Args:
        instructions (str): The system prompt or instructions for the model.
        response_format (Type[T]): The desired output format. Either `str` for plain text
            or a Pydantic `BaseModel` for structured JSON output. Defaults to `str`.
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.
        temperature (float): Sampling temperature (0.0 to 2.0). Defaults to 0.0.
        top_p (float): Nucleus sampling parameter. Defaults to 1.0.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate responses asynchronously.
            Output schema is `StringType` or a struct derived from `response_format`.

    Raises:
        ValueError: If `response_format` is not `str` or a Pydantic `BaseModel`.
    """
    if issubclass(response_format, BaseModel):
        spark_schema = _pydantic_to_spark_schema(response_format)
        json_schema_string = serialize_base_model(response_format)

        @pandas_udf(returnType=spark_schema)
        def structure_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=instructions,
                        response_format=deserialize_base_model(json_schema_string),
                        batch_size=batch_size,
                        temperature=temperature,
                        top_p=top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield pd.DataFrame(predictions.map(_safe_dump).tolist())

        return structure_udf

    elif issubclass(response_format, str):

        @pandas_udf(returnType=StringType())
        def string_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.responses_model(self.model_name)

            for part in col:
                predictions: pd.Series = asyncio.run(
                    part.aio.responses(
                        instructions=instructions,
                        response_format=str,
                        batch_size=batch_size,
                        temperature=temperature,
                        top_p=top_p,
                        max_concurrency=max_concurrency,
                    )
                )
                yield predictions.map(_safe_cast_str)

        return string_udf

    else:
        raise ValueError(f"Unsupported response_format: {response_format}")

build_from_task

build_from_task(
    task: PreparedTask,
    batch_size: int = 128,
    max_concurrency: int = 8,
) -> UserDefinedFunction

Builds the asynchronous pandas UDF from a predefined task.

This method allows users to create UDFs from predefined tasks such as sentiment analysis, translation, or other common NLP operations defined in the openaivec.task module.

Parameters:

Name Type Description Default
task PreparedTask

A predefined task configuration containing instructions, response format, temperature, and top_p settings.

required
batch_size int

Number of rows per async batch request passed to the underlying pandas_ext function. Defaults to 128.

128
max_concurrency int

Maximum number of concurrent requests. Defaults to 8.

8

Returns:

Name Type Description
UserDefinedFunction UserDefinedFunction

A Spark pandas UDF configured to execute the specified task asynchronously, returning a struct derived from the task's response format.

Example
from openaivec.task import nlp

builder = ResponsesUDFBuilder.of_openai(
    api_key="your-api-key",
    model_name="gpt-4o-mini"
)

sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

spark.udf.register("analyze_sentiment", sentiment_udf)
Source code in src/openaivec/spark.py
def build_from_task(
    self,
    task: PreparedTask,
    batch_size: int = 128,
    max_concurrency: int = 8,
) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF from a predefined task.

    This method allows users to create UDFs from predefined tasks such as sentiment analysis,
    translation, or other common NLP operations defined in the openaivec.task module.

    Args:
        task (PreparedTask): A predefined task configuration containing instructions,
            response format, temperature, and top_p settings.
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.
        max_concurrency (int): Maximum number of concurrent requests. Defaults to 8.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to execute the specified task
            asynchronously, returning a struct derived from the task's response format.

    Example:
        ```python
        from openaivec.task import nlp

        builder = ResponsesUDFBuilder.of_openai(
            api_key="your-api-key",
            model_name="gpt-4o-mini"
        )

        sentiment_udf = builder.build_from_task(nlp.SENTIMENT_ANALYSIS)

        spark.udf.register("analyze_sentiment", sentiment_udf)
        ```
    """
    # Serialize task parameters for Spark serialization compatibility
    task_instructions = task.instructions
    task_response_format_json = serialize_base_model(task.response_format)
    task_temperature = task.temperature
    task_top_p = task.top_p

    # Deserialize the response format from JSON
    response_format = deserialize_base_model(task_response_format_json)
    spark_schema = _pydantic_to_spark_schema(response_format)

    @pandas_udf(returnType=spark_schema)
    def task_udf(col: Iterator[pd.Series]) -> Iterator[pd.DataFrame]:
        _initialize(self.api_key, self.endpoint, self.api_version)
        pandas_ext.responses_model(self.model_name)

        for part in col:
            predictions: pd.Series = asyncio.run(
                part.aio.responses(
                    instructions=task_instructions,
                    response_format=response_format,
                    batch_size=batch_size,
                    temperature=task_temperature,
                    top_p=task_top_p,
                    max_concurrency=max_concurrency,
                )
            )
            yield pd.DataFrame(predictions.map(_safe_dump).tolist())

    return task_udf

EmbeddingsUDFBuilder dataclass

Builder for asynchronous Spark pandas UDFs for creating embeddings.

Configures and builds UDFs that leverage pandas_ext.aio.embeddings to generate vector embeddings from OpenAI models asynchronously. An instance stores authentication parameters and the model name.

Attributes:

Name Type Description
api_key str

OpenAI or Azure API key.

endpoint Optional[str]

Azure endpoint base URL. None for public OpenAI.

api_version Optional[str]

Azure API version. Ignored for public OpenAI.

model_name str

Deployment name (Azure) or model name (OpenAI) for embeddings.

Source code in src/openaivec/spark.py
@dataclass(frozen=True)
class EmbeddingsUDFBuilder:
    """Builder for asynchronous Spark pandas UDFs for creating embeddings.

    Configures and builds UDFs that leverage `pandas_ext.aio.embeddings`
    to generate vector embeddings from OpenAI models asynchronously.
    An instance stores authentication parameters and the model name.

    Attributes:
        api_key (str): OpenAI or Azure API key.
        endpoint (Optional[str]): Azure endpoint base URL. None for public OpenAI.
        api_version (Optional[str]): Azure API version. Ignored for public OpenAI.
        model_name (str): Deployment name (Azure) or model name (OpenAI) for embeddings.
    """

    # Params for OpenAI SDK
    api_key: str
    endpoint: str | None
    api_version: str | None

    # Params for Embeddings API
    model_name: str

    @classmethod
    def of_openai(cls, api_key: str, model_name: str) -> "EmbeddingsUDFBuilder":
        """Creates a builder configured for the public OpenAI API.

        Args:
            api_key (str): The OpenAI API key.
            model_name (str): The OpenAI model name for embeddings (e.g., "text-embedding-3-small").

        Returns:
            EmbeddingsUDFBuilder: A builder instance configured for OpenAI embeddings.
        """
        return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

    @classmethod
    def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "EmbeddingsUDFBuilder":
        """Creates a builder configured for Azure OpenAI.

        Args:
            api_key (str): The Azure OpenAI API key.
            endpoint (str): The Azure OpenAI endpoint URL.
            api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
            model_name (str): The Azure OpenAI deployment name for embeddings.

        Returns:
            EmbeddingsUDFBuilder: A builder instance configured for Azure OpenAI embeddings.
        """
        return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

    def build(self, batch_size: int = 128, max_concurrency: int = 8) -> UserDefinedFunction:
        """Builds the asynchronous pandas UDF for generating embeddings.

        Args:
            batch_size (int): Number of rows per async batch request passed to the underlying
                `pandas_ext` function. Defaults to 128.

        Returns:
            UserDefinedFunction: A Spark pandas UDF configured to generate embeddings asynchronously,
                returning an `ArrayType(FloatType())` column.
        """

        @pandas_udf(returnType=ArrayType(FloatType()))
        def embeddings_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
            _initialize(self.api_key, self.endpoint, self.api_version)
            pandas_ext.embeddings_model(self.model_name)

            for part in col:
                embeddings: pd.Series = asyncio.run(
                    part.aio.embeddings(batch_size=batch_size, max_concurrency=max_concurrency)
                )
                yield embeddings.map(lambda x: x.tolist())

        return embeddings_udf

of_openai classmethod

of_openai(
    api_key: str, model_name: str
) -> EmbeddingsUDFBuilder

Creates a builder configured for the public OpenAI API.

Parameters:

Name Type Description Default
api_key str

The OpenAI API key.

required
model_name str

The OpenAI model name for embeddings (e.g., "text-embedding-3-small").

required

Returns:

Name Type Description
EmbeddingsUDFBuilder EmbeddingsUDFBuilder

A builder instance configured for OpenAI embeddings.

Source code in src/openaivec/spark.py
@classmethod
def of_openai(cls, api_key: str, model_name: str) -> "EmbeddingsUDFBuilder":
    """Creates a builder configured for the public OpenAI API.

    Args:
        api_key (str): The OpenAI API key.
        model_name (str): The OpenAI model name for embeddings (e.g., "text-embedding-3-small").

    Returns:
        EmbeddingsUDFBuilder: A builder instance configured for OpenAI embeddings.
    """
    return cls(api_key=api_key, endpoint=None, api_version=None, model_name=model_name)

of_azure_openai classmethod

of_azure_openai(
    api_key: str,
    endpoint: str,
    api_version: str,
    model_name: str,
) -> EmbeddingsUDFBuilder

Creates a builder configured for Azure OpenAI.

Parameters:

Name Type Description Default
api_key str

The Azure OpenAI API key.

required
endpoint str

The Azure OpenAI endpoint URL.

required
api_version str

The Azure OpenAI API version (e.g., "2024-02-01").

required
model_name str

The Azure OpenAI deployment name for embeddings.

required

Returns:

Name Type Description
EmbeddingsUDFBuilder EmbeddingsUDFBuilder

A builder instance configured for Azure OpenAI embeddings.

Source code in src/openaivec/spark.py
@classmethod
def of_azure_openai(cls, api_key: str, endpoint: str, api_version: str, model_name: str) -> "EmbeddingsUDFBuilder":
    """Creates a builder configured for Azure OpenAI.

    Args:
        api_key (str): The Azure OpenAI API key.
        endpoint (str): The Azure OpenAI endpoint URL.
        api_version (str): The Azure OpenAI API version (e.g., "2024-02-01").
        model_name (str): The Azure OpenAI deployment name for embeddings.

    Returns:
        EmbeddingsUDFBuilder: A builder instance configured for Azure OpenAI embeddings.
    """
    return cls(api_key=api_key, endpoint=endpoint, api_version=api_version, model_name=model_name)

build

build(
    batch_size: int = 128, max_concurrency: int = 8
) -> UserDefinedFunction

Builds the asynchronous pandas UDF for generating embeddings.

Parameters:

Name Type Description Default
batch_size int

Number of rows per async batch request passed to the underlying pandas_ext function. Defaults to 128.

128

Returns:

Name Type Description
UserDefinedFunction UserDefinedFunction

A Spark pandas UDF configured to generate embeddings asynchronously, returning an ArrayType(FloatType()) column.

Source code in src/openaivec/spark.py
def build(self, batch_size: int = 128, max_concurrency: int = 8) -> UserDefinedFunction:
    """Builds the asynchronous pandas UDF for generating embeddings.

    Args:
        batch_size (int): Number of rows per async batch request passed to the underlying
            `pandas_ext` function. Defaults to 128.

    Returns:
        UserDefinedFunction: A Spark pandas UDF configured to generate embeddings asynchronously,
            returning an `ArrayType(FloatType())` column.
    """

    @pandas_udf(returnType=ArrayType(FloatType()))
    def embeddings_udf(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        _initialize(self.api_key, self.endpoint, self.api_version)
        pandas_ext.embeddings_model(self.model_name)

        for part in col:
            embeddings: pd.Series = asyncio.run(
                part.aio.embeddings(batch_size=batch_size, max_concurrency=max_concurrency)
            )
            yield embeddings.map(lambda x: x.tolist())

    return embeddings_udf

split_to_chunks_udf

split_to_chunks_udf(
    model_name: str, max_tokens: int, sep: List[str]
) -> UserDefinedFunction

Create a pandas‑UDF that splits text into token‑bounded chunks.

Parameters:

Name Type Description Default
model_name str

Model identifier passed to tiktoken.

required
max_tokens int

Maximum tokens allowed per chunk.

required
sep List[str]

Ordered list of separator strings used by TextChunker.

required

Returns:

Type Description
UserDefinedFunction

A pandas UDF producing an ArrayType(StringType()) column whose values are lists of chunks respecting the max_tokens limit.

Source code in src/openaivec/spark.py
def split_to_chunks_udf(model_name: str, max_tokens: int, sep: List[str]) -> UserDefinedFunction:
    """Create a pandas‑UDF that splits text into token‑bounded chunks.

    Args:
        model_name: Model identifier passed to *tiktoken*.
        max_tokens: Maximum tokens allowed per chunk.
        sep: Ordered list of separator strings used by ``TextChunker``.

    Returns:
        A pandas UDF producing an ``ArrayType(StringType())`` column whose
            values are lists of chunks respecting the ``max_tokens`` limit.
    """

    @pandas_udf(ArrayType(StringType()))
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        global _TIKTOKEN_ENC
        if _TIKTOKEN_ENC is None:
            _TIKTOKEN_ENC = tiktoken.encoding_for_model(model_name)

        chunker = TextChunker(_TIKTOKEN_ENC)

        for part in col:
            yield part.map(lambda x: chunker.split(x, max_tokens=max_tokens, sep=sep) if isinstance(x, str) else [])

    return fn

count_tokens_udf

count_tokens_udf(
    model_name: str = "gpt-4o",
) -> UserDefinedFunction

Create a pandas‑UDF that counts tokens for every string cell.

The UDF uses tiktoken to approximate tokenisation and caches the resulting Encoding object per executor.

Parameters:

Name Type Description Default
model_name str

Model identifier understood by tiktoken.

'gpt-4o'

Returns:

Type Description
UserDefinedFunction

A pandas UDF producing an IntegerType column with token counts.

Source code in src/openaivec/spark.py
def count_tokens_udf(model_name: str = "gpt-4o") -> UserDefinedFunction:
    """Create a pandas‑UDF that counts tokens for every string cell.

    The UDF uses *tiktoken* to approximate tokenisation and caches the
    resulting ``Encoding`` object per executor.

    Args:
        model_name: Model identifier understood by ``tiktoken``.

    Returns:
        A pandas UDF producing an ``IntegerType`` column with token counts.
    """

    @pandas_udf(IntegerType())
    def fn(col: Iterator[pd.Series]) -> Iterator[pd.Series]:
        global _TIKTOKEN_ENC
        if _TIKTOKEN_ENC is None:
            _TIKTOKEN_ENC = tiktoken.encoding_for_model(model_name)

        for part in col:
            yield part.map(lambda x: len(_TIKTOKEN_ENC.encode(x)) if isinstance(x, str) else 0)

    return fn