Basics of spark¶
This notebook demonstrates how to integrate Apache Spark with OpenAI's API to perform token counting, embedding generation, and multilingual translation using Spark UDFs.
In [1]:
Copied!
# Initialize Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Initialize Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
25/05/05 23:13:06 WARN Utils: Your hostname, Hirokis-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.12 instead (on interface en0) 25/05/05 23:13:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/05/05 23:13:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Create Dummy Data¶
Create a simple DataFrame containing names of fruits.
In [2]:
Copied!
# Create DataFrame with fruit names
fruit_data = [("apple",), ("banana",), ("cherry",), ("mango",), ("orange",), ("peach",), ("pear",), ("pineapple",), ("plum",), ("strawberry",)]
df = spark.createDataFrame(fruit_data, ["name"])
df.createOrReplaceTempView("fruits")
# Create DataFrame with fruit names
fruit_data = [("apple",), ("banana",), ("cherry",), ("mango",), ("orange",), ("peach",), ("pear",), ("pineapple",), ("plum",), ("strawberry",)]
df = spark.createDataFrame(fruit_data, ["name"])
df.createOrReplaceTempView("fruits")
In [3]:
Copied!
# Display the fruits DataFrame
spark.sql("select * from fruits").show()
# Display the fruits DataFrame
spark.sql("select * from fruits").show()
+----------+ | name| +----------+ | apple| | banana| | cherry| | mango| | orange| | peach| | pear| | pineapple| | plum| |strawberry| +----------+
Count Tokens¶
Use OpenAI's GPT model to count the number of tokens in each fruit name.
In [4]:
Copied!
# Register UDF to count tokens using OpenAI GPT model
from openaivec.spark import count_tokens_udf
spark.udf.register("count_tokens", count_tokens_udf("gpt-4o"))
# Register UDF to count tokens using OpenAI GPT model
from openaivec.spark import count_tokens_udf
spark.udf.register("count_tokens", count_tokens_udf("gpt-4o"))
Out[4]:
<pyspark.sql.udf.UserDefinedFunction at 0x10b1e1b40>
In [5]:
Copied!
# Show token counts for each fruit name
spark.sql("""
select
name,
count_tokens(name) as token_count
from fruits
""").show()
# Show token counts for each fruit name
spark.sql("""
select
name,
count_tokens(name) as token_count
from fruits
""").show()
+----------+-----------+ | name|token_count| +----------+-----------+ | apple| 1| | banana| 1| | cherry| 2| | mango| 2| | orange| 1| | peach| 2| | pear| 1| | pineapple| 2| | plum| 2| |strawberry| 3| +----------+-----------+
Generate Embeddings¶
Generate embeddings for each fruit name using OpenAI's embedding model.
In [7]:
Copied!
# Register UDF to generate embeddings
import os
from openaivec.spark import EmbeddingsUDFBuilder
embeddings_udf = EmbeddingsUDFBuilder.of_openai(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
spark.udf.register("embed", embeddings_udf.build(batch_size=1024))
# Register UDF to generate embeddings
import os
from openaivec.spark import EmbeddingsUDFBuilder
embeddings_udf = EmbeddingsUDFBuilder.of_openai(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
spark.udf.register("embed", embeddings_udf.build(batch_size=1024))
Out[7]:
<pyspark.sql.udf.UserDefinedFunction at 0x10e2cc4c0>
In [8]:
Copied!
# Display embeddings for each fruit name
spark.sql("""
select
name,
embed(name) as embedding
from fruits
""").show()
# Display embeddings for each fruit name
spark.sql("""
select
name,
embed(name) as embedding
from fruits
""").show()
[Stage 8:===============================> (6 + 5) / 11]
+----------+--------------------+ | name| embedding| +----------+--------------------+ | apple|[0.01764064, -0.0...| | banana|[0.013411593, -0....| | cherry|[0.036218576, -0....| | mango|[0.055494547, -0....| | orange|[-0.025922043, -0...| | peach|[0.030673496, -0....| | pear|[0.023718908, -0....| | pineapple|[0.020983547, -0....| | plum|[0.0049052937, 6....| |strawberry|[0.020106195, -0....| +----------+--------------------+
Multilingual Translation¶
Translate fruit names into multiple languages using OpenAI's GPT model.
In [9]:
Copied!
# Register UDF for multilingual translation
import os
from openaivec.spark import ResponsesUDFBuilder
from pydantic import BaseModel
udf = ResponsesUDFBuilder.of_openai(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="gpt-4.1-nano",
)
class Translation(BaseModel):
en: str
fr: str
ja: str
es: str
de: str
it: str
pt: str
ru: str
spark.udf.register("translate", udf.build(
instructions="Translate the following text to English, French, Japanese, Spanish, German, Italian, Portuguese, and Russian.",
response_format=Translation,
))
# Register UDF for multilingual translation
import os
from openaivec.spark import ResponsesUDFBuilder
from pydantic import BaseModel
udf = ResponsesUDFBuilder.of_openai(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="gpt-4.1-nano",
)
class Translation(BaseModel):
en: str
fr: str
ja: str
es: str
de: str
it: str
pt: str
ru: str
spark.udf.register("translate", udf.build(
instructions="Translate the following text to English, French, Japanese, Spanish, German, Italian, Portuguese, and Russian.",
response_format=Translation,
))
25/05/05 23:15:38 WARN SimpleFunctionRegistry: The function translate replaced a previously registered function.
Out[9]:
<pyspark.sql.udf.UserDefinedFunction at 0x10b203610>
In [10]:
Copied!
# Display translations for each fruit name
spark.sql("""
select
name,
translate(name) as t,
t.en as en,
t.fr as fr,
t.ja as ja,
t.es as es,
t.de as de,
t.it as it,
t.pt as pt,
t.ru as ru
from fruits
""").show()
# Display translations for each fruit name
spark.sql("""
select
name,
translate(name) as t,
t.en as en,
t.fr as fr,
t.ja as ja,
t.es as es,
t.de as de,
t.it as it,
t.pt as pt,
t.ru as ru
from fruits
""").show()
The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. The model name 'gpt-4.1-nano' is not supported by tiktoken. Instead, using the 'o200k_base' encoding. [Stage 11:==================================================> (10 + 1) / 11]
+----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+ | name| t| en| fr| ja| es| de| it| pt| ru| +----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+ | apple| {apple, pomme, リン...| apple| pomme| リンゴ|manzana| Apfel| mela| maçã| яблоко| | banana| {banana, banane, ...| banana|banane| バナナ|plátano| Banane| banana| banana| банан| | cherry| {cherry, cerise, ...| cherry|cerise| さくらんぼ| cereza| Kirsche|ciliegia| cereja| вишня| | mango| {mango, mangue, マ...| mango|mangue| マンゴー| mango| Mango| mango| manga| манго| | orange| {orange, orange, ...| orange|orange| オレンジ|naranja| orange| arancia|laranja|апельсин| | peach| {peach, pêche, もも...| peach| pêche| もも|durazno|Pfirsich| pesca|pêssego| персик| | pear| {pear, poire, 梨, ...| pear| poire| 梨| pera| Birne| pera| pêra| груша| | pineapple| {pineapple, anana...| pineapple|ananas|パイナップル| piña| Ananas| ananas|abacaxi| ананас| | plum|{plum, prune, プラム...| plum| prune| プラム|ciruela| Pflaume| prugna| ameixa| слива| |strawberry| {strawberry, frai...|strawberry|fraise| イチゴ| fresa|Erdbeere| fragola|morango|клубника| +----------+-----------------------+----------+------+------------+-------+--------+--------+-------+--------+
Conclusion¶
This notebook illustrated how to effectively integrate Apache Spark with OpenAI's API for various NLP tasks such as token counting, embedding generation, and multilingual translation.