Survey Data Transformation¶
Transform unstructured survey responses into structured, analysis-ready datasets using openaivec.
In [1]:
Copied!
import pandas as pd
from openaivec import pandas_ext
from pydantic import BaseModel
from typing import List, Optional
pandas_ext.responses_model("gpt-4o-mini")
import pandas as pd
from openaivec import pandas_ext
from pydantic import BaseModel
from typing import List, Optional
pandas_ext.responses_model("gpt-4o-mini")
Sample Survey Data¶
Realistic free-form survey responses from various demographic groups.
In [2]:
Copied!
# Sample survey responses
survey_responses = [
"I'm a 28-year-old software engineer from San Francisco. I love hiking, coding, and coffee. Currently working on AI projects.",
"45 years old, marketing manager in NYC. Interests include yoga, reading business books, and traveling to Europe.",
"College student, 20, studying biology in Boston. Enjoys gaming, anime, and volunteer work at animal shelters.",
"Retired teacher, 62, living in Austin Texas. Passionate about gardening, cooking, and spending time with grandchildren.",
"35-year-old doctor from Chicago, specializing in pediatrics. Hobbies are running marathons and playing piano.",
"Freelance graphic designer, 29, based in Portland. Into rock climbing, photography, and sustainable living.",
"High school student, 17, from Miami. Loves basketball, music production, and dreams of becoming a filmmaker.",
"Small business owner, 52, runs a bakery in Denver. Enjoys baking (obviously), hiking, and local community events.",
"Data scientist, 31, working remotely from Vancouver. Interested in machine learning, skiing, and craft beer.",
"Stay-at-home parent, 38, from Phoenix. Passionate about child development, crafting, and organizing community activities."
]
survey_df = pd.DataFrame({
"response_id": [f"RESP_{i:03d}" for i in range(1, len(survey_responses) + 1)],
"response": survey_responses
})
survey_df.head()
# Sample survey responses
survey_responses = [
"I'm a 28-year-old software engineer from San Francisco. I love hiking, coding, and coffee. Currently working on AI projects.",
"45 years old, marketing manager in NYC. Interests include yoga, reading business books, and traveling to Europe.",
"College student, 20, studying biology in Boston. Enjoys gaming, anime, and volunteer work at animal shelters.",
"Retired teacher, 62, living in Austin Texas. Passionate about gardening, cooking, and spending time with grandchildren.",
"35-year-old doctor from Chicago, specializing in pediatrics. Hobbies are running marathons and playing piano.",
"Freelance graphic designer, 29, based in Portland. Into rock climbing, photography, and sustainable living.",
"High school student, 17, from Miami. Loves basketball, music production, and dreams of becoming a filmmaker.",
"Small business owner, 52, runs a bakery in Denver. Enjoys baking (obviously), hiking, and local community events.",
"Data scientist, 31, working remotely from Vancouver. Interested in machine learning, skiing, and craft beer.",
"Stay-at-home parent, 38, from Phoenix. Passionate about child development, crafting, and organizing community activities."
]
survey_df = pd.DataFrame({
"response_id": [f"RESP_{i:03d}" for i in range(1, len(survey_responses) + 1)],
"response": survey_responses
})
survey_df.head()
Out[2]:
response_id | response | |
---|---|---|
0 | RESP_001 | I'm a 28-year-old software engineer from San F... |
1 | RESP_002 | 45 years old, marketing manager in NYC. Intere... |
2 | RESP_003 | College student, 20, studying biology in Bosto... |
3 | RESP_004 | Retired teacher, 62, living in Austin Texas. P... |
4 | RESP_005 | 35-year-old doctor from Chicago, specializing ... |
Define Structured Output Schema¶
Create comprehensive demographic and interest profiles.
In [3]:
Copied!
class Demographics(BaseModel):
age: Optional[int]
age_group: str # "18-25", "26-35", "36-45", "46-55", "56+"
occupation: str
occupation_category: str # "technology", "healthcare", "education", etc.
location: str
location_type: str # "urban", "suburban", "rural"
life_stage: str # "student", "professional", "parent", "retired"
class Interests(BaseModel):
primary_interests: List[str]
hobby_categories: List[str] # "sports", "arts", "technology", etc.
lifestyle_indicators: List[str] # "active", "creative", "social", etc.
class PersonProfile(BaseModel):
demographics: Demographics
interests: Interests
personality_traits: List[str]
potential_products: List[str] # Products/services they might be interested in
class Demographics(BaseModel):
age: Optional[int]
age_group: str # "18-25", "26-35", "36-45", "46-55", "56+"
occupation: str
occupation_category: str # "technology", "healthcare", "education", etc.
location: str
location_type: str # "urban", "suburban", "rural"
life_stage: str # "student", "professional", "parent", "retired"
class Interests(BaseModel):
primary_interests: List[str]
hobby_categories: List[str] # "sports", "arts", "technology", etc.
lifestyle_indicators: List[str] # "active", "creative", "social", etc.
class PersonProfile(BaseModel):
demographics: Demographics
interests: Interests
personality_traits: List[str]
potential_products: List[str] # Products/services they might be interested in
Transform Unstructured to Structured¶
Extract comprehensive profiles from free-text responses.
In [4]:
Copied!
# Extract structured profiles
structured_df = survey_df.assign(
profile=lambda df: df.response.ai.responses(
instructions="""
Extract comprehensive demographic and interest information from the survey response.
Infer missing information based on context clues when reasonable.
Categorize interests and suggest relevant product categories.
""",
response_format=PersonProfile
)
).ai.extract("profile")
structured_df.head()
# Extract structured profiles
structured_df = survey_df.assign(
profile=lambda df: df.response.ai.responses(
instructions="""
Extract comprehensive demographic and interest information from the survey response.
Infer missing information based on context clues when reasonable.
Categorize interests and suggest relevant product categories.
""",
response_format=PersonProfile
)
).ai.extract("profile")
structured_df.head()
Out[4]:
response_id | response | profile_demographics | profile_interests | profile_personality_traits | profile_potential_products | |
---|---|---|---|---|---|---|
0 | RESP_001 | I'm a 28-year-old software engineer from San F... | {'age': 28, 'age_group': 'Young Adult', 'occup... | {'primary_interests': ['Hiking', 'Coding', 'Co... | [Adventurous, Analytical, Creative] | [Hiking gear, Coffee subscriptions, Coding cou... |
1 | RESP_002 | 45 years old, marketing manager in NYC. Intere... | {'age': 45, 'age_group': 'Middle-aged Adult', ... | {'primary_interests': ['Yoga', 'Reading Busine... | [Ambitious, Inquisitive, Open-minded] | [Yoga mats, Business book subscriptions, Trave... |
2 | RESP_003 | College student, 20, studying biology in Bosto... | {'age': 20, 'age_group': 'Young Adult', 'occup... | {'primary_interests': ['Gaming', 'Anime', 'Vol... | [Creative, Empathetic, Curious] | [Gaming consoles, Anime merchandise, Volunteer... |
3 | RESP_004 | Retired teacher, 62, living in Austin Texas. P... | {'age': 62, 'age_group': 'Senior', 'occupation... | {'primary_interests': ['Gardening', 'Cooking',... | [Nurturing, Patient, Creative] | [Gardening tools, Cookbooks, Family activity k... |
4 | RESP_005 | 35-year-old doctor from Chicago, specializing ... | {'age': 35, 'age_group': 'Adult', 'occupation'... | {'primary_interests': ['Running Marathons', 'P... | [Disciplined, Compassionate, Creative] | [Running gear, Piano sheet music, Health suppl... |
Demographic Analysis¶
Extract demographic insights from the structured data.
In [5]:
Copied!
# Age distribution
print("AGE GROUP DISTRIBUTION:")
age_dist = structured_df.ai.extract("profile_demographics").profile_demographics_age_group.value_counts()
print(age_dist)
print("\n" + "="*50 + "\n")
# Occupation categories
print("OCCUPATION CATEGORIES:")
occ_dist = structured_df.ai.extract("profile_demographics").profile_demographics_occupation_category.value_counts()
print(occ_dist)
print("\n" + "="*50 + "\n")
# Life stages
print("LIFE STAGE DISTRIBUTION:")
life_dist = structured_df.ai.extract("profile_demographics").profile_demographics_life_stage.value_counts()
print(life_dist)
# Age distribution
print("AGE GROUP DISTRIBUTION:")
age_dist = structured_df.ai.extract("profile_demographics").profile_demographics_age_group.value_counts()
print(age_dist)
print("\n" + "="*50 + "\n")
# Occupation categories
print("OCCUPATION CATEGORIES:")
occ_dist = structured_df.ai.extract("profile_demographics").profile_demographics_occupation_category.value_counts()
print(occ_dist)
print("\n" + "="*50 + "\n")
# Life stages
print("LIFE STAGE DISTRIBUTION:")
life_dist = structured_df.ai.extract("profile_demographics").profile_demographics_life_stage.value_counts()
print(life_dist)
AGE GROUP DISTRIBUTION: profile_demographics_age_group Young Adult 3 Adult 3 Middle-aged Adult 2 Senior 1 Teenager 1 Name: count, dtype: int64 ================================================== OCCUPATION CATEGORIES: profile_demographics_occupation_category Education 3 Technology 2 Business 1 Healthcare 1 Creative 1 Food & Beverage 1 Family 1 Name: count, dtype: int64 ================================================== LIFE STAGE DISTRIBUTION: profile_demographics_life_stage Professional 6 Student 2 Retired 1 Family-oriented 1 Name: count, dtype: int64
Interest Pattern Analysis¶
Analyze hobby and interest patterns across demographics.
In [6]:
Copied!
# Explode interest categories for analysis
interests_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_hobby_categories')
print("TOP HOBBY CATEGORIES:")
hobby_counts = interests_expanded.profile_interests_hobby_categories.value_counts()
print(hobby_counts.head(10))
print("\n" + "="*50 + "\n")
# Lifestyle patterns
lifestyle_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_lifestyle_indicators')
print("LIFESTYLE INDICATORS:")
lifestyle_counts = lifestyle_expanded.profile_interests_lifestyle_indicators.value_counts()
print(lifestyle_counts.head(10))
# Explode interest categories for analysis
interests_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_hobby_categories')
print("TOP HOBBY CATEGORIES:")
hobby_counts = interests_expanded.profile_interests_hobby_categories.value_counts()
print(hobby_counts.head(10))
print("\n" + "="*50 + "\n")
# Lifestyle patterns
lifestyle_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_lifestyle_indicators')
print("LIFESTYLE INDICATORS:")
lifestyle_counts = lifestyle_expanded.profile_interests_lifestyle_indicators.value_counts()
print(lifestyle_counts.head(10))
TOP HOBBY CATEGORIES: profile_interests_hobby_categories Outdoor Activities 3 Technology 2 Community Engagement 2 Sports 2 Music 2 Culinary 2 Fitness 2 Entertainment 1 Community Service 1 Home & Garden 1 Name: count, dtype: int64 ================================================== LIFESTYLE INDICATORS: profile_interests_lifestyle_indicators Health-conscious 4 Tech-savvy 3 Family-oriented 2 Adventurous 2 Creative 2 Wellness-focused 1 Culturally aware 1 Socially conscious 1 Artistic 1 Environmentally conscious 1 Name: count, dtype: int64
Market Segmentation¶
Create customer segments based on extracted profiles.
In [7]:
Copied!
# Generate market segments
segments_df = structured_df.ai.extract("profile_demographics").assign(
segment=lambda df: df.apply(
lambda row: f"{row.profile_demographics_age_group}_{row.profile_demographics_occupation_category}",
axis=1
)
)
print("MARKET SEGMENTS:")
segment_counts = segments_df.segment.value_counts()
print(segment_counts)
print("\n" + "="*50 + "\n")
# Product recommendations by segment
print("PRODUCT OPPORTUNITIES BY SEGMENT:")
for segment in segment_counts.index[:5]: # Top 5 segments
segment_data = segments_df[segments_df.segment == segment]
print(f"\n📊 {segment.upper()}:")
# Get all product suggestions for this segment
products = []
for products_list in segment_data.profile_potential_products:
products.extend(products_list)
# Count and display top products
from collections import Counter
product_counter = Counter(products)
for product, count in product_counter.most_common(3):
print(f" • {product} ({count} mentions)")
# Generate market segments
segments_df = structured_df.ai.extract("profile_demographics").assign(
segment=lambda df: df.apply(
lambda row: f"{row.profile_demographics_age_group}_{row.profile_demographics_occupation_category}",
axis=1
)
)
print("MARKET SEGMENTS:")
segment_counts = segments_df.segment.value_counts()
print(segment_counts)
print("\n" + "="*50 + "\n")
# Product recommendations by segment
print("PRODUCT OPPORTUNITIES BY SEGMENT:")
for segment in segment_counts.index[:5]: # Top 5 segments
segment_data = segments_df[segments_df.segment == segment]
print(f"\n📊 {segment.upper()}:")
# Get all product suggestions for this segment
products = []
for products_list in segment_data.profile_potential_products:
products.extend(products_list)
# Count and display top products
from collections import Counter
product_counter = Counter(products)
for product, count in product_counter.most_common(3):
print(f" • {product} ({count} mentions)")
MARKET SEGMENTS: segment Young Adult_Technology 1 Middle-aged Adult_Business 1 Young Adult_Education 1 Senior_Education 1 Adult_Healthcare 1 Young Adult_Creative 1 Teenager_Education 1 Middle-aged Adult_Food & Beverage 1 Adult_Technology 1 Adult_Family 1 Name: count, dtype: int64 ================================================== PRODUCT OPPORTUNITIES BY SEGMENT: 📊 YOUNG ADULT_TECHNOLOGY: • Hiking gear (1 mentions) • Coffee subscriptions (1 mentions) • Coding courses (1 mentions) 📊 MIDDLE-AGED ADULT_BUSINESS: • Yoga mats (1 mentions) • Business book subscriptions (1 mentions) • Travel packages to Europe (1 mentions) 📊 YOUNG ADULT_EDUCATION: • Gaming consoles (1 mentions) • Anime merchandise (1 mentions) • Volunteer opportunities (1 mentions) 📊 SENIOR_EDUCATION: • Gardening tools (1 mentions) • Cookbooks (1 mentions) • Family activity kits (1 mentions) 📊 ADULT_HEALTHCARE: • Running gear (1 mentions) • Piano sheet music (1 mentions) • Health supplements (1 mentions)
Export for Analysis¶
Prepare clean datasets for business intelligence tools.
In [8]:
Copied!
# Create clean demographic table
demographics_clean = structured_df.ai.extract("profile_demographics")[[
'response_id',
'profile_demographics_age',
'profile_demographics_age_group',
'profile_demographics_occupation',
'profile_demographics_occupation_category',
'profile_demographics_location',
'profile_demographics_location_type',
'profile_demographics_life_stage'
]].copy()
print("📊 CLEAN DEMOGRAPHICS TABLE:")
print(demographics_clean.head())
# Save to CSV for external analysis
# demographics_clean.to_csv('demographics_analysis.csv', index=False)
# print("\n💾 Data exported to demographics_analysis.csv")
# Create clean demographic table
demographics_clean = structured_df.ai.extract("profile_demographics")[[
'response_id',
'profile_demographics_age',
'profile_demographics_age_group',
'profile_demographics_occupation',
'profile_demographics_occupation_category',
'profile_demographics_location',
'profile_demographics_location_type',
'profile_demographics_life_stage'
]].copy()
print("📊 CLEAN DEMOGRAPHICS TABLE:")
print(demographics_clean.head())
# Save to CSV for external analysis
# demographics_clean.to_csv('demographics_analysis.csv', index=False)
# print("\n💾 Data exported to demographics_analysis.csv")
📊 CLEAN DEMOGRAPHICS TABLE: response_id profile_demographics_age profile_demographics_age_group \ 0 RESP_001 28 Young Adult 1 RESP_002 45 Middle-aged Adult 2 RESP_003 20 Young Adult 3 RESP_004 62 Senior 4 RESP_005 35 Adult profile_demographics_occupation profile_demographics_occupation_category \ 0 Software Engineer Technology 1 Marketing Manager Business 2 College Student Education 3 Retired Teacher Education 4 Doctor Healthcare profile_demographics_location profile_demographics_location_type \ 0 San Francisco Urban 1 New York City Urban 2 Boston Urban 3 Austin, Texas Urban 4 Chicago Urban profile_demographics_life_stage 0 Professional 1 Professional 2 Student 3 Retired 4 Professional
Conclusion¶
This notebook demonstrates how openaivec transforms unstructured survey data into:
- Structured Demographics: Age, occupation, location, life stage
- Interest Profiles: Hobbies, lifestyle indicators, personality traits
- Market Segments: Actionable customer groupings
- Product Opportunities: Data-driven recommendation insights
- Analysis-Ready Data: Clean datasets for BI tools
Scale this approach to thousands of survey responses for comprehensive market research.