Survey Data Transformation¶

Transform unstructured survey responses into structured, analysis-ready datasets using openaivec.

In [1]:

Copied!





import pandas as pd
from openaivec import pandas_ext
from pydantic import BaseModel
from typing import List, Optional

pandas_ext.responses_model("gpt-4o-mini")
import pandas as pd
from openaivec import pandas_ext
from pydantic import BaseModel
from typing import List, Optional

pandas_ext.responses_model("gpt-4o-mini")

Sample Survey Data¶

Realistic free-form survey responses from various demographic groups.

In [2]:

Copied!





# Sample survey responses
survey_responses = [
    "I'm a 28-year-old software engineer from San Francisco. I love hiking, coding, and coffee. Currently working on AI projects.",
    "45 years old, marketing manager in NYC. Interests include yoga, reading business books, and traveling to Europe.",
    "College student, 20, studying biology in Boston. Enjoys gaming, anime, and volunteer work at animal shelters.",
    "Retired teacher, 62, living in Austin Texas. Passionate about gardening, cooking, and spending time with grandchildren.",
    "35-year-old doctor from Chicago, specializing in pediatrics. Hobbies are running marathons and playing piano.",
    "Freelance graphic designer, 29, based in Portland. Into rock climbing, photography, and sustainable living.",
    "High school student, 17, from Miami. Loves basketball, music production, and dreams of becoming a filmmaker.",
    "Small business owner, 52, runs a bakery in Denver. Enjoys baking (obviously), hiking, and local community events.",
    "Data scientist, 31, working remotely from Vancouver. Interested in machine learning, skiing, and craft beer.",
    "Stay-at-home parent, 38, from Phoenix. Passionate about child development, crafting, and organizing community activities."
]

survey_df = pd.DataFrame({
    "response_id": [f"RESP_{i:03d}" for i in range(1, len(survey_responses) + 1)],
    "response": survey_responses
})

survey_df.head()
# Sample survey responses
survey_responses = [
    "I'm a 28-year-old software engineer from San Francisco. I love hiking, coding, and coffee. Currently working on AI projects.",
    "45 years old, marketing manager in NYC. Interests include yoga, reading business books, and traveling to Europe.",
    "College student, 20, studying biology in Boston. Enjoys gaming, anime, and volunteer work at animal shelters.",
    "Retired teacher, 62, living in Austin Texas. Passionate about gardening, cooking, and spending time with grandchildren.",
    "35-year-old doctor from Chicago, specializing in pediatrics. Hobbies are running marathons and playing piano.",
    "Freelance graphic designer, 29, based in Portland. Into rock climbing, photography, and sustainable living.",
    "High school student, 17, from Miami. Loves basketball, music production, and dreams of becoming a filmmaker.",
    "Small business owner, 52, runs a bakery in Denver. Enjoys baking (obviously), hiking, and local community events.",
    "Data scientist, 31, working remotely from Vancouver. Interested in machine learning, skiing, and craft beer.",
    "Stay-at-home parent, 38, from Phoenix. Passionate about child development, crafting, and organizing community activities."
]

survey_df = pd.DataFrame({
    "response_id": [f"RESP_{i:03d}" for i in range(1, len(survey_responses) + 1)],
    "response": survey_responses
})

survey_df.head()

Out[2]:

	response_id	response
0	RESP_001	I'm a 28-year-old software engineer from San F...
1	RESP_002	45 years old, marketing manager in NYC. Intere...
2	RESP_003	College student, 20, studying biology in Bosto...
3	RESP_004	Retired teacher, 62, living in Austin Texas. P...
4	RESP_005	35-year-old doctor from Chicago, specializing ...

Define Structured Output Schema¶

Create comprehensive demographic and interest profiles.

In [3]:

Copied!





class Demographics(BaseModel):
    age: Optional[int]
    age_group: str  # "18-25", "26-35", "36-45", "46-55", "56+"
    occupation: str
    occupation_category: str  # "technology", "healthcare", "education", etc.
    location: str
    location_type: str  # "urban", "suburban", "rural"
    life_stage: str  # "student", "professional", "parent", "retired"

class Interests(BaseModel):
    primary_interests: List[str]
    hobby_categories: List[str]  # "sports", "arts", "technology", etc.
    lifestyle_indicators: List[str]  # "active", "creative", "social", etc.

class PersonProfile(BaseModel):
    demographics: Demographics
    interests: Interests
    personality_traits: List[str]
    potential_products: List[str]  # Products/services they might be interested in
class Demographics(BaseModel):
    age: Optional[int]
    age_group: str  # "18-25", "26-35", "36-45", "46-55", "56+"
    occupation: str
    occupation_category: str  # "technology", "healthcare", "education", etc.
    location: str
    location_type: str  # "urban", "suburban", "rural"
    life_stage: str  # "student", "professional", "parent", "retired"

class Interests(BaseModel):
    primary_interests: List[str]
    hobby_categories: List[str]  # "sports", "arts", "technology", etc.
    lifestyle_indicators: List[str]  # "active", "creative", "social", etc.

class PersonProfile(BaseModel):
    demographics: Demographics
    interests: Interests
    personality_traits: List[str]
    potential_products: List[str]  # Products/services they might be interested in

Transform Unstructured to Structured¶

Extract comprehensive profiles from free-text responses.

In [4]:

Copied!





# Extract structured profiles
structured_df = survey_df.assign(
    profile=lambda df: df.response.ai.responses(
        instructions="""
        Extract comprehensive demographic and interest information from the survey response.
        Infer missing information based on context clues when reasonable.
        Categorize interests and suggest relevant product categories.
        """,
        response_format=PersonProfile
    )
).ai.extract("profile")

structured_df.head()
# Extract structured profiles
structured_df = survey_df.assign(
    profile=lambda df: df.response.ai.responses(
        instructions="""
        Extract comprehensive demographic and interest information from the survey response.
        Infer missing information based on context clues when reasonable.
        Categorize interests and suggest relevant product categories.
        """,
        response_format=PersonProfile
    )
).ai.extract("profile")

structured_df.head()

Out[4]:

	response_id	response	profile_demographics	profile_interests	profile_personality_traits	profile_potential_products
0	RESP_001	I'm a 28-year-old software engineer from San F...	{'age': 28, 'age_group': 'Young Adult', 'occup...	{'primary_interests': ['Hiking', 'Coding', 'Co...	[Adventurous, Analytical, Creative]	[Hiking gear, Coffee subscriptions, Coding cou...
1	RESP_002	45 years old, marketing manager in NYC. Intere...	{'age': 45, 'age_group': 'Middle-aged Adult', ...	{'primary_interests': ['Yoga', 'Reading Busine...	[Ambitious, Inquisitive, Open-minded]	[Yoga mats, Business book subscriptions, Trave...
2	RESP_003	College student, 20, studying biology in Bosto...	{'age': 20, 'age_group': 'Young Adult', 'occup...	{'primary_interests': ['Gaming', 'Anime', 'Vol...	[Creative, Empathetic, Curious]	[Gaming consoles, Anime merchandise, Volunteer...
3	RESP_004	Retired teacher, 62, living in Austin Texas. P...	{'age': 62, 'age_group': 'Senior', 'occupation...	{'primary_interests': ['Gardening', 'Cooking',...	[Nurturing, Patient, Creative]	[Gardening tools, Cookbooks, Family activity k...
4	RESP_005	35-year-old doctor from Chicago, specializing ...	{'age': 35, 'age_group': 'Adult', 'occupation'...	{'primary_interests': ['Running Marathons', 'P...	[Disciplined, Compassionate, Creative]	[Running gear, Piano sheet music, Health suppl...

Demographic Analysis¶

Extract demographic insights from the structured data.

In [5]:

Copied!





# Age distribution
print("AGE GROUP DISTRIBUTION:")
age_dist = structured_df.ai.extract("profile_demographics").profile_demographics_age_group.value_counts()
print(age_dist)

print("\n" + "="*50 + "\n")

# Occupation categories
print("OCCUPATION CATEGORIES:")
occ_dist = structured_df.ai.extract("profile_demographics").profile_demographics_occupation_category.value_counts()
print(occ_dist)

print("\n" + "="*50 + "\n")

# Life stages
print("LIFE STAGE DISTRIBUTION:")
life_dist = structured_df.ai.extract("profile_demographics").profile_demographics_life_stage.value_counts()
print(life_dist)
# Age distribution
print("AGE GROUP DISTRIBUTION:")
age_dist = structured_df.ai.extract("profile_demographics").profile_demographics_age_group.value_counts()
print(age_dist)

print("\n" + "="*50 + "\n")

# Occupation categories
print("OCCUPATION CATEGORIES:")
occ_dist = structured_df.ai.extract("profile_demographics").profile_demographics_occupation_category.value_counts()
print(occ_dist)

print("\n" + "="*50 + "\n")

# Life stages
print("LIFE STAGE DISTRIBUTION:")
life_dist = structured_df.ai.extract("profile_demographics").profile_demographics_life_stage.value_counts()
print(life_dist)

AGE GROUP DISTRIBUTION:
profile_demographics_age_group
Young Adult          3
Adult                3
Middle-aged Adult    2
Senior               1
Teenager             1
Name: count, dtype: int64

==================================================

OCCUPATION CATEGORIES:
profile_demographics_occupation_category
Education          3
Technology         2
Business           1
Healthcare         1
Creative           1
Food & Beverage    1
Family             1
Name: count, dtype: int64

==================================================

LIFE STAGE DISTRIBUTION:
profile_demographics_life_stage
Professional       6
Student            2
Retired            1
Family-oriented    1
Name: count, dtype: int64

Interest Pattern Analysis¶

Analyze hobby and interest patterns across demographics.

In [6]:

Copied!





# Explode interest categories for analysis
interests_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_hobby_categories')

print("TOP HOBBY CATEGORIES:")
hobby_counts = interests_expanded.profile_interests_hobby_categories.value_counts()
print(hobby_counts.head(10))

print("\n" + "="*50 + "\n")

# Lifestyle patterns
lifestyle_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_lifestyle_indicators')
print("LIFESTYLE INDICATORS:")
lifestyle_counts = lifestyle_expanded.profile_interests_lifestyle_indicators.value_counts()
print(lifestyle_counts.head(10))
# Explode interest categories for analysis
interests_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_hobby_categories')

print("TOP HOBBY CATEGORIES:")
hobby_counts = interests_expanded.profile_interests_hobby_categories.value_counts()
print(hobby_counts.head(10))

print("\n" + "="*50 + "\n")

# Lifestyle patterns
lifestyle_expanded = structured_df.ai.extract("profile_interests").explode('profile_interests_lifestyle_indicators')
print("LIFESTYLE INDICATORS:")
lifestyle_counts = lifestyle_expanded.profile_interests_lifestyle_indicators.value_counts()
print(lifestyle_counts.head(10))

TOP HOBBY CATEGORIES:
profile_interests_hobby_categories
Outdoor Activities      3
Technology              2
Community Engagement    2
Sports                  2
Music                   2
Culinary                2
Fitness                 2
Entertainment           1
Community Service       1
Home & Garden           1
Name: count, dtype: int64

==================================================

LIFESTYLE INDICATORS:
profile_interests_lifestyle_indicators
Health-conscious             4
Tech-savvy                   3
Family-oriented              2
Adventurous                  2
Creative                     2
Wellness-focused             1
Culturally aware             1
Socially conscious           1
Artistic                     1
Environmentally conscious    1
Name: count, dtype: int64

Market Segmentation¶

Create customer segments based on extracted profiles.

In [7]:

Copied!





# Generate market segments
segments_df = structured_df.ai.extract("profile_demographics").assign(
    segment=lambda df: df.apply(
        lambda row: f"{row.profile_demographics_age_group}_{row.profile_demographics_occupation_category}", 
        axis=1
    )
)

print("MARKET SEGMENTS:")
segment_counts = segments_df.segment.value_counts()
print(segment_counts)

print("\n" + "="*50 + "\n")

# Product recommendations by segment
print("PRODUCT OPPORTUNITIES BY SEGMENT:")
for segment in segment_counts.index[:5]:  # Top 5 segments
    segment_data = segments_df[segments_df.segment == segment]
    print(f"\n📊 {segment.upper()}:")
    
    # Get all product suggestions for this segment
    products = []
    for products_list in segment_data.profile_potential_products:
        products.extend(products_list)
    
    # Count and display top products
    from collections import Counter
    product_counter = Counter(products)
    for product, count in product_counter.most_common(3):
        print(f"   • {product} ({count} mentions)")
# Generate market segments
segments_df = structured_df.ai.extract("profile_demographics").assign(
    segment=lambda df: df.apply(
        lambda row: f"{row.profile_demographics_age_group}_{row.profile_demographics_occupation_category}", 
        axis=1
    )
)

print("MARKET SEGMENTS:")
segment_counts = segments_df.segment.value_counts()
print(segment_counts)

print("\n" + "="*50 + "\n")

# Product recommendations by segment
print("PRODUCT OPPORTUNITIES BY SEGMENT:")
for segment in segment_counts.index[:5]:  # Top 5 segments
    segment_data = segments_df[segments_df.segment == segment]
    print(f"\n📊 {segment.upper()}:")
    
    # Get all product suggestions for this segment
    products = []
    for products_list in segment_data.profile_potential_products:
        products.extend(products_list)
    
    # Count and display top products
    from collections import Counter
    product_counter = Counter(products)
    for product, count in product_counter.most_common(3):
        print(f"   • {product} ({count} mentions)")

MARKET SEGMENTS:
segment
Young Adult_Technology               1
Middle-aged Adult_Business           1
Young Adult_Education                1
Senior_Education                     1
Adult_Healthcare                     1
Young Adult_Creative                 1
Teenager_Education                   1
Middle-aged Adult_Food & Beverage    1
Adult_Technology                     1
Adult_Family                         1
Name: count, dtype: int64

==================================================

PRODUCT OPPORTUNITIES BY SEGMENT:

📊 YOUNG ADULT_TECHNOLOGY:
   • Hiking gear (1 mentions)
   • Coffee subscriptions (1 mentions)
   • Coding courses (1 mentions)

📊 MIDDLE-AGED ADULT_BUSINESS:
   • Yoga mats (1 mentions)
   • Business book subscriptions (1 mentions)
   • Travel packages to Europe (1 mentions)

📊 YOUNG ADULT_EDUCATION:
   • Gaming consoles (1 mentions)
   • Anime merchandise (1 mentions)
   • Volunteer opportunities (1 mentions)

📊 SENIOR_EDUCATION:
   • Gardening tools (1 mentions)
   • Cookbooks (1 mentions)
   • Family activity kits (1 mentions)

📊 ADULT_HEALTHCARE:
   • Running gear (1 mentions)
   • Piano sheet music (1 mentions)
   • Health supplements (1 mentions)

Export for Analysis¶

Prepare clean datasets for business intelligence tools.

In [8]:

Copied!





# Create clean demographic table
demographics_clean = structured_df.ai.extract("profile_demographics")[[
    'response_id',
    'profile_demographics_age',
    'profile_demographics_age_group', 
    'profile_demographics_occupation',
    'profile_demographics_occupation_category',
    'profile_demographics_location',
    'profile_demographics_location_type',
    'profile_demographics_life_stage'
]].copy()

print("📊 CLEAN DEMOGRAPHICS TABLE:")
print(demographics_clean.head())

# Save to CSV for external analysis
# demographics_clean.to_csv('demographics_analysis.csv', index=False)
# print("\n💾 Data exported to demographics_analysis.csv")
# Create clean demographic table
demographics_clean = structured_df.ai.extract("profile_demographics")[[
    'response_id',
    'profile_demographics_age',
    'profile_demographics_age_group', 
    'profile_demographics_occupation',
    'profile_demographics_occupation_category',
    'profile_demographics_location',
    'profile_demographics_location_type',
    'profile_demographics_life_stage'
]].copy()

print("📊 CLEAN DEMOGRAPHICS TABLE:")
print(demographics_clean.head())

# Save to CSV for external analysis
# demographics_clean.to_csv('demographics_analysis.csv', index=False)
# print("\n💾 Data exported to demographics_analysis.csv")

📊 CLEAN DEMOGRAPHICS TABLE:
  response_id  profile_demographics_age profile_demographics_age_group  \
0    RESP_001                        28                    Young Adult   
1    RESP_002                        45              Middle-aged Adult   
2    RESP_003                        20                    Young Adult   
3    RESP_004                        62                         Senior   
4    RESP_005                        35                          Adult   

  profile_demographics_occupation profile_demographics_occupation_category  \
0               Software Engineer                               Technology   
1               Marketing Manager                                 Business   
2                 College Student                                Education   
3                 Retired Teacher                                Education   
4                          Doctor                               Healthcare   

  profile_demographics_location profile_demographics_location_type  \
0                 San Francisco                              Urban   
1                 New York City                              Urban   
2                        Boston                              Urban   
3                 Austin, Texas                              Urban   
4                       Chicago                              Urban   

  profile_demographics_life_stage  
0                    Professional  
1                    Professional  
2                         Student  
3                         Retired  
4                    Professional

Conclusion¶

This notebook demonstrates how openaivec transforms unstructured survey data into:

Structured Demographics: Age, occupation, location, life stage
Interest Profiles: Hobbies, lifestyle indicators, personality traits
Market Segments: Actionable customer groupings
Product Opportunities: Data-driven recommendation insights
Analysis-Ready Data: Clean datasets for BI tools

Scale this approach to thousands of survey responses for comprehensive market research.