The State of Data Engineering in 2023: Does Your Data Program Stack Up?

Back

The State of Data Engineering in 2023: Does Your Data Program Stack Up?

Explore 2023 DataAware Pulse findings: Key insights to refine your data team's strategies and boost competitiveness.

Paul

Paul Lacey

[email protected]

Data teams are consistently challenged by a rapidly evolving technological landscape and escalating demands. To navigate this environment, staying attuned isn't just beneficial — it's a necessity. This means not only understanding where you stand, but also recognizing how the evolving patterns in the broader industry might align with or diverge from your own data programs.

Over the past four years, we have conducted an industry-wide DataAware Pulse Survey to capture the current state of data teams. In this article, we detail the six main patterns from the 2023 results. The objective is to provide data teams with the essential insights to refine their strategies and maintain a competitive edge.

For those seeking an exhaustive analysis, the full 2023 DataAware Pulse Survey is available for download.

Download: 2023 DataAware Pulse Survey

Pattern 1: Data Teams Are at Capacity

Teams are consistently operating at their limits:

Persistent Capacity Strain: For four consecutive years, an overwhelming 95% of data teams have reported operating at or above their work capacity. While 57% indicate they're either somewhat above or significantly over their capacity.

‍

Disproportionate Burden on Data Engineers: Delving deeper, it's evident that data engineers are feeling this strain most acutely. They're three times more likely to report feeling significantly overburdened compared to their counterparts in data architecture and data analytics.

‍

Hiring Isn't Catching Up: The gap between organizational demands and team growth is widening, especially for data engineers. They're 27% more likely than others to report a stark disparity between the growth of their team and the escalating company demands. Simply put, the demand for data is outpacing team growth at an alarming rate.

Pause for a moment and evaluate the workload dynamics within your data team. Do these findings resonate with your current situation? Are you sensing a similar strain when it comes to capacity and workload? Especially for the data engineers on your team, is there a palpable feeling of being consistently overburdened? Compare the growth rate of your team with the increasing company demands. It's essential to discern if your team mirrors this pattern, as recognizing it is the first step toward strategic adjustments.

the current state of data engineering: data teams are red-lining

Pattern 2: Data Teams Are Stuck in Reactive Mode

Data teams, rather than driving innovation forward, find themselves constantly playing catch-up:

Maintenance Over Innovation: Instead of pioneering new solutions, the majority of data engineers, alarmingly, allocate 50% or more of their time to merely maintaining existing programs and infrastructure.

Reactive Metrics: Data teams are stuck in a reactive mode. They primarily gauge their impact by the number of errors fixed — with 42% saying this metric is their top performance indicator.

The Engineer's Dilemma: Taking a closer look, engineers are the most impacted by this trend. 46% cite "number of errors fixed" as the most common metric used to measure their success, underscoring a prevalent culture of squashing bugs rather than creating value.

Reflect on the mode in which your data team currently operates. Are you finding yourselves entrenched in a reactive loop, constantly addressing errors rather than pioneering new solutions? How does your team's focus on maintenance versus innovation compare with the insights shared? Consider whether your metrics and day-to-day operations truly align with driving innovation forward or if there's an imperative need for a shift toward proactivity.

50% of data engineering spend too much time on maintenance

Pattern 3: Productivity Is Stagnant

Despite advances in data technology, there's a pressing concern in the data community about stagnant productivity:

Tool Sprawl: Data teams now use five tools on average in their data stack and 68% plan to maintain or increase this number in the next year. With so many tools to jump between just to do basic tasks, teams are falling further behind. Individual contributors were 2.1x more likely than their managers to say they want to cut tools out of their stack.

Rising Inefficiencies: Surprisingly, respondents are twice as likely to highlight a decline in their productivity this year when compared to previous years.

Engineers at the Forefront: Among the roles, data engineers feel the dip most acutely, reporting a 2.2x drop in their productivity year-over-year.

Tools or Hindrances?: Data engineers, in particular, identify two core issues stymieing their productivity: constraints within the tools they use and a glaring absence of effective automation.

Are you noticing a similar trend of stagnation, or perhaps even a decline, despite leveraging new technologies? Consider the experiences of your data engineers in particular. Are they confronting issues with the tools they use or lamenting a lack of impactful automation? Assessing whether your team's experiences align with these identified patterns can be a crucial step in pinpointing areas for improvement and intervention.

data teams' productivity is stagnant or decreasing.

Pattern 4: Data Automation as the Answer

Data professionals are seeking solutions to their productivity issues, and automation consistently emerges as a sought-after remedy:

Surge in Automation Interest: 91% of survey participants are either currently employing, or have intentions of leveraging data automation technologies in the next 12 months.

Momentum Grows, Yet Challenges Persist: There's a marked 20% uptick in respondents who are 'very likely' to integrate data automation technologies within the coming year. Remarkably, a 110% increase was observed in the current use of these technologies compared to 2022. However, a disparity emerges when considering past year's projections: 40% professed a strong inclination to adopt automation, yet a mere 5% followed through. This stark difference suggests that, despite the genuine aspiration, there may be a gap in resources or expertise without the aid of supporting automation platforms.

Top-Down Approach and Disparity: Data team leadership is 74% more inclined to prioritize and fund automation projects. This automation imperative is felt acutely by larger enterprises and teams, with such entities being 1.3 times more likely to experience the setbacks from an automation deficit.

Engineers and Architects Champion Automation: Data engineers and architects are twice as likely to pivot to automation for augmenting team capabilities, as compared to their data analyst counterparts.

Take a moment to evaluate your organization's journey toward automation. Are you among the majority expressing keen interest, or have you been able to fully integrate automation solutions? Given the evident surge in interest and its perceived value, how closely does your team's progression match the general trend?

It's particularly worthwhile to probe into the reasons if there's a lag in the actual implementation despite intentions. Does your data team, especially the engineers and architects, resonate with this call for automation?

Data teams are focused on automating data.

Pattern 5: Managers Bridge the Gap for Data Alignment

A significant chasm exists between data professionals on the ground and their executive counterparts, evidenced by variances in perception, objectives, and strategic approaches. This underscores the pivotal role managers and directors must play in bridging the gap and fostering alignment.

Disparate Perspectives on KPIs: A striking difference emerges when considering top impact measures. Executives gravitate towards tangible outputs, being 3x more inclined to highlight presentations created and 1.5x more drawn to new dashboards as the primary impact measure for their role. Contrastingly, individual contributors prioritize error resolutions and ticket closures.

‍

Perceptions of Time Efficiency: Executive frustration is palpable, with many feeling basic data tasks are simply taking too long. This sentiment is twice as pronounced among them than the individual contributors on their teams. Further exacerbating their concerns, executives invest an average of 17.8 hours weekly looking for data to do their jobs — 5.6 hours more than individual contributors.

‍

Strategic Dissonance: Disagreements on major initiatives surface. For instance, individual contributors are five times more likely to dismiss plans to adopt data mesh/fabric, while executives are enthusiastic about its imminent implementation. Skepticism around data mesh benefits is also prevalent among team leads, being 2.8 times more widespread than among executives.

‍

Tooling Preferences: A tug-of-war ensues over data stack preferences. Individual contributors are up to 2.1x more likely to favor cutting tools. Team leads favor consolidating to platforms while retaining all functions. On the flip side, executives are up to 1.6x more likely to favor adding more tools to the stack.

‍

AI's Role: The data team sees high potential for Generative AI. The top use cases include test automation, code generation, and documentation. However, individual contributors are 200% less likely to report believing that Generative AI can increase their impact on the business.

As you reflect on your organization's structure and communication dynamics, consider the alignment between your data professionals and executive leaders. Are the differences in perspectives on KPIs, tool preferences, and strategic initiatives evident within your team?

If you're in a managerial or directorial role, assess how effectively you're mediating these distinct viewpoints. If you're an individual contributor, are the discrepancies in priorities palpable? For executives, consider the insights you gain from your middle management.

Recognizing where your organization mirrors or contrasts with this pattern can offer insights into potential areas of focus to enhance understanding and collaboration across your data teams.

Misalignment between data professionals on the ground and their executive counterparts.

Pattern 6: Expenses Keep Growing

The increasing reliance on cloud infrastructure is undeniable, but with it comes the challenge of escalating costs:

Proliferation of Data Pipelines: A staggering 89% of respondents anticipate a growth in the number of data pipelines over the next year, underscoring the continual expansion of data infrastructure.

‍

Cost-Saving Strategies Emerge: As expenses rise, teams are strategizing to curtail them. A prominent 48% of teams are leaning towards optimizing data pipelines as their top tactic to mitigate cloud computing costs, seeking efficiency in operations.

‍

Consolidation Over Volume: Meanwhile, 38% are setting their sights on consolidating data sources. By reducing data volumes, they aim to achieve a dual goal of streamlining operations and saving costs.

‍

Divergence in Strategies: Interestingly, while data team executives, at a rate 1.9 times higher than others, feel confident about consolidating data sources, the broader technical team expresses a preference for pipeline optimizations as their cost-control measure.

Pause for a moment to assess your organization's data strategy in the cloud. Are you experiencing a similar surge in data pipelines and the associated costs? How does your current approach to cost management align with the findings above? As expenses rise, it's crucial to contemplate if your team is adequately equipped with the strategies to optimize operations and manage costs effectively.

The current state of data engineering: cost saving strategies

In Closing: The Road Ahead for Data Engineering

Recognizing prevailing patterns is essential to navigating the complexities of the industry. From teams consistently operating at capacity, to the challenges of stagnant productivity, the push for automation solutions, and the indispensable role of management in fostering alignment, each of these patterns provides a unique lens through which we can evaluate and refine our approaches to data.

As you reflect on these findings, consider how they resonate with your own experiences and challenges. Are these patterns reflective of your data team's realities? Or do you see divergences in your organization's journey? We encourage you to delve deep, assess, and recalibrate where necessary — and join the conversation by sharing your thoughts in the comments below!

def fetch_commit_history(
  repos: Union[str, List[str], pathlib.Path],
  timeout_seconds: int = 120,
  since_date: Optional[str] = None,
  from_ref: Optional[str] = None,
  to_ref: Optional[str] = None,
) -> Dict[str, List[Dict[str, Any]]]:
  """
  Fetches commit history from one or multiple GitHub repositories using the GitHub CLI.
  Works with both public and private repositories, provided the authenticated user has access.
  """
  # Check GitHub CLI is installed
  subprocess.run(
    ["gh", "--version"],
    capture_output=True,
    check=True,
    timeout=timeout_seconds,
  )

  # Process the repos input to handle various formats
  if isinstance(repos, pathlib.Path) or (isinstance(repos, str) and os.path.exists(repos) and repos.endswith(".json")):
    with open(repos, "r") as f:
      repos = json.load(f)
  elif isinstance(repos, str):
    repos = [repo.strip() for repo in repos.split(",")]

  results = {}
  for repo in repos:
    # Get repository info and default branch
    default_branch_cmd = subprocess.run(
      ["gh", "api", f"/repos/{repo}"],
      capture_output=True,
      text=True,
      check=True,
      timeout=timeout_seconds,
    )
    repo_info = json.loads(default_branch_cmd.stdout)
    default_branch = repo_info.get("default_branch", "main")

    # Build API query with parameters
    api_path = f"/repos/{repo}/commits"
    query_params = ["per_page=100"]

    if since_date:
      query_params.append(f"since={since_date}T00:00:00Z")

    target_ref = to_ref or default_branch
    query_params.append(f"sha={target_ref}")

    api_url = f"{api_path}?{'&'.join(query_params)}"

    # Fetch commits using GitHub CLI
    result = subprocess.run(
      ["gh", "api", api_url],
      capture_output=True,
      text=True,
      check=True,
      timeout=timeout_seconds,
    )

    commits = json.loads(result.stdout)
    results[repo] = commits

  return results

Key implementation details:

GitHub CLI integration: Uses the `gh` command-line tool for authenticated API access to both public and private repositories
‍
Flexible input handling: Accepts single repos, comma-separated lists, or JSON files containing repository lists
‍
Robust error handling: Validates GitHub CLI installation and repository access before attempting to fetch commits
‍
Configurable date filtering: Supports both date-based and ref-based commit filtering

‍

AI-Powered Summarization

def summarize_text(content: str, api_key: Optional[str] = None) -> str:
  """
  Summarize provided text content (e.g., commit messages) using OpenAI API.
  """
  if not content.strip():
    return "No commit data found to summarize"

  # Get API key from parameter or environment
  api_key = api_key or os.getenv("OPENAI_API_KEY")
  if not api_key:
    raise RuntimeError("OpenAI API key not found. Set the OPENAI_API_KEY environment variable.")

  client = OpenAI(api_key=api_key)
  response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
      {"role": "system", "content": SYSTEM_PROMPT},
      {"role": "user", "content": content},
    ],
    temperature=0.1,
    max_tokens=1000,
  )
  return response.choices[0].message.content.strip()

def summarize_commits(content: str, add_date_header: bool = True) -> str:
  """
  Summarize commit content and optionally add a date header.
  """
  summary_body = summarize_text(content)

  if add_date_header:
    # Add header with week date
    now_iso = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    monday = get_monday_of_week(now_iso)
    return f"## 🗓️ Week of {monday}\n\n{summary_body}"

  return summary_body

Our initial system prompt for consistent categorization:

You are a commit message organizer. Analyze the commit messages and organize them into a clear summary.

Group similar commits and format as bullet points under these categories:
- 🚀 Features
- ⚠️ Breaking changes
- 🌟 Improvements
- 🛠️ Bug fixes
- 📝 Additional changes
...
Within the Improvements section, do not simply say "Improved X" or "Fixed Y" or "Added Z" or "Removed W".
Instead, provide a more detailed and user-relevant description of the improvement or fix.

Convert technical commit messages to user-friendly descriptions and remove PR numbers and other technical IDs.
Focus on changes that would be relevant to users and skip internal technical changes.

Format specifications:
- Format entries as bullet points: "- [Feature description]"
- Use clear, user-friendly language while preserving technical terms
- For each item, convert technical commit messages to user-friendly descriptions:
   - "add line" → "New line functionality has been added"
   - "fix css overflow" → "CSS overflow issue has been fixed"
- Capitalize Ascend-specific terms in bullet points such as "Components"

Strictly exclude the following from your output:
- Any mentions of branches (main, master, develop, feature, etc.)
- Any mentions of AI rules such as "Added the ability to specify keywords for rules"
- Any references to branch integration or merges
- Any language about "added to branch" or "integrated into branch"
- Dependency upgrades and version bumps
…

Prompt engineering:

Structured categorization: Our prompt enforces specific emoji-categorized sections for consistent output formatting
‍
User-focused translation: Explicitly instructs the AI to convert technical commits into user-friendly language
‍
Content filtering: Automatically excludes dependency updates, test changes, and internal technical modifications
‍
Low temperature setting: Uses 0.1 temperature for consistent, factual output rather than creative interpretation

Content Integration and File Management

def get_monday_of_week(date_str: str) -> str:
  """
  Get the Monday of the week containing the given date, except for Sunday which returns the next Monday.
  """
  date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ")

  # For Sunday (weekday 6), get the following Monday
  if date.weekday() == 6:  # Sunday
    days_ahead = 1
  else:  # For all other days, get the Monday of the current week
    days_behind = date.weekday()
    days_ahead = -days_behind

  target_monday = date + timedelta(days=days_ahead)
  return target_monday.strftime("%Y-%m-%d")

File handling considerations:

Consistent date formatting: Automatically calculates the Monday of the current week for consistent release note headers
‍
Encoding safety: Properly handles Unicode characters in commit messages from international contributors
‍
Atomic file operations: Uses temporary files during processing to prevent corruption if the process is interrupted

GitHub Actions: Orchestrating the Automation

Our workflow ties everything together with robust automation that handles the complexities of CI/CD environments.

Workflow Triggers and Inputs

name: Weekly Release Notes Update
on:
  workflow_dispatch:
    inputs:
      year:
        description: 'Year (YYYY) of date to start collecting releases from'
        default: '2025'
      month:
        description: 'Month (MM) of date to start collecting releases from'
        default: '01'
      day:
        description: 'Day (DD) of date to start collecting releases from'
        default: '01'
      repo_filters:
        description: 'JSON string defining filters for specific repos'
        required: false
      timeout_seconds:
        description: 'Timeout in seconds for API calls'
        default: '45'

Flexible triggering options:

Manual dispatch with granular date control: Separate year, month, day inputs for precise date filtering
‍
Repository-specific filtering: JSON configuration allows different filtering strategies per repository
‍
Configurable timeouts: Adjustable API timeout settings for different network conditions

Secure Authentication Flow

- uses: actions/create-github-app-token@v2
  id: app-token
  with:
    app-id: <YOUR-APP-ID>
    private-key: ${{ secrets.GHA_DOCS_PRIVATE_KEY }}
    owner: ascend-io
    repositories: ascend-docs,ascend-core,ascend-ui,ascend-backend

Security best practices:

GitHub App with specific repository access: Explicitly lists only the repositories that need access
‍
Scoped permissions: App configured with minimal necessary permissions for the specific repositories
‍
Secret management: Private key stored securely in GitHub Secrets

Repository Configuration Processing

- name: Prepare repository filter configuration
  run: |
    CONFIG_FILE=$(mktemp)
    echo "{}" > "$CONFIG_FILE"

    if [ -n "${{ github.event.inputs.repo_filters }}" ]; then
      echo '${{ github.event.inputs.repo_filters }}' > "$CONFIG_FILE"
    else
      DATE_STRING="${{ github.event.inputs.year }}-${{ github.event.inputs.month }}-${{ github.event.inputs.day }}"
      jq -r '.[]' bin/release_notes/input_repos.json | while read -r REPO; do
        FILTER="since:$DATE_STRING"
        jq --arg repo "$REPO" --arg filter "$FILTER" '. + {($repo): $filter}' "$CONFIG_FILE" > "${CONFIG_FILE}.tmp" && mv "${CONFIG_FILE}.tmp" "$CONFIG_FILE"
      done
    fi

    CONFIG_JSON=$(cat "$CONFIG_FILE")
    echo "config_json<<EOF" >> $GITHUB_OUTPUT
    echo "$CONFIG_JSON" >> $GITHUB_OUTPUT
    echo "EOF" >> $GITHUB_OUTPUT

Data Processing and File Management

- name: Generate release notes
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    GITHUB_TOKEN: ${{ steps.app-token.outputs.token }}
  run: |
    CONFIG_JSON='${{ steps.repo_config.outputs.config_json }}'
    CONFIG_FILE=$(mktemp)
    echo "$CONFIG_JSON" > "$CONFIG_FILE"

    RAW_OUTPUT=$(python bin/release_notes/generate_release_notes.py \
      --repo-config-string "$(cat "$CONFIG_FILE")" \
      --timeout "${{ github.event.inputs.timeout_seconds }}")

    # Split summary and commits using delimiter
    SUMMARY=$(echo "$RAW_OUTPUT" | sed -n '1,/^### END SUMMARY ###$/p' | sed '$d')
    MONDAY_DATE=$(echo "$SUMMARY" | head -n1 | grep -oE "[0-9]{4}-[0-9]{2}-[0-9]{2}")

    echo "monday_date=$MONDAY_DATE" >> $GITHUB_OUTPUT
    echo 'summary<<EOF' >> $GITHUB_OUTPUT
    echo "$SUMMARY" >> $GITHUB_OUTPUT
    echo 'EOF' >> $GITHUB_OUTPUT

Key implementation lessons:

Temporary file strategy: We learned the hard way that GitHub Actions environments can lose data between steps. Writing to temporary files solved reliability issues where data would appear blank in subsequent steps.
‍
Complex JSON handling: Uses `jq` for safe JSON manipulation and temporary files to avoid shell quoting issues with complex JSON strings
‍
Output parsing: Logic to split AI-generated summaries from raw commit data using delimiter markers
‍
Robust error handling: `set -euo pipefail` ensures the script fails fast on any error, preventing silent failures

File Integration and Pull Request Creation

- name: Update whats-new.mdx with release notes
  run: |
    FILE="website/docs/whats-new.mdx"
    BRANCH_NAME="notes-${{ steps.generate_notes.outputs.monday_date }}"
    git branch $BRANCH_NAME main
    git switch $BRANCH_NAME

    TEMP_SUMMARY_FILE=$(mktemp)
    echo '${{ steps.generate_notes.outputs.summary }}' > "$TEMP_SUMMARY_FILE"
    cat "$TEMP_SUMMARY_FILE" "$FILE" > "${FILE}.new"
    mv "${FILE}.new" "$FILE"
    rm -f "$TEMP_SUMMARY_FILE"

File management features:

Atomic file operations: Uses temporary files and atomic moves to prevent file corruption
‍
Branch management: Creates date-based branches for organized PR tracking
‍
Content preservation: Carefully prepends new content while preserving existing documentation structure

Lessons Learned and Best Practices

Building this pipeline taught us valuable lessons about documentation automation that go beyond the technical implementation.

Technical Insights

File persistence matters in CI/CD environments. GitHub Actions environments can be unpredictable—always write important data to files rather than relying on environment variables or memory. We learned this the hard way when release notes would mysteriously appear blank in PRs.
‍

API reliability requires defensive programming. Build retry logic and fallbacks for external API calls (OpenAI, GitHub). Network issues and rate limits are inevitable, especially as your usage scales.
‍

Prompt engineering is crucial for consistent output. Spend time crafting prompts that consistently produce the format and tone you want. Small changes in wording can dramatically affect AI output quality and consistency.
‍

Human review is essential, even with AI generation. Having team members review PRs catches edge cases, ensures quality, and builds confidence in the automated system. The goal isn't to eliminate human oversight—it's to make it more efficient and focused.
‍

Historical tracking and product evolution insights. Automated generation creates a consistent record of product evolution that's valuable for retrospectives, planning, and onboarding new team members.
‍

Results and Impact

The automation has fundamentally transformed our release process and team dynamics:

Quantifiable Improvements

Dramatic time savings: Reduced release note creation from 2-3 hours of writing time to 15 minutes of review time. That's a 90% reduction in effort while improving quality and consistency.
‍

Perfect consistency: Every release now has properly formatted, comprehensive notes. No more missed releases or inconsistent formatting across different team members.

‍

Increased frequency: We can now generate release notes weekly, providing users with more timely updates about product improvements.

‍

Complete coverage: Captures changes across all repositories without manual coordination, eliminating the risk of missing important updates.

‍

Next Steps and Future Enhancements

We're continuously improving the pipeline based on team feedback and evolving needs:

Immediate Roadmap

Slack integration: Building a Slackbot to automatically share release notes with our community channels, extending the reach beyond just documentation updates.

‍

Repository tracing: Categorize the raw commits by repository and add links so it's easy to (literally) double-click into each PR for additional context.

Future Possibilities

Multi-language support: Generating release notes in different languages for global audiences as we expand internationally.
‍

Ready to automate your own release notes? Start with the requirements above and build incrementally. Begin with a single repository, get the basic workflow running, then expand to multiple repos and add advanced features. Your future self (and your team) will thank you for eliminating this manual drudgery and creating a more consistent, professional release process.

‍

The investment in automation pays dividends immediately—not just in time saved, but in the improved quality and consistency of your user communication. In a time where software moves fast, automated release notes ensure your documentation keeps pace.

The State of Data Engineering in 2023: Does Your Data Program Stack Up?

Pattern 1: Data Teams Are at Capacity

Pattern 2: Data Teams Are Stuck in Reactive Mode

Pattern 3: Productivity Is Stagnant

Pattern 4: Data Automation as the Answer

Pattern 5: Managers Bridge the Gap for Data Alignment

Pattern 6: Expenses Keep Growing

In Closing: The Road Ahead for Data Engineering

AI-Powered Summarization

Content Integration and File Management

GitHub Actions: Orchestrating the Automation

Workflow Triggers and Inputs

Secure Authentication Flow

Repository Configuration Processing

Data Processing and File Management

File Integration and Pull Request Creation

Lessons Learned and Best Practices

Technical Insights

Results and Impact

Quantifiable Improvements

Next Steps and Future Enhancements

Immediate Roadmap

Future Possibilities

Try it out. Your future self will thank you :)