Back to Blog
11 min read

20 Data Engineering Interview Questions You'll Actually Get Asked in 2026

Real data engineering interview questions across SQL, Spark, Kafka, AWS, and system design. Plus what interviewers expect at each seniority level.

data-engineeringinterview-prepsqlsparkkafkaawssystem-design

Most "data engineering interview questions" lists are useless. They give you 50 generic questions with no context on what the interviewer actually expects to hear, how the question changes based on your seniority level, or what separates a passing answer from a strong one.

This list is different. These are 20 questions pulled from patterns I've seen across real interview loops at companies that hire data engineers seriously. For each one, I'll tell you what the interviewer is testing for and where most candidates fall short.

Grouped by topic. Let's go.

SQL (Still the #1 Topic)

If you think SQL questions are too basic for senior roles, you're in for a surprise. Every company tests SQL. The questions just get harder.

1. "Write a query to find the second highest salary in each department."

What they're testing: Window functions. Specifically, whether you reach for ROW_NUMBER(), RANK(), or DENSE_RANK() and whether you know the difference.

Where candidates struggle: Most people get the window function right but forget to handle ties. If two people share the highest salary, what's the "second highest"? Your choice of RANK vs DENSE_RANK answers that, and the interviewer wants you to call it out proactively.

2. "You have a query that ran fine for months but suddenly takes 20 minutes. Walk me through debugging it."

What they're testing: Operational SQL knowledge. Reading execution plans, understanding statistics staleness, identifying partition scanning vs seeks, recognizing data skew.

Where candidates struggle: They jump to "add an index" without diagnosing first. The best answers walk through a systematic approach: check the execution plan, check if data volume changed, check if statistics are stale, check for lock contention. Interviewers want to see a process, not a guess.

3. "Explain the difference between a slowly changing dimension Type 1, Type 2, and Type 3. When would you use each?"

What they're testing: Data modeling fundamentals. This comes up in almost every interview loop that involves warehousing.

Where candidates struggle: Most people can define all three types. Fewer can explain when Type 2 is overkill (low-value attributes that nobody queries historically) or when Type 3 makes more sense than Type 2 (when you only care about "previous" vs "current" and the full history is noise).

4. "How would you deduplicate a table with 2 billion rows where the same event can arrive multiple times?"

What they're testing: Whether you can think about scale. A solution that works on 1,000 rows might choke on 2 billion.

Where candidates struggle: They propose SELECT DISTINCT or GROUP BY without considering memory and sort spill. Strong answers discuss partitioning the dedup (by date, by key range), using window functions with ROW_NUMBER() to pick one representative row, or handling it upstream in the ingestion layer.

Apache Spark

Spark questions separate people who've used Spark from people who've read about Spark.

5. "Your Spark job keeps failing with an OOM error on the executors. What do you check?"

What they're testing: Real operational experience. This is the kind of problem every Spark user hits eventually.

Where candidates struggle: Listing configuration knobs (spark.executor.memory, spark.memory.fraction) without diagnosing the root cause first. The interviewer wants to hear: check for data skew (one partition way bigger than others), check for accidental collect() or broadcast() of large datasets, check if a shuffle is exploding cardinality, look at the Spark UI stage details. Configuration tuning is the last step, not the first.

6. "Explain the difference between narrow and wide transformations. Why does it matter?"

What they're testing: Whether you understand Spark's execution model at a level deeper than the API.

Where candidates struggle: They define both correctly but can't explain the practical impact. The real answer is about shuffle boundaries and stage planning. Wide transformations create shuffle dependencies which mean network I/O, disk spill, and new stages. This matters because it determines how your job performs at scale, how it recovers from failures (narrow = recompute one partition, wide = recompute everything upstream of the shuffle), and where your bottlenecks will appear.

7. "When would you use repartition() vs coalesce()?"

What they're testing: Whether you understand shuffling costs in practice.

Where candidates struggle: They know coalesce() avoids a full shuffle but don't explain when you'd actually want the full shuffle that repartition() gives you. If your data is heavily skewed after a filter, coalesce() will just merge the skew into fewer partitions. repartition() redistributes evenly. The choice depends on what you're about to do next.

Kafka and Streaming

Streaming questions are getting more common as more companies move to real-time or near-real-time architectures.

8. "How do you guarantee exactly-once processing with Kafka?"

What they're testing: Distributed systems understanding. This question is harder than it sounds.

Where candidates struggle: Saying "use exactly-once semantics in Kafka" as if it's a toggle you flip. The real answer involves idempotent producers (handling retries without duplicate writes), transactional producers (atomic writes across partitions), and the consumer side (committing offsets atomically with your processing). And even then, exactly-once only holds within the Kafka ecosystem. Once you write to an external sink, you need idempotent writes there too or you're back to at-least-once.

9. "Your consumer group is lagging behind by 2 hours. How do you catch up without losing data?"

What they're testing: Operational maturity. Lag is the #1 problem teams hit with Kafka in production.

Where candidates struggle: Jumping to "add more consumers." That's often right, but only if the topic has enough partitions. If you have 10 partitions and 10 consumers, adding an 11th does nothing. Strong answers discuss: checking if one partition is hotter than others (producer-side key skew), increasing partitions (with caveats about rebalancing and ordering), optimizing consumer processing time, or temporarily running a parallel catch-up consumer group that reads from the earliest offset.

10. "What happens if a Kafka broker goes down? Walk through the failure scenario."

What they're testing: Whether you understand replication and leadership at a real level.

Where candidates struggle: Vague answers about "replicas take over." The interviewer wants specifics. Which component detects the failure? (The controller broker, via KRaft — or ZooKeeper on pre-4.0 clusters.) What happens to partitions whose leader was on that broker? (Leader election from the ISR set.) What if the ISR only had one replica? (Data loss risk unless unclean.leader.election.enable is false.) How do producers behave during the election? (Timeout and retry if acks=all, possible duplicates if acks=1.)

AWS (The Most Common Cloud in DE Interviews)

11. "Compare Glue, EMR, and Lambda for a batch processing use case. When do you pick each?"

What they're testing: Cloud architecture judgment. They want to see you make trade-offs, not just describe services.

Where candidates struggle: Giving a feature comparison instead of a decision framework. A strong answer anchors on data volume, processing complexity, cost model, and team familiarity. Glue for managed Spark jobs under ~1TB where you want zero infrastructure management. EMR when you need custom Spark configs, large clusters, or long-running workloads. Lambda for lightweight transforms under 15 minutes and 10GB. Then mention cost: Glue charges per DPU-hour (expensive at scale), EMR gives you EC2 pricing (cheaper but you manage it), Lambda is dirt cheap for small jobs.

12. "Your S3-to-Redshift pipeline is loading duplicate records. How do you fix it?"

What they're testing: Debugging skills plus understanding of how COPY, Glue, and Redshift interact.

Where candidates struggle: Proposing a DISTINCT on the Redshift side without investigating the root cause. Is the COPY running twice because a Glue job retried? Is the S3 path overlapping between runs? Is the manifest file including the same files? Strong answers trace the problem from source to sink instead of slapping a bandaid on the destination.

13. "How would you set up a data lake on S3 with proper partitioning for a 10TB/day clickstream?"

What they're testing: Large-scale design thinking with cost awareness.

Where candidates struggle: Forgetting about file size and partition granularity. Partitioning by year/month/day/hour sounds reasonable until you realize that gives you 8,760 partition prefixes per year, each with many small files. Small files kill query performance on Athena and Spark. Strong answers discuss compaction jobs, target file sizes (128MB-1GB for Parquet), the trade-off between partition granularity and file count, and using a table format like Iceberg or Delta to manage this.

System Design

This is where interviews get hard. There's no single right answer, and the interviewer cares more about your process than your diagram.

14. "Design a real-time fraud detection pipeline."

What they're testing: End-to-end design under real constraints. Latency requirements, accuracy trade-offs, and operational concerns.

Where candidates struggle: Jumping to Kafka + Spark Streaming without establishing requirements first. How fast does "real-time" need to be? (100ms? 5 seconds? 1 minute? These lead to completely different architectures.) What's the false positive tolerance? What happens during a burst? Strong answers start with questions, state assumptions, then build layer by layer.

15. "Your analytics warehouse query performance has degraded 10x over the past quarter. Design a solution."

What they're testing: Whether you can diagnose AND design, not just build greenfield.

Where candidates struggle: Proposing a full migration to a new tool instead of investigating root causes. Maybe it's a missing partition prune. Maybe table statistics are stale. Maybe someone added a new dashboard that runs expensive queries every 5 minutes. The interviewer wants to see you investigate before you rebuild.

16. "Design a data pipeline for a company that has 50 microservices producing events, each with its own schema."

What they're testing: Schema management, evolution, and governance at scale.

Where candidates struggle: Ignoring the schema problem entirely and just designing a Kafka-to-warehouse pipeline. The hard part of this question is the "50 services, each with its own schema" bit. How do you handle schema evolution? (Schema registry.) How do you handle breaking changes? (Compatibility modes.) How do you maintain a unified data model downstream? (Mapping layer, canonical events.) If you skip this, you've missed the point of the question.

Behavioral (Yes, These Matter)

17. "Tell me about a time a pipeline failed in production. What happened and how did you handle it?"

What they're testing: Incident response maturity. Ownership. Communication.

Where candidates struggle: Either picking a trivial example ("a cron job ran late") or not following the STAR structure. The best answers involve a real outage with business impact, clear ownership of the fix, communication to stakeholders, and a post-mortem with preventive measures. Bonus points if you changed a process afterward, not just fixed the code.

18. "Describe a situation where you disagreed with a teammate about a technical approach."

What they're testing: Collaboration and influence. Can you disagree without being a jerk? Can you change your mind when you're wrong?

Where candidates struggle: Either saying "I convinced them I was right" (sounds arrogant) or "I just went along with their approach" (sounds passive). The ideal answer shows you advocated clearly for your position with data or evidence, listened to their perspective, and made a decision together. Sometimes you won, sometimes you didn't, and both are fine.

19. "Why are you leaving your current role?"

What they're testing: Self-awareness and professionalism.

Where candidates struggle: Badmouthing their current employer. Even if your manager is terrible and the codebase is a disaster, the interview isn't the place to air it out. Frame it around what you're moving toward, not what you're running from. "I want to work with larger-scale data systems" or "I'm looking for more ownership over the full pipeline" land much better than "our tech stack is outdated."

20. "What questions do you have for me?"

What they're testing: Whether you've thought about this job specifically or just applied everywhere.

Where candidates struggle: Asking nothing, or asking generic questions they could Google. Strong questions show you've researched the company and care about the specifics: "What does your data platform team's on-call rotation look like?" or "How does your team handle schema changes across microservices?" These show genuine engagement and technical curiosity.

How to Actually Prepare

Reading through a list of questions is a start. But reading isn't practice.

The gap between knowing an answer and delivering it well in an interview is huge. You need to actually say your answers out loud, get feedback on whether you're hitting the right depth for your seniority level, and find out which areas you're weaker than you think.

That's what HireBench is built for. You pick a track (SQL, Spark, Kafka, AWS, system design, behavioral), answer questions in interview conditions, and get scored against rubrics written by senior engineers. Not "great job!" feedback, but specific rubric checkpoints showing which concepts you hit and which you missed. The kind of feedback that actually tells you where to focus your prep.

Try a few questions free and see where you stand.