Most asked interview question for data analysis

Most asked interview question for data analysis and guide to answer for each question _ bảng tổng hợp hầu hết các câu hỏi nhà tuyển dụng hay hỏi ứng viên cho vị trí phân tích dữ liệu

Trang 1

MOST ASKED INTERVIEW

QUESTION FOR DATA

ANALYSIS

SQL

1 How do you find duplicate records in a table?

To find duplicates, use GROUP BY and HAVING to filter groups with counts greater than 1

Example:

sql

SELECT email, COUNT(*)

FROM users

GROUP BY email

HAVING COUNT(*) > 1;

This identifies duplicate emails in the users table

2 Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN

INNER JOIN: Returns rows where there’s a match in both tables

sql

SELECT * FROM orders

INNER JOIN customers ON orders.customer_id = customers.id;

LEFT JOIN: Returns all rows from the left table and matched rows from the right (unmatched rows get NULL)

sql

SELECT * FROM employees

LEFT JOIN departments ON employees.dept_id = departments.id;

RIGHT JOIN: Opposite of LEFT JOIN (all rows from the right table)

FULL OUTER JOIN: Returns all rows from both tables, with NULL where no match exists

Trang 2

sql

SELECT * FROM tableA

FULL OUTER JOIN tableB ON tableA.id = tableB.id;

3 How do you optimize a slow-running query? What are window functions, and how

do you use them for ranking, running totals, or moving averages?

Optimization:

Add indexes on frequently queried columns

Avoid SELECT *; fetch only needed columns

Use EXPLAIN to analyze the query execution plan

Window Functions:

Ranking:

sql

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;

Running Total:

sql

SELECT date, revenue, SUM(revenue) OVER (ORDER BY date) AS running_total

FROM sales;

7-Day Moving Average:

sql

SELECT date, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

FROM sales;

4 What are Common Table Expressions (CTEs), and how do they simplify complex queries?

CTEs simplify complex queries by breaking them into named temporary result sets

Example:

sql

WITH high_sales AS (

Trang 3

SELECT employee_id, SUM(amount) AS total

FROM orders

GROUP BY employee_id

HAVING SUM(amount) > 10000

)

SELECT * FROM employees

WHERE id IN (SELECT employee_id FROM high_sales);

5 How would you filter results using a subquery in another table?

Use IN, EXISTS, or JOIN with a subquery

Example:

sql

SELECT * FROM products

WHERE category_id IN (

SELECT id FROM categories WHERE name = 'Electronics'

);

6 How do you calculate a rolling 7-day average in SQL?

Use a window function with a frame clause:

sql

SELECT date, AVG(sales) OVER (

ORDER BY date

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

) AS rolling_avg

FROM daily_sales;

7 What’s the difference between WHERE and HAVING? When should you use each? How do you pivot (transpose) a table in SQL?

WHERE: Filters rows before aggregation

sql

SELECT dept_id, AVG(salary)

FROM employees

WHERE salary > 50000

GROUP BY dept_id;

HAVING: Filters after aggregation

sql

SELECT dept_id, AVG(salary)

FROM employees

GROUP BY dept_id

Trang 4

HAVING AVG(salary) > 70000;

Pivoting: Use CASE or PIVOT (SQL Server)

sql

SELECT

year,

SUM(CASE WHEN month = 'Jan' THEN revenue END) AS Jan,

SUM(CASE WHEN month = 'Feb' THEN revenue END) AS Feb

FROM sales

GROUP BY year;

8 How do indexes improve query performance, and when should you use them? Indexes speed up data retrieval but slow down writes Use them on columns used in:

WHERE clauses

JOIN conditions

ORDER BY

Example:

sql

CREATE INDEX idx_customer_email ON customers(email);

Best Practices:

Avoid over-indexing

Use clustered indexes for primary keys

Monitor query performance regularly

POWER BI

1 What are the key components of Power BI, and how do they interact?

Components:

Power Query: For data extraction, transformation, and loading (ETL)

Data Model: Stores tables and defines relationships

Trang 5

DAX (Data Analysis Expressions): Creates calculations (measures, calculated columns)

Visualizations: Charts, tables, and dashboards

Power BI Service: Cloud platform for publishing and sharing reports

Interaction:

Power Query cleans and transforms raw data (e.g., merging CSV files)

Data is loaded into the Data Model, where relationships between tables (e.g., Sales

↔ Products) are defined

DAX calculates metrics (e.g., Total Sales = SUM(Sales[Amount])) for use in visuals

Visualizations (e.g., bar charts, maps) are built using the modeled data

Reports are published to the Power BI Service for sharing and collaboration

2 How do you handle relationships between tables in Power BI?

Auto-detect: Power BI automatically links tables via matching column names (e.g., ProductID)

Manual Setup: In Model View, drag a column from one table to another (e.g., Sales[CustomerID] → Customers[ID])

Relationship Types:

One-to-Many (★): Most common (e.g., one customer → many orders)

Many-to-Many: Requires a bridge table (e.g., students ↔ courses)

Cross-Filter Direction: Set to Single (default) or Both (bidirectional filtering)

3 What is the difference between calculated columns and measures in DAX?

Trang 6

Aspect Calculated Column Measure

Calculation

Time During data refresh (static) At query runtime (dynamic) Storage Stored in the table Not stored; computed on the fly

Use Case Row-level calculations

(e.g., Profit = [Revenue] - [Cost])

Aggregations (e.g., Total Sales

= SUM(Sales[Amount]))

Example:

Column: Sales[Profit] = Sales[Revenue] - Sales[Cost] (added to the Sales table) Measure: Total Profit = SUM(Sales[Profit]) (used in a visual)

4 How do you optimize Power BI reports for performance?

Data Model:

Use star schema design

Avoid unnecessary columns

DAX:

Avoid complex nested CALCULATE statements

Use SUMMARIZE sparingly

Visuals:

Limit visuals on a single page

Use aggregations (e.g., pre-calculate totals)

Data Mode:

Use Import Mode for small datasets

Trang 7

Use DirectQuery with aggregations for large datasets

Tools: Use Performance Analyzer to identify bottlenecks

5 Explain row-level security (RLS) in Power BI and how to implement it

Definition: Restricts data access based on user roles (e.g., a sales manager sees only their region)

Implementation:

Create Roles: In Power BI Desktop → Modeling → Manage Roles

Define DAX Filters:

dax

[Region] = USERPRINCIPALNAME()

Publish and Assign Roles: In Power BI Service → Datasets → Security

Example: A role EastRegion filters the Sales table to Region = "East"

6 What are the different types of filters in Power BI, and when should you use each? How do you use the CALCULATE function in DAX, and why is it important? Filter Types:

Visual-Level: Applies to a single visual

Page-Level: Affects all visuals on a page

Report-Level: Applies to the entire report

Drill-Through: Filters data when drilling to another page

CALCULATE Function:

Modifies filter context (e.g., override existing filters)

Example:

dax

East Sales = CALCULATE(SUM(Sales[Amount]), Sales[Region] = "East")

7 How do you create a dynamic date range (e.g., last 7 days) in Power BI?

Create a Date Table:

Trang 8

dax

DateTable = CALENDAR(MIN(Sales[OrderDate]), MAX(Sales[OrderDate])) Create a Measure:

dax

Sales Last 7 Days =

CALCULATE(

SUM(Sales[Amount]),

DATESINPERIOD('DateTable'[Date], TODAY(), -7, DAY)

)

8 What is the difference between DirectQuery and Import Mode? When should you use each?

Data Storage Queries source live (no local copy) Data imported into Power BI

Performance Slower (depends on source) Faster (data cached)

Use Case Real-time data, large datasets (TB) Small/medium datasets (GB)

Transformations Limited to source capabilities Full Power Query transformations

9 How would you handle large datasets in Power BI without affecting performance? Aggregations: Pre-summarize data (e.g., daily → monthly totals)

Incremental Refresh: Load only new data (e.g., refresh last 30 days)

Optimize Queries:

Use Query Folding in Power Query

Avoid merging large tables

Data Model:

Use star schema

Limit high-cardinality columns (e.g., unique IDs)

Trang 9

Storage Mode: Use Composite Models (mix Import + DirectQuery)

Example:

An e-commerce dataset with 100M rows uses Aggregations to summarize sales by month and category

TABLEAU

1 What are the different types of joins in Tableau, and how do they work?

Joins combine data from multiple tables based on a common field Tableau supports:

Inner Join: Returns rows where there’s a match in both tables

sql

SELECT * FROM Orders INNER JOIN Customers ON Orders.CustomerID = Customers.ID;

Left Join: Returns all rows from the left table and matched rows from the right (unmatched rows show NULL)

Right Join: Opposite of Left Join (all rows from the right table)

Full Outer Join: Returns all rows from both tables, with NULL for unmatched fields

Union: Stacks tables vertically (requires identical columns)

Example:

Combine Orders (left) and Returns (right) on OrderID using a Left Join to see all orders, including those not returned

2 What is the difference between a live connection and an extract in Tableau? When should you use each?

Aspect Live Connection Extract (TDE/Hyper)

Data Freshness Real-time querying Snapshot of data at refresh time

Performance Slower (depends on source

speed) Faster (data stored locally)

Trang 10

Aspect Live Connection Extract (TDE/Hyper)

Use Case Real-time dashboards (e.g., stock data) Large datasets or offline access

Transformations Limited to source capabilities Full Power Query-like transformations

When to Use:

Live: Real-time reporting or sensitive data (e.g., live sales dashboards)

Extract: Optimize performance for large datasets (e.g., historical sales analysis)

3 How do you create calculated fields in Tableau? Can you provide an example? What are Level of Detail (LOD) expressions, and how do FIXED, INCLUDE, and EXCLUDE differ?

Calculated Fields: Custom formulas using existing fields

Example:

sql

Profit = [Sales] - [Cost]

LOD Expressions:

FIXED: Computes values independent of the view (e.g., total sales per region regardless of filters)

sql

{ FIXED [Region] : SUM([Sales]) }

INCLUDE: Adds a dimension to the calculation (e.g., average sales per product within each category)

sql

{ INCLUDE [Product] : AVG([Sales]) }

EXCLUDE: Removes a dimension from the calculation (e.g., total sales ignoring region filters)

4 How do you create dynamic parameters and filters in Tableau?

Dynamic Parameters:

Create a parameter (e.g., Selected Year)

Trang 11

Use it in a calculated field to filter data:

sql

YEAR([OrderDate]) = [Selected Year]

Dynamic Filters:

Use a parameter to control a dimension (e.g., Top N Products based on a parameter input)

Example: A slider to adjust the date range

5 What are the different types of charts available in Tableau, and how do you decide which one to use?

6 Chart Type Use Case

Bar Chart Comparing categories (e.g., sales by region)

Line Chart Trends over time (e.g., monthly revenue)

Scatter Plot Correlations (e.g., marketing spend vs sales)

Heatmap Density or patterns (e.g., sales by region and product)

Pie Chart Proportions (e.g., market share)

Best Practice: Use bar/line charts for clarity; avoid pie charts for complex comparisons

6 How do you optimize Tableau dashboards for better performance?

Data Source: Use extracts, filter unused data

Calculations: Simplify complex formulas; avoid unnecessary LODs

Filters: Apply context filters early in the pipeline

Visuals: Limit heavy charts (e.g., avoid 10,000-row tables)

Performance Recording: Use Tableau’s built-in tool to identify bottlenecks

7 How can you implement row-level security (RLS) in Tableau?

Implement RLS using:

User Filters: Map users to data segments (e.g., [Region] = USERNAME())

Trang 12

Data Source Filters: Restrict access at the database level

Dynamic Zone Visibility: Hide/show sheets based on user roles

Example: Sales reps only see data where [Salesperson] = USERNAME()

8 What is the difference between table calculations and calculated fields? When should you use each?

Aspect Table Calculations Calculated Fields

Scope Based on the table structure (e.g., percent of total) Row-level or aggregate (e.g., Profit = Sales - Cost)

Execution Computed in the visualization

layer

Computed during data processing

Use Case Running totals, rankings Pre-computed metrics

Example:

Table Calc: WINDOW_AVG(SUM([Sales])) for a moving average

Calculated Field: IF [Sales] > 1000 THEN "High" ELSE "Low" END

9 How do you create a dual-axis chart in Tableau, and when is it useful?

Steps:

Drag two measures to the Rows/Columns shelf (e.g., Sales and Profit)

Right-click the second measure → Dual Axis

Synchronize axes if scales are compatible

Use Case: Compare trends with different scales (e.g., sales ($$) vs profit margin (%))

Example: Overlay a line chart (profit margin) on a bar chart (sales) by month

PYTHON BASICS

Trang 13

1 How do you read and manipulate data in Pandas? Can you provide examples of common operations (e.g., filtering, grouping, merging)?

Reading Data:

python

import pandas as pd

# Read CSV file

df = pd.read_csv('data.csv')

# Read Excel file

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Common Operations:

Filtering:

python

filtered_df = df[df['Age'] > 30] # Rows where Age > 30

Grouping:

python

grouped = df.groupby('City')['Sales'].sum() # Total sales per city

Merging:

python

merged_df = pd.merge(df1, df2, on='ID', how='inner') # Inner join on 'ID'

2 What is the difference between a list, tuple, set, and dictionary in Python? When should you use each?

List Yes Yes Storing sequences (e.g., [1, 2, 3])

Tuple No Yes Immutable data (e.g., coordinates (x,

y))

Set Yes No Unique elements (e.g., {1, 2, 3})

Dict Yes No (Python <3.7) /

Yes (Python ≥3.7) Key-value pairs (e.g., {'name': 'Alice'})

Trang 14

Example:

python

# List: Modify elements

fruits = ['apple', 'banana']

fruits.append('orange')

# Tuple: Fixed data

dimensions = (1920, 1080)

# Set: Remove duplicates

unique_ids = set([101, 102, 101])

# Dict: Lookup by key

user = {'id': 1, 'name': 'Alice'}

3 How do you handle missing values in a dataset using Pandas?

Detect Missing Values:

python

df.isna().sum() # Count NaNs per column

Fill Missing Values:

python

df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill with mean

Drop Rows/Columns:

python

df.dropna(axis=0, subset=['Age']) # Drop rows with NaN in 'Age'

4 What is the difference between apply(), map(), and lambda functions in Pandas? apply(): Apply a function to a DataFrame/Series

python

df['Name_Upper'] = df['Name'].apply(lambda x: x.upper())

map(): Replace values in a Series

python

df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})

Lambda: Anonymous functions for quick operations

python

Tiêu đề	Most asked interview question for data analysis
Trường học	University of Data Science
Chuyên ngành	Data Analysis
Thể loại	Essay
Năm xuất bản	2023
Thành phố	New York

Định dạng
Số trang	17
Dung lượng	213,31 KB