Most asked interview question for data analysis and guide to answer for each question _ bảng tổng hợp hầu hết các câu hỏi nhà tuyển dụng hay hỏi ứng viên cho vị trí phân tích dữ liệu
Trang 1MOST ASKED INTERVIEW
QUESTION FOR DATA
ANALYSIS
SQL
1 How do you find duplicate records in a table?
To find duplicates, use GROUP BY and HAVING to filter groups with counts greater than 1
Example:
sql
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
This identifies duplicate emails in the users table
2 Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
INNER JOIN: Returns rows where there’s a match in both tables
sql
SELECT * FROM orders
INNER JOIN customers ON orders.customer_id = customers.id;
LEFT JOIN: Returns all rows from the left table and matched rows from the right (unmatched rows get NULL)
sql
SELECT * FROM employees
LEFT JOIN departments ON employees.dept_id = departments.id;
RIGHT JOIN: Opposite of LEFT JOIN (all rows from the right table)
FULL OUTER JOIN: Returns all rows from both tables, with NULL where no match exists
Trang 2sql
SELECT * FROM tableA
FULL OUTER JOIN tableB ON tableA.id = tableB.id;
3 How do you optimize a slow-running query? What are window functions, and how
do you use them for ranking, running totals, or moving averages?
Optimization:
Add indexes on frequently queried columns
Avoid SELECT *; fetch only needed columns
Use EXPLAIN to analyze the query execution plan
Window Functions:
Ranking:
sql
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;
Running Total:
sql
SELECT date, revenue, SUM(revenue) OVER (ORDER BY date) AS running_total
FROM sales;
7-Day Moving Average:
sql
SELECT date, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM sales;
4 What are Common Table Expressions (CTEs), and how do they simplify complex queries?
CTEs simplify complex queries by breaking them into named temporary result sets
Example:
sql
WITH high_sales AS (
Trang 3SELECT employee_id, SUM(amount) AS total
FROM orders
GROUP BY employee_id
HAVING SUM(amount) > 10000
)
SELECT * FROM employees
WHERE id IN (SELECT employee_id FROM high_sales);
5 How would you filter results using a subquery in another table?
Use IN, EXISTS, or JOIN with a subquery
Example:
sql
SELECT * FROM products
WHERE category_id IN (
SELECT id FROM categories WHERE name = 'Electronics'
);
6 How do you calculate a rolling 7-day average in SQL?
Use a window function with a frame clause:
sql
SELECT date, AVG(sales) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_avg
FROM daily_sales;
7 What’s the difference between WHERE and HAVING? When should you use each? How do you pivot (transpose) a table in SQL?
WHERE: Filters rows before aggregation
sql
SELECT dept_id, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY dept_id;
HAVING: Filters after aggregation
sql
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id
Trang 4HAVING AVG(salary) > 70000;
Pivoting: Use CASE or PIVOT (SQL Server)
sql
SELECT
year,
SUM(CASE WHEN month = 'Jan' THEN revenue END) AS Jan,
SUM(CASE WHEN month = 'Feb' THEN revenue END) AS Feb
FROM sales
GROUP BY year;
8 How do indexes improve query performance, and when should you use them? Indexes speed up data retrieval but slow down writes Use them on columns used in:
WHERE clauses
JOIN conditions
ORDER BY
Example:
sql
CREATE INDEX idx_customer_email ON customers(email);
Best Practices:
Avoid over-indexing
Use clustered indexes for primary keys
Monitor query performance regularly
POWER BI
1 What are the key components of Power BI, and how do they interact?
Components:
Power Query: For data extraction, transformation, and loading (ETL)
Data Model: Stores tables and defines relationships
Trang 5DAX (Data Analysis Expressions): Creates calculations (measures, calculated columns)
Visualizations: Charts, tables, and dashboards
Power BI Service: Cloud platform for publishing and sharing reports
Interaction:
Power Query cleans and transforms raw data (e.g., merging CSV files)
Data is loaded into the Data Model, where relationships between tables (e.g., Sales
↔ Products) are defined
DAX calculates metrics (e.g., Total Sales = SUM(Sales[Amount])) for use in visuals
Visualizations (e.g., bar charts, maps) are built using the modeled data
Reports are published to the Power BI Service for sharing and collaboration
2 How do you handle relationships between tables in Power BI?
Auto-detect: Power BI automatically links tables via matching column names (e.g., ProductID)
Manual Setup: In Model View, drag a column from one table to another (e.g., Sales[CustomerID] → Customers[ID])
Relationship Types:
One-to-Many (★): Most common (e.g., one customer → many orders)
Many-to-Many: Requires a bridge table (e.g., students ↔ courses)
Cross-Filter Direction: Set to Single (default) or Both (bidirectional filtering)
3 What is the difference between calculated columns and measures in DAX?
Trang 6Aspect Calculated Column Measure
Calculation
Time During data refresh (static) At query runtime (dynamic) Storage Stored in the table Not stored; computed on the fly
Use Case Row-level calculations
(e.g., Profit = [Revenue] - [Cost])
Aggregations (e.g., Total Sales
= SUM(Sales[Amount]))
Example:
Column: Sales[Profit] = Sales[Revenue] - Sales[Cost] (added to the Sales table) Measure: Total Profit = SUM(Sales[Profit]) (used in a visual)
4 How do you optimize Power BI reports for performance?
Data Model:
Use star schema design
Avoid unnecessary columns
DAX:
Avoid complex nested CALCULATE statements
Use SUMMARIZE sparingly
Visuals:
Limit visuals on a single page
Use aggregations (e.g., pre-calculate totals)
Data Mode:
Use Import Mode for small datasets
Trang 7Use DirectQuery with aggregations for large datasets
Tools: Use Performance Analyzer to identify bottlenecks
5 Explain row-level security (RLS) in Power BI and how to implement it
Definition: Restricts data access based on user roles (e.g., a sales manager sees only their region)
Implementation:
Create Roles: In Power BI Desktop → Modeling → Manage Roles
Define DAX Filters:
dax
[Region] = USERPRINCIPALNAME()
Publish and Assign Roles: In Power BI Service → Datasets → Security
Example: A role EastRegion filters the Sales table to Region = "East"
6 What are the different types of filters in Power BI, and when should you use each? How do you use the CALCULATE function in DAX, and why is it important? Filter Types:
Visual-Level: Applies to a single visual
Page-Level: Affects all visuals on a page
Report-Level: Applies to the entire report
Drill-Through: Filters data when drilling to another page
CALCULATE Function:
Modifies filter context (e.g., override existing filters)
Example:
dax
East Sales = CALCULATE(SUM(Sales[Amount]), Sales[Region] = "East")
7 How do you create a dynamic date range (e.g., last 7 days) in Power BI?
Create a Date Table:
Trang 8dax
DateTable = CALENDAR(MIN(Sales[OrderDate]), MAX(Sales[OrderDate])) Create a Measure:
dax
Sales Last 7 Days =
CALCULATE(
SUM(Sales[Amount]),
DATESINPERIOD('DateTable'[Date], TODAY(), -7, DAY)
)
8 What is the difference between DirectQuery and Import Mode? When should you use each?
Data Storage Queries source live (no local copy) Data imported into Power BI
Performance Slower (depends on source) Faster (data cached)
Use Case Real-time data, large datasets (TB) Small/medium datasets (GB)
Transformations Limited to source capabilities Full Power Query transformations
9 How would you handle large datasets in Power BI without affecting performance? Aggregations: Pre-summarize data (e.g., daily → monthly totals)
Incremental Refresh: Load only new data (e.g., refresh last 30 days)
Optimize Queries:
Use Query Folding in Power Query
Avoid merging large tables
Data Model:
Use star schema
Limit high-cardinality columns (e.g., unique IDs)
Trang 9Storage Mode: Use Composite Models (mix Import + DirectQuery)
Example:
An e-commerce dataset with 100M rows uses Aggregations to summarize sales by month and category
TABLEAU
1 What are the different types of joins in Tableau, and how do they work?
Joins combine data from multiple tables based on a common field Tableau supports:
Inner Join: Returns rows where there’s a match in both tables
sql
SELECT * FROM Orders INNER JOIN Customers ON Orders.CustomerID = Customers.ID;
Left Join: Returns all rows from the left table and matched rows from the right (unmatched rows show NULL)
Right Join: Opposite of Left Join (all rows from the right table)
Full Outer Join: Returns all rows from both tables, with NULL for unmatched fields
Union: Stacks tables vertically (requires identical columns)
Example:
Combine Orders (left) and Returns (right) on OrderID using a Left Join to see all orders, including those not returned
2 What is the difference between a live connection and an extract in Tableau? When should you use each?
Aspect Live Connection Extract (TDE/Hyper)
Data Freshness Real-time querying Snapshot of data at refresh time
Performance Slower (depends on source
speed) Faster (data stored locally)
Trang 10Aspect Live Connection Extract (TDE/Hyper)
Use Case Real-time dashboards (e.g., stock data) Large datasets or offline access
Transformations Limited to source capabilities Full Power Query-like transformations
When to Use:
Live: Real-time reporting or sensitive data (e.g., live sales dashboards)
Extract: Optimize performance for large datasets (e.g., historical sales analysis)
3 How do you create calculated fields in Tableau? Can you provide an example? What are Level of Detail (LOD) expressions, and how do FIXED, INCLUDE, and EXCLUDE differ?
Calculated Fields: Custom formulas using existing fields
Example:
sql
Profit = [Sales] - [Cost]
LOD Expressions:
FIXED: Computes values independent of the view (e.g., total sales per region regardless of filters)
sql
{ FIXED [Region] : SUM([Sales]) }
INCLUDE: Adds a dimension to the calculation (e.g., average sales per product within each category)
sql
{ INCLUDE [Product] : AVG([Sales]) }
EXCLUDE: Removes a dimension from the calculation (e.g., total sales ignoring region filters)
4 How do you create dynamic parameters and filters in Tableau?
Dynamic Parameters:
Create a parameter (e.g., Selected Year)
Trang 11Use it in a calculated field to filter data:
sql
YEAR([OrderDate]) = [Selected Year]
Dynamic Filters:
Use a parameter to control a dimension (e.g., Top N Products based on a parameter input)
Example: A slider to adjust the date range
5 What are the different types of charts available in Tableau, and how do you decide which one to use?
6 Chart Type Use Case
Bar Chart Comparing categories (e.g., sales by region)
Line Chart Trends over time (e.g., monthly revenue)
Scatter Plot Correlations (e.g., marketing spend vs sales)
Heatmap Density or patterns (e.g., sales by region and product)
Pie Chart Proportions (e.g., market share)
Best Practice: Use bar/line charts for clarity; avoid pie charts for complex comparisons
6 How do you optimize Tableau dashboards for better performance?
Data Source: Use extracts, filter unused data
Calculations: Simplify complex formulas; avoid unnecessary LODs
Filters: Apply context filters early in the pipeline
Visuals: Limit heavy charts (e.g., avoid 10,000-row tables)
Performance Recording: Use Tableau’s built-in tool to identify bottlenecks
7 How can you implement row-level security (RLS) in Tableau?
Implement RLS using:
User Filters: Map users to data segments (e.g., [Region] = USERNAME())
Trang 12Data Source Filters: Restrict access at the database level
Dynamic Zone Visibility: Hide/show sheets based on user roles
Example: Sales reps only see data where [Salesperson] = USERNAME()
8 What is the difference between table calculations and calculated fields? When should you use each?
Aspect Table Calculations Calculated Fields
Scope Based on the table structure (e.g., percent of total) Row-level or aggregate (e.g., Profit = Sales - Cost)
Execution Computed in the visualization
layer
Computed during data processing
Use Case Running totals, rankings Pre-computed metrics
Example:
Table Calc: WINDOW_AVG(SUM([Sales])) for a moving average
Calculated Field: IF [Sales] > 1000 THEN "High" ELSE "Low" END
9 How do you create a dual-axis chart in Tableau, and when is it useful?
Steps:
Drag two measures to the Rows/Columns shelf (e.g., Sales and Profit)
Right-click the second measure → Dual Axis
Synchronize axes if scales are compatible
Use Case: Compare trends with different scales (e.g., sales ($$) vs profit margin (%))
Example: Overlay a line chart (profit margin) on a bar chart (sales) by month
PYTHON BASICS
Trang 131 How do you read and manipulate data in Pandas? Can you provide examples of common operations (e.g., filtering, grouping, merging)?
Reading Data:
python
import pandas as pd
# Read CSV file
df = pd.read_csv('data.csv')
# Read Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Common Operations:
Filtering:
python
filtered_df = df[df['Age'] > 30] # Rows where Age > 30
Grouping:
python
grouped = df.groupby('City')['Sales'].sum() # Total sales per city
Merging:
python
merged_df = pd.merge(df1, df2, on='ID', how='inner') # Inner join on 'ID'
2 What is the difference between a list, tuple, set, and dictionary in Python? When should you use each?
List Yes Yes Storing sequences (e.g., [1, 2, 3])
Tuple No Yes Immutable data (e.g., coordinates (x,
y))
Set Yes No Unique elements (e.g., {1, 2, 3})
Dict Yes No (Python <3.7) /
Yes (Python ≥3.7) Key-value pairs (e.g., {'name': 'Alice'})
Trang 14Example:
python
# List: Modify elements
fruits = ['apple', 'banana']
fruits.append('orange')
# Tuple: Fixed data
dimensions = (1920, 1080)
# Set: Remove duplicates
unique_ids = set([101, 102, 101])
# Dict: Lookup by key
user = {'id': 1, 'name': 'Alice'}
3 How do you handle missing values in a dataset using Pandas?
Detect Missing Values:
python
df.isna().sum() # Count NaNs per column
Fill Missing Values:
python
df['Age'].fillna(df['Age'].mean(), inplace=True) # Fill with mean
Drop Rows/Columns:
python
df.dropna(axis=0, subset=['Age']) # Drop rows with NaN in 'Age'
4 What is the difference between apply(), map(), and lambda functions in Pandas? apply(): Apply a function to a DataFrame/Series
python
df['Name_Upper'] = df['Name'].apply(lambda x: x.upper())
map(): Replace values in a Series
python
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
Lambda: Anonymous functions for quick operations
python