Treinamento em R
Your director walks in:
“We need to identify the top 5 firms with the largest VAT gaps in the retail sector. Break them down by firm size—small, medium, and large—based on annual taxable income from CIT data. Can you have this ready by tomorrow?”
🤔 You think:
This module teaches you exactly how to solve this problem.
What data do we have?
What do we need to do?
Key question: How do we get from messy, separate datasets to one clean analysis table?
Answer: By mastering reshaping and joins!
What is Tidy Data?
Tidy data organizes tax administrative data into a consistent, analysis-ready format:

| Taxpayer ID | Tax Type | 2021 Q1 | 2021 Q2 | 2021 Q3 | 2021 Q4 |
|---|---|---|---|---|---|
| 101 | Income Tax | 500 | 600 | 450 | 700 |
| 102 | VAT | 300 | 400 | 350 | 500 |
What’s the problem with this dataset?
Problems:
What are the variables?
Taxpayer ID, Tax Type, Quarter, Payment Amount.
What constitutes a single observation in this dataset?
One observation is a specific tax payment for a taxpayer during a particular quarter.
How would you reshape the dataset to meet the three tidy characteristics?
| Taxpayer ID | Tax Type | Quarter | Payment Amount |
|---|---|---|---|
| 101 | Income Tax | 2021 Q1 | 500 |
| 101 | Income Tax | 2021 Q2 | 600 |
| 101 | Income Tax | 2021 Q3 | 450 |
| 101 | Income Tax | 2021 Q4 | 700 |
| 102 | VAT | 2021 Q1 | 300 |
| 102 | VAT | 2021 Q2 | 400 |
| 102 | VAT | 2021 Q3 | 350 |
| 102 | VAT | 2021 Q4 | 500 |
Tidy data makes analysis easier:
The Goal
Most of our work involves transforming messy data into tidy data, then analyzing it.
Unit of Observation
What each row represents in your raw data
Unit of Analysis
What you need for your analysis
Example: You have firm-quarter observations but need firm-year analysis
Each time period is a column
| firm_id | Q1_vat | Q2_vat | Q3_vat | Q4_vat |
|---|---|---|---|---|
| FIRM_01 | 1000 | 1200 | 1100 | 1300 |
| FIRM_02 | 800 | 900 | 950 | 1000 |
Good for:
Each observation is a row
| firm_id | quarter | vat_amount |
|---|---|---|
| FIRM_01 | Q1 | 1000 |
| FIRM_01 | Q2 | 1200 |
| FIRM_01 | Q3 | 1100 |
| FIRM_01 | Q4 | 1300 |
| FIRM_02 | Q1 | 800 |
Good for:
Most common transformation in tax data analysis
Wide quarterly VAT data:
| firm_id | vat_q1 | vat_q2 | vat_q3 | vat_q4 |
|---|---|---|---|---|
| FIRM_01 | 5000 | 5200 | 4800 | 5500 |
↓ Transform to long ↓
| firm_id | quarter | vat_amount |
|---|---|---|
| FIRM_01 | Q1 | 5000 |
| FIRM_01 | Q2 | 5200 |
| FIRM_01 | Q3 | 4800 |
| FIRM_01 | Q4 | 5500 |
Why? Required for:
pivot_longer() SyntaxBreaking it down:
cols: Columns to reshape (the ones with repeated measurements)names_to: Name for the new column that will hold the old column namesvalues_to: Name for the new column that will hold the valuesThink of it as “unpacking”
You’re taking columns (Q1, Q2, Q3, Q4) and unpacking them into rows, storing the column name (Q1) and its value (5000) separately.
# Show the wide format first
head(vat_wide, 3)
# Transform to long format
vat_long <- vat_wide %>%
pivot_longer(
cols = c(vat_q1, vat_q2, vat_q3, vat_q4),
names_to = "quarter",
values_to = "vat_amount"
)
# Show the result
head(vat_long, 6)
# Check dimensions
cat("Wide format:", nrow(vat_wide), "rows\n")
cat("Long format:", nrow(vat_long), "rows (4x more!)\n")Task: Reshape the provided wide CIT data to long format
exercise_04_01_template.Rdata/Intermediate/cit_wide.csvpivot_longer() to transform year columns to long format10:00 Less common, but important for specific tasks
Long format data:
| firm_id | year | tax_type | amount |
|---|---|---|---|
| FIRM_01 | 2023 | VAT | 20000 |
| FIRM_01 | 2023 | CIT | 15000 |
| FIRM_02 | 2023 | VAT | 18000 |
| FIRM_02 | 2023 | CIT | 12000 |
↓ Transform to wide ↓
| firm_id | year | VAT | CIT |
|---|---|---|---|
| FIRM_01 | 2023 | 20000 | 15000 |
| FIRM_02 | 2023 | 18000 | 12000 |
Why? Useful for:
pivot_wider() SyntaxBreaking it down:
id_cols: Columns that uniquely identify each row in the resultnames_from: Which column contains the values that will become new column namesvalues_from: Which column contains the values to fill the new columnsWatch out for duplicates!
If you have multiple rows with the same combination of id_cols and names_from, pivot_wider() will create a list-column. Always check your data first!
# Create sample data with VAT and CIT
tax_long <- tibble(
firm_id = rep(c("FIRM_01", "FIRM_02"), each = 2),
year = rep(2023, 4),
tax_type = rep(c("VAT", "CIT"), 2),
amount = c(20000, 15000, 18000, 12000)
)
# Show long format
print(tax_long)
# Transform to wide format
tax_wide <- tax_long %>%
pivot_wider(
id_cols = c(firm_id, year),
names_from = tax_type,
values_from = amount
)
# Show result
print(tax_wide)
# Now we can easily calculate the VAT/CIT ratio
tax_wide <- tax_wide %>%
mutate(vat_cit_ratio = VAT / CIT)
print(tax_wide)In tax administration, data lives in separate systems:
VAT System
CIT System
Firm Registry
The Goal
Combine these datasets to analyze firm behavior across multiple tax types and characteristics.
Three separate tables linked by firm_id:
Table 1: VAT (panel_vat)
| firm_id | quarter | vat_amount |
|---|---|---|
| FIRM_01 | Q1 | 5000 |
| FIRM_01 | Q2 | 5200 |
Table 2: CIT (panel_cit)
| firm_id | year | cit_amount |
|---|---|---|
| FIRM_01 | 2023 | 15000 |
| FIRM_02 | 2023 | 12000 |
Table 3: Firms (dt_firms)
| firm_id | industry | size |
|---|---|---|
| FIRM_01 | Retail | Medium |
| FIRM_02 | Services | Small |
The key: firm_id appears in all three tables, allowing us to link them together.
Join key = The column(s) used to match rows between tables
Table A (VAT)
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
| FIRM_03 | 6000 |
Table B (Firms)
| firm_id | industry |
|---|---|
| FIRM_01 | Retail |
| FIRM_02 | Services |
+
= ?
R matches rows where firm_id is the same
What happens with FIRM_03?
That depends on the join type!
left_join(): The WorkhorseMost commonly used join (80% of real-world cases)
Keeps all rows from left table
Table A (left)
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
| FIRM_03 | 6000 |
Table B (right)
| firm_id | industry |
|---|---|
| FIRM_01 | Retail |
| FIRM_02 | Services |
Result: left_join(A, B)
| firm_id | vat | industry |
|---|---|---|
| FIRM_01 | 5000 | Retail |
| FIRM_02 | 4500 | Services |
| FIRM_03 | 6000 | NA |
Note: FIRM_03 kept, but industry is NA
Why it’s the default
Preserves your primary dataset (the left table). Perfect for enriching existing data with additional attributes.
left_join() in Action# Create sample data
vat_data <- tibble(
firm_id = c("FIRM_01", "FIRM_02", "FIRM_03"),
quarter = c("Q1", "Q1", "Q1"),
vat_amount = c(5000, 4500, 6000)
)
firm_data <- tibble(
firm_id = c("FIRM_01", "FIRM_02"),
industry = c("Retail", "Services"),
size = c("Medium", "Small")
)
# Show original tables
cat("VAT data:\n")
print(vat_data)
cat("\nFirm data:\n")
print(firm_data)
# Perform left join
vat_enriched <- left_join(vat_data, firm_data, by = "firm_id")
cat("\nAfter left_join:\n")
print(vat_enriched)
# Check row counts
cat("\nRow count check:\n")
cat("VAT data:", nrow(vat_data), "rows\n")
cat("Result:", nrow(vat_enriched), "rows (same!)\n")Task: Join CIT data to firm characteristics
exercise_04_02_template.Rpanel_cit.csv and dt_firms.csvleft_join() to add firm characteristics to CIT data10:00 inner_join(): Only MatchesKeeps only rows that exist in BOTH tables
Table A
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
| FIRM_03 | 6000 |
Table B
| firm_id | industry |
|---|---|
| FIRM_01 | Retail |
| FIRM_02 | Services |
Result: inner_join(A, B)
| firm_id | vat | industry |
|---|---|---|
| FIRM_01 | 5000 | Retail |
| FIRM_02 | 4500 | Services |
Note: FIRM_03 dropped (no match in Table B)
full_join(): EverythingKeeps ALL rows from BOTH tables
Table A
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
Table B
| firm_id | cit |
|---|---|
| FIRM_02 | 12000 |
| FIRM_03 | 15000 |
Result: full_join(A, B)
| firm_id | vat | cit |
|---|---|---|
| FIRM_01 | 5000 | NA |
| FIRM_02 | 4500 | 12000 |
| FIRM_03 | NA | 15000 |
Note: All firms kept, with NAs where no match
anti_join(): The Detective ToolReturns rows from left table that DON’T have a match in right table
Table A (VAT filers)
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
| FIRM_03 | 6000 |
Table B (CIT filers)
| firm_id | cit |
|---|---|
| FIRM_01 | 15000 |
| FIRM_02 | 12000 |
Result: anti_join(A, B)
| firm_id | vat |
|---|---|
| FIRM_03 | 6000 |
Note: Only FIRM_03 returned (filed VAT but not CIT)
Critical for diagnostics!
Use anti_join() to find:
Same Example Data, Different Join Types:
Table A (VAT)
| firm_id | vat |
|---|---|
| FIRM_01 | 5000 |
| FIRM_02 | 4500 |
| FIRM_03 | 6000 |
Table B (CIT)
| firm_id | cit |
|---|---|
| FIRM_01 | 15000 |
| FIRM_02 | 12000 |
| FIRM_04 | 18000 |
left_join(A, B)
| firm_id | vat | cit |
|---|---|---|
| FIRM_01 | 5000 | 15000 |
| FIRM_02 | 4500 | 12000 |
| FIRM_03 | 6000 | NA |
Rows: 3 (all from A)
inner_join(A, B)
| firm_id | vat | cit |
|---|---|---|
| FIRM_01 | 5000 | 15000 |
| FIRM_02 | 4500 | 12000 |
Rows: 2 (only matches)
full_join(A, B)
| firm_id | vat | cit |
|---|---|---|
| FIRM_01 | 5000 | 15000 |
| FIRM_02 | 4500 | 12000 |
| FIRM_03 | 6000 | NA |
| FIRM_04 | NA | 18000 |
Rows: 4 (all from both)
anti_join(A, B)
| firm_id | vat |
|---|---|
| FIRM_03 | 6000 |
Rows: 1 (in A, not in B)
Sometimes one key isn’t enough
Problem: Same firm appears in multiple years
VAT data (panel)
| firm_id | year | quarter | vat |
|---|---|---|---|
| FIRM_01 | 2022 | Q1 | 5000 |
| FIRM_01 | 2023 | Q1 | 5200 |
Firm characteristics (also panel)
| firm_id | year | industry |
|---|---|---|
| FIRM_01 | 2022 | Retail |
| FIRM_01 | 2023 | Retail |
If you join only by firm_id, you’ll get duplicate matches!
Real data often has inconsistent naming
Table A uses ‘id’
| id | vat |
|---|---|
| FIRM_01 | 5000 |
Table B uses ‘firm_id’
| firm_id | industry |
|---|---|
| FIRM_01 | Retail |
⚠️ Horror Story
You run a join. Everything looks fine. You send results to your director.
😱 The next day:
“These numbers look wrong. Why do we have 50,000 firm-year observations when we only have 5,000 firms?”
What happened? Duplicate keys in the right table caused a row explosion.
Lesson: Trust, but verify. ALWAYS validate your joins.
by = "firm_id" but one table uses "firm_ID"BEFORE joining:
# 1. Check for duplicate keys in both tables
panel_vat %>%
count(firm_id, year) %>%
filter(n > 1)
dt_firms %>%
count(firm_id, year) %>%
filter(n > 1)
# 2. Verify key columns exist and have same type
str(panel_vat$firm_id)
str(dt_firms$firm_id)
# 3. Check for missing values in key columns
sum(is.na(panel_vat$firm_id))
sum(is.na(dt_firms$firm_id))AFTER joining:
# 4. Check row count - does it make sense?
cat("Before join:", nrow(panel_vat), "rows\n")
cat("After join:", nrow(merged_data), "rows\n")
# For left_join, should match left table (unless right has duplicates)
# For inner_join, should be less than or equal to either table
# 5. Find non-matches using anti_join
unmatched <- anti_join(panel_vat, dt_firms, by = c("firm_id", "year"))
cat("Unmatched rows:", nrow(unmatched), "\n")
# 6. Check for unexpected NAs in joined columns
merged_data %>%
summarize(
na_industry = sum(is.na(industry)),
na_size = sum(is.na(size))
)
# 7. Spot check a few rows
merged_data %>%
filter(firm_id == "FIRM_001") %>%
select(firm_id, year, vat_amount, industry, size)Task: The provided join produces the wrong number of rows. Find and fix the problem.
exercise_04_03_template.R15:00 Realistic Scenario: Computing VAT Gap by Firm Size
Your director wants to identify top 5 firms with largest VAT gaps in retail, segmented by size.
The Pipeline:
# Step 1: Reshape VAT from wide to long (if needed)
# In this example, panel_vat is already long, so we'll aggregate directly
# Step 2: Aggregate VAT to firm-year level
vat_annual <- panel_vat %>%
mutate(year = lubridate::year(declaration_date)) %>%
group_by(firm_id, year) %>%
summarize(
actual_vat = sum(vat_outputs - vat_inputs, na.rm = TRUE),
vat_inputs = sum(vat_inputs, na.rm = TRUE),
vat_outputs = sum(vat_outputs, na.rm = TRUE),
quarters_filed = n(),
.groups = "drop"
)
# Step 3: Aggregate CIT to get taxable income
cit_annual <- panel_cit %>%
mutate(year = lubridate::year(declaration_date)) %>%
group_by(firm_id, year) %>%
summarize(
taxable_income = sum(taxable_income, na.rm = TRUE),
.groups = "drop"
)
# Step 4: Join VAT with CIT data
vat_cit <- left_join(
vat_annual,
cit_annual,
by = c("firm_id", "year")
)
# Step 5: Join to firm characteristics
vat_with_firms <- left_join(
vat_cit,
dt_firms,
by = c("firm_id", "year")
)
# Step 6: Create size categories based on taxable income
vat_with_firms <- vat_with_firms %>%
mutate(
firm_size = case_when(
taxable_income < 50000 ~ "Small",
taxable_income >= 50000 & taxable_income < 125000 ~ "Medium",
taxable_income >= 125000 ~ "Large",
is.na(taxable_income) ~ "Unknown",
TRUE ~ "Unknown"
)
)
# Step 7: Calculate VAT gap (simplified example)
# VAT gap = Expected VAT - Actual VAT
vat_analysis <- vat_with_firms %>%
mutate(
expected_vat = vat_outputs * 0.15, # Assuming 15% VAT rate
vat_gap = expected_vat - actual_vat
)
# Step 8: Filter to retail and find top 5 by size
top_gaps <- vat_analysis %>%
filter(industry == "Retail", !is.na(vat_gap)) %>%
group_by(firm_size) %>%
slice_max(order_by = vat_gap, n = 5) %>%
ungroup()
# Step 9: Validate
cat("Analysis summary:\n")
cat("Total firms analyzed:", nrow(vat_analysis), "\n")
cat("Retail firms:", sum(vat_analysis$industry == "Retail", na.rm = TRUE), "\n")
cat("Top gaps by size:\n")
print(top_gaps %>% count(firm_size))
# Save results
fwrite(
vat_analysis,
here("data", "final", "vat_gap_analysis.csv")
)
cat("\n✓ VAT gap analysis complete!\n")Task: Identify top 5 retail firms with largest VAT gaps, by size category
exercise_04_final_template.Rdata/Final/vat_gap_analysis.csvThis analysis will help identify compliance risks by firm size!
20:00 pivot_longer() for most reshaping tasks
left_join() as your default join
anti_join() to find non-matches📖 Further Reading
R4DS Chapter 6: Data tidying
https://r4ds.hadley.nz/data-tidy
R4DS Chapter 20: Joins
https://r4ds.hadley.nz/joins
dplyr cheatsheet: Join functions reference
tidyr cheatsheet: Pivot functions reference
🎂 You can now:
You’ve solved the challenge!
You now know how to: - Reshape quarterly data to annual - Join firm characteristics - Calculate metrics like VAT gap - Segment by categories (firm size, industry) - Rank and identify top performers/risks
Practice makes perfect
The best way to master joins is to practice with real data. Don’t hesitate to experiment with different join types and see what happens!