dplyrTreinamento em R
In this module, we’ll learn how to wrangle (clean, transform, and prepare) tax data using dplyr.
What does “data wrangling” mean?
Data wrangling is the process of transforming raw data into a format that’s ready for analysis.
Think of it like preparing ingredients before cooking:
By the end of this module, you’ll be able to:
dplyr?dplyr is part of the tidyverse - a collection of R packages designed to work together seamlessly.
Advantages:
✅ Readable - verbs match what you want to do
✅ Consistent - similar syntax across operations
✅ Beginner-friendly - clear and logical
✅ Well-documented - tons of resources
Nota
We’ll explain what %>% means in just a moment!
Before we dive into data wrangling, we need to understand a crucial concept:
Changes Don’t Stick Unless You Save Them!
When you transform data in R, the changes are NOT automatically saved. You must use the assignment operator <- to store results.
The assignment operator <- stores results in an object:
[1] 211
Create a NEW object when:
The assignment operator <- has a keyboard shortcut:
Dica
Alt + - (minus) (Windows/Linux)
Option + - (minus) (Mac)
This types <- automatically!
%>%The pipe operator (%>%) is the key to writing clear, readable code.
Think of it as “and then”:
How to read it:
tax_data%>%) filter to 2023%>%) select specific columns%>%) sort the resultsDica
Keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)
Let’s understand pipes with an everyday example:
Making coffee (without pipes):
Making coffee (with pipes):
Let’s remind ourselves what data we’re working with:
Rows: 1,000
Columns: 8
$ firm_id <chr> "FIRM_0096", "FIRM_0081", "FIRM_0024", "FIRM_0091",…
$ declaration_date <date> 2023-01-31, 2022-05-31, 2023-03-31, 2023-08-31, 20…
$ vat_inputs <int> 45262, 9225, 15289, 9347, 47317, 44471, 23806, 4907…
$ vat_outputs <int> 45914, 29661, 10005, 14410, 33378, 45639, 19855, 25…
$ reimbursement_date <date> 2023-12-31, 2021-07-31, 2023-09-30, 2021-11-30, 20…
$ filing_year <int> 2023, 2022, 2023, 2023, 2023, 2022, 2021, 2022, 202…
$ filing_quarter <int> 1, 2, 1, 3, 3, 4, 1, 3, 4, 4, 4, 2, 2, 1, 3, 1, 4, …
$ filing_month <int> 1, 5, 3, 8, 9, 12, 2, 7, 12, 11, 12, 4, 6, 3, 9, 2,…
Nota
glimpse() is a dplyr function that gives us a quick overview of our data - the column names, data types, and first few values.
dplyr has six main verbs (functions) for working with data:
| Verb | What it does | Example |
|---|---|---|
filter() |
Choose rows based on conditions | “Show me only 2023 declarations” |
select() |
Choose columns | “I only need firm ID and VAT amount” |
mutate() |
Create or modify columns | “Calculate net VAT” |
arrange() |
Sort rows | “Order by VAT amount, highest first” |
summarize() |
Calculate summaries | “What’s the average VAT payment?” |
group_by() |
Group data for grouped operations | “Average VAT by industry” |
Let’s learn each one with real tax data examples!
filter()filter() lets you select specific rows based on conditions.
Basic example:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0090 2022-12-31 44471 45639 2022-10-31
3: FIRM_0078 2021-06-30 46685 46336 2022-04-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 4 12
3: 2021 2 6
Importante
Notice we use == (two equal signs) for comparison, not = (one equal sign).
== means “is equal to”= is for assigning valuesYou can filter using various comparison operators:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0090 2022-12-31 44471 45639 2022-10-31
3: FIRM_0078 2021-06-30 46685 46336 2022-04-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 4 12
3: 2021 2 6
Available operators:
== equal to!= not equal to> greater than< less than>= greater than or equal to<= less than or equal toCombine conditions using & (AND) or | (OR):
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0095 2023-10-31 8254 43505 2021-12-31
2: FIRM_0015 2022-02-28 1444 44201 2023-01-31
3: FIRM_0099 2023-01-31 16775 47471 2023-05-31
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 4 10
2: 2022 1 2
3: 2023 1 1
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0080 2022-07-31 49071 25070 2022-06-30
2: FIRM_0040 2022-12-31 43741 49806 2021-07-31
3: FIRM_0100 2021-10-31 48270 14916 2021-06-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2022 3 7
2: 2022 4 12
3: 2021 4 10
1. Check if value is in a list: %in%
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0010 2021-02-28 30557 40239 2022-03-31
2: FIRM_0005 2022-09-30 22196 30998 2022-04-30
3: FIRM_0005 2021-04-30 5016 30653 2021-10-31
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2021 1 2
2: 2022 3 9
3: 2021 2 4
Dica
%in% is much easier than writing firm_id == "FIRM_0001" | firm_id == "FIRM_0005" | ...
2. Check if value is within a range: between()
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0081 2022-05-31 9225 29661 2021-07-31
2: FIRM_0014 2023-09-30 47317 33378 2021-12-31
3: FIRM_0080 2022-07-31 49071 25070 2022-06-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2022 2 5
2: 2023 3 9
3: 2022 3 7
select()select() lets you choose which columns to keep.
Select specific columns:
firm_id vat_inputs vat_outputs
<char> <int> <int>
1: FIRM_0096 45262 45914
2: FIRM_0081 9225 29661
3: FIRM_0024 15289 10005
Remove columns with -:
firm_id declaration_date vat_inputs vat_outputs filing_year filing_quarter
<char> <Date> <int> <int> <int> <int>
1: FIRM_0096 2023-01-31 45262 45914 2023 1
2: FIRM_0081 2022-05-31 9225 29661 2022 2
3: FIRM_0024 2023-03-31 15289 10005 2023 1
filing_month
<int>
1: 1
2: 5
3: 3
Drop multiple columns:
firm_id declaration_date vat_inputs vat_outputs filing_quarter
<char> <Date> <int> <int> <int>
1: FIRM_0096 2023-01-31 45262 45914 1
2: FIRM_0081 2022-05-31 9225 29661 2
3: FIRM_0024 2023-03-31 15289 10005 1
filing_month
<int>
1: 1
2: 5
3: 3
mutate()mutate() creates new columns or modifies existing ones.
Basic calculation:
firm_id vat_outputs vat_inputs net_vat
<char> <int> <int> <int>
1: FIRM_0096 45914 45262 652
2: FIRM_0081 29661 9225 20436
3: FIRM_0024 10005 15289 -5284
Nota
The original data is NOT changed. mutate() creates a new version with the additional column.
You can create several columns at once:
firm_id net_vat vat_ratio large_taxpayer
<char> <int> <num> <lgcl>
1: FIRM_0096 652 0.9857995 TRUE
2: FIRM_0081 20436 0.3110145 FALSE
3: FIRM_0024 -5284 1.5281359 FALSE
if_else()Create columns based on conditions:
firm_id net_vat is_refund
<char> <int> <lgcl>
1: FIRM_0096 652 FALSE
2: FIRM_0081 20436 FALSE
3: FIRM_0024 -5284 TRUE
4: FIRM_0091 5063 FALSE
5: FIRM_0014 -13939 TRUE
Syntax: if_else(condition, value_if_TRUE, value_if_FALSE)
Dica
You can use text too: if_else(net_vat < 0, "Refund", "Payment")
arrange()arrange() sorts your data by one or more columns.
Ascending order (smallest to largest):
firm_id vat_outputs
<char> <int>
1: FIRM_0066 1014
2: FIRM_0044 1045
3: FIRM_0020 1073
firm_id filing_year vat_outputs
<char> <int> <int>
1: FIRM_0075 2021 49710
2: FIRM_0056 2021 49409
3: FIRM_0060 2021 49309
4: FIRM_0062 2021 49283
5: FIRM_0051 2021 49280
Nota
This is useful for ranking: “Who are the top taxpayers each year?”
summarize()summarize() calculates summary statistics and reduces your data to a single row.
Basic example:
total_vat_collected average_vat median_vat num_declarations
1 25883247 25883.25 26212.5 1000
Importante
Always use na.rm = TRUE to remove missing values, otherwise you’ll get NA as the result!
| Function | What it calculates | Example |
|---|---|---|
sum() |
Total | Total VAT collected |
mean() |
Average | Average payment |
median() |
Middle value | Median VAT amount |
min() |
Smallest value | Minimum payment |
max() |
Largest value | Maximum payment |
n() |
Count of rows | Number of declarations |
n_distinct() |
Count unique values | Number of unique firms |
group_by()group_by() is powerful - it lets you perform operations separately for each group.
Example: Average VAT by year
# A tibble: 3 × 3
filing_year avg_vat num_declarations
<int> <dbl> <int>
1 2021 26088. 329
2 2022 25410. 332
3 2023 26148. 339
Nota
Think of group_by() as creating separate “buckets” - one for each unique value. Then calculations happen within each bucket.
# A tibble: 8 × 4
filing_year filing_quarter avg_vat num_declarations
<int> <int> <dbl> <int>
1 2021 1 25976. 72
2 2021 2 27633. 77
3 2021 3 25752. 95
4 2021 4 25158. 85
5 2022 1 25747. 90
6 2022 2 24686. 78
7 2022 3 27352. 86
8 2022 4 23604. 78
Why .groups = "drop"?
After summarize(), R keeps the grouping structure by default. Adding .groups = "drop" removes this grouping, which:
Always add .groups = "drop" after group_by() + summarize()!
count()count() is a shortcut for group_by() + summarize(n = n()).
Simple frequency table:
filing_year n
<int> <int>
1: 2021 329
2: 2022 332
3: 2023 339
distinct()distinct() keeps only unique rows.
Example: Get unique firms
firm_id
<char>
1: FIRM_0096
2: FIRM_0081
3: FIRM_0024
4: FIRM_0091
5: FIRM_0014
Keep all columns:
firm_id declaration_date vat_inputs vat_outputs reimbursement_date
<char> <Date> <int> <int> <Date>
1: FIRM_0096 2023-01-31 45262 45914 2023-12-31
2: FIRM_0081 2022-05-31 9225 29661 2021-07-31
3: FIRM_0024 2023-03-31 15289 10005 2023-09-30
filing_year filing_quarter filing_month
<int> <int> <int>
1: 2023 1 1
2: 2022 2 5
3: 2023 1 3
Dica
.keep_all = TRUE means “keep all the other columns too, not just the ones I specified”.
Let’s combine multiple operations in one pipeline:
# Complete analysis: Top 5 firms by average VAT in 2023
panel_vat %>%
filter(filing_year == 2023) %>% # Only 2023
mutate(net_vat = vat_outputs - vat_inputs) %>% # Calculate net VAT
group_by(firm_id) %>% # Group by firm
summarize(avg_net_vat = mean(net_vat, na.rm = TRUE), .groups = "drop") %>% # Average per firm
arrange(desc(avg_net_vat)) %>% # Sort highest first
head(5) # Top 5# A tibble: 5 × 2
firm_id avg_net_vat
<chr> <dbl>
1 FIRM_0060 40856
2 FIRM_0053 27003
3 FIRM_0012 21663
4 FIRM_0022 21581
5 FIRM_0018 21551
Nota
Read this from top to bottom like a recipe - each step builds on the previous one!
You can find the exercise in the folder “Exercises/exercise_01_template.R”
10:00 Your tasks:
Filtering: Filter panel_vat to declarations with vat_outputs greater than 30,000 and save as high_vat. Check how many rows using nrow().
Selecting: From panel_vat, select only firm_id, declaration_date, and vat_outputs. Display the first few rows with head().
Mutating: Create a new column called net_vat (vat_outputs - vat_inputs) and a flag is_refund (TRUE if net_vat < 0). Display the first few rows.
Arranging: Sort panel_cit by tax_paid (highest first) and show the top 10 taxpayers.
Summarizing: Calculate total, average, and median taxable_income from panel_cit.
Grouping: Calculate average vat_outputs by filing_year and count declarations per year using group_by() and summarize().
Challenge: Find the top 3 firms by total net VAT paid in 2022, showing firm_id and total amount.
# Load required packages
library(dplyr)
library(data.table)
library(here)
library(lubridate)
# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))
# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)
# 1. Filtering
high_vat <- panel_vat %>%
filter(vat_outputs > 30000)
nrow(high_vat)
# 2. Selecting
vat_selected <- panel_vat %>%
select(firm_id, declaration_date, vat_outputs)
head(vat_selected)
# 3. Mutating
vat_with_flags <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
is_refund = if_else(net_vat < 0, TRUE, FALSE)
)
head(vat_with_flags)
# 4. Arranging
top_taxpayers <- panel_cit %>%
arrange(desc(tax_paid)) %>%
head(10)
top_taxpayers
# 5. Summarizing
cit_summary <- panel_cit %>%
summarize(
total_income = sum(taxable_income, na.rm = TRUE),
avg_income = mean(taxable_income, na.rm = TRUE),
median_income = median(taxable_income, na.rm = TRUE)
)
cit_summary
# 6. Grouping
vat_by_year <- panel_vat %>%
mutate(filing_year = year(declaration_date)) %>%
group_by(filing_year) %>%
summarize(
avg_vat = mean(vat_outputs, na.rm = TRUE),
num_declarations = n(),
.groups = "drop"
)
vat_by_year
# 7. Challenge
top_firms_2022 <- panel_vat %>%
mutate(filing_year = year(declaration_date)) %>%
filter(filing_year == 2022) %>%
mutate(net_vat = vat_outputs - vat_inputs) %>%
group_by(firm_id) %>%
summarize(total_vat = sum(net_vat, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(total_vat)) %>%
head(3)
top_firms_2022Tax data is inherently time-based:
📆 Filing dates - to check if declarations are late
📆 Tax periods - to calculate quarterly, annual totals
📆 Payment deadlines - to calculate penalties
📆 Time gaps - to identify irregular filing patterns
We need to extract, calculate, and analyze dates properly!
lubridate Packagelubridate makes working with dates much easier!
Common tasks:
Dates are often stored as text. We need to convert them to Date format:
[1] "2023-01-15" "2023-06-20" "2023-12-31"
[1] "2023-01-15" "2023-06-20" "2023-12-31"
[1] "2023-01-15" "2023-06-20" "2023-12-31"
Dica
Use the function that matches your format: ymd() for year-month-day, dmy() for day-month-year, etc.
Pull out specific parts of a date:
firm_id declaration_date filing_year filing_month filing_quarter
<char> <Date> <int> <int> <int>
1: FIRM_0096 2023-01-31 2023 1 1
2: FIRM_0081 2022-05-31 2022 5 2
3: FIRM_0024 2023-03-31 2023 3 1
4: FIRM_0091 2023-08-31 2023 8 3
5: FIRM_0014 2023-09-30 2023 9 3
Nota
This is useful for grouping declarations by time period!
Calculate how many days between dates:
Time difference of 10 days
With real data:
Time difference of 1064 days
Nota
Date arithmetic in R is straightforward - just subtract one date from another!
[1] "2023-05-30"
You can find the exercise in the folder “Exercises/exercise_02_template.R”
10:00 Your tasks:
Parse dates: Load panel_vat and convert declaration_date to proper Date format using as.Date(). Check with class().
Extract components: Create three new columns:
filing_year using year()filing_quarter using quarter()filing_month using month()Save as panel_vat_dates and display first few rows.
Calculate differences: For each firm, calculate days_since_last (days between consecutive declarations) using arrange(), group_by(), lag(), and ungroup(). Display results.
Filing analysis: Calculate quarter_end, filing_deadline (45 days after quarter end), and days_late. Then summarize: how many declarations were late?
Recent filers: Find firms that filed in the last 180 days from today() using filter() and distinct().
# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)
# Load data
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
# 1. Parse dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
class(panel_vat$declaration_date)
# 2. Extract date components
panel_vat_dates <- panel_vat %>%
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date),
filing_month = month(declaration_date)
)
head(panel_vat_dates)
# 3. Calculate days between declarations
panel_vat_gaps <- panel_vat_dates %>%
arrange(firm_id, declaration_date) %>%
group_by(firm_id) %>%
mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
ungroup()
panel_vat_gaps %>%
select(firm_id, declaration_date, days_since_last) %>%
head(10)
# 4. Filing analysis
panel_vat_deadlines <- panel_vat_dates %>%
mutate(
quarter_end = ceiling_date(declaration_date, "quarter") - 1,
filing_deadline = quarter_end + 45,
days_late = as.numeric(declaration_date - filing_deadline)
)
panel_vat_deadlines %>%
summarize(
total_declarations = n(),
late_filers = sum(days_late > 0, na.rm = TRUE),
pct_late = round(100 * late_filers / total_declarations, 2)
)
# 5. Recent filers
recent_filers <- panel_vat_dates %>%
filter(declaration_date >= (today() - 180)) %>%
distinct(firm_id)
nrow(recent_filers)
head(recent_filers)
# Combine all transformations
panel_vat_dates <- panel_vat_deadlines %>%
left_join(
panel_vat_gaps %>% select(firm_id, declaration_date, days_since_last),
by = c("firm_id", "declaration_date")
)The real power of dplyr comes from combining operations in a logical sequence:
# Complete workflow example
analysis_result <- panel_vat %>%
# Step 1: Calculate new variables
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date)
) %>%
# Step 2: Filter to recent data
filter(filing_year == 2023) %>%
# Step 3: Analyze by quarter
group_by(filing_quarter) %>%
summarize(
total_vat = sum(net_vat, na.rm = TRUE),
avg_vat = mean(net_vat, na.rm = TRUE),
num_firms = n_distinct(firm_id),
.groups = "drop"
)
analysis_result# A tibble: 4 × 4
filing_quarter total_vat avg_vat num_firms
<int> <int> <dbl> <int>
1 1 442405 5395. 54
2 2 15185 167. 59
3 3 -104600 -1113. 62
4 4 179021 2486. 49
Sometimes it’s better to break long pipelines into smaller pieces:
Good for complex workflows:
# Step 1: Add derived variables
vat_enhanced <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date)
)
# Step 2: Analyze
vat_summary <- vat_enhanced %>%
group_by(filing_year, filing_quarter) %>%
summarize(avg_vat = mean(net_vat, na.rm = TRUE), .groups = "drop")
head(vat_summary)# A tibble: 6 × 3
filing_year filing_quarter avg_vat
<int> <int> <dbl>
1 2021 1 -519.
2 2021 2 2447.
3 2021 3 219.
4 2021 4 442.
5 2022 1 100.
6 2022 2 -1321.
Dica
Break into steps when:
Build up your analysis with meaningful calculated columns:
# Create a rich dataset with many derived variables
panel_vat_enhanced <- panel_vat %>%
mutate(
# Financial calculations
net_vat = vat_outputs - vat_inputs,
vat_ratio = vat_inputs / vat_outputs,
# Business logic flags
is_refund = if_else(net_vat < 0, TRUE, FALSE),
large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE)
)
# Check what we created
panel_vat_enhanced %>%
select(firm_id, net_vat, is_refund, large_taxpayer, high_ratio) %>%
head(5) firm_id net_vat is_refund large_taxpayer high_ratio
<char> <int> <lgcl> <lgcl> <lgcl>
1: FIRM_0096 652 FALSE TRUE TRUE
2: FIRM_0081 20436 FALSE FALSE FALSE
3: FIRM_0024 -5284 TRUE FALSE TRUE
4: FIRM_0091 5063 FALSE FALSE FALSE
5: FIRM_0014 -13939 TRUE FALSE TRUE
Save your enhanced datasets for use in later modules:
# Save enhanced VAT panel to Clean folder
panel_vat_clean <- panel_vat %>%
mutate(
net_vat = vat_outputs - vat_inputs,
filing_quarter = quarter(declaration_date),
is_refund = net_vat < 0,
large_taxpayer = vat_outputs > 40000
)
fwrite(panel_vat_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
# Save enhanced CIT panel to Clean folder
panel_cit_clean <- panel_cit %>%
mutate(
filing_quarter = quarter(declaration_date),
effective_tax_rate = tax_paid / taxable_income
)
fwrite(panel_cit_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))Keep your files organized:
r_training_datax/
└── Exercises/
└── data/
├── Raw/ # Original, untouched data
│ ├── firm_characteristics.csv
│ ├── vat_declarations.dta
│ └── cit_declarations.xlsx
├── intermediate/ # After Module 2 cleaning
│ ├── dt_firms.csv
│ ├── panel_vat.csv
│ └── panel_cit.csv
└── Clean/ # After Module 3 transformation
├── panel_vat_clean.csv
└── panel_cit_clean.csv
Dica
Save enhanced versions to Clean folder so you can use them in future modules!
You can find the exercise in the folder “Exercises/exercise_03_template.R”
15:00 Your tasks:
Create fully transformed datasets ready for analysis:
net_vat, filing_year, filing_quarter, filing_monthvat_ratio and days_since_last per firmis_refund, large_taxpayer (>40000), high_ratio (>0.9)data/Clean/panel_vat_clean.csv using fwrite()filing_year, filing_quartereffective_tax_rate (tax_paid / taxable_income)has_adjustments (adjustments != 0)data/Clean/panel_cit_clean.csv using fwrite()clean_names()data/Clean/dt_firms_clean.csv using fwrite()file.exists()names() and glimpse()Remember: These cleaned datasets will be used in Modules 4, 5, and 6!
# Load required packages
library(dplyr)
library(lubridate)
library(data.table)
library(here)
library(janitor)
# Load data from intermediate folder
panel_vat <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_vat.csv"))
panel_cit <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "panel_cit.csv"))
dt_firms <- fread(here("r_training_datax", "Exercises", "data", "intermediate", "dt_firms.csv"))
# Convert dates
panel_vat$declaration_date <- as.Date(panel_vat$declaration_date)
panel_cit$declaration_date <- as.Date(panel_cit$declaration_date)
# 1. Transform panel_vat
panel_vat_clean <- panel_vat %>%
# Add date components
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date),
filing_month = month(declaration_date)
) %>%
# Calculate financial metrics
mutate(
net_vat = vat_outputs - vat_inputs,
vat_ratio = vat_inputs / vat_outputs
) %>%
# Create business flags
mutate(
is_refund = if_else(net_vat < 0, TRUE, FALSE),
large_taxpayer = if_else(vat_outputs > 40000, TRUE, FALSE),
high_ratio = if_else(vat_ratio > 0.9, TRUE, FALSE, missing = FALSE)
) %>%
# Calculate days between declarations per firm
arrange(firm_id, declaration_date) %>%
group_by(firm_id) %>%
mutate(days_since_last = as.numeric(declaration_date - lag(declaration_date))) %>%
ungroup()
# Save to Clean folder
fwrite(panel_vat_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
# 2. Transform panel_cit
panel_cit_clean <- panel_cit %>%
# Add date components
mutate(
filing_year = year(declaration_date),
filing_quarter = quarter(declaration_date)
) %>%
# Calculate financial metrics
mutate(
effective_tax_rate = tax_paid / taxable_income
) %>%
# Create business flags
mutate(
has_adjustments = if_else(adjustments != 0, TRUE, FALSE)
)
# Save to Clean folder
fwrite(panel_cit_clean,
here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
# 3. Transform dt_firms
dt_firms_clean <- dt_firms %>%
clean_names()
# Save to Clean folder
fwrite(dt_firms_clean,
here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# 4. Verify the work
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
file.exists(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# Load them back
vat_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_vat_clean.csv"))
cit_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "panel_cit_clean.csv"))
firms_verify <- fread(here("r_training_datax", "Exercises", "data", "Clean", "dt_firms_clean.csv"))
# Check columns
names(vat_verify)
names(cit_verify)
names(firms_verify)
# Quick summary
glimpse(vat_verify)
glimpse(cit_verify)
glimpse(firms_verify)Core dplyr verbs:
filter() - select rowsselect() - choose columnsmutate() - create/modify columnsarrange() - sort datasummarize() + group_by() - aggregatecount() - frequenciesdistinct() - remove duplicatesDate operations:
lubridate✅ Always use pipes (%>%) to chain operations clearly
✅ Save your work with <- for variables you’ll use again
✅ Include na.rm = TRUE in calculations
✅ Use meaningful variable names (net_vat, not x1)
✅ Comment your code so others understand it
✅ Save transformed datasets to Clean folder for future modules
✅ Add .groups = "drop" after group_by() + summarize()