| industry | avg_gap |
|---|---|
| Retail | 8209 |
| Services | 12158 |
| Manufacturing | 8861 |
| Technology | 17238 |
Treinamento em R
Average VAT gap by industry:
As a Table:
| industry | avg_gap |
|---|---|
| Retail | 8209 |
| Services | 12158 |
| Manufacturing | 8861 |
| Technology | 17238 |
As Numbers:
| min | mean | max |
|---|---|---|
| -84038 | 11411 | 125974 |
Which tells the story instantly?
Now we instantly see: Manufacturing has the highest average gap!
For tax administration:
Nota
A good visualization answers a question immediately
Three rules:
❌ Bad: - 3D charts - Rainbow colors with no meaning - Alphabetical sorting
✓ Good: - Simple, clear charts - Meaningful colors - Sorted by value
Source: ActiveWizards
Today’s focus: Bar charts, line charts, scatter plots
Most tools: “Make me a bar chart”
ggplot2: Build plots layer by layer with a systematic grammar
Nota
If you understand the grammar, you can create ANY plot
Three essential components:
Every plot follows this pattern:
Example:
Importante
Note the + at the end of the first line - this connects layers together!
Step 1: Tell ggplot what data
Empty gray box - ggplot is ready but doesn’t know what to plot
Step 2: Map variables to x and y
Now we have axes, but no data yet!
Step 3: Add geometric shapes
Complete plot! 🎉
Aesthetics map your data to visual properties
Three main aesthetics you’ll use:
The foundation of every plot:
Position tells: Which firm has what values
Color can show categories:
Now we see: Large firms (orange) cluster in top-right
Different geoms use different aesthetics:
color - for points and lines:

Use color for geom_point() and geom_line()
Bars can have both fill (interior) and color (border):
Dica
Quick rule: Points use color, Bars use fill
Variables go inside aes(), fixed values go outside:
Importante
If it’s in your data → use aes()
If it’s a fixed choice → outside aes()
Answer: Only C is correct!
aes()industry should be inside aes()Best for: Comparing values across categories
In tax administration: - Total VAT by industry - Number of audits by region - Compliance rates by sector
Problem: Hard to read vertical labels
Much better! coord_flip() rotates the plot
reorder(industry, total_vat) sorts industries by VAT amount
10:00 Task: Create a grouped bar chart showing VAT collection by industry and firm size
Steps:
actual_vat by industry AND firm_sizeindustry on x-axis, total_vat on y-axisfill = firm_size to color bars by firm sizeposition = "dodge" to make bars side-by-sidecoord_flip()theme_minimal()Question: Which combination (industry + firm size) collects the most VAT?
When you have TWO categorical variables:
Create side-by-side bars with position = "dodge":
Show total AND breakdown with stacked bars:
ggplot(industry_size, aes(x = reorder(industry, total_vat), y = total_vat, fill = firm_size)) +
geom_bar(stat = "identity", position = "stack") + # position = "stack" is default
coord_flip() +
labs(
title = "VAT Collection by Industry (by Firm Size)",
x = NULL,
y = "Total VAT (Millions USD)",
fill = "Firm Size"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal()Each bar shows total, colors show contribution from each firm size
Use Grouped Bars when:
Example: Compare small vs large firms across industries
Use Stacked Bars when:
Example: Total VAT with breakdown by firm size
Aviso
Avoid stacking when comparing values - hard to read middle/top segments!
Best for: Showing relationships between two continuous variables
In tax administration: - Expected vs actual VAT (compliance) - Firm size vs tax liability - Inputs vs outputs
Shows relationship between expected and actual VAT
Now we see: Pattern differs by firm size
Red line shows perfect compliance (actual = expected)
geom_smooth(method = "lm") adds linear trend line
08:00 Task: Create a scatter plot of VAT inputs vs outputs
Steps:
vat_gap_analysis datasetvat_inputs to x-axis, vat_outputs to y-axisgeom_point() with color by industrygeom_abline(intercept = 0, slope = 1)Question: What does the reference line represent?
Best for: Showing trends over time
In tax administration: - Monthly VAT collection - Compliance rates over years - Quarterly trends by sector
Best practice: Add geom_point() to show actual data points
Color automatically creates separate lines for each group!
08:00 Task: Create a line chart showing average VAT gap over time
Steps:
vat_gap by yeargeom_line() and geom_point()year to x-axis, average gap to y-axistheme_minimal()Question: Is the VAT gap increasing or decreasing?
When you have many categories, faceting is better than color:
Each industry gets its own panel - much clearer!
ncol = 3 controls number of columns
Use faceting when:
Syntax:
10:00 Task: Create faceted scatter plots by industry
Steps:
taxable_income vs actual_vatgeom_point()facet_wrap(~ industry)geom_smooth(method = "lm")Always include:
ggplot(vat_gap_analysis[1:100], aes(x = expected_vat/1000, y = actual_vat/1000)) +
geom_point(aes(color = firm_size)) +
labs(
title = "VAT Compliance by Firm Size",
subtitle = "Fiscal Years 2021-2023",
x = "Expected VAT (Thousands USD)",
y = "Actual VAT (Thousands USD)",
color = "Firm Size",
caption = "Source: Tax Administration Database"
) +
theme_minimal()Recommended theme for reports:
theme_minimal() gives clean, modern look
ColorBrewer palettes: “Set1”, “Set2”, “Dark2” are good defaults
# First, create and save plot to object
my_plot <- ggplot(vat_gap_analysis, aes(x = expected_vat, y = actual_vat)) +
geom_point() +
theme_minimal()
# Save as PNG for presentations
ggsave(
filename = "vat_compliance.png",
plot = my_plot,
width = 10,
height = 6,
dpi = 300
)
# Save as PDF for reports
ggsave("vat_compliance.pdf", plot = my_plot, width = 10, height = 6)Before sharing any visualization:
theme_minimal())20:00 Task: Create three professional plots for a tax report
Create:
All plots must have: - Proper titles and labels - theme_minimal() - Professional colors - Data source caption
Save all three as PNG files (300 dpi)
Core concepts: - When to use bar charts, scatter plots, and line charts - The ggplot2 template: ggplot() + aes() + geom_*() - Mapping data to aesthetics (x, y, color, fill) - Fill vs color: points use color, bars use fill - The aes() rule: variables inside, fixed values outside
Practical skills: - Create professional visualizations - Grouped and stacked bars for multiple categories - Add labels and themes - Use faceting for multiple groups - Save high-quality plots
⚠️ Forgetting the +
⚠️ Wrong use of aes() - Variables → inside aes() - Fixed values → outside aes()
Bar chart:
# Simple bar chart
ggplot(data, aes(x = category, y = value)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip()
# Grouped bars
ggplot(data, aes(x = category, y = value, fill = group)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip()
# Stacked bars
ggplot(data, aes(x = category, y = value, fill = group)) +
geom_bar(stat = "identity", position = "stack") +
coord_flip()Scatter plot:
Line chart:
Remember: Points use color, Bars use fill
Documentation: - R for Data Science - Chapter 2 - ggplot2 Cheat Sheet
Examples: - R Graph Gallery
Dica
When stuck: Google “ggplot how to…” - there’s almost always an example!
You can now: - Choose the right plot type for your question - Create professional tax visualizations - Communicate data insights effectively
Practice makes perfect: - Try these techniques with your own data - Start simple, add complexity gradually - Share plots with colleagues for feedback
Questions?