Tidy Data | Data Frosch Guides

Introduction: What is tidy data and why it matters

Tidy data is a specific, consistent way of organising a dataset so that every tool be it a spreadsheet, a script, or a visualisation library can read it without surprises. The concept comes from statistician Hadley Wickham, but the underlying idea is simple enough to apply in Excel.

Three rules of tidy data

Each variable forms one column. A variable is a single type of measurement or attribute: population, year, country name, contract value. If two different things are crammed into the same column, or if one thing is spread across several columns, the data is not tidy.

✗ Not tidy — year as column headers

country	2021	2022	2023
Germany	83m	84m	84m
France	68m	68m	68m

✓ Tidy — year as a variable

country	year	population
Germany	2021	83m
Germany	2022	84m
France	2021	68m

✗ Not tidy — one variable split across columns

city	male_count	female_count
Berlin	1,823,000	1,901,000
Hamburg	912,000	954,000

✓ Tidy — gender as a variable

city	gender	count
Berlin	male	1,823,000
Berlin	female	1,901,000
Hamburg	male	912,000
Hamburg	female	954,000

✗ Not tidy — two variables in one column

contract_id	location	amount
C-001	Amsterdam (NL)	€120,000
C-002	Berlin (DE)	€45,000
C-003	Paris (FR)	€80,000

✓ Tidy — city and country as separate columns

contract_id	city	country	amount
C-001	Amsterdam	NL	€120,000
C-002	Berlin	DE	€45,000
C-003	Paris	FR	€80,000

Each observation forms one row. An observation is one complete record: the population of one country in one year, or the value of one procurement contract. If a single observation spans multiple rows, or if one row contains several observations, the data is not tidy.

✗ Not tidy — one party per two rows

party	measure	value
SPD	votes	8,254,861
SPD	vote_share	25.7%
CDU	votes	8,775,471
CDU	vote_share	27.4%

✓ Tidy — one party per row

party	votes	vote_share
SPD	8,254,861	25.7%
CDU	8,775,471	27.4%

Each type of observational unit forms one table. Don't mix things that belong in separate tables. Contract details and buyer addresses should live in two linked tables, not one flat sheet with repeated address fields on every row.

✗ Not tidy — buyer address repeated on every contract row

contract_id	amount	buyer	buyer_city	buyer_country
C-001	€120k	Min. Finance	Berlin	DE
C-002	€45k	Min. Finance	Berlin	DE
C-003	€80k	Min. Finance	Berlin	DE

✓ Tidy — contracts table + buyers table, linked by ID

contract_id	amount	buyer_id
C-001	€120k	B-01
C-002	€45k	B-01

buyer_id	buyer	city	country
B-01	Min. Finance	Berlin	DE

Most real-world data you receive such as government spreadsheets, scraped tables, exported reports will break at least one of these rules. That is not a moral failing on anyone's part. It reflects the fact that data is often laid out for human readers, not for machines. Tidying it up is a routine part of data journalism.

Other common issues

These problems don't break the three tidy rules, but they will break your analysis just as surely.

Merged cells & colour-coded meaning

Spreadsheets designed for human readers use visual tricks — merged region headers, highlighted rows — that tools cannot read. Unmerge and fill down; replace colour with a text column.

✗ Merged region header + colour-coded row

North Region
city	year	contracts
Hamburg	2023	142
Bremen	2023	38
Kiel	2023	61

✓ Region filled down; status as a column

region	city	year	contracts	status
North	Hamburg	2023	142
North	Bremen	2023	38	disputed
North	Kiel	2023	61

Inconsistent values & typos

A group-by treats every unique string as a separate category. "Netherlands" and "The Netherlands" will produce two separate rows in any aggregate.

✗ Four spellings → four groups in any count

country	contracts
Netherlands	14
The Netherlands	9
NL	3
nederland	2

✓ Standardised → one group, correct total

country	contracts
Netherlands	14
Netherlands	9
Netherlands	3
Netherlands	2

Numbers & dates stored as plain text

A currency symbol or stray space inside a cell turns a number into text. SUM() returns zero; sorting by date gives wrong order.

✗ Amounts and dates stored as text

contract_id	amount	signed
C-001	€ 120,000	15/03/2023
C-002	€45.000	3-9-2023
C-003	80000 EUR	Nov 2023

✓ Clean numbers; ISO 8601 dates

contract_id	amount_eur	signed
C-001	120000	2023-03-15
C-002	45000	2023-09-03
C-003	80000	2023-11-01

Subtotal rows mixed in with data

A grand-total or subtotal row is not an observation — it will skew every aggregate you run and produce phantom records in any join.

✗ Subtotal row mixed into data

region	sales
North	142
South	98
Total	240
East	76
West	113
Grand total	429

✓ Data only — totals computed by your tool

region	sales
North	142
South	98
East	76
West	113

Tidy check

Run through this confirmation list before passing the data to analysis, a visualisation tool, or a colleague.

Every column header is a variable name (not a value, not a blank, not a merged label spanning multiple columns).
Every row is exactly one observation (no subtotals, no grand totals, no blank separator rows, no duplicate records).
Every cell contains a single value for a single variable (no combined strings that encode multiple facts, no lists inside a cell).
All values in a column are the same data type (all numbers, all dates, or all text; no mixed columns).
Categorical values are consistent and standardised. Run a final unique-values check on every categorical column.
You have saved a clean, unmodified copy of the raw data separately, and you have a change log (even a simple one) describing what you altered and why.
A colleague who has never seen this dataset could open it and immediately understand what each column means. If not, add a data dictionary.