Tidy Data
Most data you receive was laid out for human readers, not for machines. This guide explains what tidy data is, shows the most common ways data goes wrong with concrete examples, and ends with a checklist to confirm your dataset is ready for analysis.
Introduction: What is tidy data and why it matters
Tidy data is a specific, consistent way of organising a dataset so that every tool be it a spreadsheet, a script, or a visualisation library can read it without surprises. The concept comes from statistician Hadley Wickham, but the underlying idea is simple enough to apply in Excel.
Three rules of tidy data
-
Each variable forms one column. A variable is a single type of measurement or attribute: population, year, country name, contract value. If two different things are crammed into the same column, or if one thing is spread across several columns, the data is not tidy.
✗ Not tidy — year as column headers
country 2021 2022 2023 Germany 83m 84m 84m France 68m 68m 68m ✓ Tidy — year as a variable
country year population Germany 2021 83m Germany 2022 84m France 2021 68m ✗ Not tidy — one variable split across columns
city male_count female_count Berlin 1,823,000 1,901,000 Hamburg 912,000 954,000 ✓ Tidy — gender as a variable
city gender count Berlin male 1,823,000 Berlin female 1,901,000 Hamburg male 912,000 Hamburg female 954,000 ✗ Not tidy — two variables in one column
contract_id location amount C-001 Amsterdam (NL) €120,000 C-002 Berlin (DE) €45,000 C-003 Paris (FR) €80,000 ✓ Tidy — city and country as separate columns
contract_id city country amount C-001 Amsterdam NL €120,000 C-002 Berlin DE €45,000 C-003 Paris FR €80,000 -
Each observation forms one row. An observation is one complete record: the population of one country in one year, or the value of one procurement contract. If a single observation spans multiple rows, or if one row contains several observations, the data is not tidy.
✗ Not tidy — one party per two rows
party measure value SPD votes 8,254,861 SPD vote_share 25.7% CDU votes 8,775,471 CDU vote_share 27.4% ✓ Tidy — one party per row
party votes vote_share SPD 8,254,861 25.7% CDU 8,775,471 27.4% -
Each type of observational unit forms one table. Don't mix things that belong in separate tables. Contract details and buyer addresses should live in two linked tables, not one flat sheet with repeated address fields on every row.
✗ Not tidy — buyer address repeated on every contract row
contract_id amount buyer buyer_city buyer_country C-001 €120k Min. Finance Berlin DE C-002 €45k Min. Finance Berlin DE C-003 €80k Min. Finance Berlin DE ✓ Tidy — contracts table + buyers table, linked by ID
contract_id amount buyer_id C-001 €120k B-01 C-002 €45k B-01 buyer_id buyer city country B-01 Min. Finance Berlin DE
Most real-world data you receive such as government spreadsheets, scraped tables, exported reports will break at least one of these rules. That is not a moral failing on anyone's part. It reflects the fact that data is often laid out for human readers, not for machines. Tidying it up is a routine part of data journalism.
Other common issues
These problems don't break the three tidy rules, but they will break your analysis just as surely.
Merged cells & colour-coded meaning
Spreadsheets designed for human readers use visual tricks — merged region headers, highlighted rows — that tools cannot read. Unmerge and fill down; replace colour with a text column.
✗ Merged region header + colour-coded row
| North Region | ||
|---|---|---|
| city | year | contracts |
| Hamburg | 2023 | 142 |
| Bremen | 2023 | 38 |
| Kiel | 2023 | 61 |
✓ Region filled down; status as a column
| region | city | year | contracts | status |
|---|---|---|---|---|
| North | Hamburg | 2023 | 142 | |
| North | Bremen | 2023 | 38 | disputed |
| North | Kiel | 2023 | 61 |
Inconsistent values & typos
A group-by treats every unique string as a separate category. "Netherlands" and "The Netherlands" will produce two separate rows in any aggregate.
✗ Four spellings → four groups in any count
| country | contracts |
|---|---|
| Netherlands | 14 |
| The Netherlands | 9 |
| NL | 3 |
| nederland | 2 |
✓ Standardised → one group, correct total
| country | contracts |
|---|---|
| Netherlands | 14 |
| Netherlands | 9 |
| Netherlands | 3 |
| Netherlands | 2 |
Numbers & dates stored as plain text
A currency symbol or stray space inside a cell turns a number into text. SUM() returns zero; sorting by date gives wrong order.
✗ Amounts and dates stored as text
| contract_id | amount | signed |
|---|---|---|
| C-001 | € 120,000 | 15/03/2023 |
| C-002 | €45.000 | 3-9-2023 |
| C-003 | 80000 EUR | Nov 2023 |
✓ Clean numbers; ISO 8601 dates
| contract_id | amount_eur | signed |
|---|---|---|
| C-001 | 120000 | 2023-03-15 |
| C-002 | 45000 | 2023-09-03 |
| C-003 | 80000 | 2023-11-01 |
Subtotal rows mixed in with data
A grand-total or subtotal row is not an observation — it will skew every aggregate you run and produce phantom records in any join.
✗ Subtotal row mixed into data
| region | sales |
|---|---|
| North | 142 |
| South | 98 |
| Total | 240 |
| East | 76 |
| West | 113 |
| Grand total | 429 |
✓ Data only — totals computed by your tool
| region | sales |
|---|---|
| North | 142 |
| South | 98 |
| East | 76 |
| West | 113 |
Tidy check
Run through this confirmation list before passing the data to analysis, a visualisation tool, or a colleague.