Data Frosch logo

Data Frosch

Guides

← All Guides

Checklist

Tidy Data

Most data you receive was laid out for human readers, not for machines. This guide explains what tidy data is, shows the most common ways data goes wrong with concrete examples, and ends with a checklist to confirm your dataset is ready for analysis.

Introduction: What is tidy data and why it matters

Tidy data is a specific, consistent way of organising a dataset so that every tool be it a spreadsheet, a script, or a visualisation library can read it without surprises. The concept comes from statistician Hadley Wickham, but the underlying idea is simple enough to apply in Excel.

Three rules of tidy data

  1. Each variable forms one column. A variable is a single type of measurement or attribute: population, year, country name, contract value. If two different things are crammed into the same column, or if one thing is spread across several columns, the data is not tidy.

    ✗ Not tidy — year as column headers

    country202120222023
    Germany83m84m84m
    France68m68m68m

    ✓ Tidy — year as a variable

    countryyearpopulation
    Germany202183m
    Germany202284m
    France202168m

    ✗ Not tidy — one variable split across columns

    citymale_countfemale_count
    Berlin1,823,0001,901,000
    Hamburg912,000954,000

    ✓ Tidy — gender as a variable

    citygendercount
    Berlinmale1,823,000
    Berlinfemale1,901,000
    Hamburgmale912,000
    Hamburgfemale954,000

    ✗ Not tidy — two variables in one column

    contract_idlocationamount
    C-001Amsterdam (NL)€120,000
    C-002Berlin (DE)€45,000
    C-003Paris (FR)€80,000

    ✓ Tidy — city and country as separate columns

    contract_idcitycountryamount
    C-001AmsterdamNL€120,000
    C-002BerlinDE€45,000
    C-003ParisFR€80,000
  2. Each observation forms one row. An observation is one complete record: the population of one country in one year, or the value of one procurement contract. If a single observation spans multiple rows, or if one row contains several observations, the data is not tidy.

    ✗ Not tidy — one party per two rows

    partymeasurevalue
    SPDvotes8,254,861
    SPDvote_share25.7%
    CDUvotes8,775,471
    CDUvote_share27.4%

    ✓ Tidy — one party per row

    partyvotesvote_share
    SPD8,254,86125.7%
    CDU8,775,47127.4%
  3. Each type of observational unit forms one table. Don't mix things that belong in separate tables. Contract details and buyer addresses should live in two linked tables, not one flat sheet with repeated address fields on every row.

    ✗ Not tidy — buyer address repeated on every contract row

    contract_idamountbuyerbuyer_citybuyer_country
    C-001€120kMin. FinanceBerlinDE
    C-002€45kMin. FinanceBerlinDE
    C-003€80kMin. FinanceBerlinDE

    ✓ Tidy — contracts table + buyers table, linked by ID

    contract_idamountbuyer_id
    C-001€120kB-01
    C-002€45kB-01
    buyer_idbuyercitycountry
    B-01Min. FinanceBerlinDE

Most real-world data you receive such as government spreadsheets, scraped tables, exported reports will break at least one of these rules. That is not a moral failing on anyone's part. It reflects the fact that data is often laid out for human readers, not for machines. Tidying it up is a routine part of data journalism.

Other common issues

These problems don't break the three tidy rules, but they will break your analysis just as surely.

Merged cells & colour-coded meaning

Spreadsheets designed for human readers use visual tricks — merged region headers, highlighted rows — that tools cannot read. Unmerge and fill down; replace colour with a text column.

✗ Merged region header + colour-coded row

North Region
cityyearcontracts
Hamburg2023142
Bremen202338
Kiel202361

✓ Region filled down; status as a column

regioncityyearcontractsstatus
NorthHamburg2023142
NorthBremen202338disputed
NorthKiel202361

Inconsistent values & typos

A group-by treats every unique string as a separate category. "Netherlands" and "The Netherlands" will produce two separate rows in any aggregate.

✗ Four spellings → four groups in any count

countrycontracts
Netherlands14
The Netherlands9
NL3
nederland2

✓ Standardised → one group, correct total

countrycontracts
Netherlands14
Netherlands9
Netherlands3
Netherlands2

Numbers & dates stored as plain text

A currency symbol or stray space inside a cell turns a number into text. SUM() returns zero; sorting by date gives wrong order.

✗ Amounts and dates stored as text

contract_idamountsigned
C-001€ 120,00015/03/2023
C-002€45.0003-9-2023
C-00380000 EURNov 2023

✓ Clean numbers; ISO 8601 dates

contract_idamount_eursigned
C-0011200002023-03-15
C-002450002023-09-03
C-003800002023-11-01

Subtotal rows mixed in with data

A grand-total or subtotal row is not an observation — it will skew every aggregate you run and produce phantom records in any join.

✗ Subtotal row mixed into data

regionsales
North142
South98
Total240
East76
West113
Grand total429

✓ Data only — totals computed by your tool

regionsales
North142
South98
East76
West113

Tidy check

Run through this confirmation list before passing the data to analysis, a visualisation tool, or a colleague.