Hacker Noon - Can We Make Data Tidy?


Ship the data importer you always dreamed of

 

Can We Make Data Tidy?

 
Imagine: You are going to sit down with a newly-fetched data set, excited about the insights it will bring you and then you find out it is no use. If you’ve been there, then you know for sure what an untidy dataset is.
A statistician from New Zealand once said: Tidy datasets are all alike, but every messy dataset is messy in its own way. Indeed, as data may come in various forms and shapes, sometimes we are inundated with it. As a result, our data science team becomes shortsighted and oops.. disillusioned by mountains of unworkable data. The only way data specialists can facilitate analysis is by keeping data clean and organized.

What is tidy data?

 
Essentially, tidy data is a term coined by Hadley Wickham in his Tidy Data paper (remember that statistician from NZ?). He defined tidy data as data that is neatly organized and all set for analysis. This way of organizing allows you to easily produce charts, diagrams, and summary statistics. As it often happens, not all data comes out of the database clean, therefore cleansing it is essential to efficiently analyze it.
 
Without further ado, let us break down the principles that allow you keep your data nice and clean.
 

Tidy Data Principles

 

1. Each row is an observational unit.

 

We’ll start with one of the basic principles. When you are giving your data the once-over, you should make sure each row contains an observation.
By definition, observation is the individual unit under question. If we look at the table above, an observational unit could be called ‘people’. You can see that each person has an individual row on the table and all of the information for that person is in the same row. Observations are included in rows, variables are represented as columns and there is only one observational unit per table. Now THIS is tidy data.
 

2. Each column is a variable.

 
A variable is the unit you are assessing. Again, if we turn to our table above, age, hair_color and height fall within the category of variables. In tidy data each variable is represented in a separate column.
 
Okay, now a one-second quiz: What is wrong with this dataset?
 
Yep, you guessed it right. Never put multiple variables in one column, otherwise your data analysis is doomed.
 

3. Each cell is a value.

If you have got hold of the first two principles, this one should already be a no-brainer. Anyway, we’ll make an extra effort to lay it all out. Each cell should contain only one value. It is also important that all values in the same column are formatted the same way.
On this data set, you can see that we have a table with four variables and three observations. Each cell contains one piece of information and our values all match. All of our age values are digits, hair color values are whole words – you got the idea. Therefore, this dataset is tidy and almost fit for analysis.
 

4. Each column has a unique name.

 
In an ideal dataset, columns should have specific and descriptive names. Let us demonstrate you an example of this principle.
The third column is labeled hair_color. This is a more specific heading that if we were simply to call it – hair. The word ‘hair’ can refer to anything from hair length to hair style. This level of specificity will help you speed up the analysis process.
 

The Final Word

Tidy data is an essential part of realizing the full data potential that exists. Once your data is tidy, it can be used as input into a wide range of other functions.
 
While we are still on this topic, we’d like to say a big thank-you to our sponsor. Flatfile Portal is the elegant import button for web apps that integrates in minutes, and makes sure your spreadsheets are clean and ready to use.


Ship the data importer you always dreamed of

 
Twitter
Facebook
Instagram
Website
YouTube
Email
Copyright © 2020 Hacker Noon. All rights reserved.

Our mailing address is:
PO Box 2206, Edwards CO, 81632, U.S.A.

unsubscribe

Older messages

Dear Companies: How To Fuel Your Performance With Customer Insights

Sunday, October 25, 2020

"Perfect is the enemy of the good" Find & fix application performance issues fast Dear Companies: How To Fuel Your Performance With Customer Insights "Perfect is the enemy of the

Brace Yourself - Data Cleanup Is Coming

Sunday, October 25, 2020

It goes without saying that data is the cornerstone of any data analysis. As for data, there are millions of things that can misfire. Solve complex data migrations with #nocode Brace Yourself - Data

Consumer Insights: The Secret Weapon

Sunday, October 25, 2020

Customer insight has come into vogue, with small to large companies leveraging a customer-driven approach to perfect their marketing strategy. It may seem that most companies are plugged into the art

2020 Noonies Winners Announced 🎉

Sunday, October 25, 2020

Official Winners of The Internet Now Declared Hey there Hacker, ❗ ICYMI: The winners of Hacker Noon's 2020 Noonies Awards have (finally) been announced! As in all elections of great importance,

The Secrets of High-Performing DevOps teams

Sunday, October 25, 2020

Ultra-fast innovation holds the key for conglomerates like Apple, Microsoft, and Tencent, known as the pacesetters in the modern markets. However, they all faced challenges that are typical for

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Friday, February 14, 2025

What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Defining Your Paranoia Level: Navigating Change Without the Overkill

Friday, February 14, 2025

We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy

5 ways AI can help with taxes 🪄

Friday, February 14, 2025

Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help

Recurring Automations + Secret Updates

Friday, February 14, 2025

Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

Friday, February 14, 2025

Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%

GCP Newsletter #437

Friday, February 14, 2025

Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

Friday, February 14, 2025

Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from

The Great Social Media Diaspora & Tapestry is here

Friday, February 14, 2025

Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great

Daily Coding Problem: Problem #1689 [Medium]

Friday, February 14, 2025

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,

📧 Stop Conflating CQRS and MediatR

Friday, February 14, 2025

​ Stop Conflating CQRS and MediatR Read on: m​y website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your