Hacker Noon - Can We Make Data Tidy?


Ship the data importer you always dreamed of

 

Can We Make Data Tidy?

 
Imagine: You are going to sit down with a newly-fetched data set, excited about the insights it will bring you and then you find out it is no use. If you’ve been there, then you know for sure what an untidy dataset is.
A statistician from New Zealand once said: Tidy datasets are all alike, but every messy dataset is messy in its own way. Indeed, as data may come in various forms and shapes, sometimes we are inundated with it. As a result, our data science team becomes shortsighted and oops.. disillusioned by mountains of unworkable data. The only way data specialists can facilitate analysis is by keeping data clean and organized.

What is tidy data?

 
Essentially, tidy data is a term coined by Hadley Wickham in his Tidy Data paper (remember that statistician from NZ?). He defined tidy data as data that is neatly organized and all set for analysis. This way of organizing allows you to easily produce charts, diagrams, and summary statistics. As it often happens, not all data comes out of the database clean, therefore cleansing it is essential to efficiently analyze it.
 
Without further ado, let us break down the principles that allow you keep your data nice and clean.
 

Tidy Data Principles

 

1. Each row is an observational unit.

 

We’ll start with one of the basic principles. When you are giving your data the once-over, you should make sure each row contains an observation.
By definition, observation is the individual unit under question. If we look at the table above, an observational unit could be called ‘people’. You can see that each person has an individual row on the table and all of the information for that person is in the same row. Observations are included in rows, variables are represented as columns and there is only one observational unit per table. Now THIS is tidy data.
 

2. Each column is a variable.

 
A variable is the unit you are assessing. Again, if we turn to our table above, age, hair_color and height fall within the category of variables. In tidy data each variable is represented in a separate column.
 
Okay, now a one-second quiz: What is wrong with this dataset?
 
Yep, you guessed it right. Never put multiple variables in one column, otherwise your data analysis is doomed.
 

3. Each cell is a value.

If you have got hold of the first two principles, this one should already be a no-brainer. Anyway, we’ll make an extra effort to lay it all out. Each cell should contain only one value. It is also important that all values in the same column are formatted the same way.
On this data set, you can see that we have a table with four variables and three observations. Each cell contains one piece of information and our values all match. All of our age values are digits, hair color values are whole words – you got the idea. Therefore, this dataset is tidy and almost fit for analysis.
 

4. Each column has a unique name.

 
In an ideal dataset, columns should have specific and descriptive names. Let us demonstrate you an example of this principle.
The third column is labeled hair_color. This is a more specific heading that if we were simply to call it – hair. The word ‘hair’ can refer to anything from hair length to hair style. This level of specificity will help you speed up the analysis process.
 

The Final Word

Tidy data is an essential part of realizing the full data potential that exists. Once your data is tidy, it can be used as input into a wide range of other functions.
 
While we are still on this topic, we’d like to say a big thank-you to our sponsor. Flatfile Portal is the elegant import button for web apps that integrates in minutes, and makes sure your spreadsheets are clean and ready to use.


Ship the data importer you always dreamed of

 
Twitter
Facebook
Instagram
Website
YouTube
Email
Copyright © 2020 Hacker Noon. All rights reserved.

Our mailing address is:
PO Box 2206, Edwards CO, 81632, U.S.A.

unsubscribe

Older messages

Dear Companies: How To Fuel Your Performance With Customer Insights

Sunday, October 25, 2020

"Perfect is the enemy of the good" Find & fix application performance issues fast Dear Companies: How To Fuel Your Performance With Customer Insights "Perfect is the enemy of the

Brace Yourself - Data Cleanup Is Coming

Sunday, October 25, 2020

It goes without saying that data is the cornerstone of any data analysis. As for data, there are millions of things that can misfire. Solve complex data migrations with #nocode Brace Yourself - Data

Consumer Insights: The Secret Weapon

Sunday, October 25, 2020

Customer insight has come into vogue, with small to large companies leveraging a customer-driven approach to perfect their marketing strategy. It may seem that most companies are plugged into the art

2020 Noonies Winners Announced 🎉

Sunday, October 25, 2020

Official Winners of The Internet Now Declared Hey there Hacker, ❗ ICYMI: The winners of Hacker Noon's 2020 Noonies Awards have (finally) been announced! As in all elections of great importance,

The Secrets of High-Performing DevOps teams

Sunday, October 25, 2020

Ultra-fast innovation holds the key for conglomerates like Apple, Microsoft, and Tencent, known as the pacesetters in the modern markets. However, they all faced challenges that are typical for

You Might Also Like

Apple renews OpenAI talks 🧠, Google fires Python team 👨‍💻, React 19 beta ⚛️

Monday, April 29, 2024

Apple has renewed discussions with OpenAI to use its generative AI technology to power new features coming to the iPhone Sign Up |Advertise|View Online TLDR Together With QA Wolf TLDR 2024-04-29 😘 Kiss

Architecture Weekly #177 - 29nd April 2024

Monday, April 29, 2024

How do you make predictions about tech without the magical crystal ball? We did that today by example. We analysed what Redis and Terraform license changes relate to the new Typescript framework Effect

Software Testing Weekly - Issue 217

Monday, April 29, 2024

How do you deal with conflicts in QA? ⚔️ View on the Web Archives ISSUE 217 April 29th 2024 COMMENT Welcome to the 217th issue! How do you deal with conflicts in QA? Ideally, you'd like to know how

📧 Did you watch the free MMA chapters? (1+ hours of content)

Monday, April 29, 2024

Did you watch the free MMA chapters? Hey there! 👋 I wish you a fantastic start to the week. Last week, I launched Modular Monolith Architecture. More than 300+ students are already deep into the MMA

WP Weekly 191 - Essentials - Duplicate in Core, White Label Kadence, Studio for Mac

Monday, April 29, 2024

Read on Website WP Weekly 191 / Essentials It seems many essential features are being covered in-house, be it the upcoming duplicate posts/pages feature in the WordPress core or the launch of Studio

SRE Weekly Issue #422

Monday, April 29, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries,

Quick question

Sunday, April 28, 2024

I want to learn how I can better serve you ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Kotlin Weekly #404 (NOT FOUND)

Sunday, April 28, 2024

ISSUE #404 28st of April 2024 Announcements Kotlin Multiplatform State of the Art Survey 2024 Help to shape and understand the Kotlin Multiplatform Ecosystem! It takes 4 minutes to fill this survey.

📲 Why Is It Called Bluetooth? — Check Out This AI Text to Song Generator

Sunday, April 28, 2024

Also: What to Know About Emulating Games on iPhone, and More! How-To Geek Logo April 28, 2024 📩 Get expert reviews, the hottest deals, how-to's, breaking news, and more delivered directly to your

Daily Coding Problem: Problem #1425 [Easy]

Sunday, April 28, 2024

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Microsoft. Suppose an arithmetic expression is given as a binary tree. Each leaf is an