Hacker Noon - Can We Make Data Tidy?


Ship the data importer you always dreamed of

 

Can We Make Data Tidy?

 
Imagine: You are going to sit down with a newly-fetched data set, excited about the insights it will bring you and then you find out it is no use. If you’ve been there, then you know for sure what an untidy dataset is.
A statistician from New Zealand once said: Tidy datasets are all alike, but every messy dataset is messy in its own way. Indeed, as data may come in various forms and shapes, sometimes we are inundated with it. As a result, our data science team becomes shortsighted and oops.. disillusioned by mountains of unworkable data. The only way data specialists can facilitate analysis is by keeping data clean and organized.

What is tidy data?

 
Essentially, tidy data is a term coined by Hadley Wickham in his Tidy Data paper (remember that statistician from NZ?). He defined tidy data as data that is neatly organized and all set for analysis. This way of organizing allows you to easily produce charts, diagrams, and summary statistics. As it often happens, not all data comes out of the database clean, therefore cleansing it is essential to efficiently analyze it.
 
Without further ado, let us break down the principles that allow you keep your data nice and clean.
 

Tidy Data Principles

 

1. Each row is an observational unit.

 

We’ll start with one of the basic principles. When you are giving your data the once-over, you should make sure each row contains an observation.
By definition, observation is the individual unit under question. If we look at the table above, an observational unit could be called ‘people’. You can see that each person has an individual row on the table and all of the information for that person is in the same row. Observations are included in rows, variables are represented as columns and there is only one observational unit per table. Now THIS is tidy data.
 

2. Each column is a variable.

 
A variable is the unit you are assessing. Again, if we turn to our table above, age, hair_color and height fall within the category of variables. In tidy data each variable is represented in a separate column.
 
Okay, now a one-second quiz: What is wrong with this dataset?
 
Yep, you guessed it right. Never put multiple variables in one column, otherwise your data analysis is doomed.
 

3. Each cell is a value.

If you have got hold of the first two principles, this one should already be a no-brainer. Anyway, we’ll make an extra effort to lay it all out. Each cell should contain only one value. It is also important that all values in the same column are formatted the same way.
On this data set, you can see that we have a table with four variables and three observations. Each cell contains one piece of information and our values all match. All of our age values are digits, hair color values are whole words – you got the idea. Therefore, this dataset is tidy and almost fit for analysis.
 

4. Each column has a unique name.

 
In an ideal dataset, columns should have specific and descriptive names. Let us demonstrate you an example of this principle.
The third column is labeled hair_color. This is a more specific heading that if we were simply to call it – hair. The word ‘hair’ can refer to anything from hair length to hair style. This level of specificity will help you speed up the analysis process.
 

The Final Word

Tidy data is an essential part of realizing the full data potential that exists. Once your data is tidy, it can be used as input into a wide range of other functions.
 
While we are still on this topic, we’d like to say a big thank-you to our sponsor. Flatfile Portal is the elegant import button for web apps that integrates in minutes, and makes sure your spreadsheets are clean and ready to use.


Ship the data importer you always dreamed of

 
Twitter
Facebook
Instagram
Website
YouTube
Email
Copyright © 2020 Hacker Noon. All rights reserved.

Our mailing address is:
PO Box 2206, Edwards CO, 81632, U.S.A.

unsubscribe

Older messages

Dear Companies: How To Fuel Your Performance With Customer Insights

Sunday, October 25, 2020

"Perfect is the enemy of the good" Find & fix application performance issues fast Dear Companies: How To Fuel Your Performance With Customer Insights "Perfect is the enemy of the

Brace Yourself - Data Cleanup Is Coming

Sunday, October 25, 2020

It goes without saying that data is the cornerstone of any data analysis. As for data, there are millions of things that can misfire. Solve complex data migrations with #nocode Brace Yourself - Data

Consumer Insights: The Secret Weapon

Sunday, October 25, 2020

Customer insight has come into vogue, with small to large companies leveraging a customer-driven approach to perfect their marketing strategy. It may seem that most companies are plugged into the art

2020 Noonies Winners Announced 🎉

Sunday, October 25, 2020

Official Winners of The Internet Now Declared Hey there Hacker, ❗ ICYMI: The winners of Hacker Noon's 2020 Noonies Awards have (finally) been announced! As in all elections of great importance,

The Secrets of High-Performing DevOps teams

Sunday, October 25, 2020

Ultra-fast innovation holds the key for conglomerates like Apple, Microsoft, and Tencent, known as the pacesetters in the modern markets. However, they all faced challenges that are typical for

You Might Also Like

🔎 How to Search Reddit Like a Pro — 9 Reasons to Always Use Windows With a VPN

Tuesday, November 12, 2024

Also: Tips for Setting Up a Mobile VR Office, and More! How-To Geek Logo November 12, 2024 Did You Know In the 2016 film Doctor Strange, the characters of both Doctor Strange and the villain Dormammu (

Web Scraping Tips, Python 3.13 Performance Boosts, Writing Interpreters & More

Tuesday, November 12, 2024

Introduction to Web Scraping With Python #655 – NOVEMBER 12, 2024 VIEW IN BROWSER The PyCoder's Weekly Logo Introduction to Web Scraping With Python In this video course, you'll learn all about

Daily Coding Problem: Problem #1606 [Easy]

Tuesday, November 12, 2024

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by PayPal. Given a binary tree, determine whether or not it is height-balanced. A height-

Charted | Breaking Down the U.S. Government's 2024 Fiscal Year 💰

Tuesday, November 12, 2024

Net interest payments cost the US government $882 billion in fiscal year 2024, the third-largest outlay in the final budget. View Online | Subscribe | Download Our App Presented by Hinrich Foundation

Spyglass Dispatch: AI's Independence Race • EU's Bad Meta Ads • AI Chip Shenanigans • Netflix Ads Religion

Tuesday, November 12, 2024

AI's Independence Race • EU's Bad Meta Ads • AI Chip Shenanigans • Netflix Ads Religion The Spyglass Dispatch is a free newsletter sent out daily on weekdays. Feel free to forward it on to

The Big T

Tuesday, November 12, 2024

Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 12, 2024? The HackerNoon

Deadline Extended: 2 Weeks Left to Compete for Over $7000 in the AI-chatbot Writing Contest🔥

Tuesday, November 12, 2024

Great news, newsletterest1 ! The submission deadline for the #ai-chatbot writing contest has been extended! You now have until November 21, 2024, to submit your unique AI chatbot ideas for a chance to

A very demure, very mindful issue

Tuesday, November 12, 2024

Plus a look at memory regions, Go's birthday, and we invent a brand new word. | #​531 — November 12, 2024 Unsub | Web Version Together with Frontend Masters logo Go Weekly Happy Birthday, Go! Go

Visual Capitalist is revealing all of its biggest secrets... 📊

Tuesday, November 12, 2024

You can get in on our newest project if you act now. View Online | Subscribe | Download Our App We're revealing our biggest secrets... The question we get asked the most is: "How does Visual

🔓🐍 Unlock Your Python Potential with Instructor-Led Courses

Tuesday, November 12, 2024

Hey there, If you've been looking for a way to go beyond on-demand tutorials and really master Python, we've got something special for you... For the first time, Real Python is launching an