Data Science Weekly - Issue 498

Curated news, articles and jobs related to Data Science

Jun 9

Share

Issue #498
June 08 2023

Hello and thank you for tuning in to Issue #498.

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

***

Seeing this for the first time? Subscribe here:

***

Want to support us? Become a paid subscriber here.

***

If you don’t find this email useful, please unsubscribe here.

***

And now, let's dive into some interesting links from this week:

Hope you enjoy it!

:)

Editor's Picks

The Data Modeling Divide
Despite the existence of “best practices”, most data practitioners I know still describe their data warehouse as a complete mess. The most innocent question of “why are these two numbers different?” can send an analyst or data engineer down a deep rabbit hole for hours or days. With so much discourse about data modeling, why are we still in such a fragile situation? In this post, I argue that the root cause is the unnatural division of “data modeling” into two separate workflows: transformation and semantic modeling…

What are the brutal truths about working in Data Science? [Reddit Discussion]
What are the brutal truths about working in Data Science (DS)?

WONKY: An exploration of rhythm and grooves that break the rules
The beat that captivated Questlove was made by the legendary James Dewitt Yancey, better known as J Dilla. Although he only lived to 32 and never had a mainstream hit, he is now considered one of the most influential producers in hip hop and popular music. The songs he produced reshaped our understanding of rhythm and time, and their influence persists among many musicians today…

A Message from this week's Sponsor:

LogicLoop AI SQL Copilot

Spending too much time fiddling with SQL?

LogicLoop AI SQL Copilot can auto suggest, generate, fix and optimize SQL queries on your data schema in seconds. Whether you want to pull ad-hoc data, fix a data bug or reduce query runtime costs, do it 10x faster. Also embeddable as an API, so your customers can ask data questions in natural language.

Save bandwidth and budget by trying LogicLoop AI SQL Copilot for free today. Sign up here.

PS: Need inspiration? Check out 100+ SQL templates. Have feedback or a feature request? Chat with us.

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Leadership needs us to do generative AI. What do we do?
The idea for the talk came from many conversations I’ve had recently with friends who need to figure out their generative AI strategy, but aren’t sure what exactly to do…This talk is a simple framework to explore what to do with generative AI. Many ideas are still being fleshed out. I hope to convert this into a proper post when I have more time. In the meantime, I’d love to hear from your experience through this process…

What are embeddings?
Ok, so LLMs are a Thing. How do they work? Embeddings. WTF are embeddings? I spent a year doing a deep dive. But when I was researching, I couldn't find anything that explained them in business, engineering, AND math contexts. So I wrote a thing. 🚀…

Privacy and Security in Data Science and Machine Learning [June 19]
Katharine Jarmul is a Principal Data Scientist at Thoughtworks Germany focusing on privacy, ethics, and security for data science workflows…In this live-streamed recording of Vanishing Gradients, Katharine joins your host Hugo Bowne-Anderson, to talk about all things privacy and security in data science and machine learning…
What is the Leetcode equivalent for Data Engineering?
Actively interviewing so I need some prep material for Data wrangling questions if there is a single source out there.
I'm looking for a source around questions like:
- Given a source data (JSON, CSV), derive insights to answer questions
- Clean up a given dataset to answer questions etc.
- Python dictionary / Json API response manipulation.
Thank you…
Do LLMs eliminate the need for programming languages?
Given new Large Language Model (LLM) powered developer tools like Copilot and Ghostwriter, many developers are wondering about the future of programming – do programming languages still matter when AI writes the code?…This is a great question! It cuts to the heart of developer workflows, allows us to reflect on the core purpose of programming tools more broadly, and encourages us to share our perspective on where coding technologies are going over the long term. First, let’s explore what a programming language is for across three critical dimensions…

Top AI researcher dismisses AI ‘extinction’ fears, challenges ‘hero scientist’ narrative
In a recent interview Kyunghyun Cho — who is highly regarded for his foundational work on neural machine translation, which helped lead to the development of the Transformer architecture that ChatGPT is based on — expressed disappointment about the lack of concrete proposals at the recent Senate hearings related to regulating AI’s current harms, as well as a lack of discussion on how to boost beneficial uses of AI…
Daft: the distributed Python dataframe for complex data
Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads…The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more…

We Are Writing Python Polars: The Definitive Guide
Polars is a highly performant DataFrame library for manipulating structured data. The core is written in Rust, and the library is officially available in Python, Rust, NodeJS, R, and SQL. Its three key selling points are:
- Record-breaking speed on common DataFrame operations
- Processing of larger than memory datasets
- Explicit, concise, and flexible syntax…

Approximating Shapley Values for Machine Learning
In a previous post, I explained the theory behind Shapley values. I also explained that calculating Shapley values for real world machine learning use cases is typically computationally infeasible, which is why in practice, methods that approximate them are used instead. In this article, we will explore a simple approach for approximating Shapley values. This sets the foundation for discussing the foremost technique for estimating Shapley values: SHAP…

Understanding GPT tokenizers
Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next. Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood. OpenAI offer a Tokenizer tool for exploring how tokens work I’ve built my own, slightly more interesting tool as an Observable notebook…

Examples of Good DS Portfolios? [Reddit Discussion]
Is there a data scientist portfolio that you're really proud of? A friend's portfolio that you envy? A standard which you aspire to have your portfolio come within light years of? I want to see it. I'm looking for some examples of really stellar data science portfolios so that I know what I should be striving towards myself. I need training data…

ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
We put ChatGPT's sense of humor to the test. In a series of exploratory experiments around jokes, i.e., generation, explanation, and detection, we seek to understand ChatGPT's capability to grasp and reproduce human humor. Since the model itself is not accessible, we applied prompt-based experiments. Our empirical evidence indicates that jokes are not hard-coded but mostly also not newly generated by the model. Over 90% of 1008 generated jokes were the same 25 Jokes…

Jobs

(Remote) Data Scientist at Cloudflare

As part of this initiative, we are looking for a strong Data Scientist to join Cloudflare and help us drive predictive analytic insights and best practices at scale from the ground up. This is a high visibility role and success in this role comes from marrying a strong data & modeling background with acute product and business acumen to deliver highly strategic and compelling insights that accelerate our business growth and influence our product decisions within Cloudflare.

What we look for:

Predictive modeling techniques, machine learning, model creation and deployment, storytelling and visualization, strong business & product acumen, cross-functional collaboration, creative problem solving, agile mindset

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

The {marginaleffects} 📦 book is now online
25 chapters on post-estimation analyses and interpretation with #Rstats. The 📖 is full of tutorials, case studies, tips, and technical notes. Please check it out and let us know how we can improve this resource…
Your guide to AI: June 2023
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering key developments in AI policy, research, industry, and startups during May 2023…
A gentle introduction to DeepScatter: visualizing millions of points in the browser
Benjamin Schmidt, the author Deepscatter, has an impressive & feature-full example of what deepscatter can do in this notebook. Here, however, is a far more bare-bones example using the same data…

Last Week's Newsletter's 3 Most Clicked Links

The Next Larger Context

All the Hard Stuff Nobody Talks About when Building Products with LLMs

Best way to defer on a question I don't know in an interview?

* Based on unique clicks (25,659 opens, 46% substack open rate).
** Find last week's issue #497 here.

Cutting Room Floor

Thanks for joining us this week :)

All our best,
Hannah & Sebastian

P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe

:)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.

Data Science Weekly - Data Science Weekly - Issue 498

Data Science Weekly - Issue 498

Curated news, articles and jobs related to Data Science

Issue #498
June 08 2023

Editor's Picks

A Message from this week's Sponsor:

LogicLoop AI SQL Copilot

Data Science Articles & Videos

Jobs

(Remote) Data Scientist at Cloudflare

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

Data Science Weekly - Issue 497

Data Science Weekly - Issue 496

Data Science Weekly - Issue 495

Data Science Weekly - Issue 494

Data Science Weekly - Issue 493

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 498

Curated news, articles and jobs related to Data Science

Issue #498June 08 2023

Editor's Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Jobs

Training & Resources

Last Week's Newsletter's 3 Most Clicked Links

Cutting Room Floor

Older messages

You Might Also Like

Issue #498
June 08 2023