Hello and thank you for tuning in to Issue #498.
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
***
Seeing this for the first time? Subscribe here:
***
Want to support us? Become a paid subscriber here.
***
If you don’t find this email useful, please unsubscribe here.
***
And now, let's dive into some interesting links from this week:
Hope you enjoy it!
:)
The Data Modeling Divide
Despite the existence of “best practices”, most data practitioners I know still describe their data warehouse as a complete mess. The most innocent question of “why are these two numbers different?” can send an analyst or data engineer down a deep rabbit hole for hours or days. With so much discourse about data modeling, why are we still in such a fragile situation? In this post, I argue that the root cause is the unnatural division of “data modeling” into two separate workflows: transformation and semantic modeling…
WONKY: An exploration of rhythm and grooves that break the rules
The beat that captivated Questlove was made by the legendary James Dewitt Yancey, better known as J Dilla. Although he only lived to 32 and never had a mainstream hit, he is now considered one of the most influential producers in hip hop and popular music. The songs he produced reshaped our understanding of rhythm and time, and their influence persists among many musicians today…
Spending too much time fiddling with SQL?
LogicLoop AI SQL Copilot can auto suggest, generate, fix and optimize SQL queries on your data schema in seconds. Whether you want to pull ad-hoc data, fix a data bug or reduce query runtime costs, do it 10x faster. Also embeddable as an API, so your customers can ask data questions in natural language.
Save bandwidth and budget by trying LogicLoop AI SQL Copilot for free today. Sign up here.
PS: Need inspiration? Check out 100+ SQL templates. Have feedback or a feature request? Chat with us.
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Leadership needs us to do generative AI. What do we do?
The idea for the talk came from many conversations I’ve had recently with friends who need to figure out their generative AI strategy, but aren’t sure what exactly to do…This talk is a simple framework to explore what to do with generative AI. Many ideas are still being fleshed out. I hope to convert this into a proper post when I have more time. In the meantime, I’d love to hear from your experience through this process…
What are embeddings?
Ok, so LLMs are a Thing. How do they work? Embeddings. WTF are embeddings? I spent a year doing a deep dive. But when I was researching, I couldn't find anything that explained them in business, engineering, AND math contexts. So I wrote a thing. 🚀…
Privacy and Security in Data Science and Machine Learning [June 19]
Katharine Jarmul is a Principal Data Scientist at Thoughtworks Germany focusing on privacy, ethics, and security for data science workflows…In this live-streamed recording of Vanishing Gradients, Katharine joins your host Hugo Bowne-Anderson, to talk about all things privacy and security in data science and machine learning…
What is the Leetcode equivalent for Data Engineering?
Actively interviewing so I need some prep material for Data wrangling questions if there is a single source out there.
I'm looking for a source around questions like:
- Given a source data (JSON, CSV), derive insights to answer questions
- Clean up a given dataset to answer questions etc.
- Python dictionary / Json API response manipulation.
Thank you…
Do LLMs eliminate the need for programming languages?
Given new Large Language Model (LLM) powered developer tools like Copilot and Ghostwriter, many developers are wondering about the future of programming – do programming languages still matter when AI writes the code?…This is a great question! It cuts to the heart of developer workflows, allows us to reflect on the core purpose of programming tools more broadly, and encourages us to share our perspective on where coding technologies are going over the long term. First, let’s explore what a programming language is for across three critical dimensions…
Top AI researcher dismisses AI ‘extinction’ fears, challenges ‘hero scientist’ narrative
In a recent interview Kyunghyun Cho — who is highly regarded for his foundational work on neural machine translation, which helped lead to the development of the Transformer architecture that ChatGPT is based on — expressed disappointment about the lack of concrete proposals at the recent Senate hearings related to regulating AI’s current harms, as well as a lack of discussion on how to boost beneficial uses of AI…
Daft: the distributed Python dataframe for complex data
Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads…The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more…
We Are Writing Python Polars: The Definitive Guide
Polars is a highly performant DataFrame library for manipulating structured data. The core is written in Rust, and the library is officially available in Python, Rust, NodeJS, R, and SQL. Its three key selling points are:
Record-breaking speed on common DataFrame operations
Processing of larger than memory datasets
Explicit, concise, and flexible syntax…
Approximating Shapley Values for Machine Learning
In a previous post, I explained the theory behind Shapley values. I also explained that calculating Shapley values for real world machine learning use cases is typically computationally infeasible, which is why in practice, methods that approximate them are used instead. In this article, we will explore a simple approach for approximating Shapley values. This sets the foundation for discussing the foremost technique for estimating Shapley values: SHAP…
Understanding GPT tokenizers
Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next. Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood. OpenAI offer a Tokenizer tool for exploring how tokens work I’ve built my own, slightly more interesting tool as an Observable notebook…
ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
We put ChatGPT's sense of humor to the test. In a series of exploratory experiments around jokes, i.e., generation, explanation, and detection, we seek to understand ChatGPT's capability to grasp and reproduce human humor. Since the model itself is not accessible, we applied prompt-based experiments. Our empirical evidence indicates that jokes are not hard-coded but mostly also not newly generated by the model. Over 90% of 1008 generated jokes were the same 25 Jokes…
As part of this initiative, we are looking for a strong Data Scientist to join Cloudflare and help us drive predictive analytic insights and best practices at scale from the ground up. This is a high visibility role and success in this role comes from marrying a strong data & modeling background with acute product and business acumen to deliver highly strategic and compelling insights that accelerate our business growth and influence our product decisions within Cloudflare.
What we look for:
Predictive modeling techniques, machine learning, model creation and deployment, storytelling and visualization, strong business & product acumen, cross-functional collaboration, creative problem solving, agile mindset
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
The {marginaleffects} 📦 book is now online
25 chapters on post-estimation analyses and interpretation with #Rstats. The 📖 is full of tutorials, case studies, tips, and technical notes. Please check it out and let us know how we can improve this resource…
Your guide to AI: June 2023
Welcome to the latest issue of your guide to AI, an editorialized newsletter covering key developments in AI policy, research, industry, and startups during May 2023…
A gentle introduction to DeepScatter: visualizing millions of points in the browser
Benjamin Schmidt, the author Deepscatter, has an impressive & feature-full example of what deepscatter can do in this notebook. Here, however, is a far more bare-bones example using the same data…
* Based on unique clicks (25,659 opens, 46% substack open rate).
** Find last week's issue #497 here.
Thanks for joining us this week :)
All our best,
Hannah & Sebastian
P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe
:)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.