Hello and thank you for tuning in to Issue #508!
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
Seeing this for the first time? Subscribe here:
Want to support us? Become a paid subscriber here.
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week :)
Tips for Writing NLP Papers
Over the years I’ve developed a certain standard for writing papers (and doing research in general) that I share verbally with my students. I recently realized I’m repeating myself — and worse than that, editing the same things over and over again in paper drafts. So I decided to document my paper writing tips. This blog post is first and foremost intended for my students, although others might find it useful too. Some of the tips here are specific to NLP papers, although many of them are general and might be useful for other fields as well…
Analysis of the data job market using "Ask HN: Who is hiring?" posts
I have worked in “big data”, “data science” or something adjacent for around 12 years, and in that time I have observed these fields (and their associated roles) change a lot….so I parsed HackerNews (HN) “Ask HN: Who is hiring?” posts from 2013 to time of writing and analyzed them to better understand the trends in the data job market with a focus on the fate of data science. Here are my main conclusions…
The AI revolution has already begun to rewire Wall St. Using it, a Harvard data scientist and his crack team have allowed everyday people to now benefit from a previously “off-limits” investment.
The company that makes it possible is called Masterworks, whose unique investment platform enables savvy investors to invest in blue-chip art for a fraction of the cost. Their proprietary database of art market returns provides an unrivaled quantitative edge in analyzing the art market.
So far, it's been right on the money. With all 15 of its exits, Masterworks has achieved a profit, delivering +17.8%, +21.5%, and +35.0% annualized net returns.
Intrigued? Data Science Weekly readers can skip the waitlist with this referral link.
See important disclosures at masterworks.com/cd
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
AI Hardware, Explained - a16z Podcast
In this episode – the first in our three-part series – we explore the terminology and technology that is now the backbone of the AI models taking the world by storm. We’ll explore what GPUs are, how they work, the key players like Nvidia competing for chip dominance, and also… whether Moore’s Law is dead? Look out for the rest of our series, where we dive even deeper; covering supply and demand mechanics, including why we can’t just “print” our way out of a shortage, how founders get access to inventory, whether they should own or rent, where open source plays a role, and of course… how much all of this truly costs!…
Open challenges in LLM research
Never before in my life had I seen so many smart people working on the same goal: making LLMs better. After talking to many people working in both industry and academia, I noticed the 10 major research directions that emerged. The first two directions, hallucinations and context learning, are probably the most talked about today. I’m the most excited about numbers 3 (multimodality), 5 (new architecture), and 6 (GPU alternatives)…
Failed an interviewee because they wouldn't shut up about LLMs at the end of the interview [Reddit Discussion]
Last week was interviewing a candidate who was very borderline. Then as I was trying to end the interview and let the candidate ask questions about our company, they insisted on talking about how they could use LLMs to help the regression problem we were discussing. It made no sense. This is essentially what tipped them from a soft thumbs up to a soft thumbs down…EDIT: This was for a senior role. They had more work experience than me…
An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability
In many fields, there has been a long-standing emphasis on inference (precisely estimating an unknown quantity, such as a population average) over prediction (forecasting individual outcomes). Here, we show that this focus on inference over prediction can mislead readers into thinking that the results of scientific studies are more definitive than they actually are. Through a series of randomized experiments, we demonstrate that this confusion arises for one of the most basic ways of presenting statistical findings and affects even experts whose jobs involve producing and interpreting such results. In contrast, we show that communicating both inferential and predictive information side by side provides a simple and effective alternative, leading to calibrated interpretations of scientific results…
ChatGPT Has Been Slumping On Google Trends Since April -- And Outranked By Leading Metaverse Platforms Roblox, Minecraft & Fortnite
ChatGPT only briefly outpaced leading meta-verse platforms Roblox, Minecraft and Fortnite on Google Trends in April. Before and since, however, it's trailed behind, usually by quite a lot. Why? For one thing, platforms comprised of live, active, masses of creative people will always be more popular than an automated service. For another, ChatGPT like all other Large Language Models are by definition mediocre content generators…
What kind of SQL are you writing and how complex is it? [Reddit Discussion]
Doing a data engineering project right now for one of my courses in data engineering, it's just a supplementary course not actually like college. But it covers basics of data engineering and stuff... I wanted to know practically if there's anything I could practice. So I'm learning some advanced SQL. Curious what SQL data engineers use. Is it like stored procedures, or writing APIs? Anything in specific to practice?…
Polars raised a $4M seed round
Polars has grown to one of the fastest open-source OLAP query engines. And the adoption has grown beyond what I ever anticipated. In Github stars the project has been the fastest growing data processing project that I am aware of. At the moment of writing Polars has over 6 million total downloads and 19.000 GitHub stars, closing in on Apache Spark and Pandas, the most popular DataFrame implementations in existence…
Whats your approach when it comes to decide whether or not eliminate variables from a dataset? [Reddit Discussion]
I have several datasets that in total represent about 250 different variables and Im doing the preliminary EDA on them before doing the actual modeling…I’m trying to go into the modeling with a dataset as "light" as possible but I don’t want to lose valuable information in the process….So my question is what do you usually do in these cases? Do you keep them until the modeling confirms their uselessness to predict, do you delete them outright, or you decide what to do based on a preliminary analysis like a correlation or Cramér's V analysis of said variable in relation to the target variable(s)?…
The Marginal Effects Zoo
Interpreting the parameters estimated by complex statistical models is often challenging. Many applied researchers are keen to report simple quantities that carry clear scientific meaning but, in doing so, they face three primary obstacles:
Intuitive estimands—and their standard errors—are often tedious to compute.
The terminology to describe these estimands is not standardized, and varies tremendously across disciplines.
Modeling packages in R
and Python
produce inconsistent objects which require users to write custom (and error-prone) code to interpret statistical results.
The “Marginal Effects Zoo” book and the marginaleffects
packages for R
and Python
are designed to help analysts overcome these challenges. The free online book provides a unified framework to describe and compute a wide range of estimands. The marginaleffects
package implements this framework and offers a consistent interface to interpret the estimates from over 85 classes of statistical models….
For the Gucci Global Data Science team based in Milan, we are currently seeking an English speaking Senior Data Scientist.
In this role, you will report to the Global Corporate Director of Data Science and help the business in central decision making processes, have the opportunity to lead the technical development of a small team of bright and driven data scientists, collaborate with teams across different regions and areas of the business leveraging Gucci’s rich data sources, infrastructure and the power of machine learning and advanced analytics.
Influential, innovative and progressive, Gucci is reinventing a wholly modern approach to fashion.
The Gucci Data Science team is the new kid on the block, bringing fresh perspectives and a new way of working that will help the company in continuing its innovation path leveraging the power of data and ML.
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Accelerate your success with AE's elite team of experts!
🚀 Get ahead with swift development of Minimum Viable Products (MVPs).
🚀 Lead the way in innovation with Digital Transformation Initiatives.
🚀 Boost your ROI with tailored AI/ML solutions.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
CMU’s Deep Learning Systems: Algorithms and Implementation
The goal of this course is to provide students an understanding and overview of the “full stack” of deep learning systems, ranging from the high-level modeling design of modern deep learning systems, to the basic implementation of automatic differentiation tools, to the underlying device-level implementation of efficient algorithms…
Probabilistic Machine Learning: Advanced Topics (3rd in the Trilogy)
An advanced counterpart to Probabilistic Machine Learning: An Introduction, this high-level textbook provides researchers and graduate students detailed coverage of cutting-edge topics in machine learning, including deep generative modeling, graphical models, Bayesian inference, reinforcement learning, and causality. This volume puts deep learning into a larger statistical context and unifies approaches based on deep learning with ones based on probabilistic modeling and inference…
Lecture materials on i) deep generative models (VAEs, diffusion models, flows) and ii) simulation-based inference
The SLAC Summer Institute (SSI) is an annual two-week-long Summer School tradition since 1973. The theme of the 51st SLAC Summer Institute is “Artificial Intelligence in Fundamental Physics”. These SSI lectures will introduce methods for Artificial Intelligence and Machine Learning and their successful applications across the fundamental physics. This SSI intends to inspire invigorated efforts for new revelations on how the rapidly developing field of Artificial Intelligence can change the ways that data is analyzed in fundamental physics..
* Based on unique clicks.
** Find last week's issue #507 here.
Thank you for joining us this week :)
All our best,
Hannah & Sebastian
P.S.
If you found this newsletter helpful to your job, please consider becoming a paid subscriber here:
https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.