Hello and thank you for tuning in to Issue #499.
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
***
Seeing this for the first time? Subscribe here:
***
Want to support us? Become a paid subscriber here.
***
If you don’t find this email useful, please unsubscribe here.
***
And now, let's dive into some interesting links from this week:
Hope you enjoy it!
:)
GPT-3-driven pedagogical agents for training children's curious question-asking skills
In order to train children's ability to ask curiosity-driven questions, previous research has explored designing specific exercises relying on providing semantic and linguistic cues to help formulate such questions. But despite showing pedagogical efficiency, this method is still limited as it relies on generating the said cues by hand, which can be a very costly process. In this context, we propose to leverage advances in the natural language processing field (NLP) and investigate the efficiency of using a large language model (LLM) for automating the production of the pedagogical content of a curious question-asking (QA) training…
Reconstructing indoor spaces with NeRF
We describe the work put into delivering these (Google Maps) indoor views in Immersive View. We build on neural radiance fields (NeRF), a state-of-the-art approach for fusing photos to produce a realistic, multi-dimensional reconstruction within a neural network. We describe our pipeline for creation of NeRFs, which includes custom photo capture of the space using DSLR cameras, image processing and scene reproduction. We take advantage of Alphabet’s recent advances in the field to design a method matching or outperforming the prior state-of-the-art in visual fidelity. These models are then embedded as interactive 360° videos following curated flight paths, enabling them to be available on smartphones…
The Value of Personal Data in Internet Commerce: A High-Stake Field Experiment on Data Regulation Policy
In collaboration with the largest E- commerce platform in China (Alibaba), we conduct a large-scale field experiment to measure the potential impact of data regulation policy, and to understand the value of personal data in Internet Commerce. For a random subset of 555,800 customers on Alibaba platform, we simulate the regulation by banning the use of personal data in the homepage recommendation algorithm and record the matching process and outcomes between these customers and merchants…
Snowplow powers 1.9m websites globally including Strava, Steve Madden and WeTransfer to fuel analytics and predictive models with rich behavioral data.
In our recent blog, Federico Castanedo, Academic Director for AI at IE University and previously Director of Data Science at DataRobot, examines the benefits of using behavioral data and Snowplow to improve the accuracy of ML models.
Interested in using highly predictive features like time engaged in seconds, scroll depth and more in your next model? Try Snowplow free for 14 days, no credit card required.
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
Data Jockey - DJing with Data Science
This is a passion project of mine that combines my two professions: data science and DJing. This project applies my data science expertise towards analyzing my collection of songs. With machine learning's increasing ability to process, synthesize, and even generate music, I became inspired to dive in and see if big data algorithms could help me better understand my musical oeuvre and perhaps optimize my routine DJ activities…
Understanding DeepMind's Sorting Algorithm
A few days ago, DeepMind published a blog post talking about a paper they wrote, where they discovered tinier kernels for sorting algorithms. They did this by taking their deep learning wisdom, which they gained by building AlphaGo, and applying it to the discipline of of superoptimization…Let's start with the assembly code they published for sorting an array with three items, translated from pseudo-assembly into assembly…
Taking Algorithms to Courts: A Relational Approach to Algorithmic Accountability
In this paper, we demonstrate that the courts are the primary route—and the primary roadblock—in the pursuit of redress for algorithmic harms. Courts often find algorithmic harms non-cognizable and rarely require developers to address material claims of harm. To address the core challenges of taking algorithms to court, we develop a relational approach to algorithmic accountability that emphasizes not what the actors do nor the results of their actions, but rather how interlocking relationships of accountability are constituted in a triadic relationship between actors, forums, and public(s)….
Knowledge Graphs & LLMs: Multi-Hop Question Answering: Retrieve information that spans across multiple documents
Excited to share our newest blog post focusing on multi-hop question-answering in retrieval-augmented LLMs! Discover how knowledge graphs bridge data from diverse sources, driving accurate answers to complex queries…
You don't need the Modern Data Stack to get sh*t done
The fading hype around the "Modern Data Stack" has recently been a popular topic amongst the data community…According to search trends, it very well may be a Bay Area + New York fad that saw hyper-growth but is starting to flatten out…We work with many enterprises, including many in the Fortune 500. I know for a fact that most of them use something other than the Modern Data Stack. Statistically, less than 50% of our enterprise customers use another major solution in the MDS…Let's dig in, starting with a bit of history on what got us here in the first place…
The Annotated S4: Efficiently Modeling Long Sequences with Structured State Spaces
The Structured State Space for Sequence Modeling (S4) architecture is a new approach to very long-range sequence modeling tasks for vision, language, and audio, showing a capacity to capture dependencies over tens of thousands of steps. Especially impressive are the model’s results on the challenging Long Range Arena benchmark, showing an ability to reason over sequences of up to 16,000+ elements with high accuracy…
Can you trust ChatGPT’s package recommendations?
We have discovered that attackers can easily use ChatGPT to help them spread malicious packages into developers’ environments. Given the widespread, rapid proliferation of AI tech for essentially every business use case, the nature of software supply chains, and the broad adoption of open-source code libraries, we feel an early warning to cyber and IT security professionals is necessary, timely, and appropriate…In this blog post we will detail our findings including a PoC of the attack…
Does anyone else hate Pandas?
I’ve been in data for ~8 years - from DBA, Analyst, Business Intelligence, to Consultant. Through all this I finally found what I actually enjoy doing and it’s Data Engineering work. With that said - I absolutely hate Pandas. It’s almost like the developers of Pandas said “Hey. You know how everyone knows SQL? Let’s make a program that uses completely different syntax. I’m sure users will love it” Spark on the other hand did it right. Curious for opinions from other experienced DEs - what do you think about Pandas?..
The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers
By synthesizing literature that reconceptualizes the production of data for computing as ``data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers in their relationship with tech companies, e.g advocating for transparency about data reuse, creating feedback channels between data producers and companies, and potentially developing mechanisms to share data's revenue more broadly. In doing so, we characterize data labor with six important dimensions - legibility, end-use awareness, collaboration requirement, openness, replaceability, and livelihood overlap - based on the parallels between data labor and various other types of labor in the computing literature…
Different development paths of LLMs
In industry, open-source, and academia, each of these giant pools of talent are driven by different incentives and will create very different language models…In the immediate months around ChatGPT's launch, it actually seemed like every large language model (LLM) of the future needed to be what ChatGPT was, but this is turning out to be far from the case…Given all of the opportunities, reproducing ChatGPT is more of a vibes goal for the open-source community rather than a real necessity because of how different open-source's stakeholders are. This article focuses on LLMs because they're timely, but I expect we see these dynamics play out for many other types of ML models in the future. Open-source will develop LLMs that are mode capable over a specific set of needs, but less cumulatively capable. What this looks like is instead of taking the giant scorecard that GPT4 was touted on, you take 10-50% as the targets for an open-source model and beat GPT4…
Fitting many statistical models at once using dplyr
One common task in applied statistics is to fit and interpret a number of statistical models at once. For example, fitting a model with the same structure to a number of different outcome or explanatory variables, or fitting several models with different structure to the same data. Here are some examples of how I usually do this, using features that were introduced with dplyr version 1.1.0…
After building a user base for their machine learning test suites (2,700+ stars and 650,000+ downloads), Deepchecks is now making another bold move: They’ve just open-sourced their ML Monitoring solution. Deepchecks is paving the way for a collaborative evolution of ML model monitoring, on their mission to enable continuous AI/ML validation for all. This comes just a week after they released support for NLP modules in their testing offering, so seems like the potential audience is constantly growing!
Try it out by following the monitoring quick start from their GitHub (and ️please ⭐️ star️ ️it ⭐️ if you like!)…
*Sponsored
As part of this initiative, we are looking for a strong Data Scientist to join Cloudflare and help us drive predictive analytic insights and best practices at scale from the ground up. This is a high visibility role and success in this role comes from marrying a strong data & modeling background with acute product and business acumen to deliver highly strategic and compelling insights that accelerate our business growth and influence our product decisions within Cloudflare.
What we look for:
Predictive modeling techniques, machine learning, model creation and deployment, storytelling and visualization, strong business & product acumen, cross-functional collaboration, creative problem solving, agile mindset
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
The Matrix Cookbook [PDF]
These pages are a collection of facts (identities, approximations, inequalities, relations, ...) about matrices and matters relating to them. It is collected in this form for the convenience of anyone who wants a quick desktop reference…
An Elementary Introduction to Information Geometry
In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry. Proofs are omitted for brevity…
The entire ETH Zürich Deep Learning in Scientific Computing Master's course is now on YouTube
Very excited to announce that our entire ETH Zürich Deep Learning in Scientific Computing Master's course is now on YouTube! 📖 Prof. Siddhartha Mishra and I (Ben Moseley) will talk you through PINNs, neural operators, neural ODEs, differentiable physics and more…
* Based on unique clicks.
** Find last week's issue #498 here.
Thanks for joining us this week :)
All our best,
Hannah & Sebastian
P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe
:)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.