Data Science Weekly - Issue 506

Curated news, articles and jobs related to Data Science

Aug 4

Share

Issue #506
August 03 2023

Hello and thank you for tuning in to Issue #506!

Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.

Seeing this for the first time? Subscribe here:

Want to support us? Become a paid subscriber here.

If you don’t find this email useful, please unsubscribe here.

And now, let's dive into some interesting links from this week :)

Editor's Picks

Patterns for Building LLM-based Systems & Products
This post is about practical patterns for integrating large language models (LLMs) into systems and products. We’ll draw from academic research, industry resources, and practitioner know-how, and try to distill them into key ideas and practices…There are seven key patterns. I’ve also organized them along the spectrum of improving performance vs. reducing cost/risk, and closer to the data vs. closer to the user.
- Evals: To measure performance
- RAG: To add recent, external knowledge
- Fine-tuning: To get better at specific tasks
- Caching: To reduce latency & cost
- Guardrails: To ensure output quality
- Defensive UX: To anticipate & manage errors gracefully
- Collect user feedback: To build our data flywheel

AI and Power: The Ethical Challenges of Automation, Centralization, and Scale
Over the past decade, topics such as explainability (having computers generate an explanation of why they compute the outputs they do) and fairness/bias (addressing when algorithms have worse accuracy on some groups of people than others) have gained more attention within the field of AI and in the media. Some computer scientists and journalists have stopped there: assuming that a computer program that can explain the logic behind its decision making, or a program that has the same accuracy on light-skinned men as on dark-skinned women, must now be ethical. While these principles are important, on their own they are not enough to address nor prevent harms of AI systems…

Robotics: An Idiosyncratic Snapshot in the Age of LLMs [PDF]
The goal of this document is to help me think through the state of the art in robotics today, along with the primary research challenges that persist. To this end, I will consider two main sets of questions: • What are state-of-the-art approaches in different application domains? What are the primary challenges in each area? • What are approaches that the field is currently excited about? What are the main opportunities and challenges? Note: This document was primarily written for myself. It is not meant to be an academic survey; it is thoroughly incomplete and biased towards directions I was interested to learn more about. But, I am making it accessible in case others find it useful as a source of further sources!…

A Message from this week's Sponsor:

Hire AE’s World Class Tech Team

Accelerate your success with AE's elite team of experts!

🚀 Get ahead with swift development of Minimum Viable Products (MVPs).

🚀 Lead the way in innovation with Digital Transformation Initiatives.

🚀 Boost your ROI with tailored AI/ML solutions.

Schedule a Consultation Today

Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org

Data Science Articles & Videos

Using {tidymodels} to Detect Heart Murmurs
When the sounds of heartbeats are recorded, the time series of the observed frequencies can be analyzed to support a diagnosis of whether a patient has a heart murmur…In this talk, Nicola gives a brief overview of the {tidymodels} framework for machine learning, discusses the extraction of time series features, demonstrates the process of fitting machine learning models in R, and considers different approaches to model evaluation. The difficulties and concerns that come along with using machine learning to automate the detection of health conditions will also be discussed…

The Data Chief Podcast:
Balancing long-term vision with near-term action with Vercel’s VP of Data
Starting with a role at the Hubble Space Telescope, Alex Viana, VP of Data at Vercel, found his way into the data space by way of data security and searching for leaked data assets. Today, he leads the data organization at Vercel, where he views building – teams, technology processes, and metrics – as his primary responsibility. In this episode Alex shares his thoughts on leading data teams at different (but fast-growing) tech companies, the importance of building scalable data platforms, delivering value through stakeholder engagement, and balancing long-term vision with short-term action as a key to success…

Why Edge Detection Doesn’t Explain Line Drawing
Why do line drawings work?…A classic answer to this question is what I will call the Lines-As-Edges hypothesis. It says that drawings simulate natural images because line features activate edge receptors in the human visual system….The purpose of this blog post is to explain why I think you should be skeptical of the Lines-As-Edges hypothesis…I don’t claim that Lines-As-Edges is necessarily false, but I do argue that it is unsatisfyingly incomplete…
Rust ML and LLM community [Twitter/X Thread]
Lots of folks reached out to me yesterday about the Rust ML and LLM community. Seems like supportive and intellectually-curious community, so I wanted to highlight some of the projects that you should check out 🧵…
Against theory-motivated experimentation in science
Scientists must choose which among many experiments to perform. We study the epistemic success of experimental choice strategies proposed by philosophers of science or executed by scientists themselves. We develop a multi-agent model of the scientific process that jointly formalizes its core aspects: active experimentation, theorizing, and social learning. We find that agents who choose new experiments at random develop the most accurate theories of the world…

Dirty imputation done dirt cheap: implementing Multiple Imputation by Chained Equations in one blog post
Coding up an algorithm is a great way to make sure you really understand how the details work. In this post I’m going to implement multiple imputation and the MICE algorithm (Van Buuren, 2007), albeit in much simplified form: only considering missing data in numeric variables, only using Normal-distribution Bayesian linear regression to generate the imputed data, no concerns about robustness of the code for production purposes. If the standard version of this method is called MICE, think of this as a smaller, cuter, maybe slightly endangered variation. Perhaps a fat-tailed dunnart…
Functions are Vectors
Conceptualizing functions as infinite-dimensional vectors lets us apply the tools of linear algebra to a vast landscape of new problems, from image and geometry processing to curve fitting, light transport, and machine learning…

Women in Data Science Worldwide: Virtual Conference 2023 (Sep 14)
Join Women in Data Science Worldwide for our 2023 Virtual Conference, a FREE epic online event welcome to anyone interested in data science…

Catching up on the weird world of LLMs
I gave a talk on Sunday at North Bay Python where I attempted to summarize the last few years of development in the space of LLMs—Large Language Models, the technology behind tools like ChatGPT, Google Bard and Llama 2…My goal was to help people who haven’t been completely immersed in this space catch up to what’s been going on. I cover a lot of ground: What they are, what you can use them for, what you can build on them, how they’re trained and some of the many challenges involved in using them safely, effectively and ethically….

Open Buildings Data Set Version #3, new release
Building footprints are useful for a range of important applications, from population estimation, urban planning and humanitarian response, to environmental and climate science. This large-scale open dataset (1.8 billion building detections) contains the outlines of buildings derived from high-resolution satellite imagery in order to support these types of uses. The project is based in Ghana, with an initial focus on the continent of Africa and new updates on South Asia, South-East Asia, Latin America and the Caribbean…

Run Llama 2 on your own Mac using LLM and Homebrew
Llama 2 is the latest commercially usable openly licensed Large Language Model, released by Meta AI a few weeks ago. I just released a new plugin for my LLM utility that adds support for Llama 2 and many other llama-cpp compatible models…

Avenging Polanyi's Revenge: Exploiting the Approximate Omniscience of LLMs in Planning without Deluding Yourself In the Process
LLMs are on track to reverse what seemed like an inexorable shift of AI from explicit to tacit knowledge tasks. Trained as they are on everything ever written on the web, LLMs exhibit "approximate omniscience"--they can provide answers to all sorts of queries, with nary a guarantee. This could herald a new era for knowledge-based AI systems--with LLMs taking the role of (blowhard?) experts. But first, we have to stop confusing the impressive form of the generated knowledge for correct content, and resist the temptation to ascribe reasoning powers to approximate retrieval by these n-gram models on steroids. We have to focus instead on LLM-Modulo techniques that complement the unfettered idea generation of LLMs with careful vetting by model-based AI systems. In this talk, I will reify this vision and attendant caveats in the context of the role of LLMs in planning tasks…

Jobs

Assistant Vice President, Data Scientist

Barclays Capital Inc. seeks Assistant Vice President, Data Scientist in New York, NY (multiple positions available):

* Write Extract, Transform, Load (ETL) code to read from our data sources, and load data for analysis using source control (git; bitbucket) to version-control code contributions

* Encapsulate analysis code built on the ETL code to make work reusable by the team

* Automate analysis processes using Spark, Python, Pandas, SpaCy, Tensorflow, Keras, PyTorch, and other open-source large-scale computing and statistical software

* Create and maintain a Reddit data pipeline, with ad hoc maintenance to serve requests

* Review other coworkers’ contributions to our shared repository

* Telecommuting benefits permitted

Apply here

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

Intro to Maths and Stats Programming : DeepIndaba X Zim 2023
This is a recording of a presentation for a workshop that is part of the workshop series for Deep Learning Indaba X Zimbabwe 2023. We cover basic maths and statistic concepts for Data Science work, and show examples of how these are used in practice using Python…
Deep Learning in Computer Vision
Computer Vision is broadly defined as the study of recovering useful properties of the world from one or more images. In recent years, Deep Learning has emerged as a powerful tool for addressing computer vision tasks. This course will cover a range of foundational topics at the intersection of Deep Learning and Computer Vision…
ML⇄DB (Machine Learning for Databases + Databases for Machine Learning) Seminar Series
The union of databases and ML is a testament to the virtuous circle of progress, where each empowers the other in a perpetual cycle of advancement. Given this, the Carnegie Mellon University Database Research Group celebrates this grand convergence of data storage and computational mastery with the ML⇄DB Seminar Series (Machine Learning for Databases + Databases for Machine Learning). Each speaker will present the implementation details of their respective systems and examples of the technical challenges they faced when working with real-world customers. All talks are on-line and open to the public via Zoom. You do not need to be a current CMU student to attend. Random people off of the internet are especially welcome…

Last Week's Newsletter's 3 Most Clicked Links

* Based on unique clicks.
** Find last week's issue #505 here.

Cutting Room Floor

Thank you for joining us this week :)

All our best,
Hannah & Sebastian

P.S.
If you found this newsletter helpful, please become a paid subscriber here:
https://datascienceweekly.substack.com/subscribe :)

You're currently a free subscriber to Data Science Weekly Newsletter. For the full experience, upgrade your subscription.