15 minute read

Note: To see the description for each problem, simply click on the problem name

LeetCode

‘Hard’ Difficulty

601. Human Traffic of Stadium

This problem can be tricky if you let it lead you down the wrong path. For example, it’s easy to look at this and immediately think you need to use window functions. Unfortunately, you’ll have to use multiple window functions on successive CTEs, each derived from the one before, and I don’t think it would be an elegant solution.

Instead, we can use what I like to call the “old reliable” - self-joins. Let’s self-join the table to itself twice so that we get 3 successive IDs in a single row. From there we simply filter for rows where the number of People is >= 100 in a given row. This will get us all required IDs, but we still need to get them in the correct format.

In Spark and Databricks SQL, you can use UNPIVOT() clause. But here I am using MySQL which does not have this functionality, so I will use an alternative method: UNION. This does require a scan of the whole table which each execution of UNION, and so it isn’t the fastest solution when datasets get large, but it will get the job done here.

After getting all required IDs, we can simply INNER JOIN to the Stadium table to get the desired results


# Write your MySQL query statement below

with self_joined_stadium as(

    select
        a.id as a_id,
        a.visit_date,
        a.people as a_people,
        b.id as b_id,
        -- b.visit_date,
        b.people as b_people,
        c.id as c_id,
        c.people as c_people
    from
        Stadium a
    left join
        Stadium b on a.id + 1 = b.id
    left join
        Stadium c on b.id + 1 = c.id
    order by a.id, b.id, c.id

),

filtered as(

    select
        *
    from
        self_joined_stadium 
    where
        a_people >= 100
        and b_people >= 100
        and c_people >= 100
),

required_id as(

    select a_id from filtered

    union

    select b_id from filtered

    union

    select c_id from filtered

)

select
    s.*
from
    Stadium s
inner join
    required_id ri on s.id = ri.a_id



262. Trips and Users

The thing to be careful with in this query is understanding that the Users table needs to be joined to the Trips table twice, once on Client ID and once on Drivers ID. This is necessary to get the correct Status (Banned vs Unbanned) for both Clients and Drivers. After that, it is a simple procedure of grouping by Day and finding the number Cancelled trips as a proportion of total trips


# Write your MySQL query statement below
with cte as(
    select
        t.id as trip_id,
        t.client_id,
        u1.banned as client_status,
        t.driver_id,
        u2.banned as driver_status,
        t.status as trip_status,
        t.request_at,
        case when t.status in ('cancelled_by_driver', 'cancelled_by_client') then 1 else 0 end as binary_status
    from
        trips t
    left join
        users u1 on t.client_id = u1.users_id
    left join
        users u2 on t.driver_id = u2.users_id
    where
        u1.banned != 'Yes'
        and u2.banned != 'Yes'
)

select
    request_at as Day,
    round(sum(binary_status) / count(request_at), 2) as 'Cancellation Rate'
from
    cte
group by 1
order by 1


185. Department Top Three Salaries

Approach: Join the Employee table to the Department table using the Primary-Foreign Key combination. Because the question asks for the top 3 unique salaries, we must be careful to use DENSE_RANK() instead of RANK() or ROW_NUMBER(), as the former will assign the same value to equal salaries when we partition by department


with cte as(
    select
        e.name as employee_name,
        e.salary as employee_salary,
        d.name as department_name,
        dense_rank() over (partition by d.name order by e.salary DESC) as ranked
    from
        Employee e
    left join
        Department d on e.departmentId = d.id
)

select
    department_name as Department,
    employee_name as Employee,
    employee_salary as Salary
from
    cte
where
    ranked <= 3



‘Medium’ Difficulty

1341. Movie Rating

Approach: Find the user with the most reviews using one query. Call this t1. Find the movie with the highest average rating in February 2020. Call this t2. These queries are simple enough.

The difficulty lies in getting the required output (the username and movie title) without the extra information (count of user_name and average rating of movies, respectively) and in a single column.

To accomplish this, we construct both t1 and t2 as subqueries and use UNION to get the required outputs in one column:


# Write your MySQL query statement below
select
    t1.name as results from

        (select
            u.name,
            count(u.name)
        from
            movies m
        inner join
            MovieRating r on m.movie_id = r.movie_id
        inner join
            Users u on u.user_id = r.user_id
        group by
            1
        order by
            2 DESC, 1 ASC
        limit
            1) t1

union

select
    t2.title as results from

        (select
            m.title,
            avg(rating) as avg_rating
        from
            movies m
        inner join
            MovieRating r on m.movie_id = r.movie_id
        where
            year(created_at) = 2020 and month(created_at) = 2
        group by
            1
        order by
            2 DESC, 1 ASC
        limit
            1) t2

1264. Page Recommendations

The key to solving this problem is to use a CASE statement to collect all the user_id of individuals who are friends of user_id = 1 in a CTE. All that is left to do is to join the Likes table on this CTE to find out what pages the friends of user_id = 1 like and filter out those pages that user_id = 1 already Likes


with rel as(

    select
        1 as user1,
        case when user1_id = 1 then user2_id
             when user2_id = 1 then user1_id end as user1_friends
    from
        Friendship
)

select distinct
    l.page_id as recommended_page
from
    rel r
inner join
    likes l on r.user1_friends = l.user_id
where
    l.page_id not in (select page_id from Likes where user_id = 1)

1205. Monthly Transactions II

This is one of the tricker Medium-difficulty questions I’ve come across. The trickiness lay in the fact that we need to recognise that chargebacks for a given transaction may not occur in the same month in which the transaction itself occurred.

This means I needed to use UNION to collect all transactions (those that were approved, decline and charged back) and their months in a CTE, which I could then query to get the required result.

Finally, the HAVING clause exists comply with the condition of the question which asks us to ignore rows where all of the latter 4 columns are 0


with t1 as       
        (select
            id,
            date_format(trans_date, '%Y-%m') as month,
            country,
            state as status,
            amount
        from
            transactions

        union

        select
           c.trans_id,
           date_format(c.trans_date, '%Y-%m') as month,
           t.country,
           'chargeback' as status,
           t.amount
        from
            chargebacks c
        inner join
            transactions t on c.trans_id = t.id
        )

select
    t1.month,
    t1.country,
    ifnull(sum(case when status = 'approved' then 1 else 0 end), 0) as approved_count,
    ifnull(sum(case when status = 'approved' then amount else 0 end), 0) as approved_amount,
    ifnull(sum(case when status = 'chargeback' then 1 else 0 end), 0) as chargeback_count,
    ifnull(sum(case when status = 'chargeback' then amount else 0 end), 0) as chargeback_amount
from
    t1
group by
    1, 2
having
    approved_count > 0 or
    approved_amount > 0 or
    chargeback_count > 0 or
    chargeback_amount > 0

1107. New Users Daily Count

The key here was to use DATEDIFF in the HAVING clause of the subquery with min(activity_date) instead of in the WHERE clause of the subquery (which would give the wrong answer because we cannot use min() in the WHERE clause)


select
    t.login_date,
    count(distinct t.user_id) as user_count
from

(select
    user_id,
    min(activity_date) as login_date
from
    Traffic
where
    activity = 'login'
group by
    1
having
    abs(datediff('2019-06-30', min(activity_date))) <= 90 ) t

group by
    1


176. Second Highest Salary


select
      ifnull(

      (select distinct
            salary
      from
            Employee
      order by
            Salary DESC
      limit
            1,1), null) as SecondHighestSalary

180. Consecutive Numbers


select distinct
      a.num as ConsecutiveNums
from
      Logs a
join
      Logs b on a.id = b.id + 1
join
      Logs c on b.id = c.id + 1
where
      a.num = b.num and
      b.num = c.num

184. Department Highest Salary

with sal as(

select
      d.id,
      d.name,
      max(e.salary) as maximum
from
      department d
inner join
      employee e on d.id = e.departmentId
group by
      1, 2)

select
      sal.name as department,
      e.name as employee,
      sal.maximum as salary
from
	    employee e
inner join
      sal on sal.id = e.departmentId and
      sal.maximum = e.salary

570. Managers with at Least 5 Direct Reports

select
      b.name as name
from
      Employee a
left join
      Employee b on a.managerId = b.id
where
      b.id is not null
group by
      b.name
having
      count(*) >=5

580. Count Student Number in Departments

select
      d.dept_name,
      count(s.student_name) as student_number
from
      Department d
left join
      student s on d.dept_id = s.dept_id
group by
      d.dept_name
order by
      student_number DESC, dept_name ASC

585. Investments in 2016

select
      round(sum(tiv_2016), 2) as tiv_2016
from
      Insurance
where
      tiv_2015 not in


      (select
            tiv_2015
       from
            Insurance
       group by
            1
       having
            count(*) = 1)

and
      concat(lat, lon) not in

      (select
            concat(lat, lon)
       from
            Insurance
       group by
            lat, lon
       having
            count(*) > 1)

602. Friend Requests II: Who Has the Most Friends

select distinct
      tab1.requester_id as id,
      sum(tab1.accepted) as num
from

(

      (select distinct
            requester_id,
            count(accepter_id) as accepted
       from
            RequestAccepted
       group by
            1)

       union all

      (select distinct
            accepter_id,
            count(requester_id) as requested
       from
            RequestAccepted
       group by
            1)

) tab1

group by
      tab1.requester_id
order by
      num DESC
limit
      1

608. Tree Node


with cte1 as

(select
      a.p_id,
      a.id,
      b.p_id as b_p_id
from
      Tree a
left join
      Tree b on b.p_id = a.id)

select distinct
      cte1.id,
      case
          when cte1.p_id is null then "Root"
          when cte1.p_id is not null and cte1.b_p_id is not null then "Inner"
          else "Leaf"
          end as type
from
      cte1

612. Shortest Distance in a Plane


with cte1 as(

select
      a.x as a_x,
      a.y as a_y,
      b.x as b_x,
      b.y as b_y
from
      Point2D a, Point2D b
where
      concat(a.x, ' ', a.y) != concat(b.x, ' ', b.y)
order by
      a_x, a_y)


select
      round(abs(min(sqrt(power((cte1.b_x-cte1.a_x), 2) + power((cte1.b_y-cte1.a_y), 2)))), 2) as shortest
from
      cte1

1045. Customers Who Bought All Products


with cte1 as(
select
      customer_id,
      count(distinct product_key) as total
from
	    Customer
group by
	    1)

select
      cte1.customer_id
from
      cte1
where
      cte1.total = (select count(distinct product_key) from Product)

1070. Product Sales Analysis III


select
      product_id,
      year as first_year,
      quantity,
      price
from
      Sales
where
      (product_id, year) in (select product_id, min(year) from Sales group by 1)

1077. Project Employees III


with cte1 as(

select
      p.project_id,
      max(e.experience_years) as max_exp
from
      Project p
inner join
      employee e using(employee_id)
group by
      p.project_id),

cte2 as(

select
      p.project_id,
      e.employee_id,
      e.name,
      e.experience_years
from
      Project p
inner join
      employee e using(employee_id))

select
      cte1.project_id,
      cte2.employee_id
from
      cte1, cte2
where
      cte1.project_id = cte2.project_id and
      cte1.max_exp = cte2.experience_years


DataLemur

“Hard” Difficulty

Facebook - Active User Retention

The following shows my first solution which has more parts to it than necessary. After it we can see my second solution, which pares down the unnecessary parts of the query and makes it cleaer.

The approach here is as follows. I need to use a self-join on the user_actions table so that I can compare, for the same user_id, whether or not the user_id had transactions in consecutive months. The question defines “active users” as those who have a required interaction type in both July and June, so I want to look for these users only.

I will join using a compound key where a.user_id is equal to b.user_id and also the event_date of table alias b occurs after the event date of table alias b.

Next, I need to restrict this self join to only those cases where the difference between b.event_date and a.event_date is only 1 month, again because we are interested in July 2022 and June 2022 solely, and where the interaction type in both months is one of “sign-in”, “like”, or “comment”.

After this is done, I wrap this query in a CTE. From this CTE, I find the COUNT of DISTINCT user_id values of those users whose second interaction occurred in July 2022; by necessity, this means their first interaction occurred in June 2022 due to the way our CTE is set up. This is exactly what we’re looking for.


-- First solutions

with t as(

    SELECT
          a.user_id,
          a.event_id as event_id_a,
          a.event_date as a_date,
          b.event_id as event_id_b,
          b.event_date as b_date
    FROM
          user_actions a
    JOIN
          user_actions b on a.user_id = b.user_id AND b.event_date > a.event_date
    WHERE
          (DATE_PART('year', b.event_date::timestamp) - DATE_PART('year', a.event_date::timestamp)) * 12 +
          (DATE_PART('month', b.event_date::timestamp) - DATE_PART('month', a.event_date::timestamp)) = 1 AND
          a.event_type in ('sign-in', 'like', 'comment') AND
          b.event_type in ('sign-in', 'like', 'comment')
         )
SELECT
    extract(month from t.b_date) as month,
    count(distinct t.user_id) as monthly_active_users
FROM
    t
WHERE
     extract(month from t.b_date) = 7 AND
     extract(year from t.b_date) = 2022
GROUP BY
    1

A less wordy, cleaner solution



    SELECT
          extract(month from b.event_date) as month,
          count(distinct a.user_id) as monthly_active_users
    FROM
          user_actions a
    JOIN
          user_actions b on a.user_id = b.user_id AND b.event_date > a.event_date
    WHERE

          (DATE_PART('month', b.event_date::timestamp) - DATE_PART('month', a.event_date::timestamp)) = 1 AND
          a.event_type in ('sign-in', 'like', 'comment') AND
          b.event_type in ('sign-in', 'like', 'comment') AND
          extract(month from b.event_date) = 7 AND
          extract(year from b.event_date) = 2022
    GROUP BY
        1

Wayfair - Y-on-Y Growth Rate


with table1 as(select
      product_id,
      extract(year from transaction_date) as year,
      sum(spend) as total_spend
FROM
      user_transactions
group by
      1, 2
ORDER BY
      1, 2)

SELECT
      a.year,
      a.product_id,
      a.total_spend as curr_year_spend,
      b.total_spend as prev_year_spend,
      round(((a.total_spend - b.total_spend)/(b.total_spend))*100.0, 2)
FROM
      table1 a
LEFT JOIN
      table1 b on a.product_id = b.product_id AND
                  a.year= b.year + 1

“Medium” Difficulty

Google - Odd and Even Measurements

The key here is to use the ROW_NUMBER() window function and to use day only (using EXTRACT()) in the PARTITION BY clause while using the entire datetime value (named measurement_time) in the ORDER BY clause


with t as(

    SELECT
        measurement_id,
        measurement_value,
        measurement_time,
        row_number() over (partition by extract(day from measurement_time) order by measurement_time)
    FROM
        measurements)
SELECT
    measurement_time::date as measurement_day,
    sum(case WHEN row_number % 2 != 0 then measurement_value else 0 end) as odd_sum,
    sum(case when row_number % 2 = 0 then measurement_value else 0 end) as even_sum
FROM
    t
GROUP BY
    1
ORDER BY
    1

UnitedHealth - Patient Support Analysis (Part 4)

The key to solving this problem is to use the LAG() window function to compare the number of long calls in one month to that of the previous month and to cast the percentage calculation as a decimal to be able to get negative percentages


with t as(

    SELECT
        extract(year from call_received) as year,
        extract(month from call_received) as month,
        sum(case when call_duration_secs > 300 then 1 else 0 end) as lc,
        lag( sum(case when call_duration_secs > 300 then 1 else 0 end), 1 )
        OVER( order by extract(year from call_received), extract(month from call_received) ) as previous_lc

    FROM
        callers
    GROUP BY
        1, 2
    ORDER BY
        1, 2)


SELECT
    year,
    month,
    round(((lc-previous_lc)/previous_lc::decimal*100), 1)
FROM
    t

Spotify - Spotify Streaming History

The key here is to use UNION ALL instead of multiple CTEs


with t as(

    (SELECT
        user_id,
        song_id,
        count(song_id) as listens
    FROM
        songs_weekly
    WHERE
        extract(year from listen_time) <= 2022 and
        extract(month from listen_time) <= 8 and
        extract(day from listen_time) <= 4
    GROUP BY
        1, 2)

    union all


    (SELECT
        user_id,
        song_id,
        sum(song_plays) as listens
    FROM
        songs_history
    GROUP BY
        1, 2)

      )

SELECT
    user_id,
    song_id ,
    sum(listens) as song_plays
FROM
    t
GROUP BY
    1, 2
ORDER BY
    3 DESC

Uber - Second Ride Delay

This question gave me more problems than usual and taught me about a crucial difference between rank() and row_number() which is obvious in hindsight. My approach here is as follows: find the order of trips per user_id dependent on the date. Isolate those user_id who booked a first ride on the same day as their registration (‘in the moment’ users). Find the difference between the 2nd ride date and the registration date for these latter users.

I kept getting the wrong answer initially until I finally realised I was using rank(), when I should have been using row_number().

If two values are equal (in this case, we are looking at ride dates), rank() will of course return the same value for both. However, row_number() will not; it will instead return the next value. For example, rank() will return 1 and 1 for the same ride date but row_number() will return 1 and 2.

This realisation finally enabled me to get the solution


with ride_record as(

    SELECT
        u.user_id,
        u.registration_date,
        r.ride_date,
        row_number() over (partition by u.user_id order by r.ride_date) as trip_no
    FROM
        users u
    inner JOIN
        rides r on u.user_id = r.user_id),

    in_moment as(

    select DISTINCT
        ride_record.user_id
    FROM
        ride_record
    WHERE
        ride_record.registration_date = ride_record.ride_date)

SELECT
    ROUND(AVG(ride_date - registration_date),2) AS average_delay
FROM
    ride_record
inner JOIN
    in_moment on ride_record.user_id = in_moment.user_id
WHERE
    trip_no = 2

CVS Health - Pharmacy Analytics (Part 4)


with t as
    (SELECT
        manufacturer,
        drug,
        rank() over(partition by manufacturer order by sum(units_sold) DESC)
    FROM
        pharmacy_sales
    GROUP BY
        1, 2)
SELECT
    t.manufacturer,
    t.drug as top_drugs
FROM
    t
WHERE
    rank <= 2
ORDER BY
    t.manufacturer asc

JPMorgan Chase - Card Launch Success


with t as(

    SELECT
        card_name,
        issued_amount,
        rank() over (partition by card_name order by min(issue_year), min(issue_month))
    FROM
        monthly_cards_issued
    group by
        1, 2
    ORDER BY
        2 DESC)

SELECT
  card_name,
  issued_amount
FROM
    t
WHERE
    rank = 1

Stitch Fix - Repeat Purchases on Multiple Days

The following is not the most elegant solution but I had to think about this problem for longer than usual and I eventually got the answer.

The approach is to first find user_id and product_id combinations that occur over at least 2 different days. I accomplish this using COUNT(DISTINCT), GROUP BY and CONCAT. I use CONCAT because the purchase_date column has both the date and a timestamp, so simply using COUNT(DISTINCT) will not necessarily only capture different days as needed by the problem. It would also capture values for the same day, just at different times.

I use COUNT(DISTINCT) to then get only those user_id-product_id combinations which occur on at least 2 different days. Then it is a simple matter of wrapping these steps in a CTE and using COUNT(DISTINCT user_id) to find the number of users who made purchases on at least 2 different days.

with t as(             -- begin CTE
    SELECT
        user_id,
        product_id,
        count(DISTINCT -- begin COUNT
            concat(    -- begin CONCAT
            extract(day from purchase_date),
            extract(month from purchase_date),
            extract(year from purchase_date)
                  )    -- end CONCAT
             )         -- end COUNT
    FROM
        purchases
    GROUP BY
        1, 2
    HAVING
        count(DISTINCT -- begin COUNT
            concat(    -- begin CONCAT
            extract(day from purchase_date),
            extract(month from purchase_date),
            extract(year from purchase_date)
                  )    -- end CONCAT
             ) > 1     -- end COUNT
         )             -- end CTE

SELECT
    count(distinct user_id) as repeat_purchasers
FROM
    t


Etsy - First Transaction


with t as(

select
      user_id,
      spend,
      dense_rank() over (partition by user_id order by transaction_date) as rank
from
      user_transactions
)

select count(distinct t.user_id) as users
from t
where t.rank = 1 and t.spend >= 50

Google - Ad Campaign ROAS


select
    advertiser_id, round(((sum(revenue)/sum(spend))::DECIMAL), 2) as ROAS
from
    ad_campaigns
group by
      1

Amazon - Average Review Ratings


select
      extract(month from submit_date) as month,
      product_id as product,
      round(avg(stars), 2) as avg_stars
from
      reviews
group by
      1, 2
order by
      1, 2

Spotify - Top 5 Artists


with t as(

    SELECT
        artist_name,
        dense_rank() over (order by count(artist_name) DESC) as artist_rank
    FROM
        artists a
    inner JOIN
        songs s on a.artist_id = s.artist_id
    inner JOIN
        global_song_rank g on s.song_id = g.song_id
    WHERE
        g.rank <= 10
    GROUP BY
        1)

SELECT
    artist_name,
    artist_rank
FROM
    t
WHERE
    artist_rank <= 5

Google - Consulting Bench Time

select
      s.employee_id,
      365 - sum(((c.end_date-c.start_date) + 1))
from
    staffing s
inner join
      consulting_engagements c on s.job_id = c.job_id
where
      s.is_consultant = 'true'
group by
      1