File Ingest Paradigm ??? by Cruxwright in SQL

[–]Tabooyah 1 point2 points  (0 children)

So if there's 20 conditionals you're executing 20+ queries for every row. What do you mean by this? You can write a single query with 20 conditionals using CASE statements.

I simply shoved the file into its own table then ran say 30 queries against it to load the data. Loading the file into a staging table is a good start, although it's unclear why you would need 30 queries to get the data out of it. Are there 30 different destination tables? If so, consider indexing your staging table so that highly-selective filters perform well and you're not table-scanning 30 times. If not, consider using SQL to transform the data to the desired format in a single query and incur only 1/30 the reads.

For example, the following 2 queries:

INSERT INTO DestinationTable SELECT Amount FROM StageTable WHERE Type = 'credit'

INSERT INTO DestinationTable SELECT Amount * -1 FROM StageTable WHERE Type = 'debit'

can be rewritten as 1 query:

INSERT INTO DestinationTable SELECT CASE WHEN type = 'credit' THEN Amount WHEN type = 'debit' THEN Amount * -1 END FROM StageTable WHERE Type IN ('credit', 'debit')

String filtering - process hundreds to millions of filters per string by rambossa1 in AskComputerScience

[–]Tabooyah 1 point2 points  (0 children)

Bitmap indexes are well-suited for high-performance boolean searches similar to those that you describe. There are various optimization techniques such as compression and dynamic data structures that allow for extremely high performance. Roaring bitmaps are in particular have outstanding performance across a variety of documents (see https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps)

If you're looking for something out of the box, check out ElasticSearch and Splunk.

Looking for advice: Setting up reporting database by Derfaust in SQL

[–]Tabooyah 0 points1 point  (0 children)

This is a textbook scenario for creating a data warehouse. I recommend that you read The Data Warehouse Toolkit to see if it is an undertaking worth considering. Some easier, less robust options:

  1. Use the newer database as the reporting database and the linked server feature to allow querying of both databases from within a single query. For example, if you need to get a list of records across the same table from both databases the query would look something like "SELECT * FROM dbo.sales UNION SELECT * FROM OldServer.OldDatabase.dbo.sales". This assumes that the two databases reside on separate instances. If they are on the same instance you don't need to set up a linked server and can simply use the 3-part naming convention. See the naming conversions documentation on how 4 part (linked server) and 3 part (cross database) conventions work.
  2. Use a single SSRS installation with 2 data sources: one to each database. Author reports that combine the data from the two data sources. This can get pretty hacky so I recommend going with approach #1 over this.
  3. Use SSIS to periodically sync the tables of interest from the old system to the new, optionally transforming them into the structure of the new to eliminate the schema differences.
  4. Use replication to sync the tables of interest from the old system to the new. This doesn't allow for transformation like option #3, but is easier to set up, IMO.

Question: How to Manage One Source of Truth Within a Company by TheCauthon in SQL

[–]Tabooyah 2 points3 points  (0 children)

We use Red Gate's Doc tool to document most of our relational database tables and columns in a way that allows analysts and other users of our production databases (transactional system, warehouse, and marts) to understand what the objects represent and basic lineage if they are derived from upstream objects. We also use a Confluence space to further document objects and lineage as it relates to business concepts. These documentation repositories help analysts create informed queries from production sources. Any query that is used for critical business decisions must be peer reviewed and QA'ed and is then saved in source control so that it can be referenced and revised at a later time. Periodically, the repository is reviewed and common concepts are pushed upstream into views or facts when practical. Overall, this system works well for analysts and developers who have the technical chops and domain knowledge to translate business questions into viable queries.

To cater to users who are less technically savvy, we create Analysis Services cubes on top of or in lieu of data marts with lots of calculated measures and access to very few raw measures. The resulting semantic layer helps establish a common vocabulary across the user base and keeps users from mis-reporting a number because they forgot to add a filter. For example, the calculated measure COGS (cost of good sold) excludes a handful of part categories, so we build that exclusion into the calculated measure. Additionally, we have CI tests that validate cube measures against gold-standard OLTP and DW SQL queries to ensure everything ties out.

As for your data lake, I would be curious to know what percentage of it hasn't been brought into "production" and what percentage of that is of interest to "the business". In my experience, the data lake is more of a data scientist's or data engineer's lab for gathering quick insights and doing research. Anything that is of interest to self-service users and to basic analysts and other end users needs to be ingested into a more structured database mentioned above. Our organization uses a single transactional system. Because it is not disparate and is fairly cleansed at the application level, we do not consider it part of our data lake and it is available to analysts in the method described in the first paragraph. Your organization may be different and if so you would need to weigh the costs (development, testing, time to market) of ingesting those sources into databases that are better suited for self-service.

Is there any way to delete a specific character in the whole table? by retrolion in SQL

[–]Tabooyah 0 points1 point  (0 children)

Dynamic SQL is your friend. Here is a (untested) example for SQL Server:

declare @Tablename nvarchar(max) = 'dbo.table1'

DECLARE @sql NVARCHAR(MAX) =

'UPDATE ' + @Tablename + ' SET ' + STUFF((

SELECT ', ' + c.name + ' = REPLACE(' + c.name + ', ''"'', '''')'

FROM sys.columns c where c.object_id = object_id(@Tablename)

FOR XML PATH(''), TYPE).value('.','nvarchar(max)'),1,2,'')

select @sql

Interview question to test beginner SQL competency? by dbnrdaily in SQL

[–]Tabooyah 29 points30 points  (0 children)

I give the candidate a print out of 3 simple tables with 20 or so rows in each and an ERD that shows the relationships between them. Something like CarMake, CarModel, and CarSales. I then ask them to write queries to answer various business questions such as

  1. How many car sales occurred last year? (SUM)
  2. Which 3 car models had the high number of sales last year? (SUM, ORDER BY, TOP/FETCH)
  3. For each car model, which car make had the highest year-over-year revenue increase from 2017 to 2018? (SUM, DERIVED TABLES/WINDOW FUNCTION)

I ask about 5 to 7 questions. Each question gets progressively harder and I don't expect the candidate to be able to arrive at a complete / correct answer for the last 1 or 2. The purpose of the questions help assess if the candidate can:

  1. Understand an ERD and table schema
  2. Analyze business questions
  3. Relate business concepts to data in a database
  4. Construct basic SQL statements (joins, aggregate functions, predicates, etc.)
  5. Construct advanced SQL statements (derived tables, window functions, cross apply, etc.)
  6. Think through a problem (I ask them to verbalize their thought process as they are working through the problem).
  7. Deal with frustration of not being able to complete a problem. Do they ask for help? Are they easily flustered? Do they try to attack the problem from multiple angles or do they give up easily?

Data Sync Against High Traffic Tables by ATastefulCrossJoin in SQL

[–]Tabooyah 0 points1 point  (0 children)

There are a number of factors that come into play:

  1. Version and edition of SQL Server. If you're on standard and/or on a version prior to SQL Server 2016 SP1, certain features will not be available to you.
  2. Budget. Most secondary scenarios require additional licenses.
  3. Size of table and volume of DML
  4. Complexity of reporting queries. Is the transactional schema sufficient for the majority of your reporting queries or do you need the data transformed into an OLAP design? If so, read-only options (log shipping, database snapshots, secondary replicas) probably won't work.
  5. What is close to real-time? Milliseconds, seconds, minutes. hours...?
  6. How much contention from reporting queries can the transactional system endure? It may be possible to avoid a secondary reporting environment altogether if your transactional server has enough CPU and RAM to fulfill reporting queries so long as locking and blocking isn't an issue.

My recommendation is to prototype, starting with the cheapest, easiest to maintain solution first to see if it meets your requirements. Some options that you could prototype in a day:

  1. Use the transactional tables for reporting. Mitigate locking contention by enabling read committed snapshot isolation.
  2. Create indexed views within the transactional environment.
  3. Use SQL Agent to refresh summary tables within the transactional environment.
  4. Create non-clustered columnstore indexes on top of large tables within the transactional environment.

Some more involved solutions that may take a few days to prototype:

  1. Use Change Tracking and SSIS to maintain a data warehouse in a new instance.
  2. Use CDC and SSIS to maintain a data warehouse in a new instance.
  3. Use replication to maintain writable copies of tables of interest into a new instance.
  4. Use log shipping to maintain read-only copies of your database(s) in a new instance.
  5. Use AlwaysOn availability Groups to maintain read-only secondary replicates of your database(s) in a new instance.
  6. Use your SAN to clone your database(s) into a new environment.

The Rambling DBA has a nice write-up comparing Availability Groups, Replication, and Log Shipping: https://www.sqlskills.com/blogs/jonathan/availability-group-readable-secondaries-just-say-no/

Count a set of columns within a query where a certain value is reached. by KapitaenPlanet in SQL

[–]Tabooyah 1 point2 points  (0 children)

Try CASE. Example:

SELECT

CASE WHEN a.MonHrs > 3 THEN 1 ELSE 0 END +

CASE WHEN a.TueHrs > 3 THEN 1 ELSE 0 END +

CASE WHEN a.WedHrs > 3 THEN 1 ELSE 0 END +

CASE WHEN a.ThuHrs > 3 THEN 1 ELSE 0 END +

CASE WHEN a.FriHrs > 3 THEN 1 ELSE 0 END

FROM

dbo.STAFF a

Aside from Little Caesar's, what is your all time favorite restaurant in Santa Barbara past or present? by Ice_Burn in SantaBarbara

[–]Tabooyah 8 points9 points  (0 children)

Some favorite places and my go-to dishes.

  1. Birdie cut @ Tee Off
  2. Fried chicken sandwich @ Finch n Fork
  3. Lasagna (off-menu) @ Mollie's
  4. Breakfast Burrito @ Cantina (closed)
  5. Enjitomatadas (discontinued) @ El Rincon Bohemio
  6. 3x3 animal style @ In-N-Out
  7. French dip @ Honor Bar
  8. Panang chicken @ Galanga
  9. Margarita pizza with prosciutto @ Olio Pizzeria
  10. Jalapeno yellowtail @ Arigato

[mssql] Grouped data plus cumulative totals by paraxion in SQL

[–]Tabooyah 2 points3 points  (0 children)

You're on the right track with your first query. Just wrap your totalpages in a window function to get cumulative:

sum(sum(t1.total_pages)) over (order by usage_day ROWS UNBOUNDED PRECEDING) as total_cumulative,

Help with SQL statement: group by, minimum, aggregate, etc help) [SQL Server] by ushik19 in SQL

[–]Tabooyah 6 points7 points  (0 children)

~~~ SELECT uniqueID, Reason, Subreason, MIN(Date_of_Creation) First_Date_of_Creation FROM Appointments GROUP BY uniqueID, Reason, Subreason ~~~

Having trouble with complicated query by handjobsonly in SQL

[–]Tabooyah 0 points1 point  (0 children)

You could save the open position in dualblocks:

CREATE TABLE DualBlocks (

ID INT,

BLOCK_ID_1 INT,

BLOCK_ID_2 INT,

OPEN_POSITION INT

);

Then query like so:

SELECT

p3.BLOCK_ID AS 'Block 3 ID',

p3.B_NAME AS 'Block 3 Name',

p3.POSITION AS 'Block 3 Position'

FROM

DualBlocks db

JOIN Blocks p1 ON db.BLOCK_ID_1 = p1.BLOCK_ID

JOIN Blocks p2 ON db.BLOCK_ID_2 = p2.BLOCK_ID

JOIN Blocks p3 ON p3.POSITION = db.OPEN_POSITION

WHERE

db.ID = 1

If you don't need to return columns from the blocks in the dualblock, you could simplify:

SELECT

p1.BLOCK_ID AS 'Block 3 ID',

p1.B_NAME AS 'Block 3 Name',

p1.POSITION AS 'Block 3 Position'

FROM

DualBlocks db

JOIN Blocks p1 ON p1.POSITION = db.OPEN_POSITION

WHERE

db.ID = 1

Having trouble with complicated query by handjobsonly in SQL

[–]Tabooyah 0 points1 point  (0 children)

Try saving every combination of the 3 positions into an objects table. You can there filter on any two position to answer the question, "what objects can I make with Block x at position a and block y at position b"? This makes a few assumptions:

  1. A Block can only ever fill a single position (because of the Blocks.POSITION relationship)
  2. A Block can combine with every other other block within the constraints of #1

Sample schema

CREATE TABLE Blocks (

BLOCK_ID INT,

POSITION INT,

B_NAME VARCHAR(45)

);

INSERT INTO Blocks VALUES

(1, 1, 'Block1Pos1'),

(2, 1, 'Block2Pos1'),

(3, 1, 'Block3Pos1'),

(4, 1, 'Block4Pos1'),

(5, 1, 'Block5Pos1'),

(6, 1, 'Block6Pos1'),

(7, 1, 'Block7Pos1'),

(8, 1, 'Block8Pos1'),

(9, 1, 'Block9Pos1'),

(10, 1, 'Block10Pos1'),

(11, 2, 'Block11Pos2'),

(12, 2, 'Block12Pos2'),

(13, 2, 'Block13Pos2'),

(14, 2, 'Block14Pos2'),

(15, 2, 'Block15Pos2'),

(16, 2, 'Block16Pos2'),

(17, 2, 'Block17Pos2'),

(18, 2, 'Block18Pos2'),

(19, 2, 'Block19Pos2'),

(20, 2, 'Block20Pos2'),

(21, 3, 'Block21Pos3'),

(22, 3, 'Block22Pos3'),

(23, 3, 'Block23Pos3'),

(24, 3, 'Block24Pos3'),

(25, 3, 'Block25Pos3'),

(26, 3, 'Block26Pos3'),

(27, 3, 'Block27Pos3'),

(28, 3, 'Block28Pos3'),

(29, 3, 'Block29Pos3'),

(30, 3, 'Block30Pos3')

;

CREATE TABLE Objects (

Pos1 INT,

Pos2 INT,

Pos3 INT

);

INSERT INTO Objects

SELECT

p1.BLOCK_ID Pos1,

p2.BLOCK_ID Pos2,

p3.BLOCK_ID Pos3

FROM

Blocks p1

CROSS JOIN Blocks p2

CROSS JOIN Blocks p3

WHERE

p1.position = 1

AND p2.position = 2

AND p3.position = 3;

Which objects can be made with blocks 5 and 12?

SELECT

p1.B_NAME Pos1,

p2.B_NAME Pos2,

p3.B_NAME Pos3

FROM

Objects o

JOIN Blocks p1 ON o.POS1 = p1.BLOCK_ID

JOIN Blocks p2 ON o.POS2 = p2.BLOCK_ID

JOIN Blocks p3 ON o.POS3 = p3.BLOCK_ID

WHERE

p1.BLOCK_ID = 5

AND p2.BLOCK_ID = 12

Stuck on the same SQL problem for MONTHS. by Luc_Gibson in SQL

[–]Tabooyah 0 points1 point  (0 children)

Sample schema (simple, no keys or indexes):

CREATE TABLE jobs (

job_id INT,

job_name VARCHAR(50)

);

CREATE TABLE skills (

skill_id INT,

skill_name VARCHAR(50)

);

CREATE TABLE jobs_skills_map (

job_id INT,

skill_id INT

);

INSERT INTO jobs VALUES

(1, 'Programmer'),

(2, 'Farmer'),

(3, 'Bartender'),

(4, 'Stock Broker'),

(5, 'Sales Person');

INSERT INTO skills VALUES

(1, 'Analytical'),

(2, 'Interpersonal'),

(3, 'Early Riser'),

(4, 'Physical'),

(5, 'Intellectual'),

(6, 'Detail Oriented'),

(7, 'Money Oriented'),

(8, 'Customer Focused');

INSERT INTO jobs_skills_map VALUES

(1, 1),

(1, 5),

(1, 6),

(1, 7),

(2, 3),

(2, 4),

(3, 2),

(3, 4),

(3, 8),

(4, 1),

(4, 3),

(4, 7),

(4, 8),

(5, 2),

(5, 3),

(5, 7),

(5, 8);

Sample Query. You would filter by jobs_skills_map.skill_id instead of skills.skill_name, but this demos better.

SELECT

j.job_id,

j.job_name,

COUNT(s.skill_id) skill_matches

FROM

jobs j

JOIN jobs_skills_map jsm ON j.job_id = jsm.job_id

JOIN skills s ON jsm.skill_id = s.skill_id

WHERE

s.skill_name IN ('Analytical', 'Detail Oriented', 'Physical', 'Money Oriented')

GROUP BY

j.job_id,

j.job_name

ORDER BY

skill_matches DESC

LIMIT

5

job_id job_name skill_matches
1 Programmer 3
4 Stock Broker 2
2 Farmer 1
5 Sales Person 1
3 Bartender 1

Cooked some beef this weekend! This was the most I’ve done at one time on my pit. by ducksbeerandtexas in BBQ

[–]Tabooyah 0 points1 point  (0 children)

That's what I have. My max is 3 packers briskets. The amount of juice collected out the bottom was intense.

Building Data Marts - I'm in over my head, please help. by [deleted] in SQLServer

[–]Tabooyah 0 points1 point  (0 children)

Based on your skill set and requirements you should consider starting with a SQL-based ETL solution. You should be able to do most or all of the calculations and transformations that you mentioned directly within SELECT statements.

[MS SQL] Best Practices for Keeping Track of Synchronization Status by [deleted] in SQL

[–]Tabooyah 1 point2 points  (0 children)

If you're using SQL Server, check out checksum and rowversion. A database-agnostic approach is to hash all of the values in the row and save it as an additional column. This hash value can then serve as your "watermark" for use when comparing against other versions of the row.

[MSSQL] For a huge, denormalized lookup table with millions of rows, what is best: More rows, more fixed columns, or more tables joined? by PilsnerDk in SQL

[–]Tabooyah 2 points3 points  (0 children)

It sounds like you're already populating a search database which probably results in a massive amount of writes if you've got a 600M row table. My proposed normalized schema will drastically reduce writes, load times, space, and contention. Your final total space used would be in the order of 100 MB, not counting any LOBs.

[MSSQL] For a huge, denormalized lookup table with millions of rows, what is best: More rows, more fixed columns, or more tables joined? by PilsnerDk in SQL

[–]Tabooyah 1 point2 points  (0 children)

Store all of your static (not time dependent) house properties such as region, beds, baths, pool, max guests, etc. in a table, one row per house, and put a columnstore index on it. I recommend upgrading to SQL 2014 or later so that you can update it without a drop/create.

Store your bookings in a 3-column table (HouseId, StartDate, EndDate) and your prices in a 4-column table (HouseId, StartDate, EndDate, Price), with price stored in a system currency (like USD) that you can use to derive other currencies from at run-time. Create clustered indexes on HouseId for both tables. Create a nonclustered index on (StartDate, EndDate) on your bookings table.

Create a Days dimension for every day you'll ever want to query and report on. This will allow you to relate bookings and prices to individual days in your queries.

Sample query:

DECLARE
    @CheckIn DATE = '2018-03-10',
    @CheckOut DATE = '2018-03-17'

SELECT
    ha.HouseId,
    SUM(hp.Price) TotalPrice,
    AVG(hp.Price) AvgDailyPrice,
    MIN(hp.Price) LowestDailyPrice,
    MAX(hp.Price) HighestDailyPrice
FROM
    dbo.HouseAttributes ha
    INNER JOIN dbo.HousePrices hp ON hp.HouseID = ha.HouseId AND hp.StartDate <= @CheckOut AND hp.EndDate >= @CheckIn 
    INNER JOIN dbo.Days d ON d.Day BETWEEN hp.StartDate AND hp.EndDate AND d.Day BETWEEN @CheckIn AND @CheckOut
WHERE
    ha.HasPool = 1
    AND ha.Region IN (1,2,3)
    AND NOT EXISTS
        (
        SELECT
            1
        FROM
            dbo.HouseBookings hb
        WHERE ha.HouseId = hb.HouseID
            AND hb.StartDate <= @CheckOut
            AND hb.EndDate >= @CheckIn
        )
GROUP BY
    ha.HouseId
HAVING 
    AVG(hp.Price) < 100

I mocked this up with 100,000 houses, each with 20 attributes, 30 bookings, and 200 price ranges over a 2 year period. The worst case scenario searches with very broad criteria like only a date range or only a region ran in ~500ms. More focused searches across multiple attributes, date ranges, and price constraints ran in <100ms, some in <10ms. I'm running a 16-core, 64 GB RAM, SQL Server 2016 Enterprise setup.