I just finished the SQLPASS pre-conference from Brian Knight – Building a Microsoft Data Warehousing Platform, and I’m so pumped to do more BI! Reporting, Data Warehousing, Data Mining – all fun stuff.
In a gist, this is what we learned today (and it’s a lot of acronyms):
– SSIS, ETL
– SSAS
– Cubes, Data Warehouse
– Data Mining
– MDX and DMX (on a very very high level)
It was a lot of details and demos jammed in one day, but I won’t have it any other way, and overall the session was just great! Loved it!
Here are some bits and pieces of information / tips / useful sites / resources (probably mostly for my benefit, so I remember. But if you find it useful, then all the better):
- 65% of data warehousing projects fail, and one of the reasons for this is lack of communication. Business users want something, but devs deliver something else.
- Waterfall model for DW projects – hard to react. Agile is better – and Brian recommends 1 month iterations (ie complete deliverable cycle per month)
- Why data warehouse? Consolidate data, aggregates, reporting, analytics, archived data (ex if you need to keep data for years and years)
- Business Key = Alternate Key = PK from source
- Surrogate Keys – insulates data warehouse from source changes
- In one of Pragmatic Works’ projects, they lowered SSIS loading time from 4.5 hours to 3 mins – all about choosing “better” way. In SSIS most components are in memory and are synchronous.
- Most SSIS components are synchronous – good for performance
- Some are asynchronous and partially blocking, not so good – for example UNION ALL and MERGE – slight overhead
- Some are asynchronous and fully blocking – not good at all – for example sort or aggregate transforms – but sometimes there are reasons to use these
- In BIDS, devenv.exe -NOSPLASH to prevent splash screen
- In SSIS – Protection Levels for Packages – EncryptSensitiveWithUserKey generally a bad idea because it associates the package with your profile; if you transfer package, might say package is corrupt. Best to Encrypt All With Password; ServerStorage relies on msdb
- Different types of dimensions
- Type 0 – Changing Attribute – updates; no history kept
- Type 1 – Fixed Attribute – cannot change under any circumstance; ignores change
- Type 2 – helps track changes; biggest disadvantage is additional storage; it creates a new record per update
- Good practice – smaller packages; easier to change and maintain
- Slowly Changing Dimension Wizard (SCD) – data types need to be lined up, need to be very careful; easy to break; also when you need to customize later on and relaunch the wizard, previous customizations will be dropped. You should keep documentation on how to recreate your customizations
- Another problem with SCD – similar to a cursor, goes through row by row; alternative way is to create staging table first. Use T-SQL for what it’s good at, SSIS for what it’s good at. If you can do sorts in T-SQL, do the sorts in T-SQL
- Inferred members – data that not in source, but is in Fact; for example, if it takes 2 weeks to add a salesperson in HR, but salesperson already has sales from Day 1
- Better load – “Fast Load”; also configure “Maximum Insert Commit Size”
- Every member should has an unknown member
- ALWAYS alias your columns
- Alternatives to detecting updates – instead of specifying all columns in a multi OR expression, use CHECKSUM or HASHBYTES. Problem with CHECKSUM is it is not guaranteed to be unique. Ex.
HASHBYTES('sha', ISNULL(col1, '') + ISNULL(col2, ''))
So in your expression column can just have something similar to: (DT_WSTR, 64)yourhashcol_src != (DT_WSTR,64)yourhashcol_dw
- SQL Cache Transform – creates cache file, can be reused multiple times; 80% faster in SQL Server 2008; SQL Server 2005 was bad with threading
- Terminologies – data warehouse (SSAS 2008)
- Measure Group = Fact Table
- Attributes = Columns in a Dimension
- Member = Row in an Attribute
- Hierarchies are shortcuts for end users; you can ultimately hide actual attributes (AttributeHierarchyVisible = True), and just expose the hierarchies
- Always reconnect to cube after dimension or attribute changes so changes are apparent
- Set up attribute relationships ex:
Date SK -- Date -- Month -- Quarter -- Year
Note that uniqueness of quarter is really Year + Quarter.
- Set up constraints/uniqueness through the “Key Column”. Set up what’s being displayed to user using “Name Column”. You will probably need to explore your dimensions and see which attributes need this.
- For any time dimensions, add “Time Intelligence”. As a side effect, this automatically adds additional filters like “Yesterday”, “This Week”, “Last Week” etc in Excel
- “Mushroom Cloud” – Invalid queries, but you are getting invalid results. For example if you drag Product then EmployeeCount. This gives you the EmployeeCount value for Product.
- In Cube Editor, MeasureGroup, set IgnoreUnrelatedDimension = False helps prevent “mushroom clouds”; result will display blank instead of giving you invalid value
- Discretization Method – allows you to create dynamic buckets of data
- Automatic
- EqualAddress – your just specify number of groups
- Clusters – groups based on your data
- HideMemberIf (trying to remember why this option was highlighted in the session)
- MDX is language for querying the cube/SSAS; used for creating KPIs, calculations
- Tuple in MDX points to a cell; you can work with six (6) different axes. Good book is MS Press MDX Step by Step
- When you want to show data, put in SELECT; if you don’t but you want to filter by it, put in WHERE
- Parenthesis in an MDX helps you represent a tuple
- SSRS and Excel can generate some MDX
- Calculations < Calculated Member does not store anything in the cube, just the metadata; actual calculations done on the fly
- PRO: Only formula is stored in the cube; don’t need extra storage
- CON: Only formula is stored in the cube; performance will be worse
- Usage Based Optimization : Go to SSMS < Connect to SSAS < Set CreateQueryLogTable = true; you can also choose to “Review Aggregation Usage”
- How to lock data in your cube : Go to your dimension then go to Role, and add roles/users
- Option: Enable Visual Totals – like an implicit WHERE; Off by default, which means users will still see all data even if data is locked down; for example, if user can only see US data, his/her totals should be just for US, but s/he will see US Totals, then a different Grand Total. Enable this to lock down these totals to only the data they can see
- Deployment wizard in SSAS – creates .asdatabase file that contains metadata required to deploy the cube
- You dont need to always process the whole cube; you can do incremental processing
- Partitioning in SSAS – in about 50K records, you will start to feel the slowness
- SQL Server Standard allows 3 partitions; SQL Server Enterprise allows any number
- New feature in SQL Server 2008 Enterprise – Proactive Caching
- In SQL Server 2008, you can use profiler on SSAS to deconstruct MDX queries. In Excel, you can also right click on some columns then check MDX
- There is an option to do offline OLAP : Excel < Options < OLAP Tools < Offline OLAP. You can get to your data, but actions and drillthrough are off
- Double hop issue in SSRS and SSAS when you build reports on top of the cube : check Kerberos, SPN
- You don’t need a cube to do data mining! You can also allow your users to do predictions based on flat tables or views.
- Sample data mining scenario: used to auto approve insurance claims; can be used for fraud detection; can be integrated with SSIS in conditional tasks
- For data mining structures, Decision Trees is the best way to start; as far as alogrithms go, you can even create your own, or purchase 3rd party algorithms
- Users can get/do data mining through Excel
- Users can do drillthroughs with data mining
Useful Resources / Sites
- pragmaticworks.com
- sqlshare.com
- ssas-info.com
- dtsxchange.com
- sqlserverdatamining.com
- Marco Russo’s Expert Cube Development with Microsoft SQL Server 2008 Analysis Services
- Brian Knight’s 24-Hour Trainer: Microsoft SQL Server 2008 Integration Services (Wrox Programmer to Programmer)
- MS Press Microsoft SQL Server 2008 MDX Step by Step (Step By Step (Microsoft))