Quantcast
Channel: SAP – Cloud Data Architect
Viewing all 140 articles
Browse latest View live

SAP seeks to simplify IT with a beefier new version of Hana

$
0
0

SAP has updated its flagship Hana in-memory computing platform with a raft of new features designed to make IT simpler while giving organizations a better handle on their data.

The updates, announced Tuesday at the company’s annual Sapphire Now conference in Florida, include a new hybrid data management service in the cloud and a new version of the company’s Hana Edge edition for SMBs.

“We’ve taken an already rock solid platform and further hardened security, enhanced availability, unified the development and administration experience, and expanded advanced analytic capabilities,” Michael Eacrett, vice president of product management for SAP, wrote in a blog post detailing the new release.

Launched more than five years ago, Hana forms the basis for S/4Hana, the enterprise suite SAP released in early 2015.

Among Hana’s key new features is a graph data processing capability that allows organizations to visualize data connections for a better understanding of the complex relationships among people, places, and things. It can help companies detect fraud, for instance, or uncover new business opportunities, SAP said.

A new “capture and replay” feature, meanwhile, helps IT departments capture live workloads and replay them on a target system. The result is that the IT staff can evaluate new features, assess upgrade options, and measure impact before making changes to the live production system.

The new Hana version also introduces an expanded maintenance lifecycle program that lets companies choose between consistent maintenance of their Hana environment for up to three years or adopting the latest Hana innovations twice a year.

Hybrid data management services are now available in the cloud through an invitation-only beta program for strategic customers. By requiring less hardware infrastructure and offering rapid integration across cloud and on-premise deployments, the hybrid model delivers a lower total cost of ownership, SAP said.

Finally, SAP also released an advanced version of its Hana Edge edition for small to mid-sized businesses. The latest edition includes SAP Predictive Analytics software and supports a 32 GB database along with 128GB of dynamic tiering.

One benefit for users is that admins can “cost-effectively manage ‘hot’ data between pricey memory and lower-cost storage,” noted Charles King, principal analyst with Pund-IT.

Time will tell whether the new features help SAP win new Hana customers. Last October SAP announced that more than 1,300 companies had already signed on with S/4Hana, but a survey released shortly thereafter found lingering hesitation over issues including licensing.


Dell Services Launches Innovation Lab for SAP HANA

$
0
0

Dell Services is introducing its own innovation lab for SAP HANA, designed to help organizations evaluate, understand, and benefit from their investments in the SAP HANA platform.

“We’re really trying to build standard industry based solutions that sit on top of S4,” said Simon Spence, global director of Dell SAP practice. “We’ve really built a set of assets around value and identification for HANA and we can help customers work through that.”

With Dell Services’ SAP HANA lab, customers can load up to 30 terabytes (TB) of data to fully configure solutions for prototyping or proof of concepts to help them better understand and articulate the business value.

The new lab is built on the SAP HANA advanced in-memory platform and offers a personalized user experience covering mission-critical business processes, such as sales, finance, manufacturing and procurement.

Special features and services include infrastructure provisioning, co-innovation lab, and consulting services that help customers identify potential value, create a business case, provide specific industry solutions with smart factory and IoT, and develop zero-impact migration plans.

“This is the next great transformational moment in the SAP space,” Spence said. “It’s really going to drive and enable the move to cloud and having infrastructure provided as a service.”

Existing SAP customers who’ve had SAP for many years but have not kept up with upgrades will benefit the most, according to Spence.

“It’s exciting times in the SAP ecosystem because we all now have a different digital core to sell,” Spence said. “We had incremental non-core capabilities that have added on to SAP ecosystem to make it bigger. Now I’m really seeing the capability to push forward with S4.”

For more information about this innovation lab for SAP HANA, visit www.dell.com.

 

C4 Instance Family certified for SAP Applications

$
0
0

Feed: AWS for SAP.
Author: Vanessa Alvarez.

The new compute-optimized EC2 instances of the C4 family are now certified for general SAP usage. The availability of the entire C4 instance family is documented in the SAP platform note 1656099 (requires SAP customer authentication)
This new family of instances allows SAP customers to improve their performance and the price/performance for compute intensive SAP workloads loads compared to the older C3 instance family.
A quick look at the SAP AWS platform note 1656099 shows as well that the flagship instance type c4.8xlarge is the AWS EC2 instance type with the highest certified total SAP throughput measured in SAPS. SAPS are a SAP throughput metric used to size SAP system.
What’s so exciting about the C4-Family?
C4 is a successor of the SAP certified C3 instance family. The new C4 instances are based on the Intel Xeon E5-2666 v3 (code name Haswell) processor. C4 instances leverage a custom version of this processor, designed specifically for EC2, which runs at a base speed of 2.9 GHz, and can achieve clock speeds as high as 3.5 GHz with Turbo boost. These instances are designed to deliver the highest level of processor performance on EC2.
The C4 instances do well with SAP applications since they allow you to achieve significantly higher packet per second (PPS) performance, lower network jitter, and lower network latency using Enhanced Networking.
I use SAP. What’s in for me?
The C4 instance family provides immediate value and savings for every one who already uses C3 instances for compute intensive SAP workloads.
The table below is a recompilation of the facts in the SAP platform note with an addition of price/performance information as used in the North Virgina region as a reference:
Instance Name RAM C3 vCPU Count C4vCPU Count C3 SAPS C4 SAPS Through-put C4 over C3 Cost per SAPS, C4 over C3*
c4.large 3.75GiB 2 2 1989 2379 +20% -16,4%
c4.xlarge 7.5GbiB 4 4 3978 4758 +20% -12,0%
c4.2xlarge 15GiB 8 8 7957 9515 +20% -10,8%
c4.4xlarge 30GiB 16 16 15915 19030 +20% -11,4%
c4.8xlarge 60GiB 32 36 31830 37950 +19% -11,4%
* Price/Performance: Costs of the North Virginia region for Windows per hour per SAPS as of April 2015
The table above shows that the SAP throughput of the C4 family is 19-20% above the one of the predecessor generation. This means to a SAP user:
• Lower risk of resource exhaustion due to 19-20% more CPU head room
• Potentially ~20% less application servers needed in client-server configuration
• Better scalability inside the box. Features like enhanced networking lower the efforts to achieve the maximum throughput.
The table shows as well that the price performance improves in between 11 and 16% depending of the instance type.
How to use the C4 Family in my SAP Fleet?
C4 instances are the perfect choice for CPU hungry SAP applications without needing too much memory.
They’re as well a great choice for CPU hungry single user test and development systems.
They’re a great starting point for standalone systems, which may not need a lot of memory. Note, if more memory is needed you can easily switch to an instance in the R3 family with a few API calls.
SAP HANA One, available on the AWS marketplace is currently supported on the c4.8xlarge instance type. For HANA deployments needing more than 60GB of RAM please continue to use the r3.8xlarge instance type.
How do I use C4 instances to optimize my SAP Instance Fleet?
First and foremost: Use HVM virtualization for all your SAP instances. This is mandatory for Windows instances. It’s optional (and strongly recommended) for Linux instances.
The HVM virtualization will allow you to switch your instance type through a simple reboot. This will work as well for future instance types!
SAP administrators will want to fine-tune their EC2 instances whenever they have a maintenance window. This allows them to stop, migrate and restart the instance. The time needed for this exercise varies depending on the individual configuration. The time needed is typically within the range it takes to have a cup of coffee…
There are two reasons to migrate an instance:
Preemptive maintenance: using a new instance type will lower the risk of forced shutdowns which may be required for security updates.
Newer instance types are likely to have a longer total product lifetime combined with a better price/performance. This leads to a recommended migration path as follows:
Rightsizing through stop/starts with instance migration: The C4 instances are financially more attractive than the R3 instances since they offer the same amount of CPUs with less memory. This makes the C4 instances very attractive for use as SAP application servers.
SAP system performance requirements may change over time. The AWS platform allows SAP users to pick a larger or smaller system in the family whenever they are needed. Now you don’t have to invest upfront for hardare you may ore may not need in the next 5 years.. Adjust your compute needs and therefore costs by the hour.
Scale up and down in the C4 instance family as you need!
The R3 instance family offers similar amount of vCPUs with more memory compared to the C4 family. The R3 instance family allows to operate SAP systems with up to 244GiB RAM.
This gives SAP users a fine grained migration path:
• Scale up and down in a family if more CPU is needed
• Migrate from the C4 family to the R3 family by reboot if you need more or less memory
The diagram below shows the choices AWS customers have to find the best fit for the needs of the day below:

Summary
The entire C4 family is certified for SAP use. SAP customers will want to use this instance type instead of the previous generation C3 family since the C4 family offers better performance and overall price/performance.
Plan your migration to C4 instances today during your next maintenance window.

How SAP Business One Solution Providers Benefit from Building an AWS Practice

$
0
0

Feed: AWS for SAP.
Author: Vanessa Alvarez.

SAP has been a member of the Amazon Partner Network (APN) since 2011 and since then, we have worked together to certify multiple SAP Solutions on AWS, including a number of configurations of Business One, and most recently SAP Business One on HANA to run on the AWS Cloud. Small and medium enterprises can now access all of the benefits of SAP Business One while leveraging the increased agility, cost savings, security and reliability that the AWS Cloud provides.
When SAP customers of all types move to AWS they almost always enjoy a lower TCO, increased agility and flexibility, and the advantages of moving to an OPEX model, which makes it a very attractive option. AWS has a very robust APN partner ecosystem with thousands of Consulting and Technology partners to support our customers globally. APN Partners are seeing that they can better serve their customers when they build a cloud practice around AWS and offer cloud solutions to their customers.
Provide Customers Solutions, The Way They want to Consume Them
If you look at the B2B software market as a whole, customers are interested in hearing about ways they can trade what usually is a large upfront CapEx expense into a lower monthly predictable recurring payment for software solutions that includes infrastructure consumption. The explosion of SaaS (software-as-service) in the market is evidence of this; the model is becoming the new standard for the way customers want to buy. When SAP Business One VARs (value-added-resellers) work with AWS to build a cloud practice they have the ability to offer their customers that exact same experience. It gives them the ability to provide their customers the SAP Business One solution they want on the terms they expect in this new cloud-based economy.
New Stable and Predictable Revenue Stream
Many software VARs operate off of a larger, transactional CapEx heavy license sales model that is measured on a quarterly basis. The revenue that they collect comes in large chunks in the quarter that they complete the sales process and obtain a payment from a customer (typically a PO). They then go through process of implementing and customizing the software over the coming months. While this does provide an influx in cash, this model is difficult to predict with a degree of certainty and typically puts companies into revenue peaks and valleys annually. These revenue peaks and valleys have an effect on gross profit, which affect marketing and other selling expenses, and can ultimately result in a potential decrease in your net income. VARs who build a fully managed SAP Business One offering on AWS introduce a new, predictable recurring revenue stream to their business that over time begins to smooth out these peaks and valleys. As they grow this business and earn back the initial customer acquisition costs they gain an increased level of predictability and more stable cash flows. The services they offer and business model they operate in today will ultimately dictate how they will shift their expenses to accommodate for this new revenue stream.

Reduce Time Consuming Steps in Your Customer Acquisition Cycle
Most of the time, SAP Business One VARs have to add multiple steps in their customer acquisition cycle around infrastructure in order to deliver a solution. They will generally have to provide a customer with the needed infrastructure requirements and allow the customer to procure new, or sometimes re-provision existing infrastructure on their own. This often requires a customer to start a second procurement track to obtain the infrastructure they need to deploy the solution they want to run. This can add weeks or months to the time it takes to deliver a solution to a customer. Regardless of the model, with AWS, you have instant access to the infrastructure you need to deploy SAP Business One globally. This allows SAP Business One VARs to provide customers more transparency around the total cost of the solution, the time it takes to deploy, and ultimately reduces the time it takes for the VAR to provide the SAP Business One solution to the customer. This allows the customer to gain access to the SAP Business One system faster and eliminate additional buying cycles.
Increase Your Agility and Provide your Customers Solutions Faster
AWS has a number of different tools that we provide our customers and APN partners to automate the provisioning of the underlying resources needed to support different applications. The power of this technology allows customers and APN partners to reduce the deployment process on AWS down to minutes and a few mouse clicks. AWS CloudFormation allows you to quickly and easily build, and deploy a template that includes a collection of AWS resources (called a stack) to support SAP Business One, or any application that can run on AWS. You have the ability to take these templates and deploy them in any of the 30 Availability Zones offered globally by AWS in a matter of minutes. This allows you to automate the infrastructure provisioning, reduce or even eliminate potential mistakes, and dramatically reduce the deployment time for a customer for SAP Business One.

Want to learn more about building an AWS Cloud practice around SAP Business One? We are here to help, contact us here.
Want to learn more about the AWS Partner Network (APN)? Click here to learn about the APN Consulting Partner program and all the benefits it provides.
Want to try Business One on Hana yourself? Click here to download our SAP Business One on HANA Quick Start. This will allow you to automatically provision an SAP Business One on HANA environment in under 90 minutes!

Amazon Web Services Can Now Support SAP Business Warehouse (BW) Workloads on SAP HANA up to 4TB

$
0
0

Feed: AWS for SAP.
Author: Vanessa Alvarez.

AWS has worked with SAP to certify additional scale out architectures for SAP Business Warehouse on HANA and can now support workloads requiring up to 4TB of memory. This sets the record for SAP HANA scale-out nodes in the cloud, and validates AWS is an ideal solution for enterprise customers to cost effectively run SAP HANA. Customers now have the ability to experience all the benefits of HANA analytics while utilizing the agility, cost savings, security and reliability the AWS cloud provides. AWS offers a comprehensive, end-to-end portfolio of cloud computing services to help manage analytics by reducing costs, scaling to meet demand, and increasing the speed of innovation.

Dynamically Scale-out to Meet Demand and Increasing Data Volumes
Customers can now to scale out their AWS infrastructure to support up to 17 HANA nodes* using the AWS r3.8xlarge instance type with 244GiB of RAM. This gives customers the ability to support up to 4TB* of data in memory for Business Warehouse and/or OLAP based HANA workloads. Additionally, AWS has developed the SAP HANA on AWS Quick Start Reference Deployment, which is a step-by-step guide for deploying SAP HANA on AWS in an SAP supported manner. This guide includes field-tested best practices and links to AWS CloudFormation templates, which automate the provisioning of the AWS infrastructure components as well as the SAP HANA deployment.. By utilizing these templates, customers can have a production ready AWS SAP HANA environment up and running in as little as 30 minutes.

SAP HANA Secure Back-up and Recovery Options with Amazon S3
SAP HANA on AWS leverages Amazon Elastic Block Storage (EBS) as the primary means of storage for the HANA database. These SSD-backed EBS volumes are automatically replicated within Availability Zones to protect from component failure, offering high availability and durability. Customers also have the ability to back-up their HANA database to Amazon Simple Storage Services (S3), which is designed to provide 99.999999999% durability for as little as $.03 per GB per month. When customers store any object in Amazon S3, the data is automatically replicated amongst multiple facilities and on multiple devices within each facility in the selected AWS region. Furthermore, Amazon S3 also supports SSL encryption in transit and 256-bit AES encryption at rest, which can be automatically configured.

Replicate SAP HANA Systems in Multiple Availability Zones
When building out any system on AWS it is a best practice to start by configuring your Amazon Virtual Private Cloud (VPC). Amazon VPC allows you to logically isolate your AWS resources within a virtual network you define. You have the ability to completely control the rules within this virtual network, including IP address range, subnets, route tables and gateways. Amazon VPCs can be applied across multiple Availability Zones within a given region. Once the VPC is created, customers can provision their HANA instances within that VPC in multiple Availability Zones within the same region. This allows customers to reduce their single points of failure by replicating their HANA database clusters in multiple Availability Zones using a single VPC

AWS Enterprise Accelerator
Finally, AWS provides its customers with a number of self-service guides on how to deploy SAP HANA at scale on AWS. AWS Professional Services has developed an Enterprise Accelerator program to help customers’ jumpstart their migration of SAP workloads onto the AWS platform. The program includes AWS workshops for SAP, SAP implementation/migrations plans, reference architecture development, and more. Click here to learn more.
*HANA scale-out clusters larger than 5 nodes are currently in controlled availability. SAP will need to verify your BW sizing report result before you implement a HANA scale-out cluster larger than 5 nodes on AWS. Please contact SAP att HWC@sap.com and sap-on-aws@amazon.com before you implement HANA scale-out clusters of this size.

Refer to SAP OSS Note 1964437

Nimble Storage Certifies Predictive All Flash Arrays for Use with SAP HANA

$
0
0

Feed: Database Trends and Applications : All Articles.

Nimble Storage’s Predictive AF-Series All Flash arrays are now certified by SAP as an enterprise storage solution for the SAP HANA platform, enabling Nimble to offer SAP HANA tailored data center integration using its certified solutions.

As a result, Nimble customers can leverage their existing hardware and infrastructure components for their SAP HANA-based environments, providing an additional choice for organizations working in heterogeneous environments.

This certification adds to the SAP HANA certification Nimble previously obtained for its Adaptive Flash CS-Series arrays for use as enterprise storage solutions for the SAP HANA platform.

“This complements our existing certification with the CS-Series so now we have all our products in our existing portfolio certified,” said Ray Austin, solutions marketing manager at Nimble.

According to the company, the Nimble Predictive Flash platform leveraged with SAP HANA allows enterprise IT organizations to quickly deploy and enhance the performance of workloads running on SAP HANA. The Nimble All Flash and Adaptive Flash arrays enable companies to maintain capacity, data protection, and availability in alignment with changing business requirements.

Combined with SAP HANA, SAP solution deployments are accelerated, business response times are improved, and instant SAP system copies can be created with zero-copy cloning, according to Nimble.

The solution will help users integrate with SAP easier by allowing data to cross streams more flexibly.

“We have the idea of unified flash fabric and, with the inclusion of the AF series as a broader family, customers can now seamlessly migrate their data workloads across hybrid arrays as well as all flash arrays and that fits very nicely in what SAP calls ‘dynamic tiering,’ ” Austin said. “It’s kind of a deployment methodology around where to store your hot data, your cold data, and your warm data.”

Storage admins and basis admins that straddle the line between infrastructure and applications will benefit the most from this certification, according to Austin.

For more information about this news, visit www.nimblestorage.com


25 Predictions About The Future Of Big Data

$
0
0

Feed: Featured Blog Posts – Data Science Central.
Author: Vincent Granville.

Guest blog post by Robert J. Abate.

In the past, I have published on the value of information, big data, advanced analytics and the Abate Information Triangle and have recently been asked to give my humble opinion on the future of Big Data.

I have been fortunate to have been on three panels recently at industry conferences which discussed this very question with such industry thought leaders as: Bill Franks (CTO, Teradata), Louis DiModugno (CDAO, AXA US), Zhongcai Zhang, (CAO, NY Community Bank), Dewey Murdick, (CAO, Department Of Homeland Security), Dr. Pamela Bonifay Peele (CAO, UPMC Insurance Services), Dr. Len Usvyat (VP Integrated Care Analytics, FMCNA), Jeffrey Bohn (Chief Science Officer, State Street), Kenneth Viciana (Business Analytics Leader, Equifax) and others.

Each brought their unique perspective to the challenges of Big Data and their insights into their “premonitions” as to the future of the field. I would like to surmise their thoughts adding in color to the discussion.

Recent Article By Bernard Marr

If you haven’t had the opportunity, I believe that a recent article published by Bernard Marr entitled: 17 Predictions About Big Data was a great start (original version posted here). Many of the industry thought leaders that I mentioned above had hit on these points.

What Was Missing…

I agree with all of Bernard’s listing but I believe that he missed some predictions that the industry has called out. I would like to add the following:

18. Data Governance and Stewardship around Master Data and Reference Data is rapidly becoming the key area where focus is required as data volumes and in turn insights grow.

19. Data Visualization is the key to understanding the overwhelming V’s of Big Data (IBM data scientists break big data into four dimensions: volume, variety, velocity and veracity) and in turn the advanced analytics and is an area where much progress is being made with new toolsets.

20. Data Fabrics will become the key delivery mechanism to the enterprise by providing a “single source of the truth” with regard to the right data source. Today the enterprise is full of “spreadmarts” where people get their “trusted information” and this will have to change.

21. More than one human sensory input source (multiple screens, 3D, sound, etc.) is required to truly capture the information that is being conveyed by big data today. The human mind has so many ways to compare information sources that it requires more feeds today in order to find correlations and find clusters of knowledge.

22. Empowerment of business partners is the key to getting information into the hands of decision makers and self-service cleansed and governed data sources and visualization toolsets (such as provided by Tableau, ClickView, etc.) will become the norm of delivery. We have to provide a “single source of the truth” and eliminate the pervasive sharing of information from untrusted sources.

23. Considering Moore’s Law (our computing power is increasing rapidly) and the technologies to look thought vast quantities of data is improving with each passing year, our analytical capabilities and in turn insights are starting to grow exponentially andwill soon change organizations to become more data driven and less “business instinct” driven.

24. Data is going to become the next global currency (late addition) and is already being globally monetized by corporations.

25. Data toolsets will become more widely used by corporations to both discover, profile and govern data assets within the confines of a data fabric or marketplace. Toolsets will include the management of metadata and automatic classification of assets and liabilities (i.e.: Global IDs, etc.).

The Four V’s Of Big Data

IBM uses a slide that discussed the myriad of challenges when facing Big Data – it is mostly self explanatory and hits many of the points that were mentioned in Bernard’s article. The Four V’s Of Big Data.

What this infographic exemplifies is that there is a barrage of data coming at businesses today and this has changed the information landscape for good.  No longer are enterprises (or even small businesses for that matter) living with mostly internal data, the shift has happened where data is now primarily coming from external sources and at a pace that would make any organizations head spin.

Today’s Best Practice “Data Insights Process”

Today, external data sources (SFDC, POS, Market-share, Consumer demographics, psychographics, Census data, CDC, Bureau of labor, etc.) provide much more than half of the information into the enterprise with the norm to create value in weeks. How is this done you may ask? Let’s call this the Data Insights process. The best practice today has turned upside down the development of business intelligence solutions, this process is:

  • Identify a number of disparate data sources of interest to start the investigation
  • Connect them together (data integration using common keys)
  • Cleanse the data (as Data Governance has not been applied) creating your own master and reference data
  • Learn about what the data is saying and visualize it (what insight or trend has been uncovered
  • Create a model that gives you answers
  • Formalize data source (cleanse and publish) to the myriad of enterprise data consumers with governance (if applicable)
  • Use the answers to change your business
  • Repeat (adding new sources, creating new models, etc.)

This process utilizes data experts to find data sources of value (1 to 2 weeks), Quickly connect together and scan to determine suitability and eliminating information which is incomplete or lacking value/connection to other sources (integrating and cleansing takes about 2 weeks), Visualize what value these sources provide using data visualization toolsets – find interesting value statements or features of the data to pursue {like store clustering and customer segmentation} (1 to 2 weeks), Develop a model or advanced analytic to see what your value statement found using a Data Scientist (2 weeks), and Then present to business to determine next steps. This whole process happens in about 6-8 weeks and usually creates the “interest” in the business to invest in developing into a data warehouse or BI solutions.

Yes, the new process is completely reusable – as what is learned can be turned into a data source (governed data store or warehouse which is part of a data fabric) for future usage in BI and in turn for self-service; but what is important is that we now go from data to insights in weeks rather than months, and it forms the foundation for our business requirements – yes, I said that.

The long term investment of a BI solution (often six months or more) is proven rapidly and then the formal process of capturing the business requirements and rules (transformations in ETL language can be taken from rapid prototyping tools like Alteryx) has a head start and typically has the added advantage of cutting down the BI process into 3-4 months.

Recent Advances In Data Engineering

We can thank recent technological advancements for the changes in delivery of information with the advent of a number of toolsets providing self-service to tech-savvy business partners.

The recent tech and analytics advances in the past decade include but are not limited to:

  • Massively parallel processing data platforms
  • Advanced in-database analytical functions
  • Analytics on un-structured data sources (Hadoop, MapReduce)
  • Data visualizations across multiple mediums and devices
  • Linking structured and unstructured data using semantics and linking
  • Self-service BI toolsets and models of delivery
  • Data discovery, profiling, matching and ELT and data enrichment capabilties
  • Self-provisioning of analytics sandboxes enabling collaboration

But there is still a need for managing the information and this process is not going away. I will elaborate further in the paragraph below.

The Need For Enterprise Information Management

The myriad of data sources is changing the way we as business intelligence and analytics experts behave and likewise it has created a demand for data management and governance (with Master data and in turn Reference data) – so this element was added to the predictions. It’s a very important piece of the puzzle and should not be overlooked or downplayed. It was even added to my latest information triangle (see my Linked-In page).

The role of enterprise data management in IT has been evolving from “A Single Source of Truth” into becoming “The Information Assurance Flexible Delivery Mechanism”. Back in March of 2008 I published at the DAMA International Symposium the needs for a flexible information delivery environment including:

  • Metadata management for compliance enforcement, audit support, analysis, and reporting
  • Master data integration and control
  • Near-real time business information
  • Source data management for controlling data quality at the transaction level
  • Effective governance for a successful managed data environment
  • Integration of analytics, reporting, and transaction control
  • Control of business processes and information usage

A flexible structure is just as important today as business needs are changing at an accelerating pace and it allows IT to be responsive in meeting new business requirements, hence the need for an information architecture for ingestion, storage, and consumption of data sources.

The Need For Knowing Where Your Data Is Coming From (And Going To)

One of the challenges facing enterprises today is that they have an ERP (like SAP, Oracle, etc.), internal data sources, external data sources and what ends up happening is that “spread-marts” (commonly referred to as Excel Spreadsheets) start proliferating data. Different resources download data from differing (and sometimes the same) sources creating dissimilar answers to the same question. This proliferation of data within the enterprise utilizes precious storage that is already overflowing – causing duplication and wasted resources without standardized or common business rules.

Not to mention that these end up being passed around as inputs to other’s work – without knowledge of the data lineage. This is where many organizations are today – many disparate data sets with little to no knowledge of if this is a “trusted” data source.

Enterprise Data Fabric (or Data Marketplace)

An enterprise data fabric or marketplace (I’ve used both terms) is one location that everyone in the enterprise can go to get their data – providing quality, semantic consistency and security. This can be accomplishing with data lakes, data virtualization or a number of integration technologies (like API’s, services, etc.). The point is to give a common point of access to the enterprise for data that has been cleansed and is ready for use with master data. Here are a couple of reasons why you should consider this approach:

  • Business mandate to obtain more value out of the data (get answers)
  • Need to adapt and become agile to information and industry-wide changes
  • Variety of sources, amount and granularity of data that customers want to integrate is growing exponentially
  • Need to shrink the latency between the business event and the data availability for analysis and decision-making

Summation – Data Is The New Global Currency

In summation, consider that increasingly information is produced outside the enterprise, combined with information across a set of partners, and consumed by ever more participants so data is the new global currency of the information age and we all pass around currency – so let’s get cracking at delivering this to our enterprise (or it will go elsewhere to find it).

To the point, Big Data is an old acronym and the new one is “Smart Data” if you ask me.

I would welcome any comments or input into the above and have posted this including pictures on my linked in page – let’s start a dialog around best-practices in today’s information age…

Robert J. Abate, CBIP, CDMP

About the Author

Credited as one of the first to publish on Services Oriented Architecture and the Abate Information Triangle, Robert is a respected IT thought leader. He is the author of the Big Data & Analytics Chapter for DAMA’s DMBoK Publication and on the technology advisory board for Nielson. He was on the governing body for ‘15 CDO Forums and an expert panelist at 2016 CAO Forum / Big Data Innovation Summit.

havenfarm@tds.net

http://www.linkedin.com/in/robertjabate

Minimizing tuple overhead

$
0
0

Feed: Planet PostgreSQL.

I hear quite often people being disappointed on how much space PostgreSQL is
wasting for each row it stores. I’ll try to show here some tricks to minimize
this effect, to allow more efficient storage.

What overhead?

If you don’t have tables with more than few hundred of million of rows, it’s
likely that you didn’t have an issue with this.

For each row stored, postgres will store aditionnal data for its own need.
This is
documented here.
The documentation says:

Field Type Length Description
t_xmin TransactionId 4 bytes insert XID stamp
t_xmax TransactionId 4 bytes delete XID stamp
t_cid CommandId 4 bytes insert and/or delete CID stamp (overlays with t_xvac)
t_xvac TransactionId 4 bytes XID for VACUUM operation moving a row version
t_ctid ItemPointerData 6 bytes current TID of this or newer row version
t_infomask2 uint16 2 bytes number of attributes, plus various flag bits
t_infomask uint16 2 bytes various flag bits
t_hoff uint8 1 byte offset to user data

Which is 23 bytes on most architectures (you have either t_cid or
t_xvac).

You can see part of these fields in hidden column present on any table by
adding them in the SELECT part of a query, or look for negative attribute
number in pg_attribute catalog:

# d test
     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |

# SELECT xmin, xmax, id FROM test LIMIT 1;
 xmin | xmax | id
------+------+----
 1361 |    0 |  1

# SELECT attname, attnum, atttypid::regtype, attlen
FROM pg_class c
JOIN pg_attribute a ON a.attrelid = c.oid
WHERE relname = 'test'
ORDER BY attnum;
 attname  | attnum | atttypid | attlen
----------+--------+----------+--------
 tableoid |     -7 | oid      |      4
 cmax     |     -6 | cid      |      4
 xmax     |     -5 | xid      |      4
 cmin     |     -4 | cid      |      4
 xmin     |     -3 | xid      |      4
 ctid     |     -1 | tid      |      6
 id       |      1 | integer  |      4

If you compare to the previous table, you can see than not all of these columns
are not stored on disk. Obviously PostgreSQL doesn’t store the table’s oid in
each row. It’s added after, while constructing a tuple.

If you want more technical details, you should read take a look at
htup_detail.c, starting
with
TupleHeaderData struct.

How costly is it?

As the overhead is fixed, it’ll become more and more neglictable as the row
size grows. If you only store a single int column (4 bytes), each row will
need:

23B + 4B = 27B

So, it’s 85% overhead, pretty horrible.

On the other hand, if you store 5 integer, 3 bigint and 2 text columns (let’s
say ~80B average), you’ll have:

23B + 5*4B + 3*8B + 2*80B = 227B

That’s “only” 10% overhead.

So, how to minimize this overhead

The idea is to store the same data with less records. How to do that?
Aggregating data in arrays. The more records you put in a single array, the
less overhead you have. And if you aggregate enough data, you can benefit
from transparent compression thanks to the TOAST
mechanism

Let’s try with a single 1 integer column table containing 10M rows:

# CREATE TABLE raw_1 (id integer);

# INSERT INTO raw_1 SELECT generate_series(1,10000000);

# CREATE INDEX ON raw_1 (id);

The user data should need 10M * 4B, ie. around 38MB, while this table will
consume 348MB. Inserting the data takes around 23 seconds.

NOTE: If you do the maths, you’ll find out that the overhead is slighty
more than 32B, not 23B. This is because each block also has some
overhead, NULL handling and alignement issue. If you want more information
on this, I recommand to see
this presentation

Let’s compare with aggregated versions of the same data:

# CREATE TABLE agg_1 (id integer[]);

# INSERT INTO raw_1 SELECT array_agg(i)
FROM generate_series(1,10000000) i
GROUP BY i % 2000000;

# CREATE INDEX ON raw_1 (id);

This will insert 5 elements per row. I’ve done the same test with 20, 100, 200
and 1000 elements per row. Results are below:

NOTE: The size for 1000 element per row is a little higher than lower value.
This is because it’s the only one which is big enough to be TOAST-ed, but not
big enough to be compressed. We can see a little TOAST overhead here.

So far so good, we can see quite good improvements, both in size and INSERT
time even for very small arrays. Let’s see the impact to retrieve rows. I’ll
try to retrieve all the rows, then only one row with an index scan (for the
tests I’ve used EXPLAIN ANALYZE to minimize the time to represent the data in
psql):

# SELECT id FROM raw_1;

# CREATE INDEX ON raw_1 (id);

# SELECT * FROM raw_1 WHERE id = 500;

To properly index this array, we need a GIN index. To get a the values from
aggregated data, we need to unnest() the arrays, and to be a little more
creative to get a single record:

# SELECT unnest(id) AS id FROM agg_1;

# CREATE INDEX ON agg_1 USING gin (id);

# WITH s(id) AS (
    SELECT unnest(id)
    FROM agg_1
    WHERE id && array[500]
)
SELECT id FROM s WHERE id = 500;

Here’s the chart comparing index creation time and index size:

The GIN index is a little more than twice the btree index, if I add the table
size, total size is almost the same as without aggregation. That’s not a big
issue since this example is naive, we’ll see later how to avoid using GIN
index to keep total size low. Also index is way slower to build, meaning that
INSERT will also be slower.

Here’s the chart comparing index creation time and index size, time to retrieve
all rows and a single row:

Getting all the rows is probably not an interesting example, but as soon as
array contains enough elements it starts to be faster than original table. We
see that getting only one element is much more faster than with the btree
index, thanks to GIN efficiency. It’s not tested here, but since only btree
index are sorted, if you need to get a lot of data sorted, using a GIN index
will require an extra sort which will be way slower than a btree index scan.

A more realistic example

Now that we’ve seen the basics, let’s see how to go further: aggregating more
than one columns and avoid to use too much disk space with a GIN index. For
this, I’ll present how PoWA stores it’s data.

For each datasource collected, two tables are used: the historic and
aggregated
one, and the current one. These tables store data in a custom
type instead of plain columns. Let’s see the tables related to
pg_stat_statements:

The custom type, basically all the counters present in pg_stat_statements and
the timestamp associated to this record:

powa=# d powa_statements_history_record
   Composite type "public.powa_statements_history_record"
       Column        |           Type           | Modifiers
---------------------+--------------------------+-----------
 ts                  | timestamp with time zone |
 calls               | bigint                   |
 total_time          | double precision         |
 rows                | bigint                   |
 shared_blks_hit     | bigint                   |
 shared_blks_read    | bigint                   |
 shared_blks_dirtied | bigint                   |
 shared_blks_written | bigint                   |
 local_blks_hit      | bigint                   |
 local_blks_read     | bigint                   |
 local_blks_dirtied  | bigint                   |
 local_blks_written  | bigint                   |
 temp_blks_read      | bigint                   |
 temp_blks_written   | bigint                   |
 blk_read_time       | double precision         |
 blk_write_time      | double precision         |

The aggregated table store the pg_stat_statement unique identifier (queryid,
dbid, userid), and a record of counters:

powa=# d powa_statements_history_current
    Table "public.powa_statements_history_current"
 Column  |              Type              | Modifiers
---------+--------------------------------+-----------
 queryid | bigint                         | not null
 dbid    | oid                            | not null
 userid  | oid                            | not null
 record  | powa_statements_history_record | not null

The aggregated table contains the same unique identifier, an array of records
and some special fields:

powa=# d powa_statements_history
            Table "public.powa_statements_history"
     Column     |               Type               | Modifiers
----------------+----------------------------------+-----------
 queryid        | bigint                           | not null
 dbid           | oid                              | not null
 userid         | oid                              | not null
 coalesce_range | tstzrange                        | not null
 records        | powa_statements_history_record[] | not null
 mins_in_range  | powa_statements_history_record   | not null
 maxs_in_range  | powa_statements_history_record   | not null
Indexes:
    "powa_statements_history_query_ts" gist (queryid, coalesce_range)

We also store the timestamp range (coalesce_range) for all aggregated
counters in a row, and the minimum and maximum values of each counter in two
dedicated records. These extra fields doesn’t consume too much space, and
allows very efficient indexing and computation, based on the access pattern of
the related application.

This table is used to know how much ressource a query consumed on a given time
range. The GiST index won’t be too big since it only indexes one row per X
aggregated counters, and will find efficiently the rows matching a given
queryid and time range.

Then, computing the resources consumed can be done efficiently, since the
pg_stat_statements counters are strictly monotonic. The algorithm would be:

  • if the row time range is entirely contained in the asked time range, we only
    need to compute delta of summary record:
    maxs_in_range.counter – mins_in_range.counter
  • if not (meaning only two rows for each queryid) we unnest the array, filter
    out records that aren’t in the asked time range, keep first and last value
    and compute for each counter the maximum minus the minimum. The unnest will
    only

NOTE: Actually, PoWA interface always unnest all records overlapping the
asked time interval, since the interface is designed to show these counters
evolution on a relatively small time range, but with a great precision.
Hopefuly, unnesting the records is not that expensive, especially compared to
the disk space saved.

And here’s the size needed for the aggregated and non aggregated values. For
this I let PoWA generate 12.331.366 records (configuring a snapshot every 5
seconds for some hours, default aggregation of 100 records per row), and used a
btree index on (queryid, ((record).ts)
to simulate the index present on the aggregated table:

Pretty efficient, right?

Limitations

There are some limitations with aggregating records. If you do this, you can’t
enforce constraints such as foreign keys or unique constraints. The use is
therefore non-relationnal data, such as counters or metadata.

Bonus

Using custom types also allows some nice things, like defining custom
operators
. For instance, the release 3.1.0 of powa will provide two
operators for each custom type defined:

  • the operator, to get difference between two record
  • the / operator, to get the difference per second

You’ll therfore be able to do this kind of queries:

# SELECT (record - lag(record) over()).*
FROM from powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;
      intvl      | calls  |    total_time    |  rows  | ...
-----------------+--------+------------------+--------+ ...
 <NULL>          | <NULL> |           <NULL> | <NULL> | ...
 00:00:05.004611 |   5753 | 20.5570000000005 |   5753 | ...
 00:00:05.004569 |   1879 | 6.40500000000047 |   1879 | ...
 00:00:05.00477  |  14369 | 48.9060000000006 |  14369 | ...
 00:00:05.00418  |      0 |                0 |      0 | ...

# SELECT (record / lag(record) over()).*
FROM powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;

  sec   | calls_per_sec | runtime_per_sec  | rows_per_sec | ...
--------+---------------+------------------+--------------+ ...
 <NULL> |        <NULL> |           <NULL> |       <NULL> | ...
      5 |        1150.6 |  4.1114000000001 |       1150.6 | ...
      5 |         375.8 | 1.28100000000009 |        375.8 | ...
      5 |        2873.8 | 9.78120000000011 |       2873.8 | ...

If you’re interested on how to implement such operators, you can look at
PoWA implementation.

Conclusion

You now know the basics to work around the per tuple overhead. Depending on
your needs and your data specifities, you should find a way to aggregate your
data and add some extra column to keep nice performance.


Deciphering Glyph :: Hitting The Wall

$
0
0

Feed: Planet Python.
Author: .

I’m an introvert.

I say that with a full-on appreciation of
just how awful
thinkpieces on “introverts” are.

However, I feel compelled to write about this today because of a certain type
of social pressure that a certain type of introvert faces. Specifically, I am
a high-energy introvert.

Cementing this piece’s place in the hallowed halls of just awful thinkpieces,
allow me to compare my mild cognitive fatigue with the plight of those
suffering from chronic illness and disability. There’s a social phenomenon
associated with many chronic illnesses,
“but you don’t LOOK sick”, where
well-meaning people will look at someone who is suffering, with no obvious
symptoms, and imply that they really ought to be able to “be normal”.

As a high-energy introvert, I frequently participate in social events. I go to
meet-ups and conferences and I engage in plenty of
public speaking. I am, in a sense,
comfortable extemporizing in front of large groups of strangers.

This all sounds like extroverted behavior, I know. But there’s a key
difference.

Let me posit two axes for personality type: on the X axis, “introvert” to
“extrovert”, and on the Y, “low energy” up to “high energy”.

The X axis describes what kinds of activities give you energy, and the Y axis
describes how large your energy reserves are for the other type.

Notice that I didn’t say which type of activity you enjoy.

Most people who would self-describe as “introverts” are in the
low-energy/introvert quadrant. They have a small amount of energy available
for social activities, which they need to frequently re-charge by doing
solitary activities. As a result of frequently running out of energy for
social activities, they don’t enjoy social activities.

Most people who would self-describe as “extroverts” are also on the
“low-energy” end of the spectrum. They have low levels of patience for
solitary activity, and need to re-charge by spending time with friends, going
to parties, etc, in order to have the mental fortitude to sit still for a while
and focus. Since they can endlessly get more energy from the company of
others, they tend to enjoy social activities quite a bit.

Therefore we have certain behaviors we expect to see from “introverts”. We
expect them to be shy, and quiet, and withdrawn. When someone who behaves this
way has to bail on a social engagement, this is expected. There’s a certain
affordance for it. If you spend a few hours with them, they may be initially
friendly but will visibly become uncomfortable and withdrawn.

This “energy” model of personality is of course an oversimplification – it’s my
personal belief that everyone needs some balance of privacy and socialization
and solitude and eventually overdoing one or the other will be bad for anyone –
but it’s a useful one.

As a high-energy introvert, my behavior often confuses people. I’ll show up
at a week’s worth of professional events, be the life of the party, go out to
dinner at all of them, and then disappear for a month. I’m not visibily shy –
quite the opposite, I’m a gregarious raconteur. In fact, I quite visibly
enjoy the company of friends. So, usually, when I try to explain that I am
quite introverted, this claim is met with (quite understandable) skepticism.

In fact, I am quite functionally what society expects of an “extrovert” – until
I hit the wall.


In endurance sports, one is said to
“hit the wall” at the point
where all the short-term energy reserves in one’s muscles are exhausted, and
there is a sudden, dramatic loss of energy. Regardless, many people enjoy
endurance sports; part of the challenge of them is properly managing your
energy.

This is true for me and social situations. I do enjoy social situations
quite a bit! But they are nevertheless quite taxing for me, and without
prolonged intermissions of solitude, eventually I get to the point where I can
no longer behave as a normal social creature without an excruciating level of
effort and anxiety.

Several years ago, I attended a prolonged social event where I hit the
wall, hard. The event itself was several hours too long for me, involved
meeting lots of strangers, and in the lead-up to it I hadn’t had a weekend to
myself for a few weeks due to work commitments and family stuff. Towards the
end I noticed I was developing a completely
flat affect, and had to
start very consciously performing even basic body language, like looking at
someone while they were talking or smiling. I’d never been so exhausted and
numb in my life; at the time I thought I was just stressed from work.

Afterwards though, I started having a lot of weird nightmares,
even during the daytime.
This concerned me, since I’d never had such a severe reaction to a social
situation, and I didn’t have good language to describe it. It was also a
little perplexing that what was effectively a nice party, the first half of
which had even been fun for me, would cause such a persistent negative reaction
after the fact. After some research, I eventually discovered that such
involuntary thoughts are
a hallmark of PTSD.

While I’ve managed to avoid this level of exhaustion before or since, this was
a real learning experience for me that the consequences of incorrectly managing
my level of social interaction can be quite severe.

I’d rather not do that again.


The reason I’m writing this, though, is not to avoid future anxiety. My
social energy reserves are quite large enough, and I now have enough
self-knowledge, that it is extremely unlikely I’d ever find myself in that
situation again.

The reason I’m writing is to help people understand that I’m not blowing them
off because I don’t like them
. Many times now, I’ve declined or bailed an
invitation from someone, and later heard that they felt hurt that I was
passive-aggressively refusing to be friendly.

I certainly understand this reaction. After all, if you see someone at a party
and they’re clearly having a great time and chatting with everyone, but then
when you invite them to do something, they say “sorry, too much social
stuff”, that seems like a pretty passive-aggressive way to respond.

You might even still be skeptical after reading this. “Glyph, if you were
really an introvert, surely, I would have seen you looking a little shy and
withdrawn. Surely I’d see some evidence of stage fright before your talks.”

But that’s exactly the problem here: no, you wouldn’t.

At a social event, since I have lots of energy to begin with, I’ll build up a
head of steam on burning said energy that no low-energy introvert would ever
risk. If I were to run out of social-interaction-juice, I’d be in the middle
of a big crowd telling a long and elaborate story when I find myself exhausted.
If I hit the wall in that situation, I can’t feel a little awkward and make
excuses and leave; I’ll be stuck creepily faking a smile like a sociopath and
frantically looking for a way out of the converstaion for an hour, as the
pressure from a large crowd of people rapidly builds up months worth of
nightmare fuel from my spiraling energy deficit.

Given that I know that’s what’s going to happen, you won’t see me when I’m
close to that line. You won’t be in at my desk when I silently sit and type
for a whole day, or on my couch when I quietly read a book for ten hours at a
time. My solitary side is, by definition, hidden.

But, if I don’t show up to your party, I promise: it’s not you, it’s me.

Build cloud apps at warp speed

$
0
0

Feed: Microsoft Azure Blog.
Author: James Staten.

One of your best customers just tweeted about a problem with your product and you want to respond to them ASAP. It would be great if you could automatically catch this type of communications and automagically respond with either the right documentation or escalate this to your support team. But the thought of writing an application to handle this event, with all that entails – allocating VMs, assigning staff to manage either the IaaS instances or the cloud service, not to mention the cost of development, (which might include software licenses) all that seems like a lot just to recognize and handle a tweet.

What if you could catch the tweet, direct it to the right person and respond to the customer quickly with no code and no infrastructure hassles: no systems-level programming, no server configuration step, not even code required – just the workflow. Just the business process.

serverless1It’s possible in the new era of cloud computing.  With newly introduced capabilities in the Microsoft Cloud – Microsoft Flow, Microsoft PowerApps, and Azure Functions, you can design your workflow in a visual designer and just deploy it.

Now in preview, these new cloud offerings foreshadow the future of cloud applications.

Intrigued? Read on.

Take a look to the left. There’s the Microsoft Flow designer being set up to tell your Slack channel any time somebody complains about your product. 

That’s it. One click and voila: your workflow is running!

(And there’s the result in Slack!)

SlackShot

But perhaps your smart support representative contacts the unhappy customer – who it turns out has a valid issue. Your rep takes down the relevant information and starts a new workflow to have the issue looked at.

Need a server for that? No! With Microsoft Power Apps, you can visually design a form for your rep: and it can kick off a Flow.  Want that app mobile-enabled on any smartphone? No problem, as you see below. And as it shows you use the Common Data Model available in PowerApps enabling a lingua franca between applications.

serverless3

If you need more sophisticated, or custom processing, your developers can create Azure Functions on the event, say, updating an on-premises or cloud-based sentiment analysis engine with the tweet, or invoking a marketing application to offer an incentive. Again: no server. (In fact, no IDE either: your devs write their business logic code directly on the Azure portal and deploy from there.)

serverless4So why do I say Microsoft Flow, PowerApps and Functions presage a new model of cloud applications? Because increasingly, cloud apps are evolving toward a lego-block model of “serverless” computing: where you create and pay only for your business logic, where chunks of processing logic are connected together to create an entire business application.

Infrastructure? Of course it’s there (“serverless” may not be the best term), but it’s under the covers: Azure manages the servers, configures them, updates them and ensures their availability. Your concern is what it should be: your business logic.

This is potentially a seismic shift in how we think about enterprise computing.

Think about it: with PowerApps your business users can quickly create apps, and with Microsoft Flow, create business processes with a few clicks. With Flow’s bigger cousin, Azure Logic Apps, you can quickly connect to any industry-standard enterprise data source such as your local ERP system, a data warehouse, support tools and many others via open protocols and interfaces such as EDIFACT/X.12, AS2, or XML. And you can easily connect to a wide variety of social media and internet assets, like Twitter, Dropbox, Slack, Facebook and many others. With Functions you can catch events generated by Logic Apps and make decisions in real time.

And you haven’t deployed a single server. What code you’ve written is business logic only; not administration scripts or other code with no business value. Your developers have focused on growing your business. And, most importantly, you’ve created a rich, intelligent end-to-end application –by simply attaching together existing blocks of logic.

Like Lego blocks. Other cloud platforms offer serverless options, but none as deep and as varied as Microsoft’s, empowering everyone in your organization, from business analyst to developer, with tools appropriate to their skills. For enterprises, the implications could not be more profound.

Maybe it’s appropriate, on this fiftieth anniversary of Star Trek, that with tools on the Microsoft Cloud, you can run your business at warp speed using Azure.

Planet Python

$
0
0

Feed: Planet Python.
Author: .

Last update: September 20, 2016 04:51 AM

September 19, 2016


Curtis Miller

An Introduction to Stock Market Data Analysis with Python (Part 1)

This post is the first in a two-part series on stock data analysis using Python, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving…Read more An Introduction to Stock Market Data Analysis with Python (Part 1)

September 19, 2016 03:00 PM


Andre Roberge

Backward incompatible change in handling permalinks with Reeborg coming soon

About two years ago, I implemented a permalink scheme which was intended to facilitate sharing various programming tasks in Reeborg’s World. As I added new capabilities, the number of possible items to include grew tremendously. In fact, for rich enough worlds, the permalink can be too long for the browser to handle. To deal with such situations, I had to implement a clumsy way to import and

September 19, 2016 02:17 PM


Doug Hellmann

dbm — Unix Key-Value Databases — PyMOTW 3

dbm is a front-end for DBM-style databases that use simple string values as keys to access records containing strings. It uses whichdb() to identify databases, then opens them with the appropriate module. It is used as a back-end for shelve, which stores objects in a DBM database using pickle . Read more… This post is … Continue reading dbm — Unix Key-Value Databases — PyMOTW 3

September 19, 2016 01:00 PM


Python Piedmont Triad User Group

PYPTUG Monthly meeting September 27th 2016 (Just bring Glue)

Come join PYPTUG at out next monthly meeting (August 30th 2016) to learn more about the Python programming language, modules and tools. Python is the perfect language to learn if you’ve never programmed before, and at the other end, it is also the perfect tool that no expert would do without. Monthly meetings are in addition to our project nights.

What

Meeting will start at 6:00pm.

We will open on an Intro to PYPTUG and on how to get started with Python, PYPTUG activities and members projects, in particular some updates on the Quadcopter project, then on to News from the community.


 

Main Talk: Just Bring Glue – Leveraging Multiple Libraries To Quickly Build Powerful New Tools

by Rob Agle

Bio:

Rob Agle is a software engineer at Inmar, where he works on the high-availability REST APIs powering the organization’s digital promotions network. His technical interests include application and network security, machine learning and natural language processing.

Abstract:

It has never been easier for developers to create simple-yet-powerful data-driven or data-informed tools. Through case studies, we’ll explore a few projects that use a number of open source libraries or modules in concert. Next, we’ll cover strategies for learning these new tools. Finally, we wrap up with pitfalls to keep in mind when gluing powerful things together quickly.

Lightning talks! 

We will have some time for extemporaneous “lightning talks” of 5-10 minute duration. If you’d like to do one, some suggestions of talks were provided here, if you are looking for inspiration. Or talk about a project you are working on.

When

Tuesday, September 27th 2016
Meeting starts at 6:00PM

Where

Wake Forest University, close to Polo Rd and University Parkway:

Wake Forest University, Winston-Salem, NC 27109

And speaking of parking:  Parking after 5pm is on a first-come, first-serve basis.  The official parking policy is:

Visitors can park in any general parking lot on campus. Visitors should avoid reserved spaces, faculty/staff lots, fire lanes or other restricted area on campus. Frequent visitors should contact Parking and Transportation to register for a parking permit.

Mailing List

Don’t forget to sign up to our user group mailing list:

It is the only step required to become a PYPTUG member.

RSVP on meetup:

https://www.meetup.com/PYthon-Piedmont-Triad-User-Group-PYPTUG/events/233759543/

September 19, 2016 12:58 PM


Mike Driscoll

PyDev of the Week: Benedikt Eggers

This week we welcome Benedikt Eggers (@be_eggers) as our PyDev of the Week. Benedikt is one of the core developers working on the IronPython project. IronPython is the version of Python that is integrated with Microsoft’s .NET framework, much like Jython is integrated with Java. If you’re interesting in seeing what Benedikt has been up to lately, you might want to check out his Github profile. Let’s take a few minutes to get to know our fellow Pythoneer!

Could you tell us a little about yourself (hobbies, education, etc):

My name is Benedikt Eggers and I was born and live in Germany (23 years). I’ve working as a software developer and engineer and had studied business informatics. At my little spare time I do sports and work on open source projects, like IronPython.

Why did you start using Python?

To be honest, I’ve started using Python by searching for a script engine for .net. That way I came to IronPython and established it in our company. There we are using it to extend our software and writing and using Python modules in both worlds. After a while I got more into Python and thought that’s a great concept of a dynamic language. So it’s a good contrast to C#. It is perfect for scripting and other nice and quick stuff.

What other programming languages do you know and which is your favorite?

The language I’m most familiar with is C#. To be honest, this is also my “partly” favorite language to write larger application and complex products. But I also like Python/IronPython very much, cause it allows me to achieve my goals very quickly with less and readable code. So a favorite language is hard to pick, cause I like to use the best technology in its specific environment (Same could be said about relational and document based database, …)

What projects are you working on now?

Mostly I’m working on my projects at work. We (http://simplic-systems.com/) are continuously working on creating more and more open source projects and also contribute to other open source projects. So I spend a lot of time there. But I also can use a lot of this time to work on IronPython. So I’m able to mix this up and work a few projects parallel. But spending time working on IronPython is something I really like, so I’m doing it, cause I enjoy it.

Which Python libraries are your favorite (core or 3rd party)?

I really like requests and all the packages to easily work with web-services and other modern technologies. On the other side, I use a lot of Python Modules in our continuous integration environment, to automate our build process. So there I also use the core libraries to move, rename files by reading JSON configurations and so on. So there are a lot of libraries I like. Because they make my life much easier every day.

Is there anything else you’d like to say?

Yes – I’d love to see how fast we are growing and that we found people who are willing to contribute to IronPython. I think we are on a good way and hope that we can achieve all of our goals. I hope that IronPython 3 and all other releases are coming soon. Furthermore I’d like to thank Jeff Hardy a lot, who has contributed to the project in that past years and is always very helpful. Finally also a thanks goes to Alex Earl who has working on this project too in the last years and now wants to bring it back together with the community. I think we will work great together!

Thanks so much for doing the interview!

September 19, 2016 12:30 PM


Wesley Chun

Accessing Gmail from Python (plus BONUS)

NOTE: The code covered in this blogpost is also available in a video walkthrough here.

UPDATE (Aug 2016): The code has been modernized to use oauth2client.tools.run_flow() instead of the deprecated oauth2client.tools.run(). You can read more about that change here.

Introduction

The last several posts have illustrated how to connect to public/simple and authorized Google APIs. Today, we’re going to demonstrate accessing the Gmail (another authorized) API. Yes, you read that correctly… “API.” In the old days, you access mail services with standard Internet protocols such as IMAP/POP and SMTP. However, while they are standards, they haven’t kept up with modern day email usage and developers’ needs that go along with it. In comes the Gmail API which provides CRUD access to email threads and drafts along with messages, search queries, management of labels (like folders), and domain administration features that are an extra concern for enterprise developers.

Earlier posts demonstrate the structure and “how-to” use Google APIs in general, so the most recent posts, including this one, focus on solutions and apps, and use of specific APIs. Once you review the earlier material, you’re ready to start with Gmail scopes then see how to use the API itself.

Gmail API Scopes

Below are the Gmail API scopes of authorization. We’re listing them in most-to-least restrictive order because that’s the order you should consider using them in  use the most restrictive scope you possibly can yet still allowing your app to do its work. This makes your app more secure and may prevent inadvertently going over any quotas, or accessing, destroying, or corrupting data. Also, users are less hesitant to install your app if it asks only for more restricted access to their inboxes.

  • 'https://www.googleapis.com/auth/gmail.readonly' — Read-only access to all resources + metadata
  • 'https://www.googleapis.com/auth/gmail.send' — Send messages only (no inbox read nor modify)
  • 'https://www.googleapis.com/auth/gmail.labels' — Create, read, update, and delete labels only
  • 'https://www.googleapis.com/auth/gmail.insert' — Insert and import messages only
  • 'https://www.googleapis.com/auth/gmail.compose' — Create, read, update, delete, and send email drafts and messages
  • 'https://www.googleapis.com/auth/gmail.modify' — All read/write operations except for immediate & permanent deletion of threads & messages
  • 'https://mail.google.com/' — All read/write operations (use with caution)

Using the Gmail API

We’re going to create a sample Python script that goes through your Gmail threads and looks for those which have more than 2 messages, for example, if you’re seeking particularly chatty threads on mailing lists you’re subscribed to. Since we’re only peeking at inbox content, the only scope we’ll request is ‘gmail.readonly’, the most restrictive scope. The API string is ‘gmail’ which is currently on version 1, so here’s the call to apiclient.discovery.build() you’ll use:

GMAIL = discovery.build('gmail', 'v1', http=creds.authorize(Http()))

Note that all lines of code above that is predominantly boilerplate (that was explained in earlier posts). Anyway, once you have an established service endpoint with build(), you can use the list() method of the threads service to request the file data. The one required parameter is the user’s Gmail address. A special value of ‘me’ has been set aside for the currently authenticated user.

threads = GMAIL.users().threads().list(userId='me').execute().get('threads', [])

If all goes well, the (JSON) response payload will (not be empty or missing and) contain a sequence of threads that we can loop over. For each thread, we need to fetch more info, so we issue a second API call for that. Specifically, we care about the number of messages in a thread:

for thread in threads:
tdata = GMAIL.users().threads().get(userId='me', id=thread['id']).execute()
nmsgs = len(tdata['messages'])

We’re seeking only all threads more than 2 (that means at least 3) messages, discarding the rest. If a thread meets that criteria, scan the first message and cycle through the email headers looking for the “Subject” line to display to users, skipping the remaining headers as soon as we find one:

    if nmsgs > 2:
msg = tdata['messages'][0]['payload']
subject = ''
for header in msg['headers']:
if header['name'] == 'Subject':
subject = header['value']
break
if subject:
print('%s (%d msgs)' % (subject, nmsgs))

If you’re on many mailing lists, this may give you more messages than desired, so feel free to up the threshold from 2 to 50, 100, or whatever makes sense for you. (In that case, you should use a variable.) Regardless, that’s pretty much the entire script save for the OAuth2 code that we’re so familiar with from previous posts. The script is posted below in its entirety, and if you run it, you’ll see an interesting collection of threads… YMMV depending on what messages are in your inbox:

$ python3 gmail_threads.py
[Tutor] About Python Module to Process Bytes (3 msgs)
Core Python book review update (30 msgs)
[Tutor] scratching my head (16 msgs)
[Tutor] for loop for long numbers (10 msgs)
[Tutor] How to show the listbox from sqlite and make it searchable? (4 msgs)
[Tutor] find pickle and retrieve saved data (3 msgs)

BONUS: Python 3!

As of Mar 2015 (formally in Apr 2015 when the docs were updated), support for Python 3 was added to Google APIs Client Library (3.3+)! This update was a long time coming (relevant GitHub thread), and allows Python 3 developers to write code that accesses Google APIs. If you’re already running 3.x, you can use its pip command (pip3) to install the Client Library:

$ pip3 install -U google-api-python-client

Because of this, unlike previous blogposts, we’re deliberately going to avoid use of the print statement and switch to the print() function instead. If you’re still running Python 2, be sure to add the following import so that the code will also run in your 2.x interpreter:

from __future__ import print_function

Conclusion

To find out more about the input parameters as well as all the fields that are in the response, take a look at the docs for threads().list(). For more information on what other operations you can execute with the Gmail API, take a look at the reference docs and check out the companion video for this code sample. That’s it!

Below is the entire script for your convenience which runs on both Python 2 and Python 3 (unmodified!):

from __future__ import print_function

from apiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools

SCOPES = 'https://www.googleapis.com/auth/gmail.readonly'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
creds = tools.run_flow(flow, store)
GMAIL = discovery.build('gmail', 'v1', http=creds.authorize(Http()))

threads = GMAIL.users().threads().list(userId='me').execute().get('threads', [])
for thread in threads:
tdata = GMAIL.users().threads().get(userId='me', id=thread['id']).execute()
nmsgs = len(tdata['messages'])

if nmsgs > 2:
msg = tdata['messages'][0]['payload']
subject = ''
for header in msg['headers']:
if header['name'] == 'Subject':
subject = header['value']
break
if subject:
print('%s (%d msgs)' % (subject, nmsgs))


You can now customize this code for your own needs, for a mobile frontend, a server-side backend, or to access other Google APIs. If you want to see another example of using the Gmail API (displaying all your inbox labels), check out the Python Quickstart example in the official docs or its equivalent in Java (server-side, Android), iOS (Objective-C, Swift), C#/.NET, PHP, Ruby, JavaScript (client-side, Node.js), or Go. That’s it… hope you find these code samples useful in helping you get started with the Gmail API!

EXTRA CREDIT: To test your skills and challenge yourself, try writing code that allows users to perform a search across their email, or perhaps creating an email draft, adding attachments, then sending them! Note that to prevent spam, there are strict Program Policies that you must abide with… any abuse could rate limit your account or get it shut down. Check out those rules plus other Gmail terms of use here.

September 19, 2016 12:04 PM


Jeff Knupp

Writing Idiomatic Python Video Four Is Out!

After an unplanned two-year hiatus, the fourth video in the Writing Idiomatic Python Video Series is out! This was long overdue, and for that I sincerely apologize. All I can do now is continue to produce the rest at a steady clip and get them out as quickly as possible. I hope you find the video useful! Part 5 will be out soon…

September 19, 2016 06:07 AM

September 18, 2016


Omaha Python Users Group

September 21 Meeting

Lightning Talks, discussion, and topic selection for this seasons meetings.

Event Details:

  • Where: DoSpace @ 7205 Dodge Street / Meeting Room #2
  • When: July 20, 2016 @ 6:30pm – 8:00pm
  • Who: People interested in programming with Python

September 18, 2016 11:37 PM


Experienced Django

KidsTasks – Working on Models

This is part two in the KidsTasks series where we’re designing and implementing a django app to manage daily task lists for my kids. See part 1 for details on requirements and goals.

Model Design Revisited

As I started coding up the models and the corresponding admin pages for the design I presented in the last section it became clear that there were several bad assumptions and mistakes in that design. (I plan to write up a post about designs and their mutability in the coming weeks.)

The biggest conceptual problem I had was the difference between “python objects” and “django models”. Django models correspond to database tables and thus do not map easily to things like “I want a list of Tasks in my DayOfWeekSchedule”.

After building up a subset of the models described in part 1, I found that the CountedTask model wasn’t going to work the way I had envisioned. Creating it as a direct subclass of Task caused unexpected (initially, at least) behavior in that all CountedTasks were also Tasks and thus showed up in all lists where Tasks could be added. While this behavior makes sense, it doesn’t fit the model I was working toward. After blundering with a couple of other ideas, it finally occurred to me that the main problem was the fundamental design. If something seems really cumbersome to implement it might be pointing to a design error.

Stepping back, it occurred to me that the idea of a “Counted” task was putting information at the wrong level. An individual task shouldn’t care if it’s one of many similar tasks in a Schedule, nor should it know how many there are. That information should be part of the Schedule models instead.

Changing this took more experimenting than I wanted, largely due to a mismatch in my thinking and how django models work. The key for working through this level of confusion was by trying to figure out how to add multiple Tasks of the same type to a Schedule. That led me to this Stack Overflow question which describes using an intermediate model to relate the two items. This does exactly what I’m looking for, allowing me to say that Kid1 needs to Practice Piano twice on Tuesdays without the need for a CountedTask model.

Changing this created problems for our current admin.py, however. I found ideas for how to clean that up here, which describes how to use inlines as part of the admin pages.

Using inlines and intermediate models, I was able to build up a schedule for a kid in a manner similar to my initial vision.  The next steps will be to work on views for this model and see where the design breaks!

Wrap Up

I’m going to stop this session here but I want to add a few interesting points and tidbits I’ve discovered on the way:

  • If you make big changes to the model and you don’t yet have any significant data, you can wipe out the database easily and start over with the following steps:
$ rm db.sqlite3 tasks/migrations/ -rf
$ ./manage.py makemigrations tasks
$ ./manage.py migrate
$ ./manage.py createsuperuser
$ ./manage.py runserver
  • For models, it’s definitely worthwhile to add a __str__ (note: maybe __unicode__?) method and a Meta class to each one.  The __str__ method controls how the class is described, at least in the admin pages. The Meta class allows you to control the ordering when items of this model are listed. Cool!
  • I found (and forgot to note where) in the official docs an example of using a single char to store the name of the week while displaying the full day name. This looks like this:
    day_of_week_choices = (
        ('M', 'Monday'),
        ('T', 'Tuesday'),
        ('W', 'Wednesday'),
        ('R', 'Thursday'),
        ('F', 'Friday'),
        ('S', 'Saturday'),
        ('N', 'Sunday'),
    )
	...
    day_name = models.CharField(max_length=1, choices=day_of_week_choices)
  • NOTE that we’re going to have to tie the “name” fields in many of these models to the kid to which it’s associated. I’m considering if the kid can be combined into the schedule, but I don’t think that’s quite right. Certainly changes are coming to that part of the design.

That’s it! The state of the code at the point I’m writing this can be found here:

git@github.com:jima80525/KidTasks.git
git checkout blog/02-Models-first-steps

Thanks for reading!

September 18, 2016 10:48 PM


François Dion

Something for your mind: Polymath Podcast launched

Some episodes
will have more Art content, some will have more Business content, some will have more Science content, and some will be a nice blend of different things. But for sure, the show will live up to its name and provide you with “something for your mind”. It might raise more questions than it answers, and that is fine too.
Episode 000
Listen to Something for your mind on http://Artchiv.es
Francois Dion
@f_dion

September 18, 2016 09:23 PM


Weekly Python Chat

Tips for learning Django

Making mistakes is a great way to learn, but some mistakes are kind of painful to make. Special guest Melanie Crutchfield and I are going to chat about things you’ll wish you knew earlier when making your first website with Django.

September 18, 2016 05:00 PM


Krzysztof Żuraw

Python & WebDAV- part two

In the last post, I set up owncloud with WebDAV server. Now it’s time to use it.

Table of Contents:

I was searching for good python library to work with WebDAV for a long time.
I finally found it- easywebdav. It works
nicely but the problem is that doesn’t have support for python 3. Let’s jump quickly
to my simple project for cli tool- webdav editor.

I decided to create cli tool to work with WebDAV server- webdav editor. Right now
it supports only basic commands like login, listing the content of directories, uploading
and downloading files.

I started from creating file webdav_utility.py:

from urlparse import urlparse
import easywebdav


class Client(object):

    def login(self, *args):
        argparse_namespace = args[0]
        url_components = urlparse(argparse_namespace.server)
        host, port = url_components.netloc.split(':')
        webdav_client = easywebdav.connect(
            host=host,
            port=port,
            path=url_components.path,
            username=argparse_namespace.user,
            password=argparse_namespace.password
        )
        pickle.dump(webdav_client, open('webdav_login', 'wb'))

    def list_content(self, *args):
        argparse_namespace = args[0]
        print [i.name for i in webdav_client.ls(argparse_namespace.path)]

    def upload_file(self, *args):
        argparse_namespace = args[0]
        webdav_client.upload(
            argparse_namespace.from_path, argparse_namespace.to_path
        )

    def download_file(self, *args):
        argparse_namespace = args[0]
        webdav_client.download(
            argparse_namespace.from_path, argparse_namespace.to_path
        )

In class Client, I write simple functions that are wrappers around easywebdav
API. In login I parse provided URL in form like localhost:8888/owncloud/remote.php/webdav
to get host, port and path for easywebdav.connect to establish a proper connection.

Another method that is worth mentioning is list_content where I retrieve names of files under a
directory on WebDAV server. In every method I provide *args argument and argparse_namespace
which leads to another component of application- module cli.py:

import argparse

from webdav_utility import Client

client = Client()

parser = argparse.ArgumentParser(description='Simple command line utility for WebDAV')
subparsers = parser.add_subparsers(help='Commands')

login_parser = subparsers.add_parser('login', help='Authenticate with WebDAV')
login_parser.add_argument('-s', '--server', required=True)
login_parser.add_argument('-u', '--user', required=True)
login_parser.add_argument('-p', '--password', required=True)
login_parser.set_defaults(func=client.login)

ls_parser = subparsers.add_parser('ls', help='List content of directory under WebDAV')
ls_parser.add_argument('-p', '--path', required=True)
ls_parser.set_defaults(func=client.list_content)

upload_parser = subparsers.add_parser('upload', help='Upload files to WebDAV')
upload_parser.add_argument('-f', '--from', metavar='PATH')
upload_parser.add_argument('-t', '--to', metavar='PATH')
upload_parser.set_defaults(func=client.upload_file)

download_parser = subparsers.add_parser('download', help='Download files from WebDAV')
download_parser.add_argument('-f', '--from', metavar='PATH')
download_parser.add_argument('-t', '--to', metavar='PATH')
download_parser.set_defaults(func=client.download_file)

if __name__ == '__main__':
    args = parser.parse_args()
    args.func(args)

There I use argparse. I create the main parser
with four additionals subparsers for login, ls, upload and download. Thanks to that
I have different namespace for every one of previously mentioned subparsers.

Problem is that this
solution is not generic enough because after running my command with login parameter I can get:
Namespace(server='localhost:8888', user='admin', password='admin') and running the same command but
with ls I will receive: Namespace(path='path_to_file'). To handle that I used set_defaults for
every subparser. I tell argparse to invoke function specified by func keyword (which is different for every command).
Thanks to that I only need to call this code once:

if __name__ == '__main__':
    args = parser.parse_args()
    args.func(args)

That’s the reason I introduce argparse_namespaces in Client.

OK, tool right now works nicely, but there is no place to store information if I am logged or not. So
calling python cli.py login -s localhost -u admin -p admin works but python cli.py ls -p / not.
To overcome that I came up with an idea to pickle webdav_client like this:

class Client(object):

  def login(self, *args):
    # login user etc
    pickle.dump(webdav_client, open('webdav_login', 'wb'))

  def list_content(self, *args):
    webdav_client = pickle.load(open('webdav_login', 'rb'))
    # rest of the code

Then I can easily run:

$ python cli.py login --server example.org/owncloud/remote.php/webdav --user admin --password admin
$ python cli.py ls --path '/'
['/owncloud/remote.php/webdav/', '/owncloud/remote.php/webdav/Documents/', '/owncloud/remote.php/webdav/Photos/', '/owncloud/remote.php/webdav/ownCloud%20Manual.pdf']

In this series, I setup an owncloud server and write simple tool just to show capabilities of WebDAV. I believe
that some work, especially for webdav editor cli can still be done: the better way to handle user auth than pickle,
separate Client class from argparse dependencies. If you have additional comments or thoughts please
write a comment! Thank you for reading.

Other blog posts in this series:

Github repo for this blog post: link.

Special thanks to Kasia for being editor for this post. Thank you.

Cover image by kleuske under CC BY-SA 2.0.

September 18, 2016 08:00 AM


Michał Bultrowicz

Choosing a CI service for your open-source project

I host my code on GitHub, as probably many or you do .
The easiest way to have it automatically tested in a clean environment (what everyone should do)
is, of course, to use one of the hosted CI services integrated with GitHub.

September 18, 2016 12:00 AM

September 17, 2016


Philip Semanchuk

Thanks for PyData Carolinas

My PyData Pass

Thanks to all who made PyData Carolinas 2016 a success! I had conversations about eating well while on the road, conveyor belts, and a Fortran algorithm to calculate the interaction of charged particles. Great stuff!

My talk was on getting Python to talk to compiled languages; specifically C, Fortran, and C++.

Once the video is online I’ll update this post with a link.

September 17, 2016 10:02 PM


Abu Ashraf Masnun

Python: Using the `requests` module to download large files efficiently

If you use Python regularly, you might have come across the wonderful requests library. I use it almost everyday to read urls or make POST requests. In this post, we shall see how we can download a large file using the requests module with low memory consumption.

To Stream or Not to Stream

When downloading large files/data, we probably would prefer the streaming mode while making the get call. If we use the stream parameter and set it to True, the download will not immediately start. The file download will start when we try to access the content property or try to iterate over the content using iter_content / iter_lines.

If we set stream to False, all the content is downloaded immediately and put into memory. If the file size is large, this can soon cause issues with higher memory consumption. On the other hand – if we set stream to False, the content is not downloaded, but the headers are downloaded and the connection is kept open. We can now choose to proceed downloading the file or simply cancel it.

But we must also remember that if we decide to stream the file, the connection will remain open and can not go back to the connection pool. If we’re working with many large files, these might lead to some efficiency. So we should carefully choose where we should stream. And we should take proper care to close the connections and dispose any unused resources in such scenarios.

Iterating The Content

By setting the stream parameter, we have delayed the download and avoided taking up large chunks of memory. The headers have been downloaded but the body of the file still awaits retrieval. We can now get the data by accessing the content property or choosing to iterate over the content. Accessing the content directly would read the entire response data to memory at once. That is a scenario we want to avoid when our target file is quite large.

So we are left with the choice to iterate over the content. We can use iter_content where the content would be read chunk by chunk. Or we can use iter_lines where the content would be read line by line. Either way, the entire file will not be loaded into memory and keep the memory usage down.

Code Example

response = requests.get(url, stream=True)
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
    if chunk:  # filter out keep-alive new chunks
        handle.write(chunk)

The code should be self explanatory. We are opening the url with stream set to True. And then we are opening a file handle to the target_path (where we want to save our file). Then we iterate over the content, chunk by chunk and write the data to the file.

That’s it!

September 17, 2016 09:48 PM


Glyph Lefkowitz

Hitting The Wall

I’m an introvert.

I say that with a full-on appreciation of
just how awful
thinkpieces on “introverts” are.

However, I feel compelled to write about this today because of a certain type
of social pressure that a certain type of introvert faces. Specifically, I am
a high-energy introvert.

Cementing this piece’s place in the hallowed halls of just awful thinkpieces,
allow me to compare my mild cognitive fatigue with the plight of those
suffering from chronic illness and disability. There’s a social phenomenon
associated with many chronic illnesses,
“but you don’t LOOK sick”, where
well-meaning people will look at someone who is suffering, with no obvious
symptoms, and imply that they really ought to be able to “be normal”.

As a high-energy introvert, I frequently participate in social events. I go to
meet-ups and conferences and I engage in plenty of
public speaking. I am, in a sense,
comfortable extemporizing in front of large groups of strangers.

This all sounds like extroverted behavior, I know. But there’s a key
difference.

Let me posit two axes for personality type: on the X axis, “introvert” to
“extrovert”, and on the Y, “low energy” up to “high energy”.

The X axis describes what kinds of activities give you energy, and the Y axis
describes how large your energy reserves are for the other type.

Notice that I didn’t say which type of activity you enjoy.

Most people who would self-describe as “introverts” are in the
low-energy/introvert quadrant. They have a small amount of energy available
for social activities, which they need to frequently re-charge by doing
solitary activities. As a result of frequently running out of energy for
social activities, they don’t enjoy social activities.

Most people who would self-describe as “extroverts” are also on the
“low-energy” end of the spectrum. They have low levels of patience for
solitary activity, and need to re-charge by spending time with friends, going
to parties, etc, in order to have the mental fortitude to sit still for a while
and focus. Since they can endlessly get more energy from the company of
others, they tend to enjoy social activities quite a bit.

Therefore we have certain behaviors we expect to see from “introverts”. We
expect them to be shy, and quiet, and withdrawn. When someone who behaves this
way has to bail on a social engagement, this is expected. There’s a certain
affordance for it. If you spend a few hours with them, they may be initially
friendly but will visibly become uncomfortable and withdrawn.

This “energy” model of personality is of course an oversimplification – it’s my
personal belief that everyone needs some balance of privacy and socialization
and solitude and eventually overdoing one or the other will be bad for anyone –
but it’s a useful one.

As a high-energy introvert, my behavior often confuses people. I’ll show up
at a week’s worth of professional events, be the life of the party, go out to
dinner at all of them, and then disappear for a month. I’m not visibily shy –
quite the opposite, I’m a gregarious raconteur. In fact, I quite visibly
enjoy the company of friends. So, usually, when I try to explain that I am
quite introverted, this claim is met with (quite understandable) skepticism.

In fact, I am quite functionally what society expects of an “extrovert” – until
I hit the wall.


In endurance sports, one is said to
“hit the wall” at the point
where all the short-term energy reserves in one’s muscles are exhausted, and
there is a sudden, dramatic loss of energy. Regardless, many people enjoy
endurance sports; part of the challenge of them is properly managing your
energy.

This is true for me and social situations. I do enjoy social situations
quite a bit! But they are nevertheless quite taxing for me, and without
prolonged intermissions of solitude, eventually I get to the point where I can
no longer behave as a normal social creature without an excruciating level of
effort and anxiety.

Several years ago, I attended a prolonged social event where I hit the
wall, hard. The event itself was several hours too long for me, involved
meeting lots of strangers, and in the lead-up to it I hadn’t had a weekend to
myself for a few weeks due to work commitments and family stuff. Towards the
end I noticed I was developing a completely
flat affect, and had to
start very consciously performing even basic body language, like looking at
someone while they were talking or smiling. I’d never been so exhausted and
numb in my life; at the time I thought I was just stressed from work.

Afterwards though, I started having a lot of weird nightmares,
even during the daytime.
This concerned me, since I’d never had such a severe reaction to a social
situation, and I didn’t have good language to describe it. It was also a
little perplexing that what was effectively a nice party, the first half of
which had even been fun for me, would cause such a persistent negative reaction
after the fact. After some research, I eventually discovered that such
involuntary thoughts are
a hallmark of PTSD.

While I’ve managed to avoid this level of exhaustion before or since, this was
a real learning experience for me that the consequences of incorrectly managing
my level of social interaction can be quite severe.

I’d rather not do that again.


The reason I’m writing this, though, is not to avoid future anxiety. My
social energy reserves are quite large enough, and I now have enough
self-knowledge, that it is extremely unlikely I’d ever find myself in that
situation again.

The reason I’m writing is to help people understand that I’m not blowing them
off because I don’t like them
. Many times now, I’ve declined or bailed an
invitation from someone, and later heard that they felt hurt that I was
passive-aggressively refusing to be friendly.

I certainly understand this reaction. After all, if you see someone at a party
and they’re clearly having a great time and chatting with everyone, but then
when you invite them to do something, they say “sorry, too much social
stuff”, that seems like a pretty passive-aggressive way to respond.

You might even still be skeptical after reading this. “Glyph, if you were
really an introvert, surely, I would have seen you looking a little shy and
withdrawn. Surely I’d see some evidence of stage fright before your talks.”

But that’s exactly the problem here: no, you wouldn’t.

At a social event, since I have lots of energy to begin with, I’ll build up a
head of steam on burning said energy that no low-energy introvert would ever
risk. If I were to run out of social-interaction-juice, I’d be in the middle
of a big crowd telling a long and elaborate story when I find myself exhausted.
If I hit the wall in that situation, I can’t feel a little awkward and make
excuses and leave; I’ll be stuck creepily faking a smile like a sociopath and
frantically looking for a way out of the converstaion for an hour, as the
pressure from a large crowd of people rapidly builds up months worth of
nightmare fuel from my spiraling energy deficit.

Given that I know that’s what’s going to happen, you won’t see me when I’m
close to that line. You won’t be in at my desk when I silently sit and type
for a whole day, or on my couch when I quietly read a book for ten hours at a
time. My solitary side is, by definition, hidden.

But, if I don’t show up to your party, I promise: it’s not you, it’s me.

September 17, 2016 09:18 PM


Podcast.__init__

Episode 75 – Sandstorm.io with Asheesh Laroia

Summary

Sandstorm.io is an innovative platform that aims to make self-hosting applications easier and more maintainable for the average individual. This week we spoke with Asheesh Laroia about why running your own services is desirable, how they have made security a first priority, how Sandstorm is architected, and what the installation process looks like.

Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
  • Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
  • We are also sponsored by Rollbar. Rollbar is a service for tracking and aggregating your application errors so that you can find and fix the bugs in your application before your users notice they exist. Use the link rollbar.com/podcastinit to get 90 days and 300,000 errors for free on their bootstrap plan.
  • Hired has also returned as a sponsor this week. If you’re looking for a job as a developer or designer then Hired will bring the opportunities to you. Sign up at hired.com/podcastinit to double your signing bonus.
  • Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
  • To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
  • Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
  • I would also like to mention that the organizers of PyCon Zimbabwe are looking to the global Python community for help in supporting their event. If you would like to donate the link will be in the show notes.
  • Your hosts as usual are Tobias Macey and Chris Patti
  • Today we’re interviewing Asheesh Laroia about Sandstorm.io, a project that is trying to make self-hosted applications easy and secure for everyone.
Linode Sponsor Banner

Use the promo code podcastinit20 to get a $20 credit when you sign up!

Rollbar Logo

I’m excited to tell you about a new sponsor of the show, Rollbar.

One of the frustrating things about being a developer, is dealing with errors… (sigh)

  • Relying on users to report errors
  • Digging thru log files trying to debug issues
  • A million alerts flooding your inbox ruining your day…

With Rollbar’s full-stack error monitoring, you get the context, insights and control you need to find and fix bugs faster. It’s easy to get started tracking the errors and exceptions in your stack.You can start tracking production errors and deployments in 8 minutes – or less, and Rollbar works with all major languages and frameworks, including Ruby, Python, Javascript, PHP, Node, iOS, Android and more.You can integrate Rollbar into your existing workflow such as sending error alerts to Slack or Hipchat, or automatically create new issues in Github, JIRA, Pivotal Tracker etc.

We have a special offer for Podcast.__init__ listeners. Go to rollbar.com/podcastinit, signup, and get the Bootstrap Plan free for 90 days. That’s 300,000 errors tracked for free.Loved by developers at awesome companies like Heroku, Twilio, Kayak, Instacart, Zendesk, Twitch and more. Help support Podcast.__init__ and give Rollbar a try a today. Go to rollbar.com/podcastinit

Hired Logo

On Hired software engineers & designers can get 5+ interview requests in a week and each offer has salary and equity upfront. With full time and contract opportunities available, users can view the offers and accept or reject them before talking to any company. Work with over 2,500 companies from startups to large public companies hailing from 12 major tech hubs in North America and Europe. Hired is totally free for users and If you get a job you’ll get a $2,000 “thank you” bonus. If you use our special link to signup, then that bonus will double to $4,000 when you accept a job. If you’re not looking for a job but know someone who is, you can refer them to Hired and get a $1,337 bonus when they accept a job.

Interview with Asheesh Laroia

  • Introductions
  • How did you get introduced to Python? – Tobias
  • Can you start by telling everyone about the Sandstorm project and how you got involved with it? – Tobias
  • What are some of the reasons that an individual would want to self-host their own applications rather than using comparable services available through third parties? – Tobias
  • How does Sandstorm try to make the experience of hosting these various applications simple and enjoyable for the broadest variety of people? – Tobias
  • What does the system architecture for Sandstorm look like? – Tobias
  • I notice that Sandstorm requires a very recent Linux kernel version. What motivated that choice and how does it affect adoption? – Chris
  • One of the notable aspects of Sandstorm is the security model that it uses. Can you explain the capability-based authorization model and how it enables Sandstorm to ensure privacy for your users? – Tobias
  • What are some of the most difficult challenges facing you in terms of software architecture and design? – Tobias
  • What is involved in setting up your own server to run Sandstorm and what kinds of resources are required for different use cases? – Tobias
  • You have a number of different applications available for users to install. What is involved in making a project compatible with the Sandstorm runtime environment? Are there any limitations in terms of languages or application architecture for people who are targeting your platform? – Tobias
  • How much of Sandstorm is written in Python and what other languages does it use? – Tobias

Keep In Touch

Picks

  • Tobias
  • Chris
  • Asheesh

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Summary
Sandstorm.io is an innovative platform that aims to make self-hosting applications easier and more maintainable for the average individual. This week we spoke with Asheesh Laroia about why running your own services is desirable, how they have made security a first priority, how Sandstorm is architected, and what the installation process looks like.Brief IntroductionHello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.comLinode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next projectWe are also sponsored by Rollbar. Rollbar is a service for tracking and aggregating your application errors so that you can find and fix the bugs in your application before your users notice they exist. Use the link rollbar.com/podcastinit to get 90 days and 300,000 errors for free on their bootstrap plan.Hired has also returned as a sponsor this week. If you’re looking for a job as a developer or designer then Hired will bring the opportunities to you. Sign up at hired.com/podcastinit to double your signing bonus.Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workersJoin our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.I would also like to mention that the organizers of PyCon Zimbabwe are looking to the global Python community for help in supporting their event. If you would like to donate the link will be in the show notes.Your hosts as usual are Tobias Macey and Chris PattiToday we’re interviewing Asheesh Laroia about Sandstorm.io, a project that is trying to make self-hosted applications easy and secure for everyone.
Use the promo code podcastinit20 to get a $20 credit when you sign up!

I’m excited to tell you about a new sponsor of the show, Rollbar.
One of the frustrating things about being a developer, is dealing with errors… (sigh)Relying on users to report errorsDigging thru log files trying to debug issuesA million alerts flooding your inbox ruining your day…With Rollbar’s full-stack error monitoring, you get the context, insights and control you need to find and fix bugs faster. It’s easy to get started tracking the errors and exceptions in your stack.You can start tracking production errors and deployments in 8 minutes – or less, and Rollbar works with all major languages and frameworks, including Ruby, Python, Javascript, PHP, Node, iOS, Android and more.You can integrate Rollbar into your existing workflow such as sending error alerts to Slack or Hipchat, or automatically create new issues in Github, JIRA, Pivotal Tracker etc.

We have a special offer for Podcast.__init__ listeners. Go to rollbar.com/podcastinit, signup, and get the Bootstrap Plan free for 90 days. That’s 300,000 errors tracked for free.Loved by developers at awesome companies like Heroku, Twilio, Kayak, Instacart, Zendesk, Twitch and more. Help support Podcast.__init__ and give Rollbar a try a today. Go to rollbar.com/podcastinit

On Hired software engineers designers can get 5+ interview requests in a week and each offer has salary and equity upfront. With full time and contract opportunities available, users can view the offers and accept or reject them before talking to any company. Work with over 2,500 companies from startups to large public companies hailing from 12 major tech hubs in North America and Europe. Hired is totally free for users and If you get a job you’ll get a $2,000 “thank you” bonus. If you use our special link to

September 17, 2016 08:52 PM


BangPypers

Ansible Workshop – BangPypers September Meetup

September Bangpypers meetup happened at RedHat office in Bannerghata road. 31 people attended the event.

In the previous meetup Abraham presented a talk on ansible. Many participants were interested in it and so we planned for workshop this time.

Abraham started workshop with brief explanation of virtualbox, vagrant and ansible. He helped participants to setup them.

After that he explained simple ansible modules like ping, shell e.t.c and how to run those on target machines.

Later he explained about ansible playbook and how to can configure them.

We had lunch break for about 30 minutes. After resuming from break, he showed a demo of deploying a django web app. Here he used 4 machines(1 load balancer and 3 web apps) and showed hot to automatically configure and orchestrate them.

Then he showed how to update all webservers with zero downtime.

Here are a few photos from workshop.

i1

i2

i3

Workshop content can be found in gitHub.

Thanks to Abraham for conducting workshop and Redhat for hosting the event.

September 17, 2016 06:26 PM


End Point

Executing Custom SQL in Django Migrations

Since version 1.7, Django has natively supported database migrations similar to Rails migrations. The biggest difference fundamentally between the two is the way the migrations are created: Rails migrations are written by hand, specifying changes you want made to the database, while Django migrations are usually automatically generated to mirror the database schema in its current state.

Usually, Django’s automatic schema detection works quite nicely, but occasionally you will have to write some custom migration that Django can’t properly generate, such as a functional index in PostgreSQL.

Creating an empty migration

To create a custom migration, it’s easiest to start by generating an empty migration. In this example, it’ll be for an application called blog:

$ ./manage.py makemigrations blog --empty -n create_custom_index
Migrations for 'blog':
  0002_create_custom_index.py:

This generates a file at blog/migrations/0002_create_custom_index.py that will look something like this:

# -*- coding: utf-8 -*-                                                                                                                                                                                             
# Generated by Django 1.9.4 on 2016-09-17 17:35                                                                                                                                                                     
from __future__ import unicode_literals                                                                                                                                                                             
                                                                                                                                                                                                                    
from django.db import migrations                                                                                                                                                                                    
                                                                                                                                                                                                                    
                                                                                                                                                                                                                    
class Migration(migrations.Migration):                                                                                                                                                                              
                                                                                                                                                                                                                    
    dependencies = [                                                                                                                                                                                                
        ('blog', '0001_initial'),                                                                                                                                                                                   
    ]                                                                                                                                                                                                               
                                                                                                                                                                                                                    
    operations = [                                                                                                                                                                                                  
    ]

Adding Custom SQL to a Migration

The best way to run custom SQL in a migration is through the migration.RunSQL operation. RunSQL allows you to write code for migrating forwards and backwards—that is, applying migrations and unapplying them. In this example, the first string in RunSQL is the forward SQL, the second is the reverse SQL.

# -*- coding: utf-8 -*-                                                                                                                                                                                             
# Generated by Django 1.9.4 on 2016-09-17 17:35                                                                                                                                                                     
from __future__ import unicode_literals                                                                                                                                                                             
                                                                                                                                                                                                                    
from django.db import migrations                                                                                                                                                                                    
                                                                                                                                                                                                                    
                                                                                                                                                                                                                    
class Migration(migrations.Migration):                                                                                                                                                                              
                                                                                                                                                                                                                    
    dependencies = [                                                                                                                                                                                                
        ('blog', '0001_initial'),                                                                                                                                                                                   
    ]                                                                                                                                                                                                               
                                                                                                                                                                                                                    
    operations = [                                                                                                                                                                                                  
        migrations.RunSQL(                                                                                                                                                                                          
            "CREATE INDEX i_active_posts ON posts(id) WHERE active",                                                                                                                                         
            "DROP INDEX i_active_posts"                                                                                                                                                                             
        )                                                                                                                                                                                                           
    ]

Unless you’re using Postgres for your database, you’ll need to install the sqlparse library, which allows Django to break the SQL strings into individual statements.

Running the Migrations

Running your migrations is easy:

$ ./manage.py migrate
Operations to perform:
  Apply all migrations: blog, sessions, auth, contenttypes, admin
Running migrations:
  Rendering model states... DONE
  Applying blog.0002_create_custom_index... OK

Unapplying migrations is also simple. Just provide the name of the app to migrate and the id of the migration you want to go to, or “zero” to reverse all migrations on that app:

$./manage.py migrate blog 0001
Operations to perform:
  Target specific migration: 0001_initial, from blog
Running migrations:
  Rendering model states... DONE
  Unapplying blog.0002_create_custom_index... OK

Hand-written migrations can be used for many other operations, including data migrations. Full documentation for migrations can be found in the Django documentation.


(This post originally covered South migrations and was updated by Phin Jensen to illustrate the now-native Django migrations.)

September 17, 2016 03:20 PM


Anatoly Techtonik

Python Usability Bugs: subprocess.Popen executable

subprocess.Popen seems to be designed as a “swiss army knife” of managing external processes, and while the task is pretty hard to solve in cross-platform way, it seems the people who have contributed to it did manage to achieve that. But it still came with some drawbacks and complications. Let’s study one of these that I think is a top one from usability point of view, because it confuses people a lot.

I’ve got a simple program that prints its name and own arguments (forgive me for Windows code, as I was debugging the issue on Windows, but this works the same on Linux too). The program is written in Go to get single executable, because subprocess has special handling for child Python processes (another usability bug for another time).

>argi.exe 1 2 3 4
prog: E:argi.exe
args: [1 2 3 4]
Let’s execute it with subprocess.Popen, and for that I almost always look up the official documentation for Popen prototype:

subprocess.Popen(argsbufsize=0executable=Nonestdin=None, stdout=Nonestderr=Nonepreexec_fn=Noneclose_fds=False, shell=Falsecwd=Noneenv=Noneuniversal_newlines=False, startupinfo=Nonecreationflags=0)

Quite scary, right? But let’s skip confusing part and quickly figure out something out of it (because time is scarce). Looks like this should do the trick:

import subprocess

args = "1 2 3 4".split()
p = subprocess.Popen(args, executable="argi.exe")
p.communicate()

After saving this code to “subs.py” and running it, you’d probably expect something like this :

> python subs.py
prog: E:argi.exe
args: [1 2 3 4]

And… you won’t get this. What you get is this:

> python subs.py
prog: 1
args: [2 3 4]

And that’s kind of crazy – not only the executable was renamed, but the first argument was lost, and it appears that this is actually a documented behavior. So let’s define Python Usability Bug as something that is documented but not expected (by most folks who is going to read the code). The trick to get code do what is expected is never use executable argument to subprocess.Popen:

import subprocess

args = "1 2 3 4".split()
args.insert(0, "argi.exe")
p = subprocess.Popen(args)
p.communicate()

>python suby.py
prog: argi.exe
args: [1 2 3 4]

The explanation for former “misbehavior” is that executable is a hack that allows to rename program when running subprocess. It should be named substitute, or – even better – altname to work as an alternative name to pass to child process (instead of providing alternative executable for the former name). To make subprocess.Popen even more intuitive, the args argument should have been named command.

From the high level design point of view, the drawbacks of this function is that it *does way too much*, its arguments are not always intuitive – it takes *a lot of time to grok official docs*, and I need to read it *every time*, because there are too many little important details of Popen behavior (have anybody tried to create its state machine?), so over the last 5 years I still discover various problems with it. Today I just wanted to save you some hours that I’ve wasted myself while debugging pymake on Windows.

That’s it for now. Bonus points to update this post with link when I get more time / mana for it:

  • [ ] people who have contributed to it
  • [ ] it came with drawbacks
  • [ ] have anybody tried to create its state machine?
  • [ ] subprocess has special handling for child Python processes

September 17, 2016 11:46 AM


Weekly Python StackOverflow Report

(xxxvii) stackoverflow python report

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2016-09-17 10:31:28 GMT


  1. What’s the closest I can get to calling a Python function using a different Python version? – [13/4]
  2. How to traverse cyclic directed graphs with modified DFS algorithm – [11/2]
  3. Difference between generators and functions returning generators – [8/3]
  4. Why does python’s datetime.datetime.strptime(‘201412’, ‘%Y%m%d’) not raise a ValueError? – [8/3]
  5. Accessing the choices passed to argument in argparser? – [7/3]
  6. Where’s the logic that returns an instance of a subclass of OSError exception class? – [7/2]
  7. TypeError: str object is not an iterator – [6/7]
  8. condensing multiple if statements in python – [6/4]
  9. How to fillna() with value 0 after calling resample? – [6/3]
  10. How to read strange csv files in Pandas? – [6/3]

September 17, 2016 10:32 AM


Nick Coghlan

The Python Packaging Ecosystem

From Development to Deployment

There have been a few recent articles reflecting on the current status of
the Python packaging ecosystem from an end user perspective, so it seems
worthwhile for me to write-up my perspective as one of the lead architects
for that ecosystem on how I characterise the overall problem space of software
publication and distribution, where I think we are at the moment, and where I’d
like to see us go in the future.

For context, the specific articles I’m replying to are:

These are all excellent pieces considering the problem space from different
perspectives, so if you’d like to learn more about the topics I cover here,
I highly recommend reading them.

Since it heavily influences the way I think about packaging system design in
general, it’s worth stating my core design philosophy explicitly:

  • As a software consumer, I should be able to consume libraries, frameworks,
    and applications in the binary format of my choice, regardless of whether
    or not the relevant software publishers directly publish in that format
  • As a software publisher working in the Python ecosystem, I should be able to
    publish my software once, in a single source-based format, and have it be
    automatically consumable in any binary format my users care to use

This is emphatically not the way many software packaging systems work – for a
great many systems, the publication format and the consumption format are
tightly coupled, and the folks managing the publication format or the
consumption format actively seek to use it as a lever of control over a
commercial market (think operating system vendor controlled application stores,
especially for mobile devices).

While we’re unlikely to ever pursue the specific design documented in the
rest of the PEP (hence the “Deferred” status), the
Development, Distribution, and Deployment of Python Software
section of PEP 426 provides additional details on how this philosophy applies
in practice.

I’ll also note that while I now work on software supply chain management
tooling at Red Hat, that wasn’t the case when I first started actively
participating in the upstream Python packaging ecosystem
design process. Back then I was working
on Red Hat’s main
hardware integration testing system, and
growing increasingly frustrated with the level of effort involved in
integrating new Python level dependencies into Beaker’s RPM based development
and deployment model. Getting actively involved in tackling these problems on
the Python upstream side of things then led to also getting more actively
involved in addressing them on the
Red Hat downstream side.

When talking about the design of software packaging ecosystems, it’s very easy
to fall into the trap of only considering the “direct to peer developers” use
case, where the software consumer we’re attempting to reach is another developer
working in the same problem domain that we are, using a similar set of
development tools. Common examples of this include:

  • Linux distro developers publishing software for use by other contributors to
    the same Linux distro ecosystem
  • Web service developers publishing software for use by other web service
    developers
  • Data scientists publishing software for use by other data scientists

In these more constrained contexts, you can frequently get away with using a
single toolchain for both publication and consumption:

  • Linux: just use the system package manager for the relevant distro
  • Web services: just use the Python Packaging Authority’s twine for publication
    and pip for consumption
  • Data science: just use conda for everything

For newer languages that start in one particular domain with a preferred
package manager and expand outwards from there, the apparent simplicity arising
from this homogeneity of use cases may frequently be attributed as an essential
property of the design of the package manager, but that perception of inherent
simplicity will typically fade if the language is able to successfully expand
beyond the original niche its default package manager was designed to handle.

In the case of Python, for example, distutils was designed as a consistent
build interface for Linux distro package management, setuptools for plugin
management in the Open Source Application Foundation’s Chandler project, pip
for dependency management in web service development, and conda for local
language-independent environment management in data science.
distutils and setuptools haven’t fared especially well from a usability
perspective when pushed beyond their original design parameters (hence the
current efforts to make it easier to use full-fledged build systems like
Scons and Meson as an alternative when publishing Python packages), while pip
and conda both seem to be doing a better job of accommodating increases in
their scope of application.

This history helps illustrate that where things really have the potential to
get complicated (even beyond the inherent challenges of domain-specific
software distribution) is when you start needing to cross domain boundaries.
For example, as the lead maintainer of contextlib in the Python
standard library, I’m also the maintainer of the contextlib2 backport
project on PyPI. That’s not a domain specific utility – folks may need it
regardless of whether they’re using a self-built Python runtime, a pre-built
Windows or Mac OS X binary they downloaded from python.org, a pre-built
binary from a Linux distribution, a CPython runtime from some other
redistributor (homebrew, pyenv, Enthought Canopy, ActiveState,
Continuum Analytics, AWS Lambda, Azure Machine Learning, etc), or perhaps even
a different Python runtime entirely (PyPy, PyPy.js, Jython, IronPython,
MicroPython, VOC, Batavia, etc).

Fortunately for me, I don’t need to worry about all that complexity in the
wider ecosystem when I’m specifically wearing my contextlib2 maintainer
hat – I just publish an sdist and a universal wheel file to PyPI, and the rest
of the ecosystem has everything it needs to take care of redistribution
and end user consumption without any further input from me.

However, contextlib2 is a pure Python project that only depends on the
standard library, so it’s pretty much the simplest possible case from a
tooling perspective (the only reason I needed to upgrade from distutils to
setuptools was so I could publish my own wheel files, and the only reason I
haven’t switched to using the much simpler pure-Python-only flit instead of
either of them is that that doesn’t yet easily support publishing backwards
compatible setup.py based sdists).

This means that things get significantly more complex once we start wanting to
use and depend on components written in languages other than Python, so that’s
the broader context I’ll consider next.

When it comes to handling the software distribution problem in general, there
are two main ways of approaching it:

  • design a plugin management system that doesn’t concern itself with the
    management of the application framework that runs the plugins
  • design a platform component manager that not only manages the plugins
    themselves, but also the application frameworks that run them

This “plugin manager or platform component manager?” question shows up over and
over again in software distribution architecture designs, but the case of most
relevance to Python developers is in the contrasting approaches that pip and
conda have adopted to handling the problem of external dependencies for Python
projects:

  • pip is a plugin manager for Python runtimes. Once you have a Python runtime
    (any Python runtime), pip can help you add pieces to it. However, by design,
    it won’t help you manage the underlying Python runtime (just as it wouldn’t
    make any sense to try to install Mozilla Firefox as a Firefox Add-On, or
    Google Chrome as a Chrome Extension)
  • conda, by contrast, is a component manager for a cross-platform platform
    that provides its own Python runtimes (as well as runtimes for other
    languages). This means that you can get pre-integrated components, rather
    than having to do your own integration between plugins obtained via pip and
    language runtimes obtained via other means

What this means is that pip, on its own, is not in any way a direct
alternative to conda. To get comparable capabilities to those offered by conda,
you have to add in a mechanism for obtaining the underlying language runtimes,
which means the alternatives are combinations like:

  • apt-get + pip
  • dnf + pip
  • yum + pip
  • pyenv + pip
  • homebrew (Mac OS X) + pip
  • python.org Windows installer + pip
  • Enthought Canopy
  • ActiveState’s Python runtime + PyPM

This is the main reason why “just use conda” is excellent advice to any
prospective Pythonista that isn’t already using one of the platform component
managers mentioned above: giving that answer replaces an otherwise operating
system dependent or Python specific answer to the runtime management problem
with a cross-platform and (at least somewhat) language neutral one.

It’s an especially good answer for Windows users, as chocalatey/OneGet/Windows
Package Management isn’t remotely comparable to pyenv or homebrew at this point
in time, other runtime managers don’t work on Windows, and getting folks
bootstrapped with MinGW, Cygwin or the new (still experimental) Windows
Subsystem for Linux is just another hurdle to place between them and whatever
goal they’re learning Python for in the first place.

However, conda’s pre-integration based approach to tackling the external
dependency problem is also why “just use conda for everything” isn’t a
sufficient answer for the Python software ecosystem as a whole.

If you’re working on an operating system component for Fedora, Debian, or any
other distro, you actually want to be using the system provided Python
runtime, and hence need to be able to readily convert your upstream Python
dependencies into policy compliant system dependencies.

Similarly, if you’re wanting to support folks that deploy to a preconfigured
Python environment in services like AWS Lambda, Azure Cloud Functions, Heroku,
OpenShift or Cloud Foundry, or that use alternative Python runtimes like PyPy
or MicroPython, then you need a publication technology that doesn’t tightly
couple your releases to a specific version of the underlying language runtime.

As a result, pip and conda end up existing at slightly different points in the
system integration pipeline:

  • Publishing and consuming Python software with pip is a matter of “bring your
    own Python runtime”. This has the benefit that you can readily bring your
    own runtime (and manage it using whichever tools make sense for your use
    case), but also has the downside that you must supply your own runtime
    (which can sometimes prove to be a significant barrier to entry for new
    Python users, as well as being a pain for cross-platform environment
    management).
  • Like Linux system package managers before it, conda takes away the
    requirement to supply your own Python runtime by providing one for you.
    This is great if you don’t have any particular preference as to which
    runtime you want to use, but if you do need to use a different runtime
    for some reason, you’re likely to end up fighting against the tooling, rather
    than having it help you. (If you’re tempted to answer “Just add another
    interpreter to the pre-integrated set!” here, keep in mind that doing so
    without the aid of a runtime independent plugin manager like pip acts as a
    multiplier on the platform level integration testing needed, which can be a
    significant cost even when it’s automated)

In case it isn’t already clear from the above, I’m largely happy with the
respective niches that pip and conda are carving out for themselves as a
plugin manager for Python runtimes and as a cross-platform platform focused
on (but not limited to) data analysis use cases.

However, there’s still plenty of scope to improve the effectiveness of the
collaboration between the upstream Python Packaging Authority and downstream
Python redistributors, as well as to reduce barriers to entry for participation
in the ecosystem in general, so I’ll go over some of the key areas I see for
potential improvement.

Sustainability and the bystander effect

It’s not a secret that the core PyPA infrastructure (PyPI, pip, twine,
setuptools) is
nowhere near as well-funded
as you might expect given its criticality to the operations of some truly
enormous organisations.

The biggest impact of this is that even when volunteers show up ready and
willing to work, there may not be anybody in a position to effectively wrangle
those volunteers, and help keep them collaborating effectively and moving in a
productive direction.

To secure long term sustainability for the core Python packaging infrastructure,
we’re only talking amounts on the order of a few hundred thousand dollars a
year – enough to cover some dedicated operations and publisher support staff for
PyPI (freeing up the volunteers currently handling those tasks to help work on
ecosystem improvements), as well as to fund targeted development directed at
some of the other problems described below.

However, rather than being a true
tragedy of the commons“,
I personally chalk this situation up to a different human cognitive bias: the
bystander effect.

The reason I think that is that we have so many potential sources of the
necessary funding that even folks that agree there’s a problem that needs to be
solved are assuming that someone else will take care of it, without actually
checking whether or not that assumption is entirely valid.

The primary responsibility for correcting that oversight falls squarely on the
Python Software Foundation, which is why the Packaging Working Group was
formed in order to investigate possible sources of additional funding, as well
as to determine how any such funding can be spent most effectively.

However, a secondary responsibility also falls on customers and staff of
commercial Python redistributors, as this is exactly the kind of ecosystem
level risk that commercial redistributors are being paid to manage on behalf of
their customers, and they’re currently not handling this particular situation
very well. Accordingly, anyone that’s actually paying for CPython, pip, and
related tools (either directly or as a component of a larger offering), and
expecting them to be supported properly as a result, really needs to be asking
some very pointed question of their suppliers right about now. (Here’s a sample
question: “We pay you X dollars a year, and the upstream Python ecosystem is
one of the things we expect you to support with that revenue. How much of what
we pay you goes towards maintenance of the upstream Python packaging
infrastructure that we rely on every day?”).

One key point to note about the current situation is that as a 501(c)(3) public
interest charity, any work the PSF funds will be directed towards better
fulfilling that public interest mission, and that means focusing primarily on
the needs of educators and non-profit organisations, rather than those of
private for-profit entities.

Commercial redistributors are thus far better positioned to properly
represent their customers interests in areas where their priorities may
diverge from those of the wider community (closing the “insider threat”
loophole in PyPI’s current security model is a particular case that comes to
mind – see Making PyPI security independent of SSL/TLS).

Migrating PyPI to pypi.org

An instance of the new PyPI implementation (Warehouse) is up and running at
https://pypi.org/ and connected directly to the
production PyPI database, so folks can already explicitly opt-in to using it
over the legacy implementation if they prefer to do so.

However, there’s still a non-trivial amount of design, development and QA work
needed on the new version before all existing traffic can be transparently
switched over to using it.

Getting at least this step appropriately funded and a clear project management
plan in place is the main current focus of the PSF’s Packaging Working Group.

Making the presence of a compiler on end user systems optional

Between the wheel format and the manylinux1 usefully-distro-independent
ABI definition, this is largely handled now, with conda available as an
option to handle the relatively small number of cases that are still a problem
for pip.

The main unsolved problem is to allow projects to properly express the
constraints they place on target environments so that issues can be detected
at install time or repackaging time, rather than only being detected as
runtime failures. Such a feature will also greatly expand the ability to
correctly generate platform level dependencies when converting Python
projects to downstream package formats like those used by conda and Linux
system package managers.

Making PyPI security independent of SSL/TLS

PyPI currently relies entirely on SSL/TLS to protect the integrity of the link
between software publishers and PyPI, and between PyPI and software consumers.
The only protections against insider threats from within the PyPI
administration team are ad hoc usage of GPG artifact signing by some projects,
personal vetting of new team members by existing team members and 3rd party
checks against previously published artifact hashes unexpectedly changing.

A credible design for end-to-end package signing that adequately accounts for
the significant usability issues that can arise around publisher and consumer
key management has been available for almost 3 years at this point (see
Surviving a Compromise of PyPI
and
Surviving a Compromise of PyPI: the Maximum Security Edition).

However, implementing that solution has been gated not only on being able to
first retire the legacy infrastructure, but also the PyPI administators being
able to credibly commit to the key management obligations of operating the
signing system, as well as to ensuring that the system-as-implemented actually
provides the security guarantees of the system-as-designed.

Accordingly, this isn’t a project that can realistically be pursued until the
underlying sustainability problems have been suitably addressed.

Automating wheel creation

While redistributors will generally take care of converting upstream Python
packages into their own preferred formats, the Python-specific wheel format
is currently a case where it is left up to publishers to decide whether or
not to create them, and if they do decide to create them, how to automate that
process.

Having PyPI take care of this process automatically is an obviously desirable
feature, but it’s also an incredibly expensive one to build and operate.

Thus, it currently makes sense to defer this cost to individual projects, as
there are quite a few commercial continuous integration and continuous
deployment service providers willing to offer free accounts to open source
projects, and these can also be used for the task of producing release
artifacts. Projects also remain free to only publish source artifacts, relying
on pip’s implicit wheel creation and caching and the appropriate use of
private PyPI mirrors and caches to meet the needs of end users.

For downstream platform communities already offering shared build
infrastructure to their members (such as Linux distributions and conda-forge),
it may make sense to offer Python wheel generation as a supported output option
for cross-platform development use cases, in addition to the platform’s native
binary packaging format.

September 17, 2016 03:46 AM

September 16, 2016


pythonwise

Simple Object Pools

Sometimes we need object pools to limit the number of resource consumed. The most common example is database connnections.

In Go we sometime use a buffered channel as a simple object pool.

In Python, we can dome something similar with a Queue. Python’s context manager makes the resource handing automatic so clients don’t need to remember to return the object.

Here’s the output of both programs:


$ go run pool.go
worker 7 got resource 0
worker 0 got resource 2
worker 3 got resource 1
worker 8 got resource 2
worker 1 got resource 0
worker 9 got resource 1
worker 5 got resource 1
worker 4 got resource 0
worker 2 got resource 2
worker 6 got resource 1

$ python pool.py
worker 5 got resource 1
worker 8 got resource 2
worker 1 got resource 3
worker 4 got resource 1
worker 0 got resource 2
worker 7 got resource 3
worker 6 got resource 1
worker 3 got resource 2
worker 9 got resource 3
worker 2 got resource 1

September 16, 2016 04:49 PM


Enthought

Canopy Data Import Tool: New Updates

In May of 2016 we released the Canopy Data Import Tool, a significant new feature of our Canopy graphical analysis environment software. With the Data Import Tool, users can now quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling.

Watch a 2-minute demo video to see how the Canopy Data Import Tool works:

With the latest version of the Data Import Tool released this month (v. 1.0.4), we’ve added new capabilities and enhancements, including:

  1. The ability to select and import a specific table from among multiple tables on a webpage,
  2. Intelligent alerts regarding the saved state of exported Python code, and
  3. Unlimited file sizes supported for import.

Download Canopy and start a free 7 day trial of the data import tool

New: Choosing from multiple tables on a webpage

Example of page with multiple tables for selection

The latest release of the Canopy Data Import Tool supports the selection of a specific table from a webpage for import, such as this Wikipedia page

In addition to CSVs and structured text files, the Canopy Data Import Tool (the Tool) provides the ability to load tables from a webpage. If the webpage contains multiple tables, by default the Tool loads the first table.

With this release, we provide the user with the ability to choose from multiple tables to import using a scrollable index parameter to select the table of interest for import.

Example: loading and working with tables from a Wikipedia page

For example, let’s try to load a table from the Demography of the UK wiki page using the Tool. In total, there are 10 tables on that wiki page.

  • As you can see in the screenshot below, the Tool initially loads the first table on the wiki page.
  • However, we are interested in loading the table ‘Vital statistics since 1960’, which is the fifth table on the page. (Note that indexing starts at 0). For a quick history lesson on why Python uses zero based indexing, see Guido van Rossum’s explanation here).
  • After the initial read-in, we can click on the ‘Table index on page’ scroll bar, choose ‘4’ and click on ‘Refresh Data’ to load the table of interest in the Data Import Tool.

See how the Canopy Data Import Tool loads a table from a webpage and prepares the data for manipulation and interaction:

The Data Import Tool allows you to select a specific table from a webpage where multiple are present, with a simple drop down menu. Once you’ve selected your table, you can readily toggle between 3 views: the Pandas DataFrame generated by the Tool, the raw data and the corresponding auto-generated Python code. Consecutively, you can export the DataFrame to the IPython console for further plotting and further analysis.

  • Further, as you can see, the first row contains column names and the first column looks like an index for the Data Frame. Therefore, you can select the ‘First row is column names’ checkbox and again click on ‘Refresh Data’ to prompt the Tool to re-read the table but, this time, use the data in the first row as column names. Then, we can right-click on the first column and select the ‘Set as Index’ option to make column 0 the index of the DataFrame.
  • You can toggle between the DataFrame, Raw Data and Python Code tabs in the Tool, to peek at the raw data being loaded by the Tool and the corresponding Python code auto-generated by the Tool.
  • Finally, you can click on the ‘Use DataFrame’ button, in the bottom right, to send the DataFrame to the IPython kernel in the Canopy User Environment, for plotting and further analysis.

New: Keeping track of exported Python scripts

The Tool generates Python commands for all operations performed by the user and provides the user with the ability to save the generated Python script. With this new update, the Tool keeps track of the saved and current states of the generated Python script and intelligently alerts the user if he/she clicks on theUse DataFrame’ button without saving changes in the Python script.

New: Unlimited file sizes supported for import

In the initial release, we chose to limit the file sizes that can be imported using the Tool to 70 MB, to ensure optimal performance. With this release, we removed that restriction and allow files of any size to be uploaded with the tool. For files over 70 MB we now provide the user with a warning that interaction, manipulation and operations on the imported Data Frame might be slower than normal, and allow them to select whether to continue or begin with a smaller subset of data to develop a script to be applied to the larger data set.

Additions and Fixes

Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in version 1.0.4 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation. If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com.

Additional resources:

Download Canopy and start a free 7 day trial of the data import tool

See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:

September 16, 2016 04:47 PM


CubicWeb

Monitor all the things! … and early too!

Following the “release often, release early” mantra, I thought it
might be a good idea to apply it to monitoring on one of our client
projects. So right from the demo stage where we deliver a new version
every few weeks (and sometimes every few days), we setup some
monitoring.

https://www.cubicweb.org/file/15338085/raw/66511658.jpg

Monitoring performance

The project is an application built with the CubicWeb platform, with
some ElasticSearch for indexing and searching. As with any complex
stack, there are a great number of places where one could monitor
performance metrics.

https://www.cubicweb.org/file/15338628/raw/Screenshot_2016-09-16_12-19-21.png

Here are a few things we have decided to monitor, and with what tools.

Monitoring CubicWeb

To monitor our running Python code, we have decided to use statsd, since it is already built into
CubicWeb’s core. Out of the box, you can configure a
statsd server address in your all-in-one.conf configuration. That will
send out some timing statistics about some core functions.

The statsd server (there a numerous implementations, we use a simple
one : python-pystatsd) gets the raw metrics and outputs
them to carbon which
stores the time series data in whisper files (which can be
swapped out for a different technology if need be).

https://www.cubicweb.org/file/15338392/raw/Screenshot_2016-09-16_11-56-44.png

If we are curious about a particular function or view that might be
taking too long to generate or slow down the user experience, we can
just add the @statsd_timeit
decorator there. Done. It’s monitored.

statsd monitoring is a fire-and-forget UDP type of monitoring, it
should not have any impact on the performance of what you are
monitoring.

Monitoring Apache

Simply enough we re-use the statsd approach by plugging in an apache
module
to time the HTTP responses sent back by apache. With nginx
and varnish, this is also really easy.

https://www.cubicweb.org/file/15338407/raw/Screenshot_2016-09-16_11-56-54.png

One of the nice things about this part is that we can then get
graphs of errors since we will differentiate OK 200 type codes from
500 type codes (HTTP codes).

Monitoring ElasticSearch

ElasticSearch comes with some metrics in GET /_stats endpoint, the
same goes for individual nodes, individual indices and even at cluster
level. Some popular tools can be installed through the ElasticSearch
plugin system or with Kibana (plugin system there too).

We decided on a different approach that fitted well with our other
tools (and demonstrates their flexibility!) : pull stats out of
ElasticSearch with SaltStack,
push them to Carbon, pull them out with Graphite and display
them in Grafana (next to our other metrics).

https://www.cubicweb.org/file/15338399/raw/Screenshot_2016-09-16_11-56-34.png

On the SaltStack side, we wrote a two line execution module (elasticsearch.py)

import requests
def stats:
    return request.get('http://localhost:9200/_stats').json()

This gets shipped using the custom execution modules mechanism
(_modules and saltutils.sync_modules), and is executed every minute
(or less) in the salt scheduler. The
resulting dictionary is fed to the carbon returner that is configured
to talk to a carbon server somewhere nearby.

# salt demohost elasticsearch.stats
[snip]
  { "indextime_inmillis" : 30,
[snip]

Monitoring web metrics

To evaluate parts of the performance of a web page we can look at some
metrics such as the number of assets the browser will need to
download, the size of the assets (js, css, images, etc) and even
things such as the number of subdomains used to deliver assets. You
can take a look at such metrics in most developer tools available in
the browser, but we want to graph this over time. A nice tool for this
is sitespeed.io (written in javascript
with phantomjs). Out of the box, it has a graphite outputter so
we just have to add –graphiteHost FQDN. sitespeed.io even
recommends using grafana to visualize the
results and publishes some example dashboards that can be adapted to
your needs.

https://www.cubicweb.org/file/15338109/raw/sitespeed-logo-2c.png

The sitespeed.io command is configured and run by salt using pillars
and its scheduler.

We will have to take a look at using their jenkins plugin with our
jenkins continuous integration instance.

Monitoring crashes / errors / bugs

Applications will have bugs (in particular when released often to get
a client to validate some design choices early). Level 0 is having
your client calling you up saying the application has crashed. The
next level is watching some log somewhere to see those errors pop
up. The next level is centralised logs on which you can monitor the
numerous pieces of your application (rsyslog over UDP helps here,
graylog might be a good solution for
visualisation).

https://www.cubicweb.org/file/15338139/raw/Screenshot_2016-09-16_11-30-53.png

When it starts getting useful and usable is when your bugs get
reported with some rich context. That’s when using sentry gets in. It’s free software developed on github (although the website does not
really show that) and it is written in python, so it was a good match
for our culture. And it is pretty awesome too.

We plug sentry into our WSGI pipeline (thanks to cubicweb-pyramid) by installing
and configuring the sentry cube : cubicweb-sentry. This will catch
rich context bugs and provide us with vital information about what the user
was doing when the crash occured.

This also helps sharing bug information within a team.

The sentry cube reports on errors being raised when using the web
application, but can also catch some errors when running some
maintenance or import commands (ccplugins in CubicWeb). In this
particular case, a lot of importing is being done and Sentry can
detect and help us triage the import errors with context on which
files are failing.

Monitoring usage / client side

This part is a bit neglected for the moment. Client side we can use
Javascript to monitor usage. Some basic metrics can come from piwik which is usually used for audience
statistics. To get more precise statistics we’ve been told Boomerang has an interesting
approach, enabling a closer look at how fast a page was displayed
client side, how much time was spend on DNS, etc.

On the client side, we’re also looking at two features of the Sentry
project : the raven-js
client which reports Javascript errors directly from the browser to
the Sentry server, and the user feedback form which captures some
context when something goes wrong or a user/client wants to report
that something should be changed on a given page.

Load testing – coverage

To wrap up, we also often generate traffic to catch some bugs and
performance metrics automatically :

  • wget –mirror $URL
  • linkchecker $URL
  • for $search_term in cat corpus; do wget URL/$search_term ; done
  • wapiti $URL –scope page
  • nikto $URL

Then watch the graphs and the errors in Sentry… Fix them. Restart.

Graphing it in Grafana

We’ve spend little time on the dashboard yet since we’re concentrating on collecting the metrics for now. But here is a glimpse of the “work in progress” dashboard which combines various data sources and various metrics on the same screen and the same time scale.

https://www.cubicweb.org/file/15338648/raw/Screenshot_2016-09-13_09-41-45.png

Further plans

  • internal health checks, we’re taking a look at python-hospital and healthz: Stop
    reverse engineering applications and start monitoring from the
    inside (Monitorama)
    (the idea is to
    distinguish between the app is running and the app is serving it’s
    purpose), and pyramid_health
  • graph the number of Sentry errors and the number of types of errors:
    the sentry API should be able to give us this information. Feed it to
    Salt and Carbon.
  • setup some alerting : next versions of Grafana will be doing that, or with elastalert
  • setup “release version X” events in Graphite that are displayed in
    Grafana, maybe with some manual command or a postcreate command when
    using docker-compose up ?
  • make it easier for devs to have this kind of setup. Using this suite
    of tools in developement might sometimes be overkill, but can be
    useful.

September 16, 2016 11:34 AM

Future of Retail:  5 Quick Insights From Top European Retailers – Hortonworks

$
0
0

Feed: Hortonworks Blog – Hortonworks.
Author: Eric Thorsen.

BlogImage

Last week I had a unique opportunity to present to a group of C-level retail industry leaders. Here are five stories I heard that you might find interesting.

These are leaders in Merchandising, Marketing, Infrastructure and IT in top European companies.  The common link was dinner and retail.

I spoke briefly about my experience in retail and adoption of open source big data, such as Hadoop. What was fascinating was hearing the stories about technology and how it impacts their organizations.

At the highest level, my first insight was that they all had a common goal – creative use of technology to drive key business initiatives.

VR Retail: Wave For Bacon

One architect from a well-known restaurant chain talked about the ordering process. Traditional counter sales and drive-through is now changing in fascinating ways. Eventually he believes they will use kiosks to place custom orders, and even wave our hands with Virtual Reality.

Remember Tom Cruise in Minority Report “directing” his report like an orchestra conductor? Consider reaching out to “build” your sandwich in virtual reality, and the final product is delivered to your table! What a fascinating way to get a personalized order in a “hands-on” fashion.

We also discussed biometric identification. Who needs usernames and passwords if your biometric fingerprint is unique and unhackable? By simply providing a view of your fingerprint, your past orders can be displayed, and you can easily repeat orders or get other personalized treatment.

Embedded Wearables … in The Skin!

IoT has become the trending acronym. Impacting everything from vehicles and telemetry, to plant automation and tracking human behavior via wearables. The concept of wearables providing feedback about our bodies is a brilliant way respond to trending issues.

Apparently the latest wearable tech is capable of predicting a heart attack three hours in advance of it happening. This has incredible value to those at risk, and can be used by insurance companies, physicians, and of course patients to track their health and help maintain good practices of exercise and healthy eating.

Wearable tech can also detect alcohol in bloodstream and determine when it is safe to drive, and send haptic feedback with vibrations to the skin, indicating when to relax, reduce stress, and exercise. This fascinating tech was worn as an arm band, but can actually be embedded subcutaneously, simply slipped beneath the skin. A poll was taken to see who was interested in pursuing procedure. We have a candidate and look forward to reports and updates afterwards!

IT Inertia Restricts But Doesn’t Stop Innovation

I was inspired by the concept of “Pioneers, Settlers, and Town Planners” in a reference to this blog.  As tech is created and consumed, we move from the forward thinking ‘tech pioneers’, to the ‘settlers’ who arrange new tech for their business, and the ‘city planners’ who in turn create governance structures and operational guidelines.

Sometimes IT is seen as a force of inertia and can reduce momentum. One interesting part of our discussion addressed when delays are used to manage deployment without risk. This was balanced with the excitement about new tech and how new value received can drive excitement to fuel adoption and business enablement.

Open Hadoop and The Competing Appliances

From an analytics perspective there are now many appliances on the market. Memory-based, columnar, and modified schemas have all recently been introduced in order to provide rapid-response, sub-second performance. Teradata and Netezza are just two companies doing this. When combined with Oracle Exadata, SAP HANA, and IBM Blue, there are now many commercial offerings.

As much of the conversation was focused on technical collaboration, we spent a great bit of time talking about Spark, Hadoop, and the importance of open source and leveraging the power of the community. A quick back of the napkin architectural design showed how any of these in-memory appliances could easily co-exist and actually benefit from a Hadoop Distribution with connections to appliance in order to “right-size the spend, and protect your data.”

Open Source is a Positive-Sum game

As I was thinking about these conversations, I noticed a tweet about the ‘positive-sum game’ with open source.  See this article on ‘Everyone Wins With Open Source Software’

They net of this is that, rather than the zero-sum game of proprietary software, open source provides a positive-sum for retail, and everyone wins. It’s a beneficial way to focus on mutual benefit and common interest. By adding new open source data, organizations can participate as much as they wish in shaping the future direction, leverage the community to drive success and accelerate innovation and collaborate with existing systems to drive business value.

As part of my role managing Industry Solutions for Hortonworks, I see this frequently with our customers. Such an approach can dramatically impact any company. As retailers struggle to engage Millennials and compete with Amazon, this creative approach to tech via open source is even more appropriate. Over time the organization evolves to become more agile, nimble, and responsive to the true business goals.

The stories I heard this week were fascinating, and show a wide variety of technology trends, coupled with a conservative approach to deployment. There is no question that tech can assist the business in creative ways. I would love to hear your opinions on where you are going.

Comment below or connect online to continue the discussion!

Adventist Health Cures Aging Legacy IT Systems with Oracle Cloud

$
0
0

Feed: All Oracle Press Releases.
Adventist Health, a faith-based, nonprofit integrated health system, has selected Oracle Applications Cloud as part of an end-to-end Cloud deployment to modernize its IT systems and transform the performance of key business functions including HR and Finance. The move to Oracle Applications Cloud will enable Adventist Health to support its strategic growth initiatives, simplify its technology infrastructure and lower costs.
An innovator for more than 150 years, with 20 hospitals, over 260 clinics and a workforce of 32,700, Adventist Health was struggling with aging legacy technology, built for a different era and was unable to meet the dynamic healthcare challenges of today. Following a competitive review that included Workday, Infor, and SAP, the organization selected Oracle Applications Cloud, including Oracle Enterprise Resource Planning (ERP) Cloud, Oracle Human Capital Management (HCM) Cloud, Oracle Analytics Cloud, and Oracle Enterprise Performance Management Cloud on the strength of its integrated breadth and depth of offerings, embedded analytics and the modern, easy-to-use interface of its applications.
“We wanted to modernize our systems by moving to the Cloud, but we also wanted a single, unified solution that comprehensively addressed a wide range of our business areas, with room to grow. Oracle met all these needs,” said Chip Dickinson, vice president for business solutions, Adventist Health. “With the Oracle Applications Cloud, we will have the modern tools that will enable us to best serve our workforce, our business operations, and most importantly, our patients and communities.”
Adventist Health will take advantage of Oracle ERP Cloud’s complete suite of capabilities across financials, procurement, and project portfolio management to help optimize its hospital and clinic network, and drive operational efficiencies while reducing costs. The solution’s available physical, data, and functional security, along with proven scalability, help ensure privacy and performance. Critically, Oracle ERP Cloud’s seamless integration with Oracle HCM Cloud will allow Adventist Health to tightly align talent requirements and investments to strategic growth plans.
With Oracle HCM Cloud, Adventist Health will be able to build a global common model for HR processes, deliver streamlined processes that reduce manual and duplicative efforts, and ensure tight HR process integration to drive complete and impactful reporting, analytics and payroll on a centralized system. Oracle HCM Cloud provides this through an enhanced user experience via a simple, scalable and intuitive design with mobile and self-service capabilities.
“We are proud to partner with Adventist Health to help them best manage and connect with staff, whether it’s through performance management, strategically aligning competencies to positions, or by providing deeper insight into organizational structuring,” said Gretchen Alarcon, group vice president of HCM Product Strategy for Oracle. “Oracle’s single unified Cloud shares data across business processes, allowing organizational leaders to keep employees engaged to best serve patient needs.”
With the Oracle Cloud, Oracle delivers the industry’s broadest suite of enterprise-grade Cloud services, including Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS), and Data as a Service (DaaS).
For additional information, visit Adventist Health, Oracle Cloud and connect with Oracle Cloud on Facebook and Twitter.

Microsoft Azure Germany now available via first-of-its-kind cloud for Europe

$
0
0

Feed: Microsoft Azure Blog.
Author: Tom Keane.

Today, Microsoft Azure is generally available from the new Microsoft Cloud Germany, a first-of-its-kind model in Europe developed in response to customer needs. It represents a major accomplishment for our Azure team.

The Microsoft Cloud Germany provides a differentiated option to the Microsoft Cloud services already available across Europe, creating increased opportunities for innovation and economic growth for highly regulated partners and customers in Germany, the European Union (EU) and the European Free Trade Association (EFTA).

Customer data in these new datacenters, in Magdeburg and Frankfurt, is managed under the control of a data trustee, T-Systems International, an independent German company and subsidiary of Deutsche Telekom. Microsoft’s commercial cloud services in these datacenters adhere to German data handling regulations and give customers additional choices of how and where data is processed.

With Azure available in Germany, Microsoft now has announced 34 Azure regions, and Azure is available in 30 regions around the world — more than any other major cloud provider. Our global cloud is backed by billions of dollars invested in building a highly secure, scalable, available and sustainable cloud infrastructure on which customers can rely. 

Built on Microsoft’s Trusted Cloud principles of security, privacy, compliance and transparency, the Microsoft Cloud Germany brings data residency, in transit and at rest in Germany, and data replication across German datacenters for business continuity. Azure Germany offers a comprehensive set of cloud computing solutions providing customers with the ability to transition to the cloud on their terms through services available today.

  • For businesses, including automotive, healthcare and construction that rely on SAP enterprise applications, SAP HANA is now certified to run in production on Azure, which will simplify infrastructure management, improve time to market and lower costs. Specifically, customers and partners can now take the advantage of storing and processing their most sensitive data.
  • Addressing the global scale of IoT while ensuring data resides in-country, Azure IoT Suite enables businesses, including the robust industrial and manufacturing sector in Germany, to adopt the latest cloud and IoT solutions. Azure IoT Suite enables enterprises to quickly get started connecting their devices and assets, uncovering actionable intelligence and ultimately modernizing their business.
  • With Industry 4.0-compatible integration of OPC Unified Architecture into Azure IoT Suite, customers and partners can connect their existing machines to Azure for sending telemetry data for analysis to the cloud and for sending commands to their machines from the cloud (i.e. control them from anywhere in the world) without making any changes to their machines or infrastructure, including firewall settings.
  • Microsoft, and particularly Azure, has been a significant and growing contributor to open source projects supporting numerous open source programming models, libraries and Linux distributions. Startups, independent software vendors and partners can take advantage of a robust open source ecosystem including Linux environments, Web/LAMP  implementations and e-commerce PaaS solutions from partners. 
  • Furthermore, with the open source .NET Standard reference stack and sample applications Microsoft has recently contributed to the OPC Foundation’s GitHub, customers and partners can quickly create and save money maintaining cross-platform OPC UA applications, which easily connect to the cloud via the OPC Publisher samples available for .NET, .NET Standard, Java and ANSI-C.
  • Azure ExpressRoute provides enterprise customers with the option of private connectivity to our German cloud. It offers greater reliability, faster speeds, lower latencies and more predictable performance than typical internet connections and is delivered in partnership with a number of the leading network service providers including Colt Telekom, e-Shelter, Equinix, Interxion and T-Systems International.

The Microsoft Cloud Germany is our response to the growing demand for Microsoft cloud services in Germany and across Europe. Customers in the EU and EFTA can continue to use Microsoft cloud options as they do today, or, for those who want the option, they’re able to use the services from German datacenters.

Read more about customers choosing the Microsoft Cloud Germany at the Microsoft News Centre Europe and learn more about the product at Azure Germany product page.

Deploying SAP HANA on AWS — What Are Your Options?

$
0
0

Feed: AWS for SAP.
Author: Sabari Radhakrishnan.

Sabari Radhakrishnan is a Partner Solutions Architect at Amazon Web Services (AWS).

Are you planning to migrate your SAP applications to the SAP HANA platform or start a new implementation with SAP HANA? If so, you might be wondering what options Amazon Web Services (AWS) provides to run your SAP HANA workloads. In this blog post, I want to discuss the core infrastructure components required for SAP HANA and the building blocks that AWS provides to help you build your virtual appliance for SAP HANA on AWS. I hope that that this information will help you understand deployment options at a high level. This is the first in a series of blog posts that we will be publishing about various SAP on AWS topics, so check back frequently.

If you’re following the SAP HANA Tailored Data Center Integration (TDI) model, memory, compute, storage, and network are the four key infrastructure components that are required for SAP HANA. Among these, memory is the only variable that depends on your data size. Requirements for compute, storage, and network are either preset or derived from the memory size. For example, there are standard core-to-memory ratio requirements that SAP has put in place to determine the number of cores you need for compute, based on the memory size. When it comes to storage, regardless of memory size, you need to be able to meet certain throughput requirements for different block sizes and other KPIs, as laid out in the SAP HANA Hardware Configuration Check Tool (HWCCT) guide. Finally, for network, especially for scale-out scenarios, you need to be able to drive a minimum of 9.5 Gbps of network throughput between the SAP HANA nodes, regardless of memory size.

Over the past several years, AWS has worked closely with SAP to certify compute and storage configurations for running SAP HANA workloads on the AWS platform. How have we been able to achieve that? The answer is that AWS has engineered Amazon Elastic Compute Cloud (Amazon EC2) instances with different memory sizes to meet all of SAP’s stringent performance requirements for SAP HANA, including proper core-to-memory ratios for compute. In addition, Amazon Elastic Block Store (Amazon EBS) meets, and, in many cases, exceeds, the storage KPIs of the TDI model. Finally, the network bandwidth of EC2 instances meets or exceeds the 9.5 Gbps requirement for internode communications in scale-out mode.

Let’s take a closer look at these building blocks and configuration options.

Memory and compute

AWS provides several EC2 instance types to support different types of workloads. There are two EC2 instance families that are well suited for SAP HANA workloads: memory-optimized R3 and R4 instances, and high-memory X1 instances. These instance families have been purpose-built for in-memory workloads such as SAP HANA. These instance families and the instance types they contain give you a variety of compute options for running your SAP HANA workload. For online analytical processing (OLAP) workloads (for example, SAP Business Warehouse on HANA, SAP BW/4HANA, data marts, etc.), you can scale vertically starting from 244 GiB to 2 TB, and horizontally all the way to 14 TB with full support from SAP. Note, too, that we have tested up to 25-node deployments or a total of 50 TB of RAM successfully in the AWS lab. For online transaction processing (OLTP) workloads (for example, SAP Business Suite on HANA, SAP S4/HANA, SAP CRM, etc.), you can scale vertically from 244 GiB to 2 TB today. As AWS continues to introduce new instance types with the latest CPU generations, we will be working closely with SAP to certify these instance types for SAP HANA workloads. Check the Certified IaaS Platforms page in the Certified and Supported SAP HANA Hardware Directory from SAP to see all the certified AWS instance types that you can use in production for SAP HANA workloads. You can always use smaller instance sizes such as r3.2xlarge, r4.2xlarge, etc. within a given instance family for non-production workloads to reduce your total cost of ownership (TCO). Remember, these are cloud native instances that give you the flexibility to seamlessly change the memory footprint of your SAP HANA system from 64 GiB to 2 TB and vice versa in a matter of minutes, which brings unprecedented agility to your SAP HANA implementation.

The following diagram and table summarize the memory and compute options that I just described.

HANA_on_AWS

Options for production workloads
Instance type Memory (GiB) vCPU SAPS
x1.32xlarge 1952 128 131,500
x1.16xlarge 976 64 65,750
r4.16xlarge 488 64 76,400
r3.8xlarge 244 32 31,920
Additional options for non-production workloads
Instance type Memory (GiB) vCPU SAPS
r4.8xlarge 244 32 38,200
r4.4xlarge 122 16 19,100
r4.2xlarge 61 8 9,550
r3.4xlarge 122 16 15,960
r3.2xlarge 61 8 7,980
Note
For SAP Business One, version for SAP HANA, additional instances and memory sizes are available. Look for another blog post on this topic.

Storage

AWS provides multiple options when it comes to persistent block storage for SAP HANA. We have two SSD-backed EBS volume types (gp2 and io1) for your performance-sensitive data and log volumes, and cost-optimized / high-throughput magnetic EBS volumes (st1) for SAP HANA backups.

  • With the General Purpose SSD (gp2) volume type, you are able to drive up to 160 MB/s of throughput per volume. To achieve the maximum required throughput of 400 MB/s for the TDI model, you have to stripe three volumes together for SAP HANA data and log files.
  • Provisioned IOPS SSD (io1) volumes provide up to 320 MB/s of throughput per volume, so you need to stripe at least two volumes to achieve the required throughput.
  • With Throughput Optimized HDD (st1) volumes, you can achieve up to 500 MB/s of throughput with sequential read and write with large block sizes, which makes st1 an ideal candidate for storing SAP HANA backups.

One key point is that each EBS volume is automatically replicated within its AWS Availability Zone to protect you from failure, offering high availability and durability. Because of this, you can configure a RAID 0 array at the operating-system level for maximum performance and not have to worry about additional protection (RAID 10 or RAID 5) for your volumes.

Network

Network performance is another critical factor for SAP HANA, especially for scale-out systems. Every EC2 instance provides a certain amount of network bandwidth, and some of the latest instance families like X1 provide up to 20 Gbps of network bandwidth for your SAP HANA needs. In addition, many instances provide dedicated network bandwidth for the Amazon EBS storage backend. For example, the largest X1 instance (x1.32xlarge) provides 20 Gbps of network bandwidth and 10 Gbps of dedicated storage bandwidth. R4 (r4.16xlarge) provides 20 Gbps of network bandwidth in addition to dedicated 12 Gbps of storage bandwidth. Here’s a quick summary of network capabilities of SAP-certified instances.

Options for production workloads
Instance type Network bandwidth (Gbps) Dedicated Amazon EBS bandwidth (Gbps)
x1.32xlarge 20 10
x1.16xlarge 10 5
r4.16xlarge 20 12
r3.8xlarge 10*

* Network and storage traffic share the same 10-Gbps network interface

Operating system (OS)

SAP supports running SAP HANA on SUSE Linux Enterprise Server (SLES) or Red Hat Enterprise Linux (RHEL). Both OS distributions are supported on AWS. In addition, you can use the SAP HANA-specific images for SUSE and Red Hat in the AWS Marketplace to get started easily. You also have the option of bringing your own OS license. Look for details on OS options for SAP HANA on AWS in a future blog post.

Building this all together

You might ask, “It’s great that AWS offers these building blocks for SAP HANA similar to TDI, but how do I put these components together to build a system that meets SAP’s requirements on AWS?” AWS customers asked this question a few years ago, and that’s why we built the AWS Quick Start for SAP HANA. This Quick Start uses AWS CloudFormation templates (infrastructure as code) and custom scripts to help provision AWS infrastructure components, including storage and network. The Quick Start helps set up the operating system prerequisites for the SAP HANA installation, and optionally installs SAP HANA software when you bring your own software and license. Quick Starts are self-service tools that can be used in many AWS Regions across the globe. They help provision infrastructure for your SAP HANA system in a consistent, predictable, and repeatable fashion, whether it is a single-node or a scale-out system, in less than an hour. Check out this recorded demo of the SAP HANA Quick Start in action, which was presented jointly with SAP during the AWS re:Invent 2016 conference.

We strongly recommend using the AWS Quick Start to provision infrastructure for your SAP HANA deployment. However, if you can’t use the Quick Start (for example, because you want to use your own OS image), you can provision a SAP HANA environment manually and put the building blocks together yourself. Just make sure to follow the recommendations in the Quick Start guide for storage and instance types. For this specific purpose, we’ve also provided step-by-step instructions in the SAP HANA on AWS – Manual Deployment Guide. (The manual deployment guide will be updated soon to include instructions for the latest OS versions, including RHEL.)

Backup and recovery

The ability to back up and restore your SAP HANA database in a reliable way is critical for protecting your business data. You can use native SAP HANA tools to back up your database to an EBS volume, and eventually move the backed up files to Amazon Simple Storage Service (Amazon S3) for increased durability. Amazon S3 is a highly scalable and durable object store service. Objects in Amazon S3 are stored redundantly across multiple facilities within a region and provide the eleven 9s of durability. You also have the choice to use enterprise-class backup solutions like Commvault, EMC NetWorker, Veritas NetBackup, and IBM Spectrum Protect (Tivoli Storage Manager), which integrate with Amazon S3 as well as the SAP HANA Backint interface. These partner solutions can help you back up your SAP HANA database directly to Amazon S3 and manage your backup and recovery using enterprise-class software.

High availability (HA) and disaster recovery (DR)

HA and DR are key for your business-critical applications running on SAP HANA. AWS provides several building blocks, including multiple AWS Regions across the globe and multiple Availability Zones within each AWS Region, for you to set up your HA and DR solution, tailored to your uptime and recovery requirements (RTO/RPO). Whether you are looking for a cost-optimized solution or a downtime-optimized solution, there are some unique options available for your SAP HANA HA/DR architecture — take a look at the SAP HANA HA/DR guide to learn more about these. We will dive deeper into this topic in future blog posts.

Migration

When it is time for actual migration, you could use standard SAP toolsets like SAP Software Provisioning Manager (SWPM) and the Database Migration Option (DMO) of the Software Update Manager (SUM), or third-party migration tools to migrate your SAP application running on any database to SAP HANA on AWS. The SAP to AWS migration process isn’t much different from a typical on-premises migration scenario. In an on-premises scenario, you typically have source and target systems residing in the same data center. When you migrate to AWS, the only difference is that your target system is residing on AWS, so you can think of AWS as an extension of your own data center. There are also a number of options for transferring your exported data from your on-premises data center to AWS during migration. I recommend that you take a look at Migrating SAP HANA Systems to X1 Instances on AWS to understand your options better.

Additional considerations include operations, sizing, scaling, integration with other AWS services like Amazon CloudWatch, and big data solutions. We will discuss these in detail in future blog posts. In the mean time, we encourage you to get started with SAP HANA on AWS by using the AWS Quick Start for SAP HANA. To learn more about running SAP workloads on AWS, see the whitepapers listed on the SAP on AWS website.

Finally, if you need a system to scale beyond the currently available system sizes, please contact us. We’ll be happy to discuss your requirements and work with you on your implementation.

– Sabari


Using the SAP Database Migration Option (DMO) to Migrate to AWS

$
0
0

Feed: AWS for SAP.
Author: Somckit Khemmanivanh.

Somckit Khemmanivanh is an SAP Solutions Architect at Amazon Web Services (AWS).

This blog post discusses how you can use the database migration option (DMO), which is a feature of the SAP Software Update Manager (SUM) tool, to migrate your anyDB database to SAP HANA on Amazon Web Services (AWS). SAP uses the term anyDB to refer to any SAP-supported, non-HANA source database (such as DB2, Oracle, or SQL Server). In this blog post, we will cover migration options from an on-premises architecture to AWS. (Note that there are many other migration options when SAP HANA is not your target platform; see the Migrating SAP HANA Systems to X1 Instances on AWS whitepaper for details.)

SAP HANA is a fully in-memory, columnar-optimized, and compressed database. The SAP HANA systems certified by SAP enable you to run your SAP HANA databases on systems ranging from 160+ GB of RAM up to 2 TB of RAM in a scale-up configuration. Certified scale-out configurations of up to 14 TB are also available. This system configuration flexibility enables AWS to scale to fit your business and IT needs. Please contact us if you have workloads requiring more memory—we’d like to work with you to satisfy your requirements.

For an introduction to DMO, see the SAP Community Network. At a high level, you can use DMO to migrate an SAP system that is running on anyDB to run on an SAP HANA database. You can also use DMO to upgrade your SAP system’s software components and to perform a Unicode conversion as part of your migration. (Note that as of Enhancement Package (EHP) 8, Unicode is mandatory.) The standard DMO process is an online and direct migration from your source anyDB to your target SAP HANA database.

sap-hana-dmo-process

Figure 1: SAP HANA DMO process

When your migration target is an SAP HANA system in the AWS Cloud, you must have your network connectivity in place to facilitate this direct migration process. Additionally, with the standard DMO process, SAP has specific restrictions, as detailed in SAP Note 2277055 – Database Migration Option (DMO) of SUM 1.0 SP18. (The SAP Notes referenced in this blog post require SAP Service Marketplace credentials.) This SAP note mentions restrictions when performing DMO transfers over a network connection (that is, between data centers). In our own testing and experience, latency does have some impact on the DMO runtime. We can help you evaluate your architecture and design and suggest possible solutions; please contact us at sap-on-aws@amazon.com.

On AWS, you have additional migration options beyond standard DMO. You can use automated migration products like ATAmotion, by AWS Advanced Technology Partner ATADATA, to replicate your source system (from your on-premises network) to AWS, and then use DMO to migrate the replicated system to SAP HANA. Please use the AWS Partner Solutions Finder to locate other AWS Partners who offer migration products and services.

Once your source systems are in AWS, you can leverage the agility and flexibility of AWS services to test your migration and to perform live migrations. An example scenario would be to run multiple tests to optimize your DMO downtime. With each test, you can take advantage of EC2 instance resizing to try different system sizes and combinations.

Now that we’ve covered the high-level processes and tools, let’s discuss the steps involved in the two migration options. We will start with the standard DMO migration process. This process assumes that your existing SAP systems reside in your on-premises data center, and that the single component that needs access to AWS is your DMO server. During the migration, the DMO server exports data from your source database into the target SAP HANA database. The migration process consists of the following steps:

      1. Create an AWS account, if you don’t already have one.
      2. Establish virtual private network (VPN) or AWS Direct Connect connectivity between your data center and the AWS Region, as shown in this illustration from the AWS Single VPC Design brief published on the AWS Answers website. See the “Internal-Only VPC” section of the brief for design considerations and details.

        vpc-connectivity-for-dmo

        Figure 2: VPC connectivity for DMO migration

      3. Deploy your AWS SAP HANA database instance and SAP application instance. (For deployment instructions, see the SAP HANA Quick Start.)
      4. Deploy your DMO server. (You can choose to use a DMO server on premises or in AWS.) The DMO server can be a separate server or co-located with your SAP application server, depending on your source database size, performance needs, system resources, and downtime requirements. (See the SAP documentation for details on this step.)
      5. Test connectivity between your on-premises DMO server and your SAP HANA database instance on AWS. (See the Amazon VPC for On-Premises Network Engineers series on the APN blog for more information.)
      6. Migrate your on-premises database to your SAP HANA database on AWS using DMO. (See the SAP documentation for details.)
      7. Install new SAP application servers on AWS that are connected to your SAP HANA database instance. (See the SAP documentation for details.)

dmo-migration

Figure 3: Standard DMO architecture

The second migration option we’ll discuss is to use an AWS Partner tool to replicate your source system on AWS, and then to migrate your system to SAP HANA. To use this process, your existing SAP systems must reside in your on-premises data center. The steps are similar to the standard DMO process:

      1. Create an AWS account, if you don’t already have one.
      2. Establish VPN or AWS Direct Connect connectivity between your data center and the AWS Region, as shown in this illustration from the AWS Single VPC Design brief published on the AWS Answers website. See the “Internal-Only VPC” section of the brief for design considerations and details.
      3. Replicate your source system on AWS. If you’re using ATAmotion, the ATADATA console is used to perform this replication.
      4. Deploy your SAP HANA database instance and SAP application instance on AWS. (For deployment instructions, see the SAP HANA Quick Start.)
      5. Deploy your DMO server on AWS. The DMO server can be a separate server or co-located with your SAP application server, depending on your source database size, performance needs, system resources, and downtime requirements. (See the SAP documentation for details on this step.)
      6. Run the DMO migration process against the source system that you replicated in step 3. Finalize the migration process on AWS. (See the SAP documentation for details.)
      7. Install new SAP application servers on AWS that are connected to your SAP HANA database instance. (See the SAP documentation for details.)

partner-migrationFigure 4: DMO architecture with ATADATA

Now let’s discuss some of the aspects of the architecture and design that are common to both migration options.

Network connection (bandwidth and latency)

For the network connection between your data center and AWS, you can use either a VPN or AWS Direct Connect, depending on how much data you need to transfer and how quickly you want to complete the migration. The amount of data to transfer correlates to the size of your target SAP HANA database. For example, if your source database size is ~2 TiB, the target SAP HANA database size may range from 250 to 600 GiB. (This estimate assumes the standard 1:4 or 1:5 times HANA compression ratio, although we have observed even higher compression ratios.) You would need to transfer 200-250 GiB over the network. You can get a good estimate of your target database size from the SAP sizing reports in SAP Note 1793345 – Sizing for SAP Suite on HANA and SAP Note 1736976 – Sizing Report for BW on HANA.

We recommend that you establish a reliable network connection to avoid interruptions and having to resend data during the migration process. AWS Direct Connect offers the most reliability and bandwidth over a VPN connection.

SAP application servers

DMO only migrates the source database to the target SAP HANA database—it does not set up SAP application servers. You will need to install new SAP application servers in AWS after your DMO migration is complete. You can choose from various installation scenarios, including:

You will need to decide on the best installation option based on your organization’s requirements, constraints, sizing, cost, complexity, and other tradeoffs.

There are many AWS technologies available to help streamline your SAP application server installation and configuration process. These include using AWS CloudFormation templates, creating an Amazon Machine Image (AMI) from a snapshot, and using the Amazon EC2 Run command. We will cover these topics in upcoming blog posts.

SAP virtual names

SAP systems can resolve host names via DNS or through your local hosts file. We recommend that you use DNS in combination with SAP virtual names for your SAP systems. Using SAP virtual names will make the migration easier by allowing you to keep the same virtual name on AWS. For details, see SAP Note 962955 – Use of virtual or logical TCP/IP host names.

Start small

SAP migrations are complex, and proper planning is needed to minimize any potential issues. We recommend that you first target smaller, standalone SAP systems to familiarize yourself with AWS and SAP on AWS. Such systems could potentially be sandbox, development, training, demo, and SAP Internet Demonstration and Evaluation System (IDES) environments. If you decide not to proceed with the SAP DMO migration, your existing source system will still be fully functional—you would just need to re-enable it.

Drop us a line at sap-on-aws@amazon.com if you have any questions. For specific SAP on AWS support issues, please create an SAP Support Message with the components BC-OP-LNX-AWS or BC-OP-NT-AWS.

Thanks for reading!

— Somckit

VPC Subnet Zoning Patterns for SAP on AWS

$
0
0

Feed: AWS for SAP.
Author: Somckit Khemmanivanh.

This post is by Harpreet Singh and Derek Ewell, Solutions Architects at Amazon Web Services (AWS).

SAP landscapes that need to reside within the corporate firewall are comparatively easy to architect, but that’s not the case for SAP applications that need to be accessed both internally and externally. In these scenarios, there is often confusion regarding which components are required and where they should be placed.

In this series of blog posts, we’ll introduce Amazon Virtual Private Cloud (Amazon VPC) subnet zoning patterns for SAP applications, and demonstrate their use through examples. We’ll show you several architectural design patterns based on access routes, and then follow up with detailed diagrams based on potential customer scenarios, along with configuration details for security groups, route tables, and network access control lists (ACLs).

To correctly identify the subnet an application should be placed in, you’ll want to understand how the application will be accessed. Let’s look at some possible ways in which SAP applications can be accessed:

  • Internal-only access: These applications are accessed only internally. They aren’t allowed to be accessed externally under any circumstances, except by SAP support teams. In this case, the user or application needs to be within the corporate network, either connected directly or through a virtual private network (VPN). SAP Enterprise Resource Planning (ERP), SAP S/4HANA, SAP Business Warehouse (BW), and SAP BW/4HANA are examples of applications for which most organizations require internal-only access.
  • Internal and controlled external access: These applications are accessed internally, but limited access is also provided to known external parties. For example, SAP Process Integration (PI) or SAP Process Orchestration (PO) can be used from internal interfaces, but known external parties might be also allowed to interface with the software from whitelisted IPs. Additionally, integration with external software as a service (SaaS) solutions, such as SAP SuccessFactors, SAP Cloud Platform, and SAP Ariba, may be desirable, to add functionality to SAP solutions running on AWS.
  • Internal and uncontrolled external access: Applications like SAP hybris or an external-facing SAP Enterprise Portal fall into this category. These applications are mostly accessible publicly, but they have components that are meant for internal access, such as components for administration, configuration, and integration with other internal applications.
  • External-only access: This is a rare scenario, because an application will need to be accessible internally for basic administration tasks such as backups, access management, and interfaces, even if most of its components are externally accessible. Due to the infrequency of this scenario, we won’t cover it in this series of blog posts.

In this blog post, we’ll cover possible architecture patterns for the first category of applications (applications that are accessible only internally). We’ll cover the two other scenarios in upcoming blog posts. In our discussions for all three scenarios, we’ll assume that you will access the Internet from your AWS-based resources, via a network address translation (NAT) device (for IPv4), an egress-only Internet gateway (for IPv6), or similar means, to deal with patching, updates, and other related scenarios. This access will still be controlled in a way to limit or eliminate inbound (Internet to AWS Cloud) access requests.

Architectural design patterns for internal-only access

We’ll look at two design patterns for this category of SAP applications, based on where the database and app server are placed: in the same private subnet or in separate private subnets.

Database and app server in a single private subnet

This setup contains three subnets:

  • Public subnet: An SAProuter, along with a NAT gateway or NAT instance, are placed in this subnet. Only the specified public IPs from SAP are allowed to connect to the SAProuter. See SAP Note 28976, Remote connection data sheet, for details. (SAP notes require SAP Support Portal access.)
  • Management private subnet: Management tools like Microsoft System Center Configuration Manager (SCCM), and admin or jump hosts are placed in this subnet. Applications in this subnet aren’t accessed by end users directly, but are required for supporting end users. Applications in this subnet can access the Internet via NAT gateways or NAT instances placed in the public subnet.
  • Apps & database private subnet: Applications and databases are placed in this subnet. SAP applications can be accessed by end users via SAPGUI or over HTTP/S via the SAP Web Dispatcher. End users aren’t allowed to access databases directly.

Database and app server in different private subnets

This setup includes four subnets. The public subnet and the management private subnet have the same functions as in previous scenario, but the third subnet (the apps & database private subnet) has been divided into two separate private subnets for applications and databases. The database private subnet is not accessible from the user environment.

You see that there isn’t much difference between these two approaches. However, the second approach protects the database better, by shielding the database subnet with separate route tables and network ACLs. This allows you to have better control and to manage the access to the database layer more effectively.

Putting our knowledge to use

Let’s put this in context by discussing an example implementation.

Example scenario

You need to deploy SAP S/4HANA, SAP Fiori, and SAP BW (ABAP and Java) on HANA. These applications should be accessible only from the corporate network. The applications will require integration with Active Directory (AD) or Active Directory Federation Services (ADFS) for single sign-on (SSO) based on Security Assertion Markup Language (SAML). SAP BW will have file-based integration with legacy applications as well, and will communicate with an SSH File Transfer Protocol (SFTP) server for this purpose. SAP should be able to access these systems for support. SAP Solution Manager is based on SAP ASE and will be used for central monitoring and change management of SAP applications. All applications are assumed to be on SUSE Linux Enterprise Server (SLES).

Solution on AWS

In this example, we are presuming only one EC2 instance per solution element. If workloads are scaled horizontally, or high availability is necessary, you may choose to include multiple, functionally similar, EC2 instances in the same security group. In this case, you’ll need to add a rule to your security groups. You will use an IPsec-based VPN for connectivity between your corporate network and the VPC. If Red Hat Enterprise Linux (RHEL) or Microsoft Windows Server are used, some configuration changes may be necessary in the security groups, route tables, and network ACLs. You can refer to the operating system product documentation, or other sources such as the Security Group Rules Reference in the Amazon Elastic Compute Cloud (EC2) documentation, for more information. Certain systems will remain on premises, such as the primary AD or ADFS servers, and the legacy SFTP server.

Here’s an architectural diagram of the solution:

The architecture diagram assumes the following example setup:

The following table shows the security group sample configurations. This represents a high-level view of the rules to be defined in the security groups. For exact port numbers or ranges, please refer to the SAP product documentation.

The flow of network traffic is managed by these sample route tables:

*AWS Data Provider requires access to AWS APIs for Amazon CloudWatch, Amazon S3, and Amazon EC2. Further details are available in the AWS Data Provider for SAP Installation and Operations Guide.

For an additional layer of security for our instances, we can use network ACLs, such as those shown in the following table. Network ACLs are fast and efficient, and provide another layer of control in addition to the security groups shown in the previous table. For additional security recommendations, see AWS Security Best Practices.

In certain cases (for example, for OS patches), you may need additional Internet access from EC2 instances; and route tables, network ACLs, and security groups will be adjusted to allow this access temporarily.

Summary and what’s next

In this post, we have defined and demonstrated by example the subnet zoning pattern for applications that require internal-only access. Stay tuned for the next blog post in this series for a discussion of the other subnet zoning patterns we introduced in this post.

We would like to hear about your experiences in setting up VPCs for SAP applications on AWS. If you have any questions or suggestions about this blog post, feel free to contact us.

Getting Started with Architecting SAP on the AWS Cloud

$
0
0

Feed: AWS for SAP.
Author: Rahul Kabra.

Rahul Kabra is a Partner Solutions Architect at Amazon Web Services (AWS).

If you are considering moving your SAP workloads to AWS, you’re probably wondering what your SAP architecture would look like on AWS, how different it would be to run SAP on AWS vs. running it on premises or in your private cloud, and how your business might benefit from migrating to AWS. This introductory article will touch on these topics and will provide you with key data points to prepare you for migrating and operating SAP workloads on AWS.

Architecting your SAP landscape on AWS does require a minor mind-set change to take advantage of the agility and scalability that AWS offers for SAP workloads. You’ll also want to understand how SAP architectures leverage various AWS services, and how the AWS environment provides better security and availability than a traditional environment. In this post, we’ll explore architectural components to give you an overview of SAP on AWS.

Core AWS services that are relevant for SAP

Here are some of the AWS core services you should be familiar with to get started with your SAP deployments:

  • Amazon VPC – The Amazon Virtual Private Cloud (Amazon VPC) service provides a logically isolated section of the AWS Cloud where all your AWS resources get deployed. Amazon VPC provides virtual networking features and security, which you can control to help ensure that only relevant and approved network traffic flow into and out of the VPC. For additional information on VPC design and architectural patterns for SAP, see VPC Subnet Zoning Patterns for SAP on AWS, published earlier on this blog.
  • Amazon EC2 – The Amazon Elastic Compute Cloud (Amazon EC2) service provides virtualized hosts where SAP application servers and databases can be installed. AWS provides multiple instance families that are certified for running SAP workloads, ranging from small, multi-purpose instances to high-memory instances on which you can run in-memory workloads like SAP HANA.
  • Amazon EBS – Amazon Elastic Block Store (Amazon EBS) is a block-based storage service that is used for hosting SAP application and database data, log files, and backup volumes. AWS provides multiple EBS volume types to meet the SAPS, IOPS, and throughput requirements of your SAP applications.
  • Amazon EFS – Amazon Elastic File System (Amazon EFS) provides a shared file system that can be attached across various EC2 hosts. This file storage can be particularly useful for SAP scale-out instances, where shared files like /usr/sap/trans or /sapmnt/ need to be mounted.
  • Amazon S3 – Amazon Simple Storage Service (Amazon S3) is a scalable and durable object-based storage service, which is used for storing SAP application backups and snapshots.
  • Amazon Glacier – This service provides highly scalable, durable, and cost-effective object storage that can be used for long-term backups, and data that needs to retained for compliance or regulatory reasons.
  • IAM – AWS Identity and Access Management (IAM) is used to create and manage AWS users and groups. You can securely control access to AWS resources using IAM roles.
  • CloudWatch – Amazon CloudWatch is a monitoring service for AWS resources. It is critical for SAP workloads, where it’s used to collect resource utilization logs and to create alarms to automatically react to changes in AWS resources.
  • CloudTrail – AWS CloudTrail keeps track of all API calls made within your AWS account. It captures key metrics about the API calls and can be useful for automating trail creation for your SAP resources.
  • AWS CLI – AWS Command Line Interface (AWS CLI) is a tool for managing and operating AWS services via command line or scripts. We’ve provided some examples in the “AWS Automation” section later in this blog post.
  • Route 53 – Amazon Route 53 is a scalable and highly available DNS service. You can use it to create hosted zones, traffic policies, and domains on AWS, and to connect to non-AWS resources.

AWS sizing and performance

The AWS platform allows enterprises to provision the right type and size of computing resources in a self-service manner, without the need for large, upfront investment in hardware. One of the primary advantages of moving to AWS is that you gain the flexibility and agility to adjust your resources to your changing business needs. For example, AWS provides a global footprint that currently includes 16 AWS Regions and 42 Availability Zones to host your SAP landscapes near your customer base in a secure, highly available manner. Unlike traditional on-premises setup, the AWS Cloud environment lets you resize your SAP EC2 instances with a few clicks in the Amazon EC2 console, as shown in the following screen illustration, or with a simple API call.

Resizing your existing EC2 instance from the Amazon EC2 console

You get almost instant access to as many resources as you need, and you pay only for what you use. This means that infrastructure architects don’t have to guess or include a buffer when determining compute/memory sizing requirements for new projects. Also, during special events like month-end or open-enrollment periods, it becomes very easy to add capacity to existing hosts, with the option of scaling out your SAP application or database tiers. AWS provides both memory-optimized and compute-optimized instance types to meet your resource requirements, and has worked closely with SAP to get these certified for SAP workloads. See SAP Note 1656099 to get a list of certified AWS instances for SAP. (SAP notes require SAP Service Marketplace credentials.)

Designing SAP applications on AWS means that not all SAP applications need to be operational at all times. Non-critical SAP systems such as sandbox, training, and demo systems can be shut down when they’re not in use, resulting in significant cost savings.

One of the performance enablers on AWS is the way storage is configured. EBS volumes are replicated within an Availability Zone, so they’re already protected against data loss. For this reason, EBS volumes can be striped via RAID 0, which provides the maximum performance. This isn’t possible in most on-premises environments. For details on various EBS volume types and performance characteristics, see the post Deploying SAP HANA on AWS – What Are Your Options on this blog. These EBS volumes can be utilized for non-HANA databases as well.

AWS automation

AWS is a leader in cloud automation and provides multiple options for programmatically scripting your resources to operate or scale them in a predictable and repeatable manner. You can execute scripts to automate SAP operations such as creating new SAP application servers, taking backups and snapshots, and building new instances from the ground up.

Here are some examples to illustrate how easy is it to take snapshots of your existing SAP HANA data volumes for backup, and to restore those volumes on your server. These commands can be easily scripted and executed as part of your backup/recovery process via crontab or any other time-based job scheduler tools.

Example 1: Taking a snapshot of existing SAP HANA data volumes

  1. Stop the SAP HANA instance:
    sudo su – adm -c "HDB stop"
  2. Quiesce the file system:
    umount /hana/data
  3. Identify the SAP HANA volume IDs attached to the instance:
    aws ec2 describe-volumes --region=us-west-2 --filters Name=attachment.instance-id,Values=i-03add123456789012 Name=tag-value,Values="HANA*" --query 'Volumes[*].{ID:VolumeId,Tag:Tags}'Response:
    [
        {
            "Tag": [
                {
                    "Value": "HANA_root",
                    "Key": "Name"
                }
            ],
            "ID": "vol-071111111111111"
        },
        {
            "Tag": [
                {
                    "Value": "HANA_data #1",
                    "Key": "Name"
                }
            ],
            "ID": "vol-082222222222222"
        },
        {
            "Tag": [
                {
                    "Value": "HANA_data #2",
                    "Key": "Name"
                }
            ],
    		"ID": "vol-0733333333333"
         }
    ]
    
  4. Take a snapshot of the SAP HANA volumes:
    aws ec2 create-snapshot --region=us-west-2 --volume-id vol-07222222222222 --description "HANA Server Data volume #1"; aws ec2 create-snapshot --region=us-west-2 --volume-id vol-08333333333333 --description "HANA Server Data volume #2
  5. Mount back the SAP HANA data volumes (this can be done while the snapshot is in a pending state):
    mount /hana/data
  6. Start SAP HANA:
    sudo su - hxeadm -c "HDB Start"

Example 2: Restoring data from a snapshot and attaching it to your existing host

  1. Create a new volume from the snapshot:
    aws ec2 create-volume --region us-west-2 --availability-zone us-west-2a --snapshot-id snap-1234abc123a12345a --volume-type gp2
  2. Attach the newly created volume to your EC2 host:
    aws ec2 attach-volume --region=us-west-2 --volume-id vol-4567c123e45678dd9 --instance-id i-03add123456789012 --device /dev/sdf
  3. Initialize the volume to optimize the read operation (highly recommended for production scenarios):
    fio --filename=/dev/xdf --rw=randread --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --name=volume-initialize
  4. Mount the logical volume associated with SAP HANA data on the host:
    mount /dev/mapper/hanaexpress-lvhanaexpressdata /hana/data
  5. Start SAP HANA:
    sudo su - hxeadm -c "HDB Start"

There are other ways you can leverage automation for SAP on AWS:

  • Use Amazon Machine Instances (AMIs) to create new copies of your SAP instances.
  • Automatically create new SAP instances using AWS CloudFormation, or use the AWS QuickStart for new SAP HANA deployments in the cloud.

HA and DR using multiple Availability Zones and Regions

In traditional on-premises SAP deployments, high availability (HA) is handled within a single, physical data center. In contrast, AWS allows HA deployments across multiple Availability Zones in the same AWS Region. Availability Zones are physically isolated collections of data centers, connected via low network latency, in the same geographical region. For some customers, this HA setup may be sufficient for disaster recovery (DR) as well. If not, you can take advantage of the multiple options AWS offers with flexible pricing to leverage a second region in a different part of the country or world. The following diagram shows a sample distributed architecture for SAP.

Distributed SAP architecture on AWS

SAP operations

Compared with on-premises environments, AWS provides more transparency and accountability in the operation of SAP systems, with the help of monitoring services such as Amazon CloudWatch and AWS CloudTrail. You can use IAM roles and policies, security groups, and network ACLs for fine-grained security access control to AWS resources.

  • Monitoring services – Services like CloudWatch and CloudTrail can be helpful in monitoring resource utilization. CloudWatch enables you to create alarms to monitor and automatically recover an instance if it becomes impaired due to an underlying hardware failure. For details, see the AWS documentation.
  • Security – IAM allows refined controlled access to AWS resources and lets you create roles to enable secure communication between different services without having to copy AWS access keys. For details, see the AWS documentation.
  • Backups – If your organization uses a backup tool that supports direct Amazon S3 integration, you can back up your database directly into your S3 bucket. If not, you can use your existing backup tools to first export your files to an EBS volume, and then sync it with Amazon S3.

AWS Simple Monthly Calculator

Unlike the on-premises world where it is really difficult to get a comprehensive understanding of hardware and operational costs, AWS provides you with tools to map the costs of the infrastructure and various payment models for EC2 instances, including On-Demand and Reserved Instances.

Next steps

For more details on SAP on AWS, use the following resources:

  • Read SAP on AWS whitepapers – Great material for understanding the architectures and deployment options for SAP on AWS.
  • Read the SAP on AWS blog – We have quite a few blog posts that dive deep into various SAP topics.
  • Sign up for an AWS training class – AWS provides great training resources to help you get up to speed on various AWS services. Consider taking a class like AWS Technical Essentials to get prepared for AWS.
  • Get AWS certified – Getting an AWS Solutions Architect Associate or Professional certification can provide you with architectural and best practices guidance for deploying SAP and non-SAP applications on AWS.
  • Reach out to your consulting partner – If there’s a consulting partner that you actively work with, contact them about getting involved with your SAP migration to AWS.

– Rahul

FAST SAP Migrations to AWS with the SAP Rapid Migration Test Program

$
0
0

Feed: AWS for SAP.
Author: Somckit Khemmanivanh.

Somckit Khemmanivanh is an SAP Solutions Architect at Amazon Web Services (AWS).

If you have an on-premises SAP application running on a non-HANA (anyDB) database, you can now use the SAP Rapid Migration Test program with Amazon Web Services (AWS) to migrate to an SAP HANA (or ASE) version of your application on AWS. The SAP Rapid Migration Test program (also known as FAST, which stands for Fast AWS and SAP Transformation) provides a set of processes, procedures, and tools that SAP developed in collaboration with AWS to help customers running SAP applications (SAP ECC and SAP Business Warehouse) on anyDB to migrate to SAP HANA or SAP ASE on AWS.

You can use FAST to migrate your SAP system to AWS and upgrade it to SAP HANA in record time, using your own in-house resources, remote consulting, or a consulting partner. AWS was chosen as the development and launch partner for this initiative due to the flexibility and scale of the AWS platform. SAP and AWS partnered on the pilot phase of the program and found that migrations can be completed in as little as 48 hours and for as little as $1,000 in infrastructure costs.

As part of the FAST program, SAP has enhanced the database migration option (DMO) of their Software Update Manager (SUM) (see SAP Note 2377305) to accelerate the migration testing of SAP applications. (SAP notes require SAP login credentials.) This enhancement in SUM 1.0 SP 20, called DMO with System Move, enables you to migrate your SAP system from your on-premises site to AWS by using a special export and import process. You can use the AWS Quick Start for SAP HANA to rapidly provision SAP HANA instances and build your SAP applications on AWS. You can additionally use AWS services such as Amazon S3, Amazon EFS (over AWS Direct Connect), AWS Storage Gateway file interface, and AWS Snowball to transfer your SAP files to AWS.

The SUM DMO tool can convert data from anyDB to SAP HANA, with OS migrations, release/enhancement pack upgrades, and Unicode conversions occurring at the same time. Results are written to flat files, which are transferred to a rapidly provisioned SAP HANA system on the AWS platform. The second phase of DMO with System Move imports the flat files and builds the migrated SAP application with the extracted data, code, and configuration. Here’s a conceptual flow of the major steps involved:

In step 1, you use the SUM DMO tool to export the SAP source to a storage location in the form of flat files. Depending on your use case and requirements, this location could be an existing AWS service you’ve already configured (such as Amazon EFS or a file gateway) or a local file system on the source server. Step 2 involves transferring the exported flat files to AWS. (Note: Depending on your storage location technology, you might not have to explicitly transfer files to AWS. For example, if you used Amazon EFS or a file gateway with AWS Storage Gateway, your files would be transferred to AWS automatically.) While the files are being transferred, you can use the SAP HANA Quick Start to provision the SAP HANA system and optionally install SAP HANA. Typical provisioning times for SAP HANA range from 30 minutes to less than one hour for larger SAP HANA scale-out systems. Lastly, step 3 involves the actual import of the flat files into the newly provisioned SAP HANA system.

With the SAP Rapid Migration Test program, SAP customers who don’t have SAP HANA licenses can get a limited test license from their SAP account team. SAP customers can use the DMO with System Move process themselves or partner with an AWS Partner Network (APN) SAP partner.

We’d like to hear about your experiences using AWS for SAP applications. If you have any questions or suggestions, feel free to contact us.

Thank you!

– Somckit

VPC Subnet Zoning Patterns for SAP on AWS, Part 2: Network Zoning

$
0
0

Feed: AWS for SAP.
Author: Somckit Khemmanivanh.

This post is by Harpreet Singh and Derek Ewell, Solutions Architects at Amazon Web Services (AWS).

In part one of this article series on VPC subnet zoning patterns, we described possible ways in which SAP applications may be accessed, and then discussed Amazon Virtual Private Cloud (Amazon VPC) subnet zoning patterns for internal-only access in detail. In this second article in the series, we’ll discuss how traditional application network zoning can be mapped to AWS.

In a traditional on-premises deployment model, applications are segregated into various network zones:

  • Restricted zone: This is the most secure zone, and it hosts confidential data. For example, databases for finance and HR solutions, content repositories, and file servers may reside here.
  • Intranet zone: This zone is meant for application servers that access databases in the restricted zone. For example, SAP Advanced Business Application Programming (ABAP) or Java Central Services, or SAP application servers may reside here. End users’ devices, when connected to the corporate network, are also in this zone.
  • Extranet zone: Middleware like SAP Process Orchestration (PO) or SAP Process Integration (PI), SSH File Transfer Protocol (SFTP) servers, or SAP TREX reside in this zone.  This zone acts as the intermediate zone between the external and internal zones.
  • External zone: This zone hosts applications and appliances that are directly Internet facing, and acts as the entry or exit point for your applications in the internal zone. Network access translation (NAT) instances, reverse proxies, and SAProuter are examples of some of the solutions that belong in this zone.
  • Management / shared services zone: Applications like Active Directory, monitoring servers, SAP Solution Manager, or DNS servers, which are required by all the zones described above, are hosted in this zone.
  • Internet zone: You do not control this zone, but you interact with it when you access applications hosted by your business partner, SaaS provider, etc.

Figure 1: Network zones

In the traditional world, the flow of traffic between zones is controlled by defining firewall rules. On AWS, you control traffic flow by using network access control lists (ACLs), which are stateless firewalls at the subnet level; security groups, which are stateful firewalls at the instance or elastic network interface level; and route tables, which are sets of rules that determine where traffic is directed.

So how do these zones fit into a deployment on AWS?

In the architecture we described in the previous blog post, we implicitly defined zones and separated the applications at the subnet level. We covered all zones except for the extranet zone.

Figure 2: Network zone mapping with subnet-level separation

However, subnets aren’t the only option for separating applications on AWS. You can also confine an application by using different VPCs for different zones. For example, you can have a dedicated VPC for the management zone, and connect it to other zones (which reside in their own VPCs) by using VPC peering. However, we don’t recommend splitting closely related components, such as SAP application servers, and databases, into separate VPCs.

Which criteria can help you decide whether to use multiple VPCs or multiple subnets within a single VPC for separation?

There is no rule of thumb, and the choice is generally driven by ease of management and segregation of duties in your organization. For example, in your organization, you might have other applications besides SAP that use services from the management zone, such as Microsoft Active Directory (for SAP user single sign-on and user management), email, and anti-malware. These shared applications may be managed by separate teams, and may require completely different change management processes, due to their wider impact. So, you may decide to create a separate VPC for shared services.  You can also manage this shared access through a multiple AWS account strategy such as AWS Organizations.

Let’s see how the architecture from the previous article would look, if we decide to host the external zone and the management zone in separate VPCs, while keeping the intranet zone and the restricted zone together in one VPC.

Figure 3: Network zone mapping with multiple VPCs

We’ll need to make a few configuration changes to accommodate this design:

  • Use separate VPC CIDRs for each VPC planned. For example, we had only 10.0.0.0/16 as the VPC CIDR, but now we’ll need:
    • 10.0.0.0/16 – Single VPC CIDR for intranet zone (subnet 10.0.1.0/24) and restricted zone (subnet 10.0.2.0/24)
    • 10.2.0.0/16 – VPC CIDR for external zone (subnet 10.2.1.0/24)
    • 10.3.0.0/16 – VPC CIDR for management zone (subnet 10.3.1.0/24)
  • Use VPC peering to establish connections with other VPCs.
  • Make adjustments to network ACLs, route tables, and security groups, to route traffic from one VPC to another, as needed.

What’s next?

In this post, we mapped traditional network zones to equivalent constructs in the AWS Cloud. In an upcoming blog post, we’ll conclude this series with our take on subnet zoning patterns for SAP applications that require internal and controlled or uncontrolled external access.

We would like to hear from you about your experiences setting up VPCs on AWS for SAP applications. If you have any questions or suggestions about this article series, feel free to contact us.

Viewing all 140 articles
Browse latest View live