Sponsored by:
 
 

BYOBD: Bring Your Own Big Data

Robert Plant
50%
50%
Newest First | Oldest First | Threaded View
comments
Page 1 / 5   >   >>
Saul Sherry
50%
50%
Saul Sherry, User Rank: Blogger
3/21/2013 | 9:51:28 AM


Re: steps to gain benifits
Hi Ahmed... have you made any progress with this project?


Saul

Keith.Grinsted
50%
50%
Keith.Grinsted, User Rank: Petabyte Pathfinder
2/28/2013 | 8:37:12 PM


Re: A collaborative future
@Robert picking up on your entrepreneurial angle it has been identified in other discussions that there are solutions needed out there and certainly someone is required to co-ordinate pools of big data and merge them into seas / oceans of big  data that can then be interrogated to provide solutions from this collaborative solution.

ahmed farag
50%
50%
ahmed farag, User Rank: Bit Player
2/1/2013 | 3:23:21 PM


Re: steps to gain benifits
Hi @Saul Sherry, how are you? I've a question and i hope to help me.

How to acquire , organize , analyze big data using Hadoop. I mean what are the tools used in each step. And how to transform the unstructured data with the structured data.

ahmed farag
50%
50%
ahmed farag, User Rank: Bit Player
1/25/2013 | 5:46:38 PM


Re: steps to gain benifits
Thank you guys, i find an open source c# code that crawl "garbage" plenty html tags that need more and more efforts to parse these tags into meaningfull data. you mentioned legalities but i want to crawl the public data only, by the way i'll take care of this point.

I hope i can found a short way to crawl and parse the data.  @AphaEdge why you said that JAVA or Python are the best ? could you give me more explanations please? also if you know any kind of open source code written by JAVA or Python share it to me


AlphaEdge
100%
0%
AlphaEdge, User Rank: Exabyte Executive
1/23/2013 | 6:02:45 PM


Re: steps to gain benifits
Agree. It might involve copyright infringement related issues. I think you guys may have to serioiusly consider this issue. It seems like either JAVA or Python would be the best tools for webcrawler. Expecially given that JAVA integrates with Hadoop so well. Maybe that can be the tool you choose? Any other comments from other experts?

Saul Sherry
100%
0%
Saul Sherry, User Rank: Blogger
1/23/2013 | 6:23:16 AM


Re: steps to gain benifits
Hi @Ahmed. There are plenty of crawler guides out there, that all allow you to do some interesting stuff and pull interesting data - I recommend checking out pages like stackoverflow to see what people like you have done with similar needs.

A caveat though - the legalities on your doing this are never particularly clear cut. According to this great article on PeteSearch. They nearly got sued by Facebook for accessing such data, being told that "the only legal way to access any web site with a crawler was to obtain prior written permission."

So I would make sure you get clarity on the legal side of what you are doign first.

As from there, I would recommend taking this one step at a time - for instance, you won't know the best kind of storage system until you know the size and types of data your crawling efforts throw up. Anyone have thoughts to add?

ahmed farag
50%
50%
ahmed farag, User Rank: Bit Player
1/22/2013 | 4:12:28 PM


Re: steps to gain benifits
Let me introduce myself to you, i'm ahmed farag from egypt currently i'm in a training program in business analytics and working with a team on the project that i mentioned it to you. I need to say that i just start to take my first steps in big data analytics. So please try to help me as much as you can.

ahmed farag
50%
50%
ahmed farag, User Rank: Bit Player
1/22/2013 | 4:04:05 PM


Re: steps to gain benifits
Hi @Saul,  let me explain to you what i want to do and what is the case again to find out how can you and the guys help me.
I'm working on a project that we need first to crawl data from social websites (facebook, twitter),  we don't want to use the API's of facebook or twitter, we need to extract exactly the data related to the company, for example from the company official page, we need to know all of the members on the page, thier names, comments on the page per post, their likes, go to thier profiles and know their ages, their contacts if possible, and so on. Also we need to search all of the pages on facebook that speaks about the company who is thier author, his names, contacts, all possible details inside every page, and so on. Sure it will be the public data only. Also we need to apply the same concept with some other sites(one news website, and blog) we need to extract the data related to the company. Hopefully after we have these data we will merge it with some structured data from the company and apply analytics and insights in these data.

so i think the scope of the project should be:
- Build the crawling & parsing engine to extract and parse the crawled data.
- Store the unstructured data in Hbase for example.
-Merge it with the data from inside the company.
- Analyse the new valuable data that we have.

Now i need to know if these steps is okay? and if their is a way to crawl the data from the perspective i mentioned before ? simply how can go ahead with this project?

Any suggestions please.

Saul Sherry
50%
50%
Saul Sherry, User Rank: Blogger
1/22/2013 | 9:21:34 AM


Re: steps to gain benifits
@Ahmed - sounds to me like this approach is to pull data from a FB page you are administrator on - is this what you were after or were you thinking of a broader pull from Facebook?

ahmed farag
50%
50%
ahmed farag, User Rank: Bit Player
1/18/2013 | 1:25:53 PM


Re: steps to gain benifits
Hi @Saul, realy i'm very thankful to you & AlphaEdge, your link is great, but i have another issue will this crawler work with facebook?. Could you take a look to these links and tell me what you understand from:

http://onestopdotnet.wordpress.com/2012/06/25/talk-to-facebook-graph-api-via-fqlc/
https://developers.facebook.com/docs/getting-started/graphapi/

i feel like it seems to be a pip or bridge from facebook to developers

take a look and tell me what you found....


Page 1 / 5   >   >>
More Blogs from Robert Plant
Do you know how much data you've generated on yourself?
People are taking sides on the CISPA debate and privacy. Whatever happens with the bill, we will be better off because of the awareness that the debate raises.
The IRS is watching you, and it will be using both big data and social networks to do so.
What big data uses are out there for Roadrunner, the world's 22nd fastest computer?
What big businesses can learn from the big data behind hurricane prediction.
Flash Poll
Information Resources
Data Visualization Showcase
This Tableau visualization of international debt demonstrates how simple visualizations can give great insight
Explore this data here.
More Data Visualization Showcase
BDR in your Inbox
Digital Audio
Latest Archived Broadcast
Join this radio show to truly understand what a CIO needs to do to build a successful private cloud and what skills and values the IT team will need to embody.
Featured Video
2
Video: Visualization Is a Team Sport
James Robinson, co-founder of Open Signal, tells us why it takes two to get great visualizations.
Watch This Video
Follow Us on Twitter
Like Us on Facebook
Accolades
Accolades