[how] Import Text Resources Into Python For Text Processing

Text files come in various sources (i.e. local or remote) and formats (i.e. txt, csv/tsv, json , pkl etc). This post assumes that the reader are using COLAB (https://colab.research.google.com/). Some parts of the codes were based on the NLTK guide (NLTK - Processing Raw Text) while some others were based on sample data files in COLAB.

Sources:

(Source.1) Local machine

- get the path and import the content.

# import os
# print current directory from within Python
import os
print( os.listdir('.') )
# assuming that this is a colab project
# that starts from within the /content/ directory
# which contains a sub-directory sample_data
# then, run command to change directory 
# to sample_data directory
%cd sample_data
# open raw file
# print content of raw file
f = open('README.md')
raw = f.read()
print( raw[:150] )

(Source.2) Remote machine

- import the content direct from remote location, or download to local machine, get the path and import the content.

There are three commonly used module packages to get remote files i.e. UrlLib, Requests and HttpLib. The Requests package is recommended for a higher-level HTTP client interface (Python Docs).

#open direct from remote storage

import requests

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

raw= requests.get(url).content.decode('utf-8-sig')

print(raw)

#download from remote storage and open locally

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

file = url.split("/")[-1]

!wget {url}

f = open(file)

raw = f.read()

print( raw[:150] )

(Source.3) Web Page

Using Beautiful Soup library to extract text content of a web page.

#open web page and scrap
#list of emotion types
#using beautiful soup

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Emotion_classification"

response = requests.get(url).content.decode('utf-8-sig')

soup = BeautifulSoup(response, 'html.parser')

listEmoType=[]

for a in soup.find_all("div", class_="div-col")[:1]:
  for b in a.find_all("a"):
    #print (b.text)
    listEmoType.append(b.text)

print(len(listEmoType))
print("\n".join(listEmoType))

Formats:

(Format.1) Text (TXT).

filepath='/content/sample_data/README.md'

f = open(filepath)

raw = f.read()

print( raw[:150] )

(Format.2) Comma Separated Value (CSV).

#import csv with default delimiter
import csv

sourcefilepath='/content/sample_data/california_housing_test.csv'

with open(sourcefilepath, newline='') as f:
    reader = csv.reader(f)
    data =list(reader)
    
print(data[:150])

#export csv to tsv format
import csv

sourcefilepath='/content/sample_data/california_housing_test.csv'
targetfilepath=filepath.replace(".csv",".tsv")

csv.writer(open(targetfilepath, 'w+'), delimiter='\t') \
  .writerows(csv.reader(open(sourcefilepath)))

#import csv with tab delimiter
import csv

filepath='/content/sample_data/california_housing_test.tsv'

with open(filepath, newline='') as f:
    reader = csv.reader(f, delimiter="\t") #add delimiter format
    data =list(reader)

print(data[:150])

(Format.3) JavaScript Object Notation (JSON)

JSON is a format for storing and exchanging data. Python has a built-in package called json which supports JSON data.

# open a json file
import json

filepath='/content/sample_data/anscombe.json'

with open(filepath, 'r') as f:
  data = json.load(f)

print(data)

# continue from the previous exercise
# save the json data into a file
import json

targetfilepath='/content/sample_data/newdata.json'

with open(targetfilepath, 'w') as json_file:
  json.dump(data, json_file)

(Format.4) Pickle (PKL)

Pickle is a python module that serializes data for storage. The data is stored in a binary string. Pickle is useful for efficient data loading in machine learning activities.

#write pickle
import pickle
  
# Create data
listData = [{'good': 10, 'morning': 4}]
print(type(listData))    

# Open a file and dump data
with open('fileData.pkl', 'wb') as fileData:      
    # A new file will be created
    pickle.dump(listData, fileData)

#read pickle
import pickle
  
# Open the file in binary mode
with open('fileData.pkl', 'rb') as fileData:
    # Open file and load data
    listData = pickle.load(fileData)

print(type(listData))    
print(listData)

Mohamad

[how] Import Text Resources Into Python For Text Processing

Sources:

(Source.1) Local machine

(Source.2) Remote machine

(Source.3) Web Page

Formats:

(Format.1) Text (TXT).

(Format.2) Comma Separated Value (CSV).

(Format.3) JavaScript Object Notation (JSON)

(Format.4) Pickle (PKL)

Posted by admin

Post a Comment

0 Comments

Search This Blog

Select Category

Select Topic

Archive

Labels

[what] Lambda (Anonymous Function) In Python

[what] Pandas (Panel Data Analysis) In Python

[what] Iterables and Iterators In Python

[what] SciPy (Scientific Computing in Python)

[what] Basic Program Control Structures In Python

[how] Connect Colab To Google Sheet

[how] Organize Text Data Using Pandas (Python Data Analysis Library)

[get-started] Python For Beginners

Natural Language Processing (NLP) Pipeline With Python

[what] Python Map, Filter and Reduce Functions

Online Tools

Popular Posts

[how] Organize Data Using List, Tuple, Set and Dictionary

Natural Language Processing (NLP) Pipeline With Python

Short Text Topic Modeling Tutorial

Mohamad Mahmood (RAZZI.MY)

Footer Menu Widget

[how] Import Text Resources Into Python For Text Processing

Sources:

(Source.1) Local machine

(Source.2) Remote machine

(Source.3) Web Page

Formats:

(Format.1) Text (TXT).

(Format.2) Comma Separated Value (CSV).

(Format.3) JavaScript Object Notation (JSON)

(Format.4) Pickle (PKL)

Posted by admin

You may like these posts

Post a Comment

0 Comments

Search This Blog

Select Category

Select Topic

Archive

Labels

Online Tools

Popular Posts

Mohamad Mahmood (RAZZI.MY)

Footer Menu Widget